2

I have two SAS data tables. The first has many millions of records, and each record is identified with a sequential record ID, like this:

Table A

Rec  Var1 Var2 ... VarX
1    ...
2
3

The second table specifies which rows from Table A should be assigned a coding variable:

Table B

Code  BegRec    EndRec
AA      1200      4370
AX      7241      9488
BY     12119     14763

So the first row of Table B means any data in Table A that has rec between 1200 and 4370 should be assigned code AA.

I know how to accomplish this with proc sql, but I want to see how this is done with a hash object.

In SQL, it's just:

proc sql;
 select b.code, a.*
 from tableA a, tableB b
 where b.begrec<=a.rec<=b.endrec;
quit;

My actual data contains hundreds of gigabytes of data, so I want to do the processing as efficiently as possible. My understanding is that using a hash object may help here, but I haven't been able to figure out how to map what I'm doing to use that way.

2
  • A very 'SAS' solution to this may be to create a custom format for the Rec field. You can rearrange your 'Table B' into a control data set that proc format understands, and then use PROC FORMAT with the CNTLIN= option to build your format. Then, just apply it. Commented Feb 13, 2012 at 19:24
  • See support.sas.com/documentation/cdl/en/proc/61895/HTML/default/… for a very similar example Commented Feb 13, 2012 at 19:25

4 Answers 4

4

A hash object solution (data input code borrowed from @Rob_Penridge).

    data big;
      do rec = 1 to 20000;
       output;
      end;
    run;

    data lookup;      
      input Code $ BegRec EndRec;
      datalines;
      AA      1200      4370
      AX      7241      9488
      BY     12119     14763
      ;
    run;


    data created;
      format code $4.;
      format begrec endrec best8.;
      if _n_=1 then do;
        declare hash h(dataset:'lookup');
        h.definekey('Code');
        h.definedata('code','begrec','endrec');
        h.definedone();
        call missing(code,begrec,endrec);
        declare hiter iter('h');
      end;

    set big;
    iter.first();
      do until (rc^=0);
       if begrec <= rec <= endrec then do;
       code_dup=code;
      end;
      rc=iter.next();
     end;
    keep rec code_dup;
    run;
Sign up to request clarification or add additional context in comments.

1 Comment

@itzy you may compare the efficiency with sql solution. I don't use hash object often and don't know whether it will deliver superior speed. For me, in most cases, sql is fast enough and capable to cope with massive amount of data.
3

I'm not sure a hash table would even be the most efficient approach here. I would probably solve this problem using a SELECT statement as the conditional logic will be fast and it still only requires 1 parse through the data:

select;
  when ( 1200 <= _n_ <=4370) code = 'AA';
  ...
  otherwise;
end;

Assuming that you will need to run this code multiple times and the data may change each time you may not want to hardcode the select statement. So the best solution would dynamically build it using a macro. I have a utility macro I use for these kinds of situations (included at the bottom):

1) Create the data

data big;
  do i = 1 to 20000;
    output;
  end;
run;

data lookup;      
  input Code $ BegRec EndRec;
  datalines;
AA      1200      4370
AX      7241      9488
BY     12119     14763
;
run;

2) Save the contents of the smaller table into macro variables. You could also do this using call symput or other preferred method. This method assumes you don't have too many rows in your lookup table.

%table_parse(iDs=lookup, iField=code  , iPrefix=code);
%table_parse(iDs=lookup, iField=begrec, iPrefix=begrec);
%table_parse(iDs=lookup, iField=endrec, iPrefix=endrec);

3) Dynamically build the SELECT statement.

%macro ds;
  %local cnt;

  data final;
    set big;

    select;
      %do cnt=1 %to &code;
        when (&&begrec&cnt <= _n_ <= &&endrec&cnt) code = "&&code&cnt";
      %end;
      otherwise;
    end;

  run;
%mend;
%ds;

Here is the utility macro:

/*****************************************************************************
**  MACRO.TABLE_PARSE.SAS
**
**  AS PER %LIST_PARSE BUT IT TAKES INPUT FROM A FIELD IN A TABLE.
**  STORE EACH OBSERVATION'S FIELD'S VALUE INTO IT'S OWN MACRO VARIABLE.
**  THE TOTAL NUMBER OF WORDS IN THE STRING IS ALSO SAVED IN A MACRO VARIABLE.
**
**  THIS WAS CREATED BECAUSE %LIST_PARSE WOULD FALL OVER WITH VERY LONG INPUT
**  STRINGS.  THIS WILL NOT.
**
**  EACH VALUE IS STORED TO ITS OWN MACRO VARIABLE.  THE NAMES
**  ARE IN THE FORMAT <PREFIX>1 .. <PREFIX>N.
**
**  PARAMETERS:
**  iDS        : (LIB.DATASET) THE NAME OF THE DATASET TO USE.
**  iFIELD     : THE NAME OF THE FIELD WITHIN THE DATASET.
**  iPREFIX    : THE PREFIX TO USE FOR STORING EACH WORD OF THE ISTRING TO 
**               ITS OWN MACRO VARIABLE (AND THE TOTAL NUMBER OF WORDS). 
**  iDSOPTIONS : OPTIONAL. ANY DATSET OPTIONS YOU MAY WANT TO PASS IN
**               SUCH AS A WHERE FILTER OR KEEP STATEMENT.
**
******************************************************************************
**  HISTORY:
**  1.0  MODIFIED: 01-FEB-2007  BY: ROBERT PENRIDGE
**  - CREATED.
**  1.1  MODIFIED: 27-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW UNMATCHED QUOTES ETC IN VALUES BEING RETURNED BY 
**    CHARACTER FIELDS.
**  1.2  MODIFIED: 30-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW BLANK CHARACTER VALUES AND ALSO REMOVED TRAILING
**    SPACES INTRODUCED BY CHANGE 1.1.
**  1.3  MODIFIED: 31-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW PARENTHESES IN CHARACTER VALUES.
**  1.4  MODIFIED: 31-AUG-2010  BY: ROBERT PENRIDGE
**  - ADDED SOME DEBUG VALUES TO DETERMINE WHY IT SOMETIMES LOCKS TABLES.
*****************************************************************************/
%macro table_parse(iDs=, iField=, iDsOptions=, iPrefix=);
  %local dsid pos rc cnt cell_value type;

  %let cnt=0;
  /*
  ** OPEN THE TABLE (AND MAKE SURE IT EXISTS)
  */
  %let dsid=%sysfunc(open(&iDs(&iDsOptions),i));
  %if &dsid eq 0 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;

  /*
  ** GET THE POSITION OF THE FIELD (AND MAKE SURE IT EXISTS)
  */
  %let pos=%sysfunc(varnum(&dsid,&iField));
  %if &pos eq 0 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;
  %else %do;
    /*
    ** DETERMINE THE TYPE OF THE FIELD
    */
    %let type = %upcase(%sysfunc(vartype(&dsid,&pos)));
  %end;

  /*
  ** READ THROUGH EACH OBSERVATION IN THE TABLE
  */
  %let rc=%sysfunc(fetch(&dsid));
  %do %while (&rc eq 0);
    %let cnt = %eval(&cnt + 1);
    %if "&type" = "C" %then %do;
      %let cell_value = %qsysfunc(getvarc(&dsid,&pos));
      %if "%trim(&cell_value)" ne "" %then %do;
        %let cell_value = %qsysfunc(cats(%nrstr(&cell_value)));
      %end;
    %end;
    %else %do;
      %let cell_value = %sysfunc(getvarn(&dsid,&pos));
    %end;

    %global &iPrefix.&cnt ;
    %let &iPrefix.&cnt = &cell_value ;

    %let rc=%sysfunc(fetch(&dsid));
  %end;


  /*
  ** CHECK FOR ABNORMAL TERMINATION OF LOOP
  */
  %if &rc ne -1 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;


  /*
  ** ENSURE THE TABLE IS CLOSED SUCCESSFULLY
  */
  %let rc=%sysfunc(close(&dsid));
  %if &rc %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;

  %global &iPrefix;
  %let &iPrefix = &cnt ;
%mend;

Other examples of calling this macro:

%table_parse(iDs=sashelp.class, iField=sex, iPrefix=myTable, iDsOptions=%str(where=(sex='F')));
%put &mytable &myTable1 &myTable2 &myTable3; *etc...;

2 Comments

Thanks very much Rob. When you say "not too many rows" in step 2, can you give me a sense of how many that is? I generally have about 8000 rows in the lookup table.
You may be ok. I recommend just testing it. If it takes longer than you're willing to wait then it's too many macro vars =P I remember I had a problem when I was doing some dev work. It took say 30 seconds to create 10k, but 30minutes to create 20k (anecdotal numbers). Honestly it may have been due to something else I was doing as I haven't had the problem since but it's something I've kept in the back of my mind. YMMV. Also, the problem may exist with my macro, and there may be no issue if you just used call symput. I simply use the macro for convenience as it is a common task.
2

I'd be tempted to use the direct access method POINT= here, this will only read the required row numbers rather than the whole dataset. Here is the code, which uses the same create data code as in Rob's answer.

    data want;
    set lookup;
    do i=begrec to endrec;
    set big point=i;
    output;
    end;
    drop begrec endrec;
    run;

If you have the code column already in the big dataset and you just wanted to update the values from the lookup dataset, then you could do this using MODIFY.

    data big;
    set lookup (rename=(code=code1));
    do i=begrec to endrec;
    modify big point=i;
    code=code1;
    replace;
    end;
    run;

Comments

2

Here's my solution, using proc format. This is also done in-memory, much like a hash table, but requires less structural code to work.

(Data input code also borrowed from @Rob_Penridge.)

data big;
  do rec = 1 to 20000;
   output;
  end;
run;

data lookup;      
  input Code $ BegRec EndRec;
  datalines;
  ZZ         0        20
  JJ        40        60
  AA      1200      4370
  AX      7241      9488
  BY     12119     14763
  ;
run;

data lookup_f;
    set lookup;    
    rename
        BegRec  = start
        EndRec  = end
        Code    = label;

    retain fmtname 'CodeRecFormat';
run;

proc format library = work cntlin=lookup_f; run;


data big_formatted;
    format rec CodeRecFormat.;
    format rec2 8.;
    length code $5.;

    set big;    

    code = putn(rec, "CodeRecFormat.");
    rec2 = rec;
run;

2 Comments

Note that you can specify a width for your custom format, just as a regular in-built format.
Yes I think a format is the best approach for this problem. Good call.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.