How to remove duplicates based on multiple columns in SAS?

Question

I have table in SAS like below:

date types:

COL1 - datetime
COL2 - numeric

date type of rest columns does not matter

COL1	COL2	...	COLn
22AUG2022:15:46:38.000000	111	...	...
22AUG2022:15:46:38.000000	111	...	...
22AUG2022:15:46:38.000000	111	...	...
22AUG2022:15:46:38.000000	222	...	...
...	...	...	...

I have many columns in my table and I need to delete duplicates in COL2 based on values in COL1, so as a result I need something like below:

COL1	COL2	...	COLn
22AUG2022:15:46:38.000000	111	...	...
22AUG2022:15:46:38.000000	222	...	...
...	...	...	...

How can I do that using SAS Procedures or PROC SQL in SAS Enterprise Guide?

You said "delete duplicates in COL2 based on values in COL1", what does "based on" means? — whymath
– whymath, Commented Dec 8, 2022 at 2:19
it means that if values in COL1 and COL2 occur repeatedly the same then leave only 1 record with such values like in example :) — dingaro
– dingaro, Commented Dec 8, 2022 at 2:28
try nodupkey option in proc sort, there is an example(stackoverflow.com/questions/74706049/…) for you. — whymath
– whymath, Commented Dec 8, 2022 at 2:51

Kermit · Accepted Answer · 2022-12-08 07:51:13Z

The question is which observation would you like to retrieve for a given by group? Perhaps the rest of the columns does not matter because they have the same values across rows?

If yes, use the noduprecs option in proc sort. It will delete duplicated observations while nodupkey will delete those observations that have duplicate BY value.

Taking the below dataset as an example

       col1        col2  col3
 22AUG22:15:46:38   111   ABC
 22AUG22:15:46:38   111   DEF
 22AUG22:15:46:38   111   GHI
 22AUG22:15:46:38   222   JKL

Using proc sort

* delete observations that have duplicated BY values;
proc sort data=have out=want nodupkey equals;
by col1 col2;
run;

22AUG22:15:46:38 111 ABC  --> Keep the first record within the by group
22AUG22:15:46:38 222 JKL

Note that by using the equals option, observations with identical BY variable values are to retain the same relative positions in the output data set as in the input data set. See more on Does NODUPKEY Select the First Record in a By Group?

* delete observations that have duplicated values;
proc sort data=have out=want noduprecs equals;
by col1 col2;
run;

22AUG22:15:46:38 111 ABC
22AUG22:15:46:38 111 DEF
22AUG22:15:46:38 111 GHI
22AUG22:15:46:38 222 JKL

Using the noduprecs option will not remove any observation with our example dataset as col3 has different values. You are not showing the rest of the columns in your example but shall they share the same values across rows, you might want to use it.

You could alternatively do it using proc summary

proc summary data=have nway;
class col1 col2;
id col3;
output out=want(drop=_type_ _freq_);
run;

22AUG22:15:46:38 111 GHI --> Keep the last value within the by group
22AUG22:15:46:38 222 JKL

Note that proc summary with a class statement provides a more efficient alternative than proc sort by avoiding the need for sorting in advance.

Or even using proc sql

proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = max(col3);
quit;

22AUG22:15:46:38 111 GHI --> max value within the group by columns
22AUG22:15:46:38 222 JKL

proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = min(col3);
quit;

22AUG22:15:46:38 111 ABC --> min value within the group by columns
22AUG22:15:46:38 222 JKL

Collectives™ on Stack Overflow

How to remove duplicates based on multiple columns in SAS?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related