0

I have table in SAS like below:

date types:

  • COL1 - datetime

  • COL2 - numeric

  • date type of rest columns does not matter

    COL1 COL2 ... COLn
    22AUG2022:15:46:38.000000 111 ... ...
    22AUG2022:15:46:38.000000 111 ... ...
    22AUG2022:15:46:38.000000 111 ... ...
    22AUG2022:15:46:38.000000 222 ... ...
    ... ... ... ...

I have many columns in my table and I need to delete duplicates in COL2 based on values in COL1, so as a result I need something like below:

COL1 COL2 ... COLn
22AUG2022:15:46:38.000000 111 ... ...
22AUG2022:15:46:38.000000 222 ... ...
... ... ... ...

How can I do that using SAS Procedures or PROC SQL in SAS Enterprise Guide?

3
  • You said "delete duplicates in COL2 based on values in COL1", what does "based on" means? Commented Dec 8, 2022 at 2:19
  • it means that if values in COL1 and COL2 occur repeatedly the same then leave only 1 record with such values like in example :) Commented Dec 8, 2022 at 2:28
  • 1
    try nodupkey option in proc sort, there is an example(stackoverflow.com/questions/74706049/…) for you. Commented Dec 8, 2022 at 2:51

1 Answer 1

0

The question is which observation would you like to retrieve for a given by group? Perhaps the rest of the columns does not matter because they have the same values across rows?

If yes, use the noduprecs option in proc sort. It will delete duplicated observations while nodupkey will delete those observations that have duplicate BY value.

Taking the below dataset as an example

       col1        col2  col3
 22AUG22:15:46:38   111   ABC
 22AUG22:15:46:38   111   DEF
 22AUG22:15:46:38   111   GHI
 22AUG22:15:46:38   222   JKL

Using proc sort

* delete observations that have duplicated BY values;
proc sort data=have out=want nodupkey equals;
by col1 col2;
run;

22AUG22:15:46:38 111 ABC  --> Keep the first record within the by group
22AUG22:15:46:38 222 JKL

Note that by using the equals option, observations with identical BY variable values are to retain the same relative positions in the output data set as in the input data set. See more on Does NODUPKEY Select the First Record in a By Group?

* delete observations that have duplicated values;
proc sort data=have out=want noduprecs equals;
by col1 col2;
run;

22AUG22:15:46:38 111 ABC
22AUG22:15:46:38 111 DEF
22AUG22:15:46:38 111 GHI
22AUG22:15:46:38 222 JKL

Using the noduprecs option will not remove any observation with our example dataset as col3 has different values. You are not showing the rest of the columns in your example but shall they share the same values across rows, you might want to use it.


You could alternatively do it using proc summary

proc summary data=have nway;
class col1 col2;
id col3;
output out=want(drop=_type_ _freq_);
run;

22AUG22:15:46:38 111 GHI --> Keep the last value within the by group
22AUG22:15:46:38 222 JKL

Note that proc summary with a class statement provides a more efficient alternative than proc sort by avoiding the need for sorting in advance.


Or even using proc sql

proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = max(col3);
quit;

22AUG22:15:46:38 111 GHI --> max value within the group by columns
22AUG22:15:46:38 222 JKL
proc sql;
create table want as
select col1, col2, col3
from have
group by col1, col2
having col3 = min(col3);
quit;

22AUG22:15:46:38 111 ABC --> min value within the group by columns
22AUG22:15:46:38 222 JKL
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.