awk to remove duplicate rows totally based on a particular column value

Question

I got a dataset like:

6   AA_A_56_30018678_E  0   30018678    P   A
6   SNP_A_30018678  0   30018678    A   G
6   SNP_A_30018679  0   30018679    T   G
6   SNP_A_30018682  0   30018682    T   G
6   SNP_A_30018695  0   30018695    G   C
6   AA_A_62_30018696_Q  0   30018696    P   A
6   AA_A_62_30018696_G  0   30018696    P   A
6   AA_A_62_30018696_R  0   30018696    P   A

I want to remove all the rows if col 4 have duplicates.

I have use the below codes (using sort, awk,uniq and join...) to get the required output, however, is there a better way to do this?

sort -k4,4 example.txt | awk '{print $4}' | uniq -u  > snp_sort.txt

join -1 1 -2 4 snp_sort.txt example.txt | awk '{print $3,$5,$6,$1}' > uniq.txt

Here is the output

SNP_A_30018679  T   G   30018679
SNP_A_30018682  T   G   30018682
SNP_A_30018695  G   C   30018695

Inian · Accepted Answer · 2016-10-03 05:20:48Z

2

Using awk to filter-out duplicate lines and print those lines which occur exactly once.

awk '{k=($2 FS $5 FS $6 FS $4)} {a[$4]++;b[$4]=k}END{for(x in a)if(a[x]==1)print b[x]}' input_file

SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
SNP_A_30018679 T G 30018679

The idea is to:-

Store all unique $4 entries in a an array(a) and maintain a counter for that in array b
Print the array for those entries which occur exactly once.

edited Oct 3, 2016 at 5:20

answered Oct 3, 2016 at 5:16

Inian

87k15 gold badges166 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

P.... · Accepted Answer · 2016-10-03 05:36:00Z

2

Using command substitution: First print only unique columns in fourth field and then grep those columns.

grep "$(echo  "$(awk '{print $4}' inputfile.txt)" |sort |uniq -u)" inputfile.txt
6   SNP_A_30018679  0   30018679    T   G
6   SNP_A_30018682  0   30018682    T   G
6   SNP_A_30018695  0   30018695    G   C

Note: add awk '{NF=4}1' at the end of the command, if you wist to print first four columns. Of course you can change the number of columns by changing value of $4 and NF=4.

edited Oct 3, 2016 at 5:36

answered Oct 3, 2016 at 5:24

P....

18.6k3 gold badges36 silver badges55 bronze badges

1 Comment

Cine Over a year ago

You still need to sort. Uniq does not do duplicate elimination unless it is consecutive lines.

Ed Morton · Accepted Answer · 2016-10-03 09:45:26Z

2

$ awk 'NR==FNR{c[$4]++;next} c[$4]<2' file file
6   SNP_A_30018679  0   30018679    T   G
6   SNP_A_30018682  0   30018682    T   G
6   SNP_A_30018695  0   30018695    G   C

answered Oct 3, 2016 at 9:45

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

Cine · Accepted Answer · 2016-10-03 04:51:27Z

1

Since your 'key' is fixed width, then uniq has a -w to check on it.

sort -k4,4 example.txt | uniq -u -f 3 -w 8  > uniq.txt

answered Oct 3, 2016 at 4:51

Cine

4,43930 silver badges50 bronze badges

1 Comment

James Brown Over a year ago

Hmm, I only got one line out of it.

James Brown · Accepted Answer · 2016-10-04 05:54:47Z

Another in awk:

$ awk '{$1=$1; a[$4]=a[$4] $0} END{for(i in a) if(gsub(FS,FS,a[i])==5) print a[i]}' file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C

Catenate to array using $4 as key. If there are more than 5 field separators, duplicates were catenated and will not be printed.

And yet an another version in awk. It expects the file to be sorted on the fourth field. It won't store all lines in memory~~, only the keys (this probably could be dealt with also since the key field must be sorted, may be fixed later)~~ and runs in one go:

$ cat ananother.awk
++seen[p[4]]==1 && NR>1 && p[4]!=$4 {  # seen count must be 1 and
    print prev                         # this and previous $4 must differ
    delete seen                        # is this enough really?
}
{ 
    q=p[4]                             # previous previous $4 for END
    prev=$0                            # previous is stored for printing
    split($0,p)                        # to get previous $4
} 
END {                                  # last record control
    if(++seen[$4]==1 && q!=$4) 
        print $0
}

Run:

$ sort -k4,4 file | awk -f ananother.awk

Ashwaq · Accepted Answer · 2019-07-03 08:53:24Z

1

A simpler way to achieve this,

cat file.csv | cut -d, -f3,5,6,1 | sort -u > uniq.txt

answered Jul 3, 2019 at 8:53

Ashwaq

4691 gold badge8 silver badges17 bronze badges

Collectives™ on Stack Overflow

awk to remove duplicate rows totally based on a particular column value

6 Answers 6

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related