I got a dataset like:
6 AA_A_56_30018678_E 0 30018678 P A
6 SNP_A_30018678 0 30018678 A G
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
6 AA_A_62_30018696_Q 0 30018696 P A
6 AA_A_62_30018696_G 0 30018696 P A
6 AA_A_62_30018696_R 0 30018696 P A
I want to remove all the rows if col 4 have duplicates.
I have use the below codes (using sort, awk,uniq and join...) to get the required output, however, is there a better way to do this?
sort -k4,4 example.txt | awk '{print $4}' | uniq -u > snp_sort.txt
join -1 1 -2 4 snp_sort.txt example.txt | awk '{print $3,$5,$6,$1}' > uniq.txt
Here is the output
SNP_A_30018679 T G 30018679
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695