Print lines that have no duplicates in a file and preserve sort order linux

Question

I have the following file:

I want the output like this (unique lines that don't have any duplicates and preserve order):

4
3

I tried sort file.txt | uniq -u it works, but output is sorted:

3
4

I tried awk '!x[$0]++' file.txt it keeps order, but it prints all values once:

The problem is that the uniq command expects that sorting must be done before you pipe to the uniq command, since the lines with same content must be consecutive - next to eachother. You can do your removal of lines that have duplicates and keep ordering using Powershell like this: Get-Content .\data.txt | Group-Object | Where-Object { $_.Count -le 1 } | Select -ExpandProperty Name — Tore Aurstad
– Tore Aurstad, Commented Mar 8, 2024 at 15:26

markp-fuso · Accepted Answer · 2024-02-09 22:53:46Z

15

A couple ideas to choose from:

a) read the input file twice:

awk '
FNR==NR         { counts[$0]++; next }  # 1st pass: keep count
counts[$0] == 1                         # 2nd pass: print rows with count == 1
' file.txt file.txt

b) read the input file once:

awk '
    { lines[NR] = $0                    # maintain ordering of rows
      counts[$0]++
    }
END { for ( i=1;i<=NR;i++ )             # run thru the indices of the lines[] array and ...
          if ( counts[lines[i]] == 1 )  # if the associated count == 1 then ...
             print lines[i]             # print the array entry to stdout
    }
' file.txt

Both of these generate:

4
3

edited Feb 9, 2024 at 22:53

answered Feb 8, 2024 at 21:46

markp-fuso

38.5k5 gold badges24 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RARE Kpop Manifesto Over a year ago

ehhhh why not just set the value of the de-duped array index to be a chain of row numbers that match it. then in the END { } zone, pre-clear $0, set OFS = "\n", ignore all array entries with more than 1 NR in it, set $NR = dedupe_arr[ x ], squeeze spare OFS seps, then print it in one shot ?

glenn jackman Over a year ago

You'll probably want to use $0 instead of $1 since the OP is asking about "unique lines"

pmf · Accepted Answer · 2024-02-09 01:57:36Z

13

I tried sort file.txt | uniq -u it works, but output is sorted

You could take that output, and use it as a list of newline-delimited patterns with grep -f on the original file. Use -Fx to treat the patterns as whole line fixed strings (not regular expressions).

sort file.txt | uniq -u | grep -Fxf- file.txt

4
3

answered Feb 9, 2024 at 1:57

pmf

38.3k3 gold badges31 silver badges41 bronze badges

Comments

dawg · Accepted Answer · 2024-02-09 23:07:09Z

6

Here is a Ruby to do that:

ruby -lne 'BEGIN{cnt=Hash.new {|h,k| h[k] = 0} } 
cnt[$_]+=1
END{puts cnt.select{|k,v| v==1}.keys.join("\n") }
' file

Prints:

4
3

Or, in one read of the file:

ruby -e 'puts $<.read.split(/\R+/).
            group_by{|x| x}.select{|k,v| v.length==1}.keys.join("\n")
' file 
# same output

Unlike awk, Ruby associative arrays maintain insertion order.

If you want a one pass awk you could do:

awk 'BEGIN{OFS="\t"}
{ if (seen[$0]++) delete order[$0]; else order[$0]=FNR } 
END { for ( e in order ) print order[e], e } ' file | sort -nk 1,1 | cut -f2-
# same output

(Thanks Ed Morton for a better awk!)

edited Feb 9, 2024 at 23:07

answered Feb 8, 2024 at 22:05

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

amphetamachine · Accepted Answer · 2024-02-09 15:52:01Z

3

Using entirely Bash built-ins, you can do this in just a few lines:

declare -A SEEN=()
while IFS= read -r LINE; do
    (( ++SEEN[_$LINE] ))
done < file.txt
while IFS= read -r LINE; do
    if [[ ${SEEN[_$LINE]} -eq 1 ]]; then
        printf -- '%s\n' "$LINE"
    fi
done < file.txt

Note: The _$LINE as the subscript is to handle empty lines correctly.

edited Feb 9, 2024 at 15:52

answered Feb 8, 2024 at 21:52

amphetamachine

30.9k12 gold badges68 silver badges74 bronze badges

2 Comments

glenn jackman Over a year ago

Get out of the habit of using ALLCAPS variable names, leave those as reserved by the shell. One day you'll write PATH=something and then wonder why. your script is broken

Ed Morton Over a year ago

Idiomatically when an array is named seen[] it's used for just one very specific purpose and that's testing whether it's index has been seen before, e.g. if ! (( seen[$line]++ )); then echo "first time for $line"; fi. You're not using your array like that, you're using it to keep a count of how many times it's index has been seen and then later testing it that was 1 time or many - an array used to keep and test a count should be named count[] or similar for clarity.

jubilatious1 · Accepted Answer · 2025-09-16 01:41:56Z

1

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %seen; 
             %seen.push: $_ => ++$; 
             END .key.put if .value.elems == 1 for %seen.sort: *.value.head;'  file

Raku is a programming language in the Perl-family that features high-level Unicode support. Above, Raku's -ne non-autoprinting linewise flags are used to act on an input file placed on the command line:

BEGIN by declaring a %-sigiled hash (%seen),
In the main body of the loop, each line is taken-in as key with ++$ an anonymous incrementing variable as value. Raku's => fat-arrow is the pair constructor.
At the END iterate through the sorted hash and print out (put) the key only if .value.elems == 1. Sort is accomplished on *.value.head, i.e. the lowest line number obtained per key.

The Raku code above works fine on the OP's numeric input example. Since keys/values(line-numbers) can get confusing when bith are numeric, below is example input/output with alphabetic input:

Sample Input:

B
A
D
C
B
A

Sample Output (1):

D
C

If you really want to see behind the scenes, here's the END statement without filtering (i.e. END .say for %seen.sort: *.value.[0];). Note how the .value output order (first element) goes 1,2,3,4 :

Sample Output (2):

B => [1 5]
A => [2 6]
D => 3
C => 4

https:docs.raku.org

https:raku.org

edited Sep 16 at 1:41

answered Sep 5 at 6:28

jubilatious1

2,47312 silver badges22 bronze badges

1 Comment

SmokeMachine Sep 15 at 21:03

Equivalently glot.io/snippets/hb5c9y7jet

pmf · Accepted Answer · 2024-02-09 00:05:33Z

0

Here's an approach using only awk, which reads the input only once, and yet doesn't store the entire file in memory:

fo stores a line's first occurrence into an array. If the line isn't registered yet (!fo[$0]), save the current line number (fo[$0]=NR).
fq counts the frequency of a line, which is incremented for every line read (fq[$0]++).
Also, the yet unincremented value of fq[$0] is used as condition (which is not met on 0, i.e. the only value that would be incremented to the desired frequency of 1, and met otherwise due to an exceeding frequency) to abandon the corresponding register of first occurrence (delete fo[$0]).
Eventually, fo contains only items of relevance (lines occurring not more than once), with the lines' contents as indices, and the line numbers of their first occurrences as values. So, to finish, only the array's indices need to be printed in ascending order of their numeric values. One way to achieve this would be by using asorti (available in GNU Awk 4+) with the proc instruction @val_num_asc to numerically sort by values in ascending order.

awk '
  !fo[$0]  {fo[$0]=NR}
  fq[$0]++ {delete fo[$0]}
  END      {asorti(fo,fo,"@val_num_asc"); for (i in fo) print fo[i]}
'

4
3

edited Feb 9, 2024 at 0:05

answered Feb 8, 2024 at 21:42

pmf

38.3k3 gold badges31 silver badges41 bronze badges

6 Comments

RARE Kpop Manifesto Over a year ago

just set PROCINFO["sorted_in"] = "@val_num_asc" then you completely skip asorti( ) by printing out i instead of fo[i] using the same for( ) loop

pmf Over a year ago

@RARE Thank you for the suggestion. Setting PROCINFO["sorted_in"] = "@val_num_asc" is a valid alternative, and I had even considered to include it, but in the end it has no compatibility benefits (e.g. regarding other implementations of awk), and is even longer than just asorti(fo,fo,"@val_num_asc").

Ed Morton Over a year ago

While asorti() will sort the array fo[], for (i in fo) will then visit it's indices in an undefined order and so the output may not be in the order you expect. You need for (i=1; i in fo; i++) or to save the return code from n = asorti(...) and then use that for (i=1; i<=n; i++) to guarantee the order of i. I personally haven't used asort() or asorti() since PROCINFO["sorted_in"] was introduced as I find the latter far clearer and easier to understand with less potential for mistakes.

RARE Kpop Manifesto Over a year ago

@EdMorton : PROCINFO["sorted_in"] = "@ind_num_asc" or : PROCINFO["sorted_in"] = "@ind_str_desc" etc.. these can directly visit indices in the desired order without having to asorti()

RARE Kpop Manifesto Over a year ago

@EdMorton : for (i=1; i in fo; i++) is a slightly problematic construct - the for( ) loop early exits if there are gaps in the numeric indices (e.g. some cells previously deleted). i know it's dang annoying to have to do another if ( ) statement to check for that condition - cuz i ran into that same problem before

|

Collectives™ on Stack Overflow

Print lines that have no duplicates in a file and preserve sort order linux

6 Answers 6

2 Comments

Comments

Comments

2 Comments

1 Comment

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

Comments

2 Comments

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related