15

I have the following file:

2
1
4
3
2
1

I want the output like this (unique lines that don't have any duplicates and preserve order):

4
3

I tried sort file.txt | uniq -u it works, but output is sorted:

3
4

I tried awk '!x[$0]++' file.txt it keeps order, but it prints all values once:

2
1
4
3
1
  • The problem is that the uniq command expects that sorting must be done before you pipe to the uniq command, since the lines with same content must be consecutive - next to eachother. You can do your removal of lines that have duplicates and keep ordering using Powershell like this: Get-Content .\data.txt | Group-Object | Where-Object { $_.Count -le 1 } | Select -ExpandProperty Name Commented Mar 8, 2024 at 15:26

6 Answers 6

15

A couple ideas to choose from:

a) read the input file twice:

awk '
FNR==NR         { counts[$0]++; next }  # 1st pass: keep count
counts[$0] == 1                         # 2nd pass: print rows with count == 1
' file.txt file.txt

b) read the input file once:

awk '
    { lines[NR] = $0                    # maintain ordering of rows
      counts[$0]++
    }
END { for ( i=1;i<=NR;i++ )             # run thru the indices of the lines[] array and ...
          if ( counts[lines[i]] == 1 )  # if the associated count == 1 then ...
             print lines[i]             # print the array entry to stdout
    }
' file.txt

Both of these generate:

4
3
Sign up to request clarification or add additional context in comments.

2 Comments

ehhhh why not just set the value of the de-duped array index to be a chain of row numbers that match it. then in the END { } zone, pre-clear $0, set OFS = "\n", ignore all array entries with more than 1 NR in it, set $NR = dedupe_arr[ x ], squeeze spare OFS seps, then print it in one shot ?
You'll probably want to use $0 instead of $1 since the OP is asking about "unique lines"
13

I tried sort file.txt | uniq -u it works, but output is sorted

You could take that output, and use it as a list of newline-delimited patterns with grep -f on the original file. Use -Fx to treat the patterns as whole line fixed strings (not regular expressions).

sort file.txt | uniq -u | grep -Fxf- file.txt
4
3

Comments

6

Here is a Ruby to do that:

ruby -lne 'BEGIN{cnt=Hash.new {|h,k| h[k] = 0} } 
cnt[$_]+=1
END{puts cnt.select{|k,v| v==1}.keys.join("\n") }
' file 

Prints:

4
3

Or, in one read of the file:

ruby -e 'puts $<.read.split(/\R+/).
            group_by{|x| x}.select{|k,v| v.length==1}.keys.join("\n")
' file 
# same output

Unlike awk, Ruby associative arrays maintain insertion order.

If you want a one pass awk you could do:

awk 'BEGIN{OFS="\t"}
{ if (seen[$0]++) delete order[$0]; else order[$0]=FNR } 
END { for ( e in order ) print order[e], e } ' file | sort -nk 1,1 | cut -f2-
# same output

(Thanks Ed Morton for a better awk!)

Comments

3

Using entirely Bash built-ins, you can do this in just a few lines:

declare -A SEEN=()
while IFS= read -r LINE; do
    (( ++SEEN[_$LINE] ))
done < file.txt
while IFS= read -r LINE; do
    if [[ ${SEEN[_$LINE]} -eq 1 ]]; then
        printf -- '%s\n' "$LINE"
    fi
done < file.txt

Note: The _$LINE as the subscript is to handle empty lines correctly.

2 Comments

Get out of the habit of using ALLCAPS variable names, leave those as reserved by the shell. One day you'll write PATH=something and then wonder why. your script is broken
Idiomatically when an array is named seen[] it's used for just one very specific purpose and that's testing whether it's index has been seen before, e.g. if ! (( seen[$line]++ )); then echo "first time for $line"; fi. You're not using your array like that, you're using it to keep a count of how many times it's index has been seen and then later testing it that was 1 time or many - an array used to keep and test a count should be named count[] or similar for clarity.
1

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %seen; 
             %seen.push: $_ => ++$; 
             END .key.put if .value.elems == 1 for %seen.sort: *.value.head;'  file

Raku is a programming language in the Perl-family that features high-level Unicode support. Above, Raku's -ne non-autoprinting linewise flags are used to act on an input file placed on the command line:

  • BEGIN by declaring a %-sigiled hash (%seen),

  • In the main body of the loop, each line is taken-in as key with ++$ an anonymous incrementing variable as value. Raku's => fat-arrow is the pair constructor.

  • At the END iterate through the sorted hash and print out (put) the key only if .value.elems == 1. Sort is accomplished on *.value.head, i.e. the lowest line number obtained per key.

The Raku code above works fine on the OP's numeric input example. Since keys/values(line-numbers) can get confusing when bith are numeric, below is example input/output with alphabetic input:

Sample Input:

B
A
D
C
B
A

Sample Output (1):

D
C

If you really want to see behind the scenes, here's the END statement without filtering (i.e. END .say for %seen.sort: *.value.[0];). Note how the .value output order (first element) goes 1,2,3,4 :

Sample Output (2):

B => [1 5]
A => [2 6]
D => 3
C => 4

https:docs.raku.org

https:raku.org

1 Comment

0

Here's an approach using only awk, which reads the input only once, and yet doesn't store the entire file in memory:

  • fo stores a line's first occurrence into an array. If the line isn't registered yet (!fo[$0]), save the current line number (fo[$0]=NR).
  • fq counts the frequency of a line, which is incremented for every line read (fq[$0]++).
  • Also, the yet unincremented value of fq[$0] is used as condition (which is not met on 0, i.e. the only value that would be incremented to the desired frequency of 1, and met otherwise due to an exceeding frequency) to abandon the corresponding register of first occurrence (delete fo[$0]).
  • Eventually, fo contains only items of relevance (lines occurring not more than once), with the lines' contents as indices, and the line numbers of their first occurrences as values. So, to finish, only the array's indices need to be printed in ascending order of their numeric values. One way to achieve this would be by using asorti (available in GNU Awk 4+) with the proc instruction @val_num_asc to numerically sort by values in ascending order.
awk '
  !fo[$0]  {fo[$0]=NR}
  fq[$0]++ {delete fo[$0]}
  END      {asorti(fo,fo,"@val_num_asc"); for (i in fo) print fo[i]}
'
4
3

6 Comments

just set PROCINFO["sorted_in"] = "@val_num_asc" then you completely skip asorti( ) by printing out i instead of fo[i] using the same for( ) loop
@RARE Thank you for the suggestion. Setting PROCINFO["sorted_in"] = "@val_num_asc" is a valid alternative, and I had even considered to include it, but in the end it has no compatibility benefits (e.g. regarding other implementations of awk), and is even longer than just asorti(fo,fo,"@val_num_asc").
While asorti() will sort the array fo[], for (i in fo) will then visit it's indices in an undefined order and so the output may not be in the order you expect. You need for (i=1; i in fo; i++) or to save the return code from n = asorti(...) and then use that for (i=1; i<=n; i++) to guarantee the order of i. I personally haven't used asort() or asorti() since PROCINFO["sorted_in"] was introduced as I find the latter far clearer and easier to understand with less potential for mistakes.
@EdMorton : PROCINFO["sorted_in"] = "@ind_num_asc" or : PROCINFO["sorted_in"] = "@ind_str_desc" etc.. these can directly visit indices in the desired order without having to asorti()
@EdMorton : for (i=1; i in fo; i++) is a slightly problematic construct - the for( ) loop early exits if there are gaps in the numeric indices (e.g. some cells previously deleted). i know it's dang annoying to have to do another if ( ) statement to check for that condition - cuz i ran into that same problem before
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.