awk: sort by first column then second; output unique 1st column once but all 2nd column

Question

I have a two-column file that you can create as follows

cat > twocol << EOF
007 03
001 03
003 01
137 12
001 11
002 01
002 02
002 03
001 02
002 04
137 94
010 21
001 01
EOF

The resultant file, twocol, has only the rows of digits.

Desired Result

I want to perform some kind of command on twocol and get the following result. (I think seeing it is much better than trying to restate my somewhat-confusing question title - "sort by first column then second; output unique 1st column once but all 2nd column".)

That's different from what a simple sort will give me, i.e. different from

My Work

The ~~only solution I've come~~ first solution I came up with (before I got a decent awk script going) - which matches the Desired Result above in bold, uses several instances of awk, a bunch of bash, and some help from 1.

col_1_max_len=$(awk '
  BEGIN{maxl=0;}
  {curr=length($1);max1=max1>curr?max1:curr;}
  END{print max1}' \
 twocol);
len1=$col_1_max_len;
len2=$(awk '
  BEGIN{max2=0;}
  {curr=length($2);max2=max2>curr?max2:curr;}
  END{print max2}' \
 twocol);
current_col_1_val="nothing";

while read -r line; do {
  current_row="${line}";
  col_1_val=$(awk '{print $1}' <<< "${current_row}");
  col_2_val=$(awk '{print $2}' <<< "${current_row}");
  if [ ! "${col_1_val}" == "${current_col_1_val}" ]; then
    printf "%0"$len1"d %0"$len2"d\n"  "${col_1_val}"  "${col_2_val}";
  else
    printf "%"$len1"s %0"$len2"d\n"  " "  "${col_2_val}";
  fi;
}; done < <(sort twocol)

I feel like I should be able to use one pass with awk, something like the answers that follow: 2 , 3 , 4 , 5 , ...

I can't seem to get it hammered together without what feel like extra, clunky, memory-eating arrays. The format is also giving me a problem - the numbers in the first and second columns can go to more digits, and it would be preferable for things to look nice.

Can anyone show me how to get this result with some nice awk code - preferably that can be used pretty-easily in the terminal? Perl answers are welcome, too.

Oh, my system

$ uname -a && bash --version | head -1 && awk --version | head -1
CYGWIN_NT-10.0 MY-MACHINE 3.2.0(0.340/5/3) 2021-03-29 08:42 x86_64 Cygwin
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.2.0-p9, GNU MP 6.2.1)

(I get exactly the same behavior on my Fedora and Ubuntu machines.)

Edit

I came up with an awk solution. It looks all nice and short, but I still feel there are problems.

awk '{if (!vals[$1]++) print($0); else print("   ",$2);}' <(sort twocol)

I think I'm using a bunch of memory with the vals array - as of now, my file only has ~10k lines, but I hope to scale it up. I hard-coded in the format, but I don't like it because I could have strings of varying lengths.

I can fix that (the formatting) if I make three passes with awk and pass in variables.

length1=$(awk '
  BEGIN{maxl=0;}
  {curr=length($1);max1=max1>curr?max1:curr;}
  END{print max1}' \
 twocol);

length2=$(awk '
  BEGIN{max2=0;}
  {curr=length($2);max2=max2>curr?max2:curr;}
  END{print max2}' \
 twocol);

awk -vlen1=$length1 -vlen2=$length2 '
{
  if (!vals[$1]++) 
    printf("%0*d %0*d\n",len1,$1,len2,$2); 
  else 
    printf("%*s %0*d\n",len1," ",len2,$2);
}' <(sort twocol)

Result matches the Desired Result exactly (see the part in bold, above), but I hope there's a way to do it all with one pass of awk.

Can anyone share something that matches the characteristics I've mentioned? Any comments about the time performance and/or the memory performance of the different methods would also be appreciated.

I think it might also be possible to do the sorting in awk; I'd like to know, especially if it could be more efficient. Edit: It can be done, as @steeldriver and @markp-fuso show below.

I just re-read the full question and your awk '{if (!vals[$1]++) print($0); else print(" ",$2);}' <(sort twocol) works quite well :-); if you actually find yourself with memory issues you can easily replace the array reference with a 'previous' variable (eg, my 2nd awk script) — markp-fuso
– markp-fuso, Commented Feb 15, 2024 at 0:57
any attempt to sort the data within awk (eg, my 1st awk script, steeldriver's awk script) are going to require storing the file in memory; you can get away from the memory-usage question by using sort to feed a sorted stream to awk, and depending on your sort version there may be some options (memory size, # of cpus) to improve on sort's performance — markp-fuso
– markp-fuso, Commented Feb 15, 2024 at 1:02
Thanks for teaching me about memory stuff, @markp-fuso, as well as for giving the great answer. — bballdave025
– bballdave025, Commented Feb 15, 2024 at 2:39

steeldriver · Accepted Answer · 2024-02-15 12:01:28Z

10

Original awk solution removed - a better solution has been posted

Pragmatically, you could pre-sort the input and just use (any) awk to format it:

sort twocol | awk 'BEGIN{OFS="\t"} {print $1 == last ? "" : $1, $2; last = $1}'
001     01
        02
        03
        11
002     01
        02
        03
        04
003     01
007     03
010     21
137     12
        94

This produces tab separated output - if you want spaces, pipe the result through expand.

Alternatively, you could use perl hash of anonymous arrays to aggregate the second column values, then sort and print:

perl -alne '
  push @{ $h{$F[0]} }, $F[1] 
  }{ 
  foreach $k (sort {$a <=> $b} keys %h) {
    @a = sort {$a <=> $b} @{ $h{$k} };
    print join "\n", map { ($_ == 0 ? $k : "") . "\t" . $a[$_] } 0..$#a;
  }
  ' twocol
001     01
        02
        03
        11
002     01
        02
        03
        04
003     01
007     03
010     21
137     12
        94

The {$a <=> $b} are perhaps not necessary since your zero padded data sort the same lexicographically as numerically.

Just for fun, with Miller:

mlr -S --nidx --ofs tab put -q '
  @m[$1] = is_not_array(@m[$1]) ? [$2] : append(@m[$1],$2); 
  end { 
    @m = sort(apply(@m, func(k,v) { return {k: joinv(sort(v), "\n\t")}; }));
    emit @m, ""
  }
  ' twocol 
001     01
        02
        03
        11
002     01
        02
        03
        04
003     01
007     03
010     21
137     12
        94

edited Feb 15, 2024 at 12:01

answered Feb 14, 2024 at 22:09

steeldriver

83.9k12 gold badges124 silver badges175 bronze badges

I like how you do the sort, and how you just use tab as the OFS. I've got some new things about awk (like PROCINFO and the @ stuff) to learn. Oh, and this has been tested, and it works!

bballdave025
– bballdave025

2024-02-14 22:19:50 +00:00
Commented Feb 14, 2024 at 22:19
2

@bballdave025 someone else may post a better solution - meanwhile I've added a perl version that may be preferable

steeldriver
– steeldriver

2024-02-14 22:43:09 +00:00
Commented Feb 14, 2024 at 22:43
1

@bballdave025 and indeed someone has - assigning the second field to the inner array's index instead of its value so both may be sorted the same way

steeldriver
– steeldriver

2024-02-15 00:08:22 +00:00
Commented Feb 15, 2024 at 0:08
fwiw, you can nest PROCINFO["sorted_in"] options; in this case you could replace n = asort(a[k], b, "@val_num_asc") with PROCINFO["sorted_in"] = "@val_num_asc"; the follow-on for/print loop would then be replaced with something like pfx=k; for (j in a[k]) {print pfx,a[k][j]; pfx="" }

markp-fuso
– markp-fuso

2024-02-15 03:12:39 +00:00
Commented Feb 15, 2024 at 3:12
1

@markp-fuso I removed my awk version - you may wish to edit yours to remove the reference

steeldriver
– steeldriver

2024-02-15 11:17:23 +00:00
Commented Feb 15, 2024 at 11:17

Add a comment |

terdon · Accepted Answer · 2024-02-15 19:24:59Z

7

One awk idea:

awk '
BEGIN { OFS="\t"  }
      { a[$1][$2] }                               # we can sort on both indices to obtain the desired ordering
END   { PROCINFO["sorted_in"] = "@ind_num_asc"    # applies to all follow-on array references (ie, both indices of the a[] array)
        for (i in a) {
            firstcol = i
            for (j in a[i]) {
                print firstcol, j
                firstcol = ""
            }
         }
      }
' twocol

NOTE: this requires GNU awk 4.0+ for PROCINFO["sorted_in"] support

This generates:

If PROCINFO["sorted_in"] is not available we can use sort to feed a simplified awk script:

awk '
BEGIN { OFS="\t" }
      { if ($1 != prev1) {
           print $1,$2
           prev1 = $1
        }
        else
           print "",$2
      }
' < <(sort twocol)

This also generates:

edited Feb 15, 2024 at 19:24

terdon♦

253k69 gold badges481 silver badges719 bronze badges

answered Feb 14, 2024 at 23:30

markp-fuso

1,7101 gold badge9 silver badges11 bronze badges

1

Ah yes use index for both fields - much nicer!

steeldriver
– steeldriver

2024-02-15 00:06:39 +00:00
Commented Feb 15, 2024 at 0:06
1

Apologies for the (now reverted) edit. You had already pointed out your first version was gawk-specific, and there is nothing gawk-specific in the second one. My bad!

terdon
– terdon ♦

2024-02-15 19:25:41 +00:00
Commented Feb 15, 2024 at 19:25
In the second script, if you make it { if ($1 == prev1) { key = "" } else { key = $1; prev1 = $1 } print key, $2 } then a) you remove the negative logic (!=) which turns the else into a double negative (it is NOT true that $1 is NOT equal to prev1), and b) you can avoid having 2 separate print statements so if you need to change the output in future you don't need to do so in 2 places. So, you avoid negative logic and avoid duplicate code, both of which are generally bad in software.

Ed Morton
– Ed Morton

2024-02-17 13:06:16 +00:00
Commented Feb 17, 2024 at 13:06

Add a comment |

jubilatious1 · Accepted Answer · 2024-02-16 00:56:06Z

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %h;  %h.append: .split(/ \s+ /);  END put .key => .value.sort.join("\n\t") for %h.sort;'  file

#OR

~$ raku -ne 'BEGIN my %h;  %h.append: .words;  END put .key => .value.sort.join("\n\t") for %h.sort;'  file

Here's an answer written in Raku, a member of the Perl-family of programming languages. Briefly, the code above is somewhat awk-like and uses Raku's (like Perl's) -ne non-autoprinting command-line flags.

A hash %h is declared in a BEGIN block.
The line is .split on one-or-more \s space characters. Alternatively (second answer), Raku's .words routine is used to split on whitespace. For both answers, resultant (two) elements are understood as a key-value pair, which is appended to the hash.
In the END block, the %h hash (sorted on keys) is output individually, with each .key followed by each .value that have been sort.join("\n\t"). The \t moves values that wrap to the next line into the second column.

Sample Input:

Sample Output:

Note, sometimes it's instructive to look at Raku's defaults, so here's the answer without "column-izing" the output as above (i.e. simpler code below):

~$ raku -ne 'BEGIN my %h; %h.append: .words; END say .key => .value.sort for %h.sort;'  file
001 => (01 02 03 11)
002 => (01 02 03 04)
003 => (01)
007 => (03)
010 => (21)
137 => (12 94)

https://docs.raku.org/type/Hash
https://raku.org

I really like this Perl(-like?) solution. Thanks.

bballdave025
– bballdave025

2024-02-16 11:39:26 +00:00
Commented Feb 16, 2024 at 11:39 — bballdave025
– bballdave025, Commented Feb 16, 2024 at 11:39

Stack Exchange Network

awk: sort by first column then second; output unique 1st column once but all 2nd column

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

awk: sort by first column then second; output unique 1st column once but all 2nd column

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions