I have a two-column file that you can create as follows
cat > twocol << EOF
007 03
001 03
003 01
137 12
001 11
002 01
002 02
002 03
001 02
002 04
137 94
010 21
001 01
EOF
The resultant file, twocol, has only the rows of digits.
Desired Result
I want to perform some kind of command on twocol and get the following result. (I think seeing it is much better than trying to restate my somewhat-confusing question title - "sort by first column then second; output unique 1st column once but all 2nd column".)
001 01
02
03
11
002 01
02
03
04
003 01
007 03
010 21
137 12
94
That's different from what a simple sort will give me, i.e. different from
001 01
001 02
001 03
001 11
002 01
002 02
002 03
002 04
003 01
007 03
010 21
137 12
137 94
My Work
The only solution I've come first solution I came up with (before I got a decent awk script going) - which matches the Desired Result above in bold, uses several instances of awk, a bunch of bash, and some help from 1.
col_1_max_len=$(awk '
BEGIN{maxl=0;}
{curr=length($1);max1=max1>curr?max1:curr;}
END{print max1}' \
twocol);
len1=$col_1_max_len;
len2=$(awk '
BEGIN{max2=0;}
{curr=length($2);max2=max2>curr?max2:curr;}
END{print max2}' \
twocol);
current_col_1_val="nothing";
while read -r line; do {
current_row="${line}";
col_1_val=$(awk '{print $1}' <<< "${current_row}");
col_2_val=$(awk '{print $2}' <<< "${current_row}");
if [ ! "${col_1_val}" == "${current_col_1_val}" ]; then
printf "%0"$len1"d %0"$len2"d\n" "${col_1_val}" "${col_2_val}";
else
printf "%"$len1"s %0"$len2"d\n" " " "${col_2_val}";
fi;
}; done < <(sort twocol)
I feel like I should be able to use one pass with awk, something like the answers that follow: 2 , 3 , 4 , 5 , ...
I can't seem to get it hammered together without what feel like extra, clunky, memory-eating arrays. The format is also giving me a problem - the numbers in the first and second columns can go to more digits, and it would be preferable for things to look nice.
Can anyone show me how to get this result with some nice awk code - preferably that can be used pretty-easily in the terminal? Perl answers are welcome, too.
Oh, my system
$ uname -a && bash --version | head -1 && awk --version | head -1
CYGWIN_NT-10.0 MY-MACHINE 3.2.0(0.340/5/3) 2021-03-29 08:42 x86_64 Cygwin
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.2.0-p9, GNU MP 6.2.1)
(I get exactly the same behavior on my Fedora and Ubuntu machines.)
Edit
I came up with an awk solution. It looks all nice and short, but I still feel there are problems.
awk '{if (!vals[$1]++) print($0); else print(" ",$2);}' <(sort twocol)
I think I'm using a bunch of memory with the vals array - as of now, my file only has ~10k lines, but I hope to scale it up. I hard-coded in the format, but I don't like it because I could have strings of varying lengths.
I can fix that (the formatting) if I make three passes with awk and pass in variables.
length1=$(awk '
BEGIN{maxl=0;}
{curr=length($1);max1=max1>curr?max1:curr;}
END{print max1}' \
twocol);
length2=$(awk '
BEGIN{max2=0;}
{curr=length($2);max2=max2>curr?max2:curr;}
END{print max2}' \
twocol);
awk -vlen1=$length1 -vlen2=$length2 '
{
if (!vals[$1]++)
printf("%0*d %0*d\n",len1,$1,len2,$2);
else
printf("%*s %0*d\n",len1," ",len2,$2);
}' <(sort twocol)
Result matches the Desired Result exactly (see the part in bold, above), but I hope there's a way to do it all with one pass of awk.
Can anyone share something that matches the characteristics I've mentioned? Any comments about the time performance and/or the memory performance of the different methods would also be appreciated.
I think it might also be possible to do the sorting in awk; I'd like to know, especially if it could be more efficient. Edit: It can be done, as @steeldriver and @markp-fuso show below.
awk '{if (!vals[$1]++) print($0); else print(" ",$2);}' <(sort twocol)works quite well :-); if you actually find yourself with memory issues you can easily replace the array reference with a 'previous' variable (eg, my 2ndawkscript)awk(eg, my 1stawkscript, steeldriver'sawkscript) are going to require storing the file in memory; you can get away from the memory-usage question by usingsortto feed a sorted stream toawk, and depending on yoursortversion there may be some options (memory size, # of cpus) to improve onsort's performance