Combining columns of data from files matching a string in bash

Question

I have an unknown number of input files that all match a search string, let's say *.dat, and all have 2 columns of data and equal number of rows. In bash I need to take the 2nd column in each file and append it as a new column in a singular merged file.

Eg:

>>cat File1.dat
1   A
2   B
3   C
>>cat File2.dat
4   D
5   E
6   F
>>cat combined.dat
A   D
B   E
C   F

Here is the code I have tried, the approach I have gone for is to try to loop and append:

for filename in $(ls *.dat); do paste combined.dat <(awk '{print $2}' $filename) >> combined.dat; done

The output format can be anything so long as its tab delimited, and the key is it must work on any number of input files up to...100 approx, where the number isn't known in advance.

I fixed two bugs (which only ocured on some systems) in my answer. Hope that everything works now. Please let me know if one of the commands works for you. — Socowi
– Socowi, Commented Jun 5, 2020 at 18:25

Socowi · Accepted Answer · 2020-06-05 18:21:14Z

Awk

Since you already use awk, you could to the whole work in awk:

rm -f combined.dat
awk 'FNR<NR{d="\t"} {a[FNR]=a[FNR] d $2} END{for(i=1;i<=FNR;i++) print a[i]}' *.dat > combined.dat

"Classic" solution by repeated `paste`

You can repeatedly paste combined.dat and the next found file. The only tricky part is getting the first paste right where combined.dat does not exist or is empty. You could use an if, but that would be boring. Here we use a trick: paste acts like cat when used with only one argument. With arrays we can conveniently specify optional further arguments. We also used sponge from moreutils to make sure that combined.dat is not mangled due to concurrent reads and writes – if you don't want to install sponge you have to use a temporary file or variables instead.

rm -f combined.dat
p=()
for f in *.dat; do
  awk '{print $2}' "$f" | paste "${p[@]}" - | sponge combined.dat
  p=(combined.dat)
done

Hacky solution using a single `paste`

Alternatively, you could build a bash command and execute that. No worries, eval is save here as printf %q ensures correct quoting.

rm -f combined.dat
eval "paste $(printf "<(awk '{printf \$2}' %q) " *.dat) > combined.dat"

Caílin · Accepted Answer · 2020-06-04 12:16:59Z

Short draft, especially inserting the new lines and tabs could be optimized:

#!/bin/bash
nrLines=$(wc -l < `(ls *dat | head -1)` | xargs)
i=1
while [ ${i} -le ${nrLines} ];
do
    for file in $(ls *dat); do
            awk -v line=${i} 'NR==line {printf $2}' ${file} >> consolidatedreport.txt
            echo -en "\t" >> consolidatedreport.txt
    done
i=$[$i+1]
echo "" >> consolidatedreport.txt
done

Be careful that, dependent on how you output data to your new file and how you iterate over your existing files, you might end up iterating over your newly created file. So be sure to either use a different ending other than *dat if you iterate over all files with that ending (I used txt in the example) or place the resulting file in a subfolder.

Collectives™ on Stack Overflow

Combining columns of data from files matching a string in bash

2 Answers 2

Awk

"Classic" solution by repeated `paste`

Hacky solution using a single `paste`

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Awk

"Classic" solution by repeated paste

Hacky solution using a single paste

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

"Classic" solution by repeated `paste`

Hacky solution using a single `paste`