Combine rows based on shared elements (bash/awk)

Question

I have a following file:

a3 v2c
v5 a7
a9 v2c
v1c a3 a7c

Desired output (without duplicates in each row):

a3 a7c a9 v1c v2c
a7 v5

What I want is to combine the rows sharing at least one element. In line 2, both elements are unique and this row goes to output as is (in sorted order). Line 1 shares "v2c" with line 3, and "a3" with line 4, so these 3 lines are combined and sorted. Shared elements can be in different columns.

My code is very slow for a large file (200000 lines):

Lines=$(awk 'END {print NR}' $1)
bank=$1
while [ $Lines -ge 1 ]
    do
        echo "Processing line $Lines"
        awk -v line=$Lines 'NR == line' $bank | awk NF=NF RS= OFS="\n" | sort | uniq > Query.$Lines
        k=0
    while [[ $k != 1 ]]
        do
                if [[ $k != "" ]]
                    then
                        grep -f Query.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query1.$Lines
                        grep -f Query1.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query2.$Lines
                        grep -f Query2.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query3.$Lines
                        k=$(diff Query2.$Lines Query3.$Lines)
                            if [[ $k != "" ]]
                                then mv Query3.$Lines Query.$Lines
                            fi
                    else
                        awk NF=NF RS= OFS=" " Query3.$Lines >> $1.output.clusters
                        grep -v -f Query3.$Lines $bank > NotFound.$Lines
                        bank=NotFound.$Lines
                        k=1
                fi
    done
rm Query*
        Lines=$(( $Lines - 1 ))
done
exit
find . -maxdepth 1 -type f -size 0 -delete
rm NotFound.* Query.* Query1.* Query2.* Query3.*

I believe there could be a much more simple and efficient solution using bash or awk. Thanks in advance!

Ed Morton · Accepted Answer · 2023-05-13 21:28:44Z

Using GNU awk for arrays of arrays and sorted_in:

$ cat tst.awk
{
    for ( fldNrA=1; fldNrA<NF; fldNrA++ ) {
        fldValA = $fldNrA
        for ( fldNrB=fldNrA+1; fldNrB<=NF; fldNrB++ ) {
            fldValB = $fldNrB
            val_pairs[fldValA][fldValB]
            val_pairs[fldValB][fldValA]
        }
    }
}

function descend(fldValA,       fldValB) {
    if ( !seen[fldValA]++ ) {
        all_vals[fldValA]
        for ( fldValB in val_pairs[fldValA] ) {
            descend(fldValB)
        }
    }
}

END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( fldValA in val_pairs ) {
        delete all_vals
        descend(fldValA)
        if ( fldValA in all_vals ) {
            sep = ""
            for ( fldValB in all_vals ) {
                printf "%s%s", sep, fldValB
                sep = OFS
            }
            print ""
        }
    }
}

$ awk -f tst.awk file
a3 a7c a9 v1c v2c
a7 v5

Original answer:

Here's a start using GNU awk for arrays of arrays:

$ cat tst.awk
{
    for ( fldNr=1; fldNr<=NF; fldNr++ ) {
        fldVal = $fldNr
        fldVals_rowNrs[fldVal][NR]
        rowNrs_fldVals[NR][fldVal]
    }
}
END {
    for ( rowNr=1; rowNr<=NR; rowNr++ ) {
        noOverlap[rowNr]
    }

    for ( rowNrA in rowNrs_fldVals ) {
        for ( fldVal in rowNrs_fldVals[rowNrA] ) {
            for ( rowNrB in fldVals_rowNrs[fldVal] ) {
                if ( rowNrB > rowNrA ) {
                    overlap[rowNrA][rowNrB]
                    delete noOverlap[rowNrA]
                    delete noOverlap[rowNrB]
                }
            }
        }
    }

    for ( rowNrA in overlap ) {
        for ( rowNrB in overlap[rowNrA] ) {
            print "Values overlap between lines:", rowNrA, rowNrB
        }
    }

    for ( rowNr in noOverlap ) {
        print "All unique values in line:", rowNr
    }
}

$ awk -f tst.awk file
Values overlap between lines: 1 3
Values overlap between lines: 1 4
All unique values in line: 2

From there I expect you'll need to implement a (recursive descent?) function, which I'm not going to do, to call in place of the line print "Values overlap between lines:", rowNrA, rowNrB to find all values in common between all lines that have overlapping values, and use PROCINFO["sorted_in"] to print them in a specific order.

Since you asked for some info on recursive functions in a comment here are examples of recursive awk functions (all named descend() but the name is irrelevant) for different purposes at:

Hopefully those will give you an idea of how to approach writing such a function for this task.

drewk · Accepted Answer · 2023-05-17 16:44:38Z

0

Here is a ruby to do that:

ruby -e '
require "set"
line_map=Hash.new { |h,k| h[k]=[] }
num_map=Hash.new { |h,k| h[k]=Set.new() }
bucket=Hash.new { |h,k| h[k]=Set.new() }
$<.each {|line| line_all=line.chomp.split
    line_all.each{|sym| line_map[sym] << $. }
    num_map[$.].merge(line_all)
}
line_map.each{|k,v| 
    bucket[num_map[v[0]].all?{|ks| v.length==1 && line_map[ks][0]==v[0]}] << k
}
puts bucket[false].sort.join(" ")
puts bucket[true].sort.join(" ")
' file

Prints:

a3 a7c a9 v1c v2c
a7 v5

edited May 17, 2023 at 16:44

answered May 14, 2023 at 23:42

drewk

2511 silver badge6 bronze badges

Add a comment |

Stack Exchange Network

Combine rows based on shared elements (bash/awk)

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Combine rows based on shared elements (bash/awk)

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions