1

I have a following file:

a3 v2c
v5 a7
a9 v2c
v1c a3 a7c

Desired output (without duplicates in each row):

a3 a7c a9 v1c v2c
a7 v5

What I want is to combine the rows sharing at least one element. In line 2, both elements are unique and this row goes to output as is (in sorted order). Line 1 shares "v2c" with line 3, and "a3" with line 4, so these 3 lines are combined and sorted. Shared elements can be in different columns.

My code is very slow for a large file (200000 lines):

Lines=$(awk 'END {print NR}' $1)
bank=$1
while [ $Lines -ge 1 ]
    do
        echo "Processing line $Lines"
        awk -v line=$Lines 'NR == line' $bank | awk NF=NF RS= OFS="\n" | sort | uniq > Query.$Lines
        k=0
    while [[ $k != 1 ]]
        do
                if [[ $k != "" ]]
                    then
                        grep -f Query.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query1.$Lines
                        grep -f Query1.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query2.$Lines
                        grep -f Query2.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query3.$Lines
                        k=$(diff Query2.$Lines Query3.$Lines)
                            if [[ $k != "" ]]
                                then mv Query3.$Lines Query.$Lines
                            fi
                    else
                        awk NF=NF RS= OFS=" " Query3.$Lines >> $1.output.clusters
                        grep -v -f Query3.$Lines $bank > NotFound.$Lines
                        bank=NotFound.$Lines
                        k=1
                fi
    done
rm Query*
        Lines=$(( $Lines - 1 ))
done
exit
find . -maxdepth 1 -type f -size 0 -delete
rm NotFound.* Query.* Query1.* Query2.* Query3.*

I believe there could be a much more simple and efficient solution using bash or awk. Thanks in advance!

0

2 Answers 2

1

Using GNU awk for arrays of arrays and sorted_in:

$ cat tst.awk
{
    for ( fldNrA=1; fldNrA<NF; fldNrA++ ) {
        fldValA = $fldNrA
        for ( fldNrB=fldNrA+1; fldNrB<=NF; fldNrB++ ) {
            fldValB = $fldNrB
            val_pairs[fldValA][fldValB]
            val_pairs[fldValB][fldValA]
        }
    }
}

function descend(fldValA,       fldValB) {
    if ( !seen[fldValA]++ ) {
        all_vals[fldValA]
        for ( fldValB in val_pairs[fldValA] ) {
            descend(fldValB)
        }
    }
}

END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( fldValA in val_pairs ) {
        delete all_vals
        descend(fldValA)
        if ( fldValA in all_vals ) {
            sep = ""
            for ( fldValB in all_vals ) {
                printf "%s%s", sep, fldValB
                sep = OFS
            }
            print ""
        }
    }
}

$ awk -f tst.awk file
a3 a7c a9 v1c v2c
a7 v5

Original answer:

Here's a start using GNU awk for arrays of arrays:

$ cat tst.awk
{
    for ( fldNr=1; fldNr<=NF; fldNr++ ) {
        fldVal = $fldNr
        fldVals_rowNrs[fldVal][NR]
        rowNrs_fldVals[NR][fldVal]
    }
}
END {
    for ( rowNr=1; rowNr<=NR; rowNr++ ) {
        noOverlap[rowNr]
    }

    for ( rowNrA in rowNrs_fldVals ) {
        for ( fldVal in rowNrs_fldVals[rowNrA] ) {
            for ( rowNrB in fldVals_rowNrs[fldVal] ) {
                if ( rowNrB > rowNrA ) {
                    overlap[rowNrA][rowNrB]
                    delete noOverlap[rowNrA]
                    delete noOverlap[rowNrB]
                }
            }
        }
    }

    for ( rowNrA in overlap ) {
        for ( rowNrB in overlap[rowNrA] ) {
            print "Values overlap between lines:", rowNrA, rowNrB
        }
    }

    for ( rowNr in noOverlap ) {
        print "All unique values in line:", rowNr
    }
}

$ awk -f tst.awk file
Values overlap between lines: 1 3
Values overlap between lines: 1 4
All unique values in line: 2

From there I expect you'll need to implement a (recursive descent?) function, which I'm not going to do, to call in place of the line print "Values overlap between lines:", rowNrA, rowNrB to find all values in common between all lines that have overlapping values, and use PROCINFO["sorted_in"] to print them in a specific order.

Since you asked for some info on recursive functions in a comment here are examples of recursive awk functions (all named descend() but the name is irrelevant) for different purposes at:

Hopefully those will give you an idea of how to approach writing such a function for this task.

0
0

Here is a ruby to do that:

ruby -e '
require "set"
line_map=Hash.new { |h,k| h[k]=[] }
num_map=Hash.new { |h,k| h[k]=Set.new() }
bucket=Hash.new { |h,k| h[k]=Set.new() }
$<.each {|line| line_all=line.chomp.split
    line_all.each{|sym| line_map[sym] << $. }
    num_map[$.].merge(line_all)
}
line_map.each{|k,v| 
    bucket[num_map[v[0]].all?{|ks| v.length==1 && line_map[ks][0]==v[0]}] << k
}
puts bucket[false].sort.join(" ")
puts bucket[true].sort.join(" ")
' file 

Prints:

a3 a7c a9 v1c v2c
a7 v5
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.