I have a following file:
a3 v2c
v5 a7
a9 v2c
v1c a3 a7c
Desired output (without duplicates in each row):
a3 a7c a9 v1c v2c
a7 v5
What I want is to combine the rows sharing at least one element. In line 2, both elements are unique and this row goes to output as is (in sorted order). Line 1 shares "v2c" with line 3, and "a3" with line 4, so these 3 lines are combined and sorted. Shared elements can be in different columns.
My code is very slow for a large file (200000 lines):
Lines=$(awk 'END {print NR}' $1)
bank=$1
while [ $Lines -ge 1 ]
do
echo "Processing line $Lines"
awk -v line=$Lines 'NR == line' $bank | awk NF=NF RS= OFS="\n" | sort | uniq > Query.$Lines
k=0
while [[ $k != 1 ]]
do
if [[ $k != "" ]]
then
grep -f Query.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query1.$Lines
grep -f Query1.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query2.$Lines
grep -f Query2.$Lines $bank | awk '{gsub(/\t/,"\n")}1' | awk '{gsub(/ /,"\n")}1' | sort | uniq > Query3.$Lines
k=$(diff Query2.$Lines Query3.$Lines)
if [[ $k != "" ]]
then mv Query3.$Lines Query.$Lines
fi
else
awk NF=NF RS= OFS=" " Query3.$Lines >> $1.output.clusters
grep -v -f Query3.$Lines $bank > NotFound.$Lines
bank=NotFound.$Lines
k=1
fi
done
rm Query*
Lines=$(( $Lines - 1 ))
done
exit
find . -maxdepth 1 -type f -size 0 -delete
rm NotFound.* Query.* Query1.* Query2.* Query3.*
I believe there could be a much more simple and efficient solution using bash or awk. Thanks in advance!