-4

I have a couple of large files (~1Gb) of such structure:

fooA iug9wa
fooA lauie
fooA nwgoieb
fooB wilgb
fooB rqgebepu
fooB ifbqeiu
...
fooN ibfiygb
fooN yvsiy
fooN aeviu

I would like to replace in shell each fooX (which contains letters, numbers "." and "_"), (I have all listed in foo.list) to sequential numbers 1 to N.

I've used:

nfoos=$(wc -l < foo.list)

for i in $(seq 1 $nfoos)
do
    currentfoo=$(sed "${i}q;d" foo.list)
    sed -i "s/"${currentfoo}"/$i/g" file1
    sed -i "s/"${currentfoo}"/$i/g" file2
    sed -i "s/"${currentfoo}"/$i/g" filen
done

However, with large files it's been taking forever. Since each consecutive fooX always appears in the files than foo(X-1) I though to make sed only search the part of fileX after the last match of fooX, so that with each foo there is less space to search. I've been trying to use labels and some multiline approaches, but the syntax keeps beating me here.

Does anyone know how to make it work? (Doesn't necessarily have to use sed, but would be great if it worked in basic shell in Bash.)

Appreciate any help. And if you do, please explain each function/option/variable used so that I can figure out where I had been messing up.

1

2 Answers 2

2

You can use awk.
The first part of the next awk command will fill the array a, the second part replaces the first word.

awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list file1

When this is what you like, you can loop over your files

for f in file1 file2 filen; do
  awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list "${f}" > "${f}.tmp" &&
  mv "${f}.tmp" "${f}"
done

The && makes sure the new file will only replace the original file when awk was OK.

Sign up to request clarification or add additional context in comments.

2 Comments

Glad I could help. Next time please add more example cases, such as a small foo.list and the result you want from the given input. Example input with dots and underscores might be releavant.
This solution removes lines from the inputfile that don't have a corresponding field in your foo.list. When you want to replace those lines with something like a foo0, create an if statemetn in the awk. Nice training!
0

Two optimizations:

  1. Use awk to generate a sed script which does all the replacements in a single run.

  2. Run sed -i with N file arguments instead of running sed N times with 1 file argument each.

awk '{ print "s/" $0 "/" NR "/g;" }' foo.list > temp_script
sed -i -f temp_script $(cat foo.list)

Now you run sed only once instead of N^2 times.

4 Comments

OP write: fooX contains letters, numbers "." and "_". Values might be val., valx and your sed command will match val. with valx. You should replace the dows with [.] of \..
I think you want to replace sed -i -f temp_script $(cat foo.list) with sed -i -f temp_script file1 file 2 filen.
Thanks @WalterA ! That is exactly the problem I ran into. Replacing "_"'s kind of misses the whole point as I would have to replace them in the large files too and this would again, take a lot of time. Reg. second comments, yes, I've noticed this too, but I got the point. I'll fix it in the reply.
You never need to have awk generate a sed script and then call sed to execute it, just do whatever you want to do in the one call to awk. What you show will fail for various input values, e.g. if foo.list contains abc.e then the sed command will replace abcae, abc5e, abcXe, etc.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.