This is a re-post of the same question in Bioinformatics. The given answer didn't work for me, and I thought I'd ask here.
My directories are organized as follows: one main directory, in which I have multiple project directories (start with 'mgp'). Inside each project directory, there are multiple metagenome directories (start with 'mgm' and end with '.3'). Inside each metagenome directory, there is always a single text .fna (fasta) file. Fasta files are defined by the following structure:
>Header1
Sequence1
>Header2
Sequence2
What I want is a loop that would go through all these fasta files, and add to the headers codes corresponding to project and metagenome directory. For example, if the fasta file in project directory mgp83581, metagenome subdirectory mgm4729322.3 had the following headers...
>seq1
>seq2
...then I would want to change it to:
>83581_322_seq1
>83581_322_seq2
Basically, I want to add the number code associated with the project, and the last three digits of the 7-digit code associated with the metagenome (apart from the '.3' part of the file name). That part of the code is okay. I know it works on one file at a time.
This is the full code I have right now. It does not work. (it only works for one of my two test files)
ls mgp*/mgm*.3/*.fna | while read line ; do
header=$(sed -r "s/^mgp([0-9]*)\/mgm[0-9]{4}([0-9]{1,})\.3\/(mgm.*\.fna)/\1_\2/") #this is where I get the code I want from the path
sed -i.bak "s/^>/>${header}_/" $line #this is where I add said code to my file
done
Individually, parts of the code 'seem' to work. If I write
ls mgp*/mgm*.3/*.fna | while read line ; do
echo $line
done
...then I am able to recover the paths for both my test files of interest. If I write:
ls mgp*/mgm*.3/*.fna | while read line ; do
echo $line
header=$(sed -r "s/^mgp([0-9]*)\/mgm[0-9]{4}([0-9]{1,})\.3\/(mgm.*\.fna)/\1_\2/")
echo $header
done
...then I get something rather weird. I get:
path for file#1
header for file #2
Whereas, I'd expect to get:
path for file#1
header for file#1
path for file#2
header for file#2
I've looked at previous questions/answers that discussed how the behaviour of variables in a while loop can be a bit strange, but I'm afraid I don't understand quite enough to make things work.
sedcommand to in your last loop?bash? Try this:for f in mgp*/mgm*.3/*.fna; do (IFS='./'; set -- $f; sed "s|^> *\(seq.*\)|>${1#mgp}_${2#${2%???}}_\1|" "$f"); done(add the the-iswitch tosedif everything's OK).