1

DNA covariance model single/one file : Input data

Header : sequence and covariance

NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGGA
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGACtACAG
NC_013791.2.5 : >>>>..<<<<......

output data/expected data for all the above IDs from a single/one file

NC_013791.2.2 :  GAG
NC_013791.2.2 :  <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<
  1. I am able to delete last character with : sed 's/.$//' as suggested in stackflow

  2. extract last characters with : rev sym.txt | cut -c 1-3 | rev

  3. to extract only < with grep : grep -Eo "<.{3}" sym.txt

but i am not able to extract as below

GAG
<<<
GAGC
<<<<

or GAGC <<<<

Could someone help with sed, awk or grep - thank you in advance

2
  • @Morton : May i know why i am not able to vote both (@Stuffy and @Potong), as their contributions are useful with future modifications to the people who work in biology - Thank you Commented Apr 26, 2024 at 11:35
  • as far as I know you can vote for whoever you like. Whatever makes you think you can't vote for however many people you like, I'm sorry but I'm not the right person to ask about that, I'm just a contributor to the site same as you. Maybe flag a question or an answer to ask a moderator about it? Commented Apr 26, 2024 at 16:15

4 Answers 4

1

If your data is always in this format, you can print the first 2 fields followed by the call to substr which will print the part of interest.

Based on the answer provided by @stuffy, you could change the code to match 3 or more times a < char:

awk 'match($0, /<<<+/) { 
  print $1, $2, substr(prev, RSTART, RLENGTH)
  print $1, $2, substr($0, RSTART, RLENGTH)
} { 
  prev = $0
}' file

Here, the $0 is the current line, and prev is the previous line.

The match function sets the predefined variables RSTART and RLENGTH that you can use for the call to substr

Output

NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

If for example the field separator is : and you want to check that both parts before that are the same on both lines:

awk -F" : " '
  match($2, /<<<+/) && key == $1 {
    print $1 FS substr(val, RSTART, RLENGTH)
    print $1 FS substr($2, RSTART, RLENGTH)
  }
  { val = $2; key = $1 }
' file
Sign up to request clarification or add additional context in comments.

Comments

1

if I understadn right you want to print all < characters plus characters above the < characters

I tried this

$ awk '{
        if (match($0, /<+/)) {
                print $1, $2, substr(prevline, RSTART, RLENGTH)
                print $1, $2, substr($0, RSTART, RLENGTH)
                next
        }
}

{
        prevline = $0
}' file

NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

4 Comments

Thank you very much its works, could you please mind to give some explanation where i can go through (as i am new). I tried "AWK command" you provided with header and seq+covariance (as below, tab separated), but it wiped all the characters than the required ones. NC_013791.2.trna2 GCTCAGCTGGCtAGAG NC_013791.2.trna2 >>>>.........<<< expected NC_013791.2.trna2 GAG NC_013791.2.trna2 <<< could you please help how to modify the command- thank you in advance
I dont understand how input goes from 8 lines to output of 4 lines
@stuffy-sorry for making confuse, c edit, input 8 lines to output 8 lines . Could you mind to contribute/modify your code for output (i..e, including headers ) as requested please and as shown/modified by Potong - Thank you to both of you
I added headers and format with -o-
1

Using any awk plus tac:

$ cat tst.awk
match($3,/<+/) {
    start = RSTART
    lgth = RLENGTH
}
{
    $3 = substr($3,start,lgth)
    print
}

$ tac file | awk -f tst.awk | tac
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

Comments

0

This might work for you (GNU sed):

sed -E 'N;:a;s/^.(.*\n)[^<]|.(\n.*)[^<]$/\1\2/;ta;' file

Use the extended regexp by setting the option -E.

Append the following line.

Introduce a loop.

Using substitution nibble away at the front and back of both lines until only the result of the mask and the mask remain.


Subsequent to further clarification(?), perhaps:

cat <<\! > file
NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
!

cat <<\! > file1
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
!

cat <<\! > file2
NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGAG
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGGCtAGAG
NC_013791.2.5 : >>>>..<<<<......
!

sed -E 'N;:a;s/^(.*: ).(.*\n\1)[^<]|.(\n.*)[^<]$/\1\2\3/;ta' file file1
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<

sed -E 'N;:a;s/^(.*: ).(.*\n\1)[^<]|.(\n.*)[^<]$/\1\2\3/;ta' file2
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GAG
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGG
NC_013791.2.5 : <<<<

7 Comments

Thank you very much. i tried your command but its works alternatively ( if a file has 10 pairs) - it works on 1, 3 so on.
@PanduC Perhaps you could show the exact input data and the exact form of the output data (not as jpg but as clear text).
Hi Stuffy and Potong, i have edited, hope it is now clear please let me know in case not, thank you in advance
@ Potong : thank you, its working when two separate files (file + file1) as given, but i am not able to work on "single file" (i have ~46k ids in single file). Could you please let me know how to modify to work on single file and please explain what does N,a,\1,\2\3,ta ?
@PanduC what does the single file look like?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.