How to extract symbol (<<) and its corresponding alphabets from a string with sed, awk or grep

Question

DNA covariance model single/one file : Input data

Header : sequence and covariance

NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGGA
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGACtACAG
NC_013791.2.5 : >>>>..<<<<......

output data/expected data for all the above IDs from a single/one file

NC_013791.2.2 :  GAG
NC_013791.2.2 :  <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

I am able to delete last character with : sed 's/.$//' as suggested in stackflow
extract last characters with : rev sym.txt | cut -c 1-3 | rev
to extract only < with grep : grep -Eo "<.{3}" sym.txt

but i am not able to extract as below

GAG
<<<
GAGC
<<<<

or GAGC <<<<

Could someone help with sed, awk or grep - thank you in advance

@Morton : May i know why i am not able to vote both (@Stuffy and @Potong), as their contributions are useful with future modifications to the people who work in biology - Thank you — Pandu C
– Pandu C, Commented Apr 26, 2024 at 11:35
as far as I know you can vote for whoever you like. Whatever makes you think you can't vote for however many people you like, I'm sorry but I'm not the right person to ask about that, I'm just a contributor to the site same as you. Maybe flag a question or an answer to ask a moderator about it? — Ed Morton
– Ed Morton, Commented Apr 26, 2024 at 16:15

The fourth bird · Accepted Answer · 2024-04-27 12:46:39Z

If your data is always in this format, you can print the first 2 fields followed by the call to substr which will print the part of interest.

Based on the answer provided by @stuffy, you could change the code to match 3 or more times a < char:

awk 'match($0, /<<<+/) { 
  print $1, $2, substr(prev, RSTART, RLENGTH)
  print $1, $2, substr($0, RSTART, RLENGTH)
} { 
  prev = $0
}' file

Here, the $0 is the current line, and prev is the previous line.

The match function sets the predefined variables RSTART and RLENGTH that you can use for the call to substr

Output

NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

If for example the field separator is : and you want to check that both parts before that are the same on both lines:

awk -F" : " '
  match($2, /<<<+/) && key == $1 {
    print $1 FS substr(val, RSTART, RLENGTH)
    print $1 FS substr($2, RSTART, RLENGTH)
  }
  { val = $2; key = $1 }
' file

stuffy · Accepted Answer · 2024-04-29 00:12:30Z

1

if I understadn right you want to print all < characters plus characters above the < characters

I tried this

$ awk '{
        if (match($0, /<+/)) {
                print $1, $2, substr(prevline, RSTART, RLENGTH)
                print $1, $2, substr($0, RSTART, RLENGTH)
                next
        }
}

{
        prevline = $0
}' file

NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

edited Apr 29, 2024 at 0:12

answered Apr 24, 2024 at 18:08

stuffy

671 gold badge1 silver badge5 bronze badges

4 Comments

Pandu C Over a year ago

Thank you very much its works, could you please mind to give some explanation where i can go through (as i am new). I tried "AWK command" you provided with header and seq+covariance (as below, tab separated), but it wiped all the characters than the required ones. NC_013791.2.trna2 GCTCAGCTGGCtAGAG NC_013791.2.trna2 >>>>.........<<< expected NC_013791.2.trna2 GAG NC_013791.2.trna2 <<< could you please help how to modify the command- thank you in advance

stuffy Over a year ago

I dont understand how input goes from 8 lines to output of 4 lines

Pandu C Over a year ago

@stuffy-sorry for making confuse, c edit, input 8 lines to output 8 lines . Could you mind to contribute/modify your code for output (i..e, including headers ) as requested please and as shown/modified by Potong - Thank you to both of you

stuffy Over a year ago

I added headers and format with -o-

Ed Morton · Accepted Answer · 2024-05-01 19:57:51Z

1

Using any awk plus tac:

$ cat tst.awk
match($3,/<+/) {
    start = RSTART
    lgth = RLENGTH
}
{
    $3 = substr($3,start,lgth)
    print
}

$ tac file | awk -f tst.awk | tac
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

answered May 1, 2024 at 19:57

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

potong · Accepted Answer · 2024-04-28 11:53:04Z

0

This might work for you (GNU sed):

sed -E 'N;:a;s/^.(.*\n)[^<]|.(\n.*)[^<]$/\1\2/;ta;' file

Use the extended regexp by setting the option -E.

Append the following line.

Introduce a loop.

Using substitution nibble away at the front and back of both lines until only the result of the mask and the mask remain.

Subsequent to further clarification(?), perhaps:

cat <<\! > file
NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
!

cat <<\! > file1
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
!

cat <<\! > file2
NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGAG
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGGCtAGAG
NC_013791.2.5 : >>>>..<<<<......
!

sed -E 'N;:a;s/^(.*: ).(.*\n\1)[^<]|.(\n.*)[^<]$/\1\2\3/;ta' file file1
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<

sed -E 'N;:a;s/^(.*: ).(.*\n\1)[^<]|.(\n.*)[^<]$/\1\2\3/;ta' file2
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GAG
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGG
NC_013791.2.5 : <<<<

edited Apr 28, 2024 at 11:53

answered Apr 25, 2024 at 7:33

potong

59.3k6 gold badges55 silver badges92 bronze badges

7 Comments

Pandu C Over a year ago

Thank you very much. i tried your command but its works alternatively ( if a file has 10 pairs) - it works on 1, 3 so on.

potong Over a year ago

@PanduC Perhaps you could show the exact input data and the exact form of the output data (not as jpg but as clear text).

Pandu C Over a year ago

Hi Stuffy and Potong, i have edited, hope it is now clear please let me know in case not, thank you in advance

Pandu C Over a year ago

@ Potong : thank you, its working when two separate files (file + file1) as given, but i am not able to work on "single file" (i have ~46k ids in single file). Could you please let me know how to modify to work on single file and please explain what does N,a,\1,\2\3,ta ?

potong Over a year ago

@PanduC what does the single file look like?

|

Collectives™ on Stack Overflow

How to extract symbol (<<) and its corresponding alphabets from a string with sed, awk or grep

4 Answers 4

Comments

4 Comments

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related