2

I (think I) am quite experienced in Perl, still I have a nasty question I'm trying to solve. I have to match a string (whose format I cannot change coming out from a bioinformatic software) in this format:

[\+\-][0-9]+[ACGTacgt]+

Actually this would be easy, though the number of repeats of the pattern [ACGTacgt] is not quite 1 or more but the number defined by [0-9]+ so it can be

[...whatever...]+2ac[...whatever...]
+4acta
+3atg

etc..

Now to test if the regex work I'm just playing with a substitution and I tried the following way:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{\1}//g

Unfortunately this guy above does not work and I get an error complaining about unescaped braces. Indeed if I define a proper number instead of \1 the thing works:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{1}//g

I need it to work since the format might contain sequences like ac.,.+2caaa..a.c from which I have to get exactly the +2ca leaving separately from the rest.

Is it possible in one step, or there's a logical reason which I'm missing right now for which it's not possible?

Thanks for any help or suggestions!

berutti

2 Answers 2

3

The {$N} component of the regex is a modifier, which can't use a backreference as its count. You could work around it with an embedded perl expression:

use strict;
use warnings;
my $string = 'ac.,.+2caaa..a.c';
$string =~ s/[+-]([0-9]+)(??{ "[ACGTacgt]{$1}" })//g;
print "$string\n";

Note that embedded subexpressions are a last resort, and for obvious reasons prevent the regex from being optimized properly - it is IMO an appropriate tradeoff for this exact case where the matched substring must be removed, but if your requirements are slightly different, a split-out iterative approach may be more appropriate.

Sign up to request clarification or add additional context in comments.

8 Comments

I think you should mention good-practice-standing of constructs like ??{..}
I don't think there's anything wrong with the usage here. The only thing to mention would be that it would not be the most performant solution, but I am not sure any other solution would do better, depending on the requirements.
See this post for instance (and another above it, commenting on it). It's not "wrong" it's just tricky and error-prone. And some uses are disallowed. Etc.
I agree it's not ideal to have to resort to subexpressions, but depending on the requirements, the tradeoff is worth it in this case IMO.
Of course, it's a trade-off (there's always some of that!), all I am saying is that I think it should be mentioned in the answer.
|
1

Can iterate over numbers and in the loop body match captured-number of letters that follow

use warnings;
use strict;
use feature 'say';

my $s = q(ac.,.+2caaa..a.c-3acgg+1tt);

while ($s =~ /[+-]([0-9]+)/g) { 
    my $c = $1; 
    $s =~ /\G([acgt]{$c})/i or next;

    say "$c$1";  # or process it further / store it ...
}

The \G assertion makes its regex start from where the previous m//g match ended, as needed. This is a standard approach to "chain global matches" and generally scan text by coordinating multiple regex. See docs for it in Assertions in perlre and, for far more detail, in perlop (search for \G).

Prints

2ca
3acg
1t

If the [+-] need be extracted as well, add capturing parens around it and renumerate captures (that'll be $1 and the number in $2)

Please clarify other requirements -- for instance: Do you only need to extract the patterns or should anything in particular happen with the original string as well?


Update  It's clarified that the matches also need be removed from the string.

An easy way is to simply remove them with another regex, after they have been collected.

After the same processing as above, the collected matches are used to form a pattern with alternation for their removal. This is also efficient since by construction the subpatterns in the alternation come in the order of their appearance in the string

use warnings;
use strict;
use feature 'say';

my $string = q(ac.,.+2caaa..a.c-3acgg+1tt);

my @matches;

while ($string =~ /([+-])([0-9]+)/g) { 
    my ($sign, $count)  = ($1, $2);
    $string =~ /\G([acgt]{$count})/i or next;    
    push @matches, $sign.$count.$1; 
}    
say for @matches;

my $matches_re = '(?:' . join('|', map { quotemeta } @matches) . ')';

$string =~ s/$matches_re//g;    
say $string;

where i've now joined the sign [+-] to the match.

It prints

+2ca
-3acg
+1t
ac.,.aa..a.cgt

7 Comments

I tried to write a solution along these lines but it becomes extremely complicated if removing the string is a requirement as in the OP. You should also check whether the second regex succeeds, in case there are instances that don't have sufficient following letters.
@Grinnz Well, yeah, good points but we omitted all error checking (your solution as well). I'd expect them to add that. Here at least it's easy to add any and all checks.
@Grinnz "removing the string is a requirement" -- I didn't see it that way, on the contrary, I think they need to extract the pattern (and never mind the original string). It says (my emphasis) "to test if the regex work I'm just playing with a substitution..." ...
The original code does not extract the string - the exact requirement is unclear, hence why I said "if" it's a requirement.
There is no error checking required in my solution, as any parts of the string that do not match the whole expression will not be affected.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.