Perl match with regex a number and as many following characters as the number specifies within a string

Question

I (think I) am quite experienced in Perl, still I have a nasty question I'm trying to solve. I have to match a string (whose format I cannot change coming out from a bioinformatic software) in this format:

[\+\-][0-9]+[ACGTacgt]+

Actually this would be easy, though the number of repeats of the pattern [ACGTacgt] is not quite 1 or more but the number defined by [0-9]+ so it can be

[...whatever...]+2ac[...whatever...]
+4acta
+3atg

etc..

Now to test if the regex work I'm just playing with a substitution and I tried the following way:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{\1}//g

Unfortunately this guy above does not work and I get an error complaining about unescaped braces. Indeed if I define a proper number instead of \1 the thing works:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{1}//g

I need it to work since the format might contain sequences like ac.,.+2caaa..a.c from which I have to get exactly the +2ca leaving separately from the rest.

Is it possible in one step, or there's a logical reason which I'm missing right now for which it's not possible?

Thanks for any help or suggestions!

berutti

Grinnz · Accepted Answer · 2019-11-25 23:00:58Z

3

The {$N} component of the regex is a modifier, which can't use a backreference as its count. You could work around it with an embedded perl expression:

use strict;
use warnings;
my $string = 'ac.,.+2caaa..a.c';
$string =~ s/[+-]([0-9]+)(??{ "[ACGTacgt]{$1}" })//g;
print "$string\n";

Note that embedded subexpressions are a last resort, and for obvious reasons prevent the regex from being optimized properly - it is IMO an appropriate tradeoff for this exact case where the matched substring must be removed, but if your requirements are slightly different, a split-out iterative approach may be more appropriate.

edited Nov 25, 2019 at 23:00

answered Nov 25, 2019 at 21:15

Grinnz

9,25113 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

zdim Over a year ago

I think you should mention good-practice-standing of constructs like ??{..}

Grinnz Over a year ago

I don't think there's anything wrong with the usage here. The only thing to mention would be that it would not be the most performant solution, but I am not sure any other solution would do better, depending on the requirements.

zdim Over a year ago

See this post for instance (and another above it, commenting on it). It's not "wrong" it's just tricky and error-prone. And some uses are disallowed. Etc.

Grinnz Over a year ago

I agree it's not ideal to have to resort to subexpressions, but depending on the requirements, the tradeoff is worth it in this case IMO.

zdim Over a year ago

Of course, it's a trade-off (there's always some of that!), all I am saying is that I think it should be mentioned in the answer.

|

zdim · Accepted Answer · 2019-12-03 08:27:03Z

1

Can iterate over numbers and in the loop body match captured-number of letters that follow

use warnings;
use strict;
use feature 'say';

my $s = q(ac.,.+2caaa..a.c-3acgg+1tt);

while ($s =~ /[+-]([0-9]+)/g) { 
    my $c = $1; 
    $s =~ /\G([acgt]{$c})/i or next;

    say "$c$1";  # or process it further / store it ...
}

The \G assertion makes its regex start from where the previous m//g match ended, as needed. This is a standard approach to "chain global matches" and generally scan text by coordinating multiple regex. See docs for it in Assertions in perlre and, for far more detail, in perlop (search for \G).

Prints

2ca
3acg
1t

If the [+-] need be extracted as well, add capturing parens around it and renumerate captures (that'll be $1 and the number in $2)

Please clarify other requirements -- for instance: Do you only need to extract the patterns or should anything in particular happen with the original string as well?

Update It's clarified that the matches also need be removed from the string.

An easy way is to simply remove them with another regex, after they have been collected.

After the same processing as above, the collected matches are used to form a pattern with alternation for their removal. This is also efficient since by construction the subpatterns in the alternation come in the order of their appearance in the string

use warnings;
use strict;
use feature 'say';

my $string = q(ac.,.+2caaa..a.c-3acgg+1tt);

my @matches;

while ($string =~ /([+-])([0-9]+)/g) { 
    my ($sign, $count)  = ($1, $2);
    $string =~ /\G([acgt]{$count})/i or next;    
    push @matches, $sign.$count.$1; 
}    
say for @matches;

my $matches_re = '(?:' . join('|', map { quotemeta } @matches) . ')';

$string =~ s/$matches_re//g;    
say $string;

where i've now joined the sign [+-] to the match.

It prints

+2ca
-3acg
+1t
ac.,.aa..a.cgt

edited Dec 3, 2019 at 8:27

answered Nov 25, 2019 at 21:46

zdim

67.2k5 gold badges59 silver badges87 bronze badges

7 Comments

Grinnz Over a year ago

I tried to write a solution along these lines but it becomes extremely complicated if removing the string is a requirement as in the OP. You should also check whether the second regex succeeds, in case there are instances that don't have sufficient following letters.

zdim Over a year ago

@Grinnz Well, yeah, good points but we omitted all error checking (your solution as well). I'd expect them to add that. Here at least it's easy to add any and all checks.

zdim Over a year ago

@Grinnz "removing the string is a requirement" -- I didn't see it that way, on the contrary, I think they need to extract the pattern (and never mind the original string). It says (my emphasis) "to test if the regex work I'm just playing with a substitution..." ...

Grinnz Over a year ago

The original code does not extract the string - the exact requirement is unclear, hence why I said "if" it's a requirement.

Grinnz Over a year ago

There is no error checking required in my solution, as any parts of the string that do not match the whole expression will not be affected.

|

Collectives™ on Stack Overflow

Perl match with regex a number and as many following characters as the number specifies within a string

2 Answers 2

8 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related