5

I've got a string:

s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"

and a regular expression I'm using

import re
delre = re.compile('-[0-9]+[ACGTNacgtn]+') #this is almost correct
print (delre.findall(s))

This returns:

['-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-1gtt', '-1gt', '-3ggg']

But -1gtt and -1gt are not desired matches. The integer in this case defines how many subsequent characters to match, so the desired output for those two matches would be -1g and -1g, respectively.

Is there a way to grab the integer after the dash and dynamically define the regex so that it matches that many and only that many subsequent characters?

3
  • 1
    Is there a limit to how big this integer can be for acceptable matches? Commented Aug 20, 2024 at 17:21
  • 1
    @ScottHunter for almost all cases the integer can be assumed to be <50 Commented Aug 20, 2024 at 17:23
  • 2
    So you could make a pattern for each specific integer, and OR them together. Maybe a bit impractical for ~50, but that's your call. @jonrsharpe's suggestion is probably the way to go. Commented Aug 20, 2024 at 17:46

2 Answers 2

6

One more alternative solution using re.sub that does it without loop:

import re

# surround [0-9]+ and [ACGTNacgtn]+ in parentheses to create two capture groups
delre = re.compile('[^-]*-([0-9]+)([ACGTNacgtn]+)[^-]*')  

s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"

print (re.sub(delre, lambda m: f"-{m.group(1)}{m.group(2)[:int(m.group(1))]}\n", s))

Output:

-2gg
-2gg
-2gg
-2gg
-2gg
-2gg
-1g
-1g
-3ggg

or else if you want output in array then use:

arr = re.sub(delre, lambda m: f"-{m.group(1)}{m.group(2)[:int(m.group(1))]} ", s).split()
print (arr)

['-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-1g', '-1g', '-3ggg']
Sign up to request clarification or add additional context in comments.

1 Comment

Nice! Sometimes I forget how much you can do with sub
4

You can't do this with the regex pattern directly, but you can use capture groups to separate the integer and character portions of the match, and then trim the character portion to the appropriate length.

import re

# surround [0-9]+ and [ACGTNacgtn]+ in parentheses to create two capture groups
delre = re.compile('-([0-9]+)([ACGTNacgtn]+)')  

s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"

# each match should be a tuple of (number, letter(s)), e.g. ('1', 'gtt') or ('2', 'gg')
for number, bases in delre.findall(s):
    # print the number, then use slicing to truncate the string portion
    print(f'-{number}{bases[:int(number)]}')

This prints

-2gg
-2gg
-2gg
-2gg
-2gg
-2gg
-1g
-1g
-3ggg

You'll more than likely want to do something other than print, but you can format the matched strings however you need!

NOTE: this does fail in cases where the integer is followed by fewer matching characters than it specifies, e.g. -10agcta is still a match even though it only contains 5 characters.

2 Comments

nice, Thanks! it can be safely assumed that the integer and subsequent number of bases always match up. The string in this case is output from the samtools mpileup command, so it's probably been thoroughly tested
@Ryan Glad I could help! If you're confident the data coming in will always line up then this should be totally serviceable. I'm not familiar with samtools or mpileup, but those letters screamed DNA to me so I took a guess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.