546

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

How to do the same thing in Python?

1
  • 5
    one liner with python 3.8 text[text.find(start:='AAA')+len(start):text.find('ZZZ')] Commented Jun 18, 2021 at 19:19

25 Answers 25

886

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234
Sign up to request clarification or add additional context in comments.

14 Comments

The second solution is better, if the pattern matches most of the time, because its Easier to ask for forgiveness than permission..
Doesn't the indexing start at 0? So you would need to use group(0) instead of group(1)?
@Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234
@Bengt: Why is that? The first solution looks quite simple to me, and it has fewer lines of code.
In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43.
|
166
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

5 Comments

The question seems to imply that the input text will always contain both "AAA" and "ZZZ". If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think "hello there" as input string).
@user225312 Is the re method not faster though?
Voteup, but I would use "x = 'AAA' ; s.find(x) + len(x)" instead of "s.find('AAA') + 3" for maintainability.
If any of the tokens can't be found in the s, s.find will return -1. the slicing operator s[begin:end] will accept it as valid index, and return undesired substring.
@confused00 find is much faster than re stackoverflow.com/questions/4901523/…
136

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

PS Python Challenge?

4 Comments

This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except.
... nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals
Great, many thanks! - this works for strings and does not require regex
Upvoting for the string method, there is no need for regex in something this simple, most languages have a library function for this
49

Surprised that nobody has mentioned this which is my quick version for one-off scripts:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

3 Comments

@user1810100 mentioned essentially that almost exactly 5 years to the day before you posted this...
Adding an if s.find("ZZZ") > s.find("AAA"): to it, avoids issues if 'ZZZ` isn't in the string, which would return '1234uuijjk'
@tzot's answer (stackoverflow.com/a/4917004/358532) with partition instead of split seems more robust (depending on your needs), as it returns an empty string if one of the substrings isn't found.
27

you can do using just one line of code

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

Comments

19
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

1 Comment

AttributeError: 'NoneType' object has no attribute 'groups' - if there is no AAA, ZZZ in the string...
12

You can use re module for that:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

Comments

12

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

Comments

8
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Gives

string

1 Comment

If the text does not include the markers, throws a ValueError: substring not found exception. That is good,
6
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

Comments

6

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

Comments

5

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

Comments

5

Using PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:

[['1234']]

Comments

5

One liner with Python 3.8 if text is guaranteed to contain the substring:

text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

2 Comments

Does not work if the text does not contain the markers.
Similar solution by fernando-wittmann using text.index throws exception, allowing detection and forgiveness. stackoverflow.com/a/54975532/2719980
4

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

Comments

1

also, you can find all combinations in the bellow function

s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
    word_places = []
    i=0
    while True:
        word_place = text.find(word,i)
        i+=len(word)+word_place
        if i>=len(text):
            break
        if word_place<0:
            break
        word_places.append(word_place)
    return word_places
def find_all_combination(text,start,end):
    start_places = find_all_places(text,start)
    end_places = find_all_places(text,end)
    combination_list = []
    for start_place in start_places:
        for end_place in end_places:
            print(start_place)
            print(end_place)
            if start_place>=end_place:
                continue
            combination_list.append(text[start_place:end_place])
    return combination_list
find_all_combination(s,"Part","Part")

result:

['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

Comments

1

In case you want to look for multiple occurences.

content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
    spos = c.find('_Suffix')
    if spos!=-1:
        strings.append( c[:spos])
print( strings )

Or more quickly :

strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

Comments

0

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

Comments

0

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

Comments

0

Typescript. Gets string in between two other strings.

Searches shortest string between prefixes and postfixes

prefixes - string / array of strings / null (means search from the start).

postfixes - string / array of strings / null (means search until the end).

public getStringInBetween(str: string, prefixes: string | string[] | null,
                          postfixes: string | string[] | null): string {

    if (typeof prefixes === 'string') {
        prefixes = [prefixes];
    }

    if (typeof postfixes === 'string') {
        postfixes = [postfixes];
    }

    if (!str || str.length < 1) {
        throw new Error(str + ' should contain ' + prefixes);
    }

    let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
    const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);

    let value = str.substring(start.pos + start.sub.length, end.pos);
    if (!value || value.length < 1) {
        throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
    }

    while (true) {
        try {
            start = this.indexOf(value, prefixes);
        } catch (e) {
            break;
        }
        value = value.substring(start.pos + start.sub.length);
        if (!value || value.length < 1) {
            throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
        }
    }

    return value;
}

Comments

0

a simple approach could be the following:

string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]

1 Comment

could you explain your code, so that it would be more helpful to the readers?
0

If you want to check whether the substrings exists and return empty string if they don't:

def substr_between(str_all, first_string, last_string):
    pos1 = str_all.find(first_string)
    if pos1 < 0:
        return ""
    pos1 += len(first_string)
    pos2 = str_all[pos1:].find(last_string)
    if pos2 < 0:
        return ""
    return str_all[pos1:pos1 + pos2]

Comments

0

I used this Split method:

main_text="dsfsgaehere535353box"
start = "here"
end = "box"
z1 = main_text.split(start)
z2 = z1[1].split(end)
print(z2[0])

will be "535353"

Comments

0

If your markers might appear multiple times or even be identical, you can safely extract all matches using re.escape():

import re

s = "XXXAAAfirstAAAsecondAAAthirdZZZ"
start, end = "AAA", "AAA"

pattern = f"{re.escape(start)}(.*?){re.escape(end)}"
matches = re.findall(pattern, s)
print(matches)
# ['first', 'second']

This approach correctly handles repeated or identical markers and avoids regex injection issues.

Comments

-1

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.