How to extract the substring between two markers?

Question

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.

How to do the same thing in Python?

one liner with python 3.8 text[text.find(start:='AAA')+len(start):text.find('ZZZ')] — cookiemonster
– cookiemonster, Commented Jun 18, 2021 at 19:19

CDMP · Accepted Answer · 2013-10-08 15:50:59Z

886

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

edited Oct 8, 2013 at 15:50

CDMP

3104 silver badges10 bronze badges

answered Jan 12, 2011 at 9:18

eumiro

214k36 gold badges307 silver badges264 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Bengt Over a year ago

The second solution is better, if the pattern matches most of the time, because its Easier to ask for forgiveness than permission..

Alexander Over a year ago

Doesn't the indexing start at 0? So you would need to use group(0) instead of group(1)?

Yurii K Over a year ago

@Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234

HelloGoodbye Over a year ago

@Bengt: Why is that? The first solution looks quite simple to me, and it has fewer lines of code.

Heather Over a year ago

In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43.

|

Lennart Regebro · Accepted Answer · 2011-01-12 09:17:23Z

166

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

answered Jan 12, 2011 at 9:17

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

5 Comments

tzot Over a year ago

The question seems to imply that the input text will always contain both "AAA" and "ZZZ". If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think "hello there" as input string).

confused00 Over a year ago

@user225312 Is the re method not faster though?

Alex Over a year ago

Voteup, but I would use "x = 'AAA' ; s.find(x) + len(x)" instead of "s.find('AAA') + 3" for maintainability.

ribamar Over a year ago

If any of the tokens can't be found in the s, s.find will return -1. the slicing operator s[begin:end] will accept it as valid index, and return undesired substring.

Claudiu Creanga Over a year ago

@confused00 find is much faster than re stackoverflow.com/questions/4901523/…

tzot · Accepted Answer · 2011-02-06 23:43:17Z

136

regular expression

import re

re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)

The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text

string methods

your_text.partition("AAA")[2].partition("ZZZ")[0]

The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.

PS Python Challenge?

answered Feb 6, 2011 at 23:43

tzot

96.6k30 gold badges151 silver badges210 bronze badges

4 Comments

ChaimG Over a year ago

This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except.

GreenAsJade Over a year ago

... nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals

Alex Over a year ago

Great, many thanks! - this works for strings and does not require regex

Harry Jones Over a year ago

Upvoting for the string method, there is no need for regex in something this simple, most languages have a library function for this

Uncle Long Hair · Accepted Answer · 2019-02-09 16:57:58Z

49

Surprised that nobody has mentioned this which is my quick version for one-off scripts:

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

answered Feb 9, 2019 at 16:57

Uncle Long Hair

2,9793 gold badges27 silver badges35 bronze badges

3 Comments

John Over a year ago

@user1810100 mentioned essentially that almost exactly 5 years to the day before you posted this...

Rolf of Saxony Over a year ago

Adding an if s.find("ZZZ") > s.find("AAA"): to it, avoids issues if 'ZZZ` isn't in the string, which would return '1234uuijjk'

Yann Dìnendal Over a year ago

@tzot's answer (stackoverflow.com/a/4917004/358532) with partition instead of split seems more robust (depending on your needs), as it returns an empty string if one of the substrings isn't found.

Mahesh Gupta · Accepted Answer · 2018-01-11 11:39:55Z

27

you can do using just one line of code

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

result will receive list...

answered Jan 11, 2018 at 11:39

Mahesh Gupta

1,90214 silver badges17 bronze badges

Comments

infrared · Accepted Answer · 2011-01-12 09:18:00Z

19

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

answered Jan 12, 2011 at 9:18

infrared

3,6262 gold badges27 silver badges37 bronze badges

1 Comment

eumiro Over a year ago

AttributeError: 'NoneType' object has no attribute 'groups' - if there is no AAA, ZZZ in the string...

andreypopp · Accepted Answer · 2011-01-12 09:19:21Z

12

You can use re module for that:

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

answered Jan 12, 2011 at 9:19

andreypopp

6,9695 gold badges29 silver badges26 bronze badges

Comments

rashok · Accepted Answer · 2018-03-14 09:11:23Z

12

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

answered Mar 14, 2018 at 9:11

rashok

13.7k17 gold badges93 silver badges103 bronze badges

Comments

Fernando Wittmann · Accepted Answer · 2019-03-04 01:31:31Z

8

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Gives

string

answered Mar 4, 2019 at 1:31

Fernando Wittmann

2,62726 silver badges20 bronze badges

1 Comment

plpsanchez Over a year ago

If the text does not include the markers, throws a ValueError: substring not found exception. That is good,

Ashwini Chaudhary · Accepted Answer · 2014-02-11 09:23:44Z

6

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

edited Feb 11, 2014 at 9:23

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

answered Feb 8, 2014 at 0:12

user1810100

631 silver badge4 bronze badges

Comments

Avinash Raj · Accepted Answer · 2015-01-31 08:29:21Z

6

With sed it is possible to do something like this with a string:

echo "$STRING" | sed -e "s|.*AAA$.*$ZZZ.*|\1|"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by $..$, but in python it was represented by (..).

answered Jan 31, 2015 at 8:29

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Comments

Saeed Zahedian Abroodi · Accepted Answer · 2017-10-21 05:38:35Z

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

Raphael · Accepted Answer · 2020-01-08 23:03:56Z

5

Using PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:

[['1234']]

answered Jan 8, 2020 at 23:03

Raphael

1,01110 silver badges24 bronze badges

Comments

cookiemonster · Accepted Answer · 2022-08-20 11:33:11Z

5

One liner with Python 3.8 if text is guaranteed to contain the substring:

text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

edited Aug 20, 2022 at 11:33

answered Jun 18, 2021 at 19:20

cookiemonster

2,2842 gold badges22 silver badges27 bronze badges

2 Comments

plpsanchez Over a year ago

Does not work if the text does not contain the markers.

plpsanchez Over a year ago

Similar solution by fernando-wittmann using text.index throws exception, allowing detection and forgiveness. stackoverflow.com/a/54975532/2719980

Community · Accepted Answer · 2017-05-23 11:55:07Z

4

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Jan 19, 2014 at 19:29

Denis Kutlubaev

16.3k6 gold badges89 silver badges72 bronze badges

Comments

yunus · Accepted Answer · 2021-10-05 19:02:30Z

also, you can find all combinations in the bellow function

s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
    word_places = []
    i=0
    while True:
        word_place = text.find(word,i)
        i+=len(word)+word_place
        if i>=len(text):
            break
        if word_place<0:
            break
        word_places.append(word_place)
    return word_places
def find_all_combination(text,start,end):
    start_places = find_all_places(text,start)
    end_places = find_all_places(text,end)
    combination_list = []
    for start_place in start_places:
        for end_place in end_places:
            print(start_place)
            print(end_place)
            if start_place>=end_place:
                continue
            combination_list.append(text[start_place:end_place])
    return combination_list
find_all_combination(s,"Part","Part")

result:

['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

Adrien Mau · Accepted Answer · 2022-08-02 13:28:35Z

1

In case you want to look for multiple occurences.

content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
    spos = c.find('_Suffix')
    if spos!=-1:
        strings.append( c[:spos])
print( strings )

Or more quickly :

strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

answered Aug 2, 2022 at 13:28

Adrien Mau

3241 silver badge5 bronze badges

Comments

Foobar · Accepted Answer · 2019-02-23 18:26:39Z

0

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

answered Feb 23, 2019 at 18:26

Foobar

8,65921 gold badges103 silver badges186 bronze badges

Comments

Julio S. · Accepted Answer · 2019-10-12 00:30:49Z

0

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

answered Oct 12, 2019 at 0:30

Julio S.

1,0002 gold badges15 silver badges33 bronze badges

Comments

Sergey Gurin · Accepted Answer · 2020-09-04 11:16:46Z

Typescript. Gets string in between two other strings.

Searches shortest string between prefixes and postfixes

prefixes - string / array of strings / null (means search from the start).

postfixes - string / array of strings / null (means search until the end).

public getStringInBetween(str: string, prefixes: string | string[] | null,
                          postfixes: string | string[] | null): string {

    if (typeof prefixes === 'string') {
        prefixes = [prefixes];
    }

    if (typeof postfixes === 'string') {
        postfixes = [postfixes];
    }

    if (!str || str.length < 1) {
        throw new Error(str + ' should contain ' + prefixes);
    }

    let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
    const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);

    let value = str.substring(start.pos + start.sub.length, end.pos);
    if (!value || value.length < 1) {
        throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
    }

    while (true) {
        try {
            start = this.indexOf(value, prefixes);
        } catch (e) {
            break;
        }
        value = value.substring(start.pos + start.sub.length);
        if (!value || value.length < 1) {
            throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
        }
    }

    return value;
}

Anonymous · Accepted Answer · 2023-02-20 15:49:58Z

0

a simple approach could be the following:

string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]

answered Feb 20, 2023 at 15:49

Anonymous

1195 bronze badges

1 Comment

Simas Joneliunas Over a year ago

could you explain your code, so that it would be more helpful to the readers?

Feng Jiang · Accepted Answer · 2023-05-28 23:21:08Z

0

If you want to check whether the substrings exists and return empty string if they don't:

def substr_between(str_all, first_string, last_string):
    pos1 = str_all.find(first_string)
    if pos1 < 0:
        return ""
    pos1 += len(first_string)
    pos2 = str_all[pos1:].find(last_string)
    if pos2 < 0:
        return ""
    return str_all[pos1:pos1 + pos2]

answered May 28, 2023 at 23:21

Feng Jiang

1,96320 silver badges27 bronze badges

Comments

Mori · Accepted Answer · 2024-06-17 09:16:38Z

0

I used this Split method:

main_text="dsfsgaehere535353box"
start = "here"
end = "box"
z1 = main_text.split(start)
z2 = z1[1].split(end)
print(z2[0])

will be "535353"

answered Jun 17, 2024 at 9:16

Mori

4,7712 gold badges28 silver badges34 bronze badges

Comments

charly_0x13 · Accepted Answer · 2025-10-22 15:27:25Z

0

If your markers might appear multiple times or even be identical, you can safely extract all matches using re.escape():

import re

s = "XXXAAAfirstAAAsecondAAAthirdZZZ"
start, end = "AAA", "AAA"

pattern = f"{re.escape(start)}(.*?){re.escape(end)}"
matches = re.findall(pattern, s)
print(matches)
# ['first', 'second']

This approach correctly handles repeated or identical markers and avoids regex injection issues.

answered Oct 22 at 15:27

charly_0x13

1493 bronze badges

Comments

MaxLZ · Accepted Answer · 2018-05-03 18:31:44Z

-1

One liners that return other string if there was no match. Edit: improved version uses next function, replace "not-found" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

edited May 3, 2018 at 18:31

answered Dec 7, 2017 at 0:55

MaxLZ

1091 silver badge4 bronze badges

Collectives™ on Stack Overflow

How to extract the substring between two markers?

25 Answers 25

14 Comments

5 Comments

regular expression

string methods

4 Comments

3 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

25 Answers 25

14 Comments

5 Comments

regular expression

string methods

4 Comments

3 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related