2

i am facing issues while trying to cut out a substring from a string using python regex. the problem statement is that i want to take any substring matching the following format from a bigger string

some_var:struct<some_variables>

In doing so, i got into three corner case scenarios and let me explain those scenarios in details

Scenario1 :-

s='firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
match = re.search(r'\w[a-zA-Z]*:struct<.*>,',s)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

the above code works fine.

Scenario2:-

subdtyp = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,last3:array<string>,last4:struct<last41:int,last42:string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,last3:array<string>,

in this case on using the above regex format, due to greedy matching i am getting a string which is more than what is expected (last3:array<string>,) is the extra bit of information that is coming. So i changed that to non-greedy matching like below

>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

this time the result is coming fine and what i want

Scenario 3 :-

subdtyp2 = 'firstname:string,middlename:double,lastname:struct<last4:struct<last41:int,last42:string>,last2:array<string>>,last3:array<string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp2)
>>> print(match.group())
lastname:struct<last4:struct<last41:int,last42:string>,

here we are not getting the completed result as (last2:array<string>) portion is missed out for non-greedy matching.

Can somebody please help me in providing me a regex which will satisfy all the above conditions ?

1
  • According to this answer, regex woulld not be the best way to handle nested expressions. A better way would be pyparsing. Commented Dec 13, 2022 at 7:26

1 Answer 1

0

Starting from this answer, I get something like this:

import pyparsing

string = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
thecontent = pyparsing.Word(pyparsing.alphanums) | ":" | ","
parens = pyparsing.nestedExpr("<", ">", content=f"<{thecontent}>")

a = parens.parseString(string).asList()[0]
print(a[a.index('struct')+1])

# ['last1', ':', 'int', ',', 'last2', ':', 'array', ['string']]

We must define thecontent as every character other than the nesting ones, while here parens are the nesting ones. Additionally, like in JSON, you can't start from something else than a nesting character, thus why the content=f"<{thecontent}>".
As far as I've understood, you want to find the content of the structs, this should allow you to do exactly this.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.