Issue in pattern matching using python regex

Question

i am facing issues while trying to cut out a substring from a string using python regex. the problem statement is that i want to take any substring matching the following format from a bigger string

some_var:struct<some_variables>

In doing so, i got into three corner case scenarios and let me explain those scenarios in details

Scenario1 :-

s='firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
match = re.search(r'\w[a-zA-Z]*:struct<.*>,',s)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

the above code works fine.

Scenario2:-

subdtyp = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,last3:array<string>,last4:struct<last41:int,last42:string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,last3:array<string>,

in this case on using the above regex format, due to greedy matching i am getting a string which is more than what is expected (last3:array<string>,) is the extra bit of information that is coming. So i changed that to non-greedy matching like below

>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

this time the result is coming fine and what i want

Scenario 3 :-

subdtyp2 = 'firstname:string,middlename:double,lastname:struct<last4:struct<last41:int,last42:string>,last2:array<string>>,last3:array<string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp2)
>>> print(match.group())
lastname:struct<last4:struct<last41:int,last42:string>,

here we are not getting the completed result as (last2:array<string>) portion is missed out for non-greedy matching.

Can somebody please help me in providing me a regex which will satisfy all the above conditions ?

According to this answer, regex woulld not be the best way to handle nested expressions. A better way would be pyparsing. — GregoirePelegrin
– GregoirePelegrin, Commented Dec 13, 2022 at 7:26

GregoirePelegrin · Accepted Answer · 2022-12-13 07:37:41Z

0

Starting from this answer, I get something like this:

import pyparsing

string = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
thecontent = pyparsing.Word(pyparsing.alphanums) | ":" | ","
parens = pyparsing.nestedExpr("<", ">", content=f"<{thecontent}>")

a = parens.parseString(string).asList()[0]
print(a[a.index('struct')+1])

# ['last1', ':', 'int', ',', 'last2', ':', 'array', ['string']]

We must define thecontent as every character other than the nesting ones, while here parens are the nesting ones. Additionally, like in JSON, you can't start from something else than a nesting character, thus why the content=f"<{thecontent}>".
As far as I've understood, you want to find the content of the structs, this should allow you to do exactly this.

answered Dec 13, 2022 at 7:37

GregoirePelegrin

1,1832 gold badges12 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Issue in pattern matching using python regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related