0

I have a large string of the the following form: [' some text [ARG1: some inner text [1: some more text], and also [ other inner text [TAG: TAG_TYPE (0.99)]] ]', 'some more text ( some text in parentheses [2: words [ARG1: more words [ARGM-TYPE: even more nested words]]] [other text in square brackets []])']

I wish to capture everything within single quotes, and I can manage that with a simple ('(.*?)')

Now within this capture group I want to capture all the other possible groups, including optional nested subgroups.

I can capture some of the subgroups on their own, e.g.: (\[ONTOTYPE: PERSON \((0\.(\d{1,4})\))\])

but I seem to miss some fairly basic insight to deal with optional nesting. If I'm missing some concept, I'd welcome links to any good explanation of it.

I'm aware of the concept of named capture groups, but I think that using them here would only increase the confusion.

For reference, my current attempt is here: https://regex101.com/r/bzSCD0/1/

In particular, group 15 in match 1 has a substring that matches one of the expressions, but it's not being parsed further.

ETA: Here are some examples of inputs and expected outputs: Let's use the one from the regex101 page: ' [ ARG0 : Those ] [ R - ARG0 : who ] [ V : graduated ] [ ARG1 : from [0: the school ] ] were promoted from provincial secretary to titular adviser . ' --> {ARG0:Those, R-ARG0: who, V:graduated, ARG1:from, 0: the school, <rest of text>}. I've turned what would be match 1 into a dict, and key-value pairs are groups1, not in the matching order.

Let's use a larger example from the beginning of the list, and use numbered capture groups: [' 2 Architects : [ V : Stasov ] [ ARG1 : V. P. Melnikov ] [ ARG1 : A. ] [ ARGM - LOC : [2: I. Suzor P. Yu [ONTOTYPE: PERSON (0.851)] ] ] . Year of construction : [1: 1835 ] , 1895 - 1910 [ONTOTYPE: DATE (0.8774)] Style : Classicism [0: School of Law ] Classicism on [3: the Fontanka River [ONTOTYPE: WORK_OF_ART (0.8261)] ] , [4: 6 - Tchaikovsky Street ] , [5: 1 - Oruzhenik Fedorov Street ] , 2 - A. A. Rzhevsky House 1790 [ONTOTYPE: DATE (0.7046)] - [0: School of Law ][1: 1835 ] - arch . Stasov Vasily Petrovich [ONTOTYPE: PERSON (0.4863)] , arch . Melnikov Avraham Ivanovich [ONTOTYPE: PERSON (0.7781)] ( ? ) ' turns into

group 1: 2 Architects
group 2: V
group 3: Stasov
group 4: ARG1
group 5: V.P. Melnikov
group 6: ARG1
group 7: A
group 8: ARGM-LOC
group 9: 2
group 10: I.   Suzor   P.   Yu
group 11: ONTOTYPE: PERSON (0.851)
group 12: Year of construction
group 13: 1
group 14: 1835

and so on

UPDATE: I've now constructed a second version of my regex: https://regex101.com/r/bzSCD0/2/ The hope is to capture all the simple groups first (this mostly happens), and then use backreference to try to capture them optionally in other groups. Still no idea about applying all this to the string between single quotes (the ('(.*?)') group).

It seems that I need some way to avoid the (.*?) capture all group: once a match for that is found, the regex engine wouldn't check to see if it fits a different pattern.

10
  • 1
    Can you provide sample input and expected output? Commented Sep 6, 2020 at 8:10
  • Let's use the one from the regex101 page: ' [ ARG0 : Those ] [ R - ARG0 : who ] [ V : graduated ] [ ARG1 : from [0: the school ] ] were promoted from provincial secretary to titular adviser . ' --> {ARG0:Those, R-ARG0: who, V:graduated, ARG1:from, 0: the school, <rest of text>}. I've turned what would be match 1 into a dict, and key-value pairs are groups1, not in the matching order. Commented Sep 6, 2020 at 8:17
  • I can provide more input--> output examples, but the comments seems too small to contain them. Commented Sep 6, 2020 at 8:21
  • 1
    Please update that in question. It will be helpful for others too. Commented Sep 6, 2020 at 8:22
  • 1
    @tripleee, very well, but I'd argue that: 1) what I'm dealing this is not arbitrarily nested, but only nested in predictable ways and up to a certain "depth" of nestedness; 2) AFAIKT there're still regex-based solutions for such cases. But, if regexes are the wrong approach for my particular case, what would you recommend? Commented Sep 6, 2020 at 11:41

1 Answer 1

1

Please try below program.

(?<=[\[\]:])(.*?)(?=[\[\]:])

is the regex used in this program (Demo)

import re
a="""
['  2  Architects  :  [  V  :  Stasov  ]  [  ARG1  :  V.  P.  Melnikov  ]  [  ARG1  :  A.  ]  [  ARGM  -  LOC  : [2:  I.   Suzor   P.   Yu [ONTOTYPE: PERSON (0.851)] ] ]  .  Year  of  construction  : [1:  1835  ] ,  1895  -  1910 [ONTOTYPE: DATE (0.8774)]  Style  :  Classicism [0:  School   of   Law ] Classicism  on [3:  the   Fontanka   River [ONTOTYPE: WORK_OF_ART (0.8261)] ] , [4:  6    -   Tchaikovsky   Street ] , [5:  1   -   Oruzhenik   Fedorov   Street ] ,  2  -  A.  A.  Rzhevsky  House  1790 [ONTOTYPE: DATE (0.7046)]  - [0:  School   of   Law ][1:  1835  ] -  arch  .  Stasov  Vasily  Petrovich [ONTOTYPE: PERSON (0.4863)]  ,  arch  .  Melnikov  Avraham  Ivanovich [ONTOTYPE: PERSON (0.7781)]  (  ?  )  ', '  -  perestroika  ,  adaptation  1895 [ONTOTYPE: DATE (0.9555)]  ,  1909  -  1910 [ONTOTYPE: DATE (0.927)]  - [2:  archbishop   Suzor   Pavel   Yulievich [ONTOTYPE: PERSON (0.7866)] ] -  perestroika  (  [  ]  .  ', '  C.  ', '  [  ARGM  -  PRD  :  293   )  ( [3:  Fontanka   River [ONTOTYPE: ORG (0.595)] ] ]  [  ARGM  -  MNR  :  , [4:  6    -   Tchaikovsky   Street ] ,  1  ,  the  right  part  - [5:  Oruzheynik   Fedorov   Street [ONTOTYPE: FAC (0.7551)] ] ,  2  ]  ,  [  ARG0  :  the  left  part  )  ]  [  V  :  see  ]  [  ARGM  -  DIS  :  also  ]  [ [ ARG1 ] :  the  building  with  columns  and  corner  domes  -  Imperial  School  of  Law [ONTOTYPE: ORG (0.6317)]  ]  .  ', '  [  ARG2  :  The  house  ]  [  V  :  occupies  ]  [ [ ARG1 ] :  an  entire  block  ]  .  ', '  In  the  XYIII  century [ONTOTYPE: DATE (0.841)]  ,  there  was  a  laundromat  on  this  site  .  ', '  In  1788 [ONTOTYPE: DATE (0.9969)]  ,  [  ARG1  :  Alexey  Andreevia  Rzhevsky [ONTOTYPE: PERSON (0.7615)]  ]  ,  [  R  -  ARG1  :  who  ]  had  [  ARGM  -  TMP  :  recently  ]  [  V  :  married  ]  [  ARG2  :  Glafira  Alymova [ONTOTYPE: PERSON (0.8034)]  (  a  graduate  of  the  Smolny  Institute  )  ] [ONTOTYPE: ORG (0.7244)]  ,  bought  the  buildings  on  the  old  Palace  Spare  Court [ONTOTYPE: FAC (0.9785)]  ,  which  were  to  be  demolished  ,  and  by  1790 [ONTOTYPE: DATE (0.9945)]  he  had  built [8:  a   building   of   two    wings   connected   by   a   gate ] .  ', '  [  ARGM  -  DIS  :  In  1793 [ONTOTYPE: DATE (0.7649)]  ]  ,  [  ARG0  : [7:  the   couple ] ]  [  V  :  sold  ]  [ [ ARG1 ] : [8: [7:  their ]  house ] ]  [  ARG2  :  to  Countess  Maria  Iosifovna  Potocka [ONTOTYPE: PERSON (0.8324)]  ,  née  Mnišek [ONTOTYPE: PERSON (0.9412)]  ]  .  ']
['Arbigny [ONTOTYPE: PERSON (0.9974)]  ,  who  had  leased [1:  it ] to  the  Paget  Corps [ONTOTYPE: ORG (0.9155)]  for  eight  years [ONTOTYPE: DATE (0.9722)]  .  ``  ,  ', '  ,  ``  [ [ ARG1 ] :  The  first [ONTOTYPE: ORDINAL (0.9903)]  major  reconstruction  of [1:  the   house ] ]  [  V  :  took  ]  [  ARG2  :  place  ]  [  ARGM  -  LOC  :  in [3:  1814 ] -  1819 [ONTOTYPE: DATE (0.8943)]  ]  .  [  ARGM  -  ADV  :  In [5:  1835 ] ]  , [4:  Neplyev   `s ] heirs  sold [1:  the   house ] to [6:  Prince   Peter   Georgievich [ONTOTYPE: PERSON (0.9766)]   of   Oldenburg [ONTOTYPE: GPE (0.9829)]   ,   who   decided   to   establish   a   School   of   Law   there ] .  ``  ,  ', '  ,  ', '  ,  ', '  ,  ', '  ,  ']
['  [  ARG0  :  Those  ]  [  R  -  ARG0  :  who  ]  [  V  :  graduated  ]  [  ARG1  :  from [0:  the   school ] ]  were  promoted  from  provincial  secretary  to  titular  adviser  .  ', '  Architect  [  ARG0  :  A.I.  Melnikov [ONTOTYPE: PERSON (0.8799)]  ]  [  V  :  created  ]  [  ARG1  :  a  reconstruction  project  -  building  a  gap  between  the  buildings  ,  decorated  the  facade  with  a  portico  ,  courtyard  outbuildings  ,  and  a  house  ]  .  ', ' [1:  Church ] .  ', ' [2:  The   trustee   of  [0:  the   school ]] is  Prince  Peter  Georgievich [ONTOTYPE: PERSON (0.9901)]  of  Oldenburg [ONTOTYPE: GPE (0.9956)]  ,  a  close  relative  of  the  imperial  family  .  ', '  [  ARGM  -  ADV  :  From  1860 [ONTOTYPE: WORK_OF_ART (0.6234)]  ]  ,  [  ARG0  : [2:  he ] ]  [  V  :  headed  ]  [  ARG1  :  the  IV  Branch  of  the  Imperial  Chancellery [ONTOTYPE: ORG (0.9247)]  ,  a  charitable  agency  ]  .  ', '  [  ARG0  : [2:  He ] ]  [  V  :  invested  ]  [  ARG1  :  energy  and  resources  ]  [  ARG2  :  in  the  creation  of  hospitals  ,  shelters  ,  and  educational  institutions  ]  .  ']
['  [  ARGM  -  LOC  :  In  1836  -  1840 [ONTOTYPE: DATE (0.8964)]  ]  ,  [  ARG0  : [1:  V.P.   Stasov ] ]  [  V  :  completed  ]  [  ARG1  :  a  number  of  interior  spaces  ,  including  a  large  hall  and  a  house  church  ]  .  ', '  (  not  dry  )  .  ', '  In  the  left  and  right  wings  of [2:  the   school ] were  the  apartments  of [0:  teachers   and   employees ] .  ', '  Among [0:  them ] :  [  ARG0  :  writer  I.  S.  Aksakov [ONTOTYPE: PERSON (0.986)]  ,  poet  A.  N.  Apukhtin [ONTOTYPE: PERSON (0.9919)]  ,  biologist  V.  O.  Kovalevsky [ONTOTYPE: PERSON (0.9897)]  ]  ,  [  V  :  composers  ]  [  ARG1  :  A.  N.  Serov [ONTOTYPE: PERSON (0.9888)]  and  P.  I.  Tchaikovsky [ONTOTYPE: PERSON (0.9887)]  , [1:  art   critic   V.   V.   Stasov [ONTOTYPE: PERSON (0.99)] ] and [1:  his ] brother  ,  famous  lawyer  D.  V.  Stasov [ONTOTYPE: PERSON (0.9937)]  ,  architect  P.  Yu  .  Suzor [ONTOTYPE: PERSON (0.8141)]  ,  scientist  V.  O.  Kovalevsky [ONTOTYPE: PERSON (0.9897)]  ,  chess  player  A.  A.  Alekhin [ONTOTYPE: PERSON (0.9898)]  ]  .  ', '  [  ARGM [ONTOTYPE: ORG (0.8209)]  -  LOC  :  In  1893  -  1895 [ONTOTYPE: DATE (0.8547)]  and  1909  -  1910 [ONTOTYPE: DATE (0.8661)]  ]  ,  [  ARG0  :  Pavel  Yulievich  Suzor  ]  [  V  :  rebuilt  ]  [  ARG1  : [2:  the   building ] ]  ,  [  ARGM [ONTOTYPE: ORG (0.8209)]  -  ADV  :  removing  the  two   middle  columns  of  the  portico  and  making  a  new  main  entrance  ]  .  ', '  [  ARG1  :  The  fronton  ]  was  [  V  :  replaced  ]  [  ARGM  -  MNR  :  with  a  stepped  attic  ]  ,  and  the  shape  of  the  dome  was  changed  .  ', '  In  the  second  half  of  1822 [ONTOTYPE: DATE (0.9192)]  ,  the  Decembrist  G.  S [ONTOTYPE: ORG (0.6)].  [  ARG0  :  Batenkov [ONTOTYPE: PERSON (0.9606)]  (  1793  -  1863 [ONTOTYPE: DATE (0.8889)]  )  ]  [  V  :  lived  ]  [  ARGM  -  LOC  :  in  Z [ONTOTYPE: GPE (0.9057)].  Z [ONTOTYPE: GPE (0.9057)] [ONTOTYPE: WORK_OF_ART (0.7346)].  In  1874  -  1918 [ONTOTYPE: DATE (0.9104)]  ,  V.  G.  Fedorov [ONTOTYPE: PERSON (0.9568)]  (  1874  -  1966 [ONTOTYPE: DATE (0.838)]  )  ,  a  designer  and  gunner  ]  ,  lived  in  Z [ONTOTYPE: GPE (0.9057)].  ']
['  [  ARGM  -  TMP  :  In  the  1920s [ONTOTYPE: DATE (0.7653)]  ]  ,  [ [ ARG1 ] : [1:  the   building ] ]  was  [  V  :  occupied  ]  [  ARG0  :  by [4:  the   Agricultural   Institute [ONTOTYPE: ORG (0.7449)] ] ]  .  ', '  [  ARG0  :  Teachers  ]  [  ARGM  -  DIS  :  also  ]  [  V  :  lived  ]  [  ARGM  -  LOC  :  here  ]  .  ', '  In  square  8  -  1928  -  1948 [ONTOTYPE: DATE (0.7954)]  A.  M.  Innokentyevich [ONTOTYPE: PERSON (0.8596)]  -  founder  of  hematology  in  the  USSR [ONTOTYPE: GPE (0.9975)]  .  ', '  [  ARG1  :  The  main  building  of [0:  the   1960 [ONTOTYPE: DATE (0.9949)]   -   1970 [ONTOTYPE: DATE (0.9065)]   ``   s ] ]  [  V  :  housed  ]  [  ARG2  :  the  NT  and  Lenelectronmash  ]  ,  and  since [0:  the   1960 [ONTOTYPE: DATE (0.9949)]   ``   s ] Lenzhradproekt [ONTOTYPE: PRODUCT (0.6771)]  has  been  located  .  ', '  [  ARGM  -  LOC  :  In  the  left  part  of [1:  the   building ] ]  ,  [  ARG1  :  the  Economic  and  Mathematical  Institute [ONTOTYPE: ORG (0.6906)]  of  the  Russian  Academy  of  Sciences [ONTOTYPE: ORG (0.8781)]  ]  is  [  ARGM  -  TMP  :  now  ]  [  V  :  located  ]  [  ARGM  -  LOC  :  on  the  site  of  former  residential  apartments  ]  .  ', '  To  the  left  of [1:  the   school   building ] ,  on  [ [ ARG1 ] :  the  plot  ]  [  V  :  owned  ]  [  ARG0  :  by  him  ]  ,  was  a  1-storey  building  of  the  school  infirmary  .  ', '  [  ARGM  -  LOC  :  In  1861 [ONTOTYPE: DATE (0.967)]  ]  ,  [  ARG0  : [3:  architect   V.P.   Lvov [ONTOTYPE: PERSON (0.6422)] ] ]  [  V  :  built  ]  [ [ ARG1 ] :  baths  ]  [  ARGM  -  LOC  :  in [3:  his ] place  in  the  1938 [ONTOTYPE: DATE (0.9597)]  ``  s  ]  [  ARGM  -  ADV  :  according  to  the  design  of  Alexander  Ivanovich  Hegello [ONTOTYPE: PERSON (0.8942)]  ,  Chairman  of  the  Union  of  Architects  of  Leningrad [ONTOTYPE: ORG (0.8652)]  ]  .  ', '  1922 [ONTOTYPE: DATE (0.9325)]  :  Petrograd  Cooperative [ONTOTYPE: ORG (0.5727)]  of  [  ARG1  : [4:  the   Agronomic   Institute [ONTOTYPE: ORG (0.5568)] ] ]  ,  [  V  :  registered  ]  [  ARG2  :  with  the  Pepo  Cooperative  Commission [ONTOTYPE: ORG (0.9466)]  ]  ;  Natural  History  and  Agriculture  Museum [ONTOTYPE: ORG (0.8408)]  (  ?  )  ']

"""

groups=[b.strip() for b in re.findall("(?<=[\[\]:])(.*?)(?=[\[\]:])",a) if b.strip()]
for (i,j) in enumerate(groups):
    print("group %d: %s"%(i,j))

Output

group 0: '  2  Architects
group 1: V
group 2: Stasov
group 3: ARG1
group 4: V.  P.  Melnikov
group 5: ARG1
group 6: A.
group 7: ARGM  -  LOC
group 8: 2
group 9: I.   Suzor   P.   Yu
group 10: ONTOTYPE
group 11: PERSON (0.851)
group 12: .  Year  of  construction
group 13: 1
group 14: 1835
group 15: ,  1895  -  1910
...
...
Sign up to request clarification or add additional context in comments.

2 Comments

That is a great solution. Could you elaborate on why it works, and how you came to this particular sequence of lookbehind and lookahead? As an aside, I've decided that I actually do need to re-calculate all the data and store it in some more manageable format like a JSON.
JSON is a good idea, as we can have more control. I came to above regex by analayzing common pattern in your data. here the data groups you mentioned are surrounded by square brackets and colons (Along with empty spaces). So I used it in the regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.