5

I have something like this:

Othername California (2000) (T) (S) (ok) {state (#2.1)}

Is there a regex code to obtain:

Othername California ok 2.1

I.e. I would like to keep the numbers within round parenthesis which are in turn within {} and keep the text "ok" which is within (). I specifically need the string "ok" to be printed out, if included in my lines, but I would like to get rid of other text within parenthesis eg (V), (S) or (2002).

I am aware that probably regex is not the most efficient way to handle such a problem.

Any help would be appreciated.

EDIT:

The string may vary since if some information is unavailable is not included in the line. Also the text itself is mutable (eg. I don't have "state" for every line). So one can have for example:

Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
6
  • Is the order of the data strict? (Eg: "Somethin state (year) (.) (.) (ok?) {state (#number)}"? In that case I think you need use the split function:pythonforbeginners.com/python-strings/python-split Commented Jun 18, 2013 at 8:41
  • No, actually it may vary from line to line, information is included only if available Commented Jun 18, 2013 at 8:46
  • must escape regex characters. the character (){} must escape with:"\" example: \{. test in url: gskinner.com/RegExr Commented Jun 18, 2013 at 8:57
  • The real challenge is to match 2.1 here, it would be much difficult if we want to take in account multiple instances of it, for example {state (#2.1) yellow (33)}. The problem with this kind of situations is the following: You have "theoretically" two ways to solve it: 1) Look ahead and behind if there is {}, the problem is that look behinds must be of fixed length in most regex flavors (same for python) 2) Use subgroup matching, something like \{(?:.*?\((\w+)\).*?)+\} which isn't available in most regex flavors. Thus I think your mission is impossible with pure regex power. Commented Jun 18, 2013 at 9:13
  • Can you post more examples of possible inputs? It's unclear what parts of the string stay the same and what may vary. Commented Jun 18, 2013 at 10:11

4 Answers 4

8

Regex

(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}

Regular expression image

Text used for test

Name1 Name2 Name3 (2000) {Education (#3.2)}
Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
Othername California (2000) (T) (S) (ok) {state (#2.1)}

Test

>>> regex = re.compile("(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x54e2105f36c16a48>
>>> regex.match(string)
<_sre.SRE_Match object at 0x54e2105f36c169e8>

# Run findall
>>> regex.findall(string)
[
   (u'Name1 Name2 Name3'   , u''  , u'3.2'),
   (u'Name1 Name2 Name3'   , u'ok', u'1.1'),
   (u'Name1 Name2'         , u''  , u'1.1'),
   (u'Name1 Name2 Name3'   , u''  , u'4.12'),
   (u'Othername California', u'ok', u'2.1')
]
Sign up to request clarification or add additional context in comments.

7 Comments

Cool. How did you generate the Regex graph?
Unfortunately it doesn't work on all my text lines and gives an error. I guess that the problem is that the text stings change all the time. Eg. there might be some other word instead of "state", and there might be also multiple words instead of it. The only recurrent pattern is given by the presence of parenthesis
@phimuemue I used debuggex.com. There is an option on the website for embedding any regular expression on SO.
@user2447387 so try replacing stats\s+ with .+
I don't know why but testing it with another line in my database does not work: "Name1 Name2 Name3 (2000) {Education (#3.2)}". It gives me "AttributeError: 'NoneType' object has no attribute 'groups'". Unfortunately information is not sometimes present in the line if not available
|
2

Try this one:

import re

thestr = 'Othername California (2000) (T) (S) (ok) {state (#2.1)}'

regex = r'''
    ([^(]*)             # match anything but a (
    \                   # a space
    (?:                 # non capturing parentheses
        \([^(]*\)       # parentheses
        \               # a space
    ){3}                # three times
    \(([^(]*)\)         # capture fourth parentheses contents
    \                   # a space
    {                   # opening {
        [^}]*           # anything but }
        \(\#            # opening ( followed by #
            ([^)]*)     # match anything but )
        \)              # closing )
    }                   # closing }
'''

match = re.match(regex, thestr, re.X)

print match.groups()

Output:

('Othername California', 'ok', '2.1')

And here's the compressed version:

import re

thestr = 'Othername California (2000) (T) (S) (ok) {state (#2.1)}'
regex = r'([^(]*) (?:\([^(]*\) ){3}\(([^(]*)\) {[^}]*\(\#([^)]*)\)}'
match = re.match(regex, thestr)

print match.groups()

Comments

1

Despite what I have said in the comments. I've found a way around:

(?(?=\([^()\w]*[\w.]+[^()\w]*\))\([^()\w]*([\w.]+)[^()\w]*\)|.)(?=[^{]*\})|(?<!\()(\b\w+\b)(?!\()|ok

Explanation:

(?                                  # If
(?=\([^()\w]*[\w.]+[^()\w]*\))      # There is (anything except [()\w] zero or more times, followed by [\w.] one or more times, followed by anything except [()\w] zero or more times)
\([^()\w]*([\w.]+)[^()\w]*\)        # Then match it, and put [\w.] in a group
|                                   # else
.                                   # advance with one character
)                                   # End if
(?=[^{]*\})                         # Look ahead if there is anything except { zero or more times followed by }

|                                   # Or
(?<!\()(\b\w+\b)(?!\()              # Match a word not enclosed between parenthesis
|                                   # Or
ok                                  # Match ok

Online demo

5 Comments

Sorry if I'm asking (I'm a newbie at python and coding in general)..can you give me some other couples of lines o test this (Ive tried with re.sub but it gives me an error. Thanks!
I've tried to substitute your regex within a re.sub and in the 1st answer code but it gives me an error...let me try a bit more...
It seems that python doesn't support this kind of if/else statements, try (?:(?=\([^()\w]*[\w.]+[^()\w]*\))\([^()\w]*([\w.]+)[^()\w]*\)|(?!\([^()\w]*[\w.]+[^()\w]*\)).)(?=[^{]*\})|(?<!\()(\b\w+\b)(?!\()|ok
This time no error but I get a wrong output. With "Name1 Name2 Name3 (2000) (V) {variation (#4.12)}" as string I get: "Name1 Name2 Name3 (2000) (V) }"
@user2447387 I upvoted your question so that you get 20 rep, you can now maybe ask for help in the python chatroom.
0

other case is:

^(\w+\s?\w+)\s?\(\d{1,}\)\s?\(\w+\)\s?\(\w+\)\s?\((\w+)\)\s?.*#(\d.\d)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.