1

I have a text file with two types of lines. One type looks like:

'6-digit-primary-id','6-digit-secondary-id',subject,author,text

The other is just words with no specific pattern. In the former case, I want to know the primary id along with the text and in the latter I want to get the words. What I've tried:

PATTERN = r'[1-9]{6},[1-9]{6},?*,?*,*'
match = re.match(PATTERN,input_line)
if match:
    primary_id = match.group()[0]
    text = match.group()[7]
else:
    text = input_line

But obviously I'm doing something wrong (getting 'invalid syntax')

Can anyone please point me to the right direction?

2
  • What does it mean for you ? in r'[1-9]{6},[1-9]{6},?*,?*,*' ? Commented Jan 11, 2014 at 16:05
  • Also - what do you expect match.group()[7] to be doing exactly? You don't have any groups in your pattern Commented Jan 11, 2014 at 16:06

2 Answers 2

2

? has a special meaning in regex patterns. It (greedily) matches 0 or 1 of the preceding regex. So ,? matches a comma or no comma. ,?* raises a sre_compile.error.

Perhaps you intended . instead of ?. It matches any character except a newline (unless the re.DOTALL flag is specified).

PATTERN = r'(\d{6}),(\d{6}),(.*?),(.*?),(.*)'
match = re.match(PATTERN, input_line)
if match:
    primary_id = match.group(1)
    text = match.group(5)
else:
    text = input_line

Some other suggestions:

  • You can use \d to specify the character pattern [0-9]. Note that this is adding 0 to your character class. (I assume that is okay). If not you can stick with [1-9]{6}.
  • If you put groups in your regex pattern, then you can specify the parts using match.group(num) instead of match.group()[num]. (And it looks like you want match.group(5) rather than match.group()[7].)
  • The pattern .* matches as many characters as possible. .*? matches non-greedily. You need to match non-greedily for the subject and author patterns, lest they expand to match the remainder of the entire line.
  • An alternative to .*? here would be [^,]*. This matches 0-or-more characters other than a comma.

    PATTERN = r'(\d{6}),(\d{6}),([^,]*),([^,]*),(.*)'
    
Sign up to request clarification or add additional context in comments.

Comments

1

In Regular Expressions, * means no, one or more occurrence of the previous character and ? means no or one occurrence of the previous character. So ?* is not a valid expression. You are probably mixing with the .*? operation which means "any character no, one or more time but match the less possible" (non-greedy).

You probably want

PATTERN = r'[1-9]{6},[1-9]{6},.*?,.*?,.*'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.