2

When the parenthesis were used in the below program output is ['www.google.com'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;

If parenthesis is removed in findall function output is ['href="www.google.com"'].

import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;

Would be helpful if someone explained how it works.

2
  • Code you have provided is exactly the same. But probably you are talking about grouping in regular expressions in general. Commented Jan 22, 2013 at 11:37
  • I've fixed your example code to actually produce the output (which also were missing the quotes). I left in the redundant semicolons though; python does not need those. Commented Jan 22, 2013 at 11:45

1 Answer 1

5

The re.findall() documentation is quite clear on the difference:

Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:

  • 0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
  • 1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
  • more than 1 capturing group in the pattern: return a tuple of all matched groups.

Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:

>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]
Sign up to request clarification or add additional context in comments.

3 Comments

Why should it be? It only would return a list of tuples (if that's what you meant) if there is more than 1 group.
My doubt is why href= is not included in the output even though it matches the pattern.i.e how does groups behave in this example..sorry i m new to python
@vindhya: The href is not grouped. Only the part matched by (.*?) (a capturing group) is returned. When you remove the group, the whole match is returned.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.