1

I have a string defined as,

content = "f(1, 4, 'red', '/color/down1.html');    
f(2, 5, 'green', '/color/colorpanel/down2.html');    
f(3, 6, 'blue', '/color/colorpanel/colorlibrary/down3.html');"

Here is the code I tried but it doesn't work:

results = re.findall(r"f(.*?)", content)
for each in results:
    print each

How to use regular expression to retrieve the links within the content? Thanks.

5
  • 2
    You should show us the code and regexes that you've tried already. Commented Feb 11, 2017 at 8:11
  • Here is the code I tried but it doesn't work. results = re.findall(r"f(.*?)", content) for each in results: print each Commented Feb 11, 2017 at 8:20
  • You probably want to use re.findall(re_pattern, content), where re_pattern is your regex. Commented Feb 11, 2017 at 8:21
  • That is exactly my question. What would be the correct pattern in order to retrieve the link. Commented Feb 11, 2017 at 8:24
  • What links are you referring to?, is it the last part as down3.html or the whole link? Commented Feb 11, 2017 at 9:58

3 Answers 3

1

You can learn the basic regexes on https://regex101.com/ and http://regexr.com/

In [4]: import re

In [5]: content = "f(1, 4, 'red', '/color/down1.html');    \
   ...: f(2, 5, 'green', '/color/colorpanel/down2.html');   \
   ...: f(3, 6, 'blue', '/color/colorpanel/colorlibrary/down3.html');"

In [6]: p = re.compile(r'(?=/).*?(?<=.html)')

In [7]: p.findall(content)
Out[7]: 
['/color/down1.html',
 '/color/colorpanel/down2.html',
 '/color/colorpanel/colorlibrary/down3.html']

.*? matches any character (except for line

*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)

You can also just get the last /

In [8]: p2 = re.compile(r'[^/]*.html')

In [9]: p2.findall(content)
Out[9]: ['down1.html', 'down2.html', 'down3.html']

[^/]* Match a single character not present in the list below

* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)

/ matches the character / literally (case sensitive)

. matches any character (except for line terminators) html matches the characters html literally (case sensitive).

Or, you can extract all the data in f()

In [15]: p3 = re.compile(r"(?=f\().*?(?<=\);)")

In [16]: p3.findall(content)
Out[16]: 
["f(1, 4, 'red', '/color/down1.html');",
 "f(2, 5, 'green', '/color/colorpanel/down2.html');",
 "f(3, 6, 'blue', '/color/colorpanel/colorlibrary/down3.html');"]
Sign up to request clarification or add additional context in comments.

2 Comments

Regarding p = re.compile(r'(?=/).*?(?<=.html)'), why not simply p = re.compile(r'(?=/).*(?<=.html)') ?What is the purpose to add? after *? Thanks.
@dullboy, I added explanation in the answer, if you think my answer solved your problem, please consider accept my answer, thanks.
0

You could do something like:

re.findall(r"f\(.*,.*,.*, '(.*)'", content)

1 Comment

That is really a smart one. Thanks.
0

You can try like so:

import re

content = """f(1, 4, 'red', '/color/down1.html');    
    f(2, 5, 'green', '/color/colorpanel/down2.html');    
    f(3, 6, 'blue', '/color/colorpanel/colorlibrary/down3.html');"""

print re.findall(r"(\/[^']+?)'", content)

Output:

['/color/down1.html', '/color/colorpanel/down2.html', '/color/colorpanel/colorlibrary/down3.html']  

Regex:

(\/[^']+?)' - match / followed by 1 or more non ' characters till first occurence of ' and capture in group1.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.