0

I would like to extract multiple urls from a node and place them into a string array. Currently I'm saving all the text from the desired node into a string;

imgsUrl= value.text

then I am parsing the string and getting the correct url.

imgsUrl[imgUrl.find("http://"):imgUrl.find(".JPG")+4]

My issue with this is there could be 1-200 urls I need from imgsUrl, and I'm only able to obtain one of them. Is there a good solution to place all of them into an array that would be less tedious?

sample input:

sampleStr="<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li>
<li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>"

output:

print imgUrlSubString
outputs this:  http://website/abc/vcd//HHD003000.JPG

expected output:

['http://website/abc/vcd//HHD003000.JPG','http://website/abc/vcd//HHD003002.JPG',....]
6
  • Can you post a sample input and the expected output? Commented Nov 20, 2014 at 18:31
  • Regex should do the trick. See [this][1] answer. [1]: stackoverflow.com/a/6883094/447599 Commented Nov 20, 2014 at 18:51
  • @vikramls alright sample input with corresponding output has been included Commented Nov 20, 2014 at 20:27
  • possible duplicate of Python xml ElementTree from a string source? Commented Nov 20, 2014 at 20:38
  • @Julius This seems to work great. Is this a similar approach that niroyb mentioned below? If so, I'd like to mark one of these as the answer. thanks! Commented Nov 20, 2014 at 21:07

3 Answers 3

0

Here's my answer - I used lxml.html to parse the HTML. It is generally a bad idea to use regexes to parse HTML (see @ivan_pozdeev's answer above).

import lxml.html

sampleStr='<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li><li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>'
html = lxml.html.fromstring(sampleStr)
print html.xpath('//a/@href')

The code uses an xpath expression to retrieve all the href properties in all a tags in the string sampleStr.

Sample output:

['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']
Sign up to request clarification or add additional context in comments.

2 Comments

Is it still possible to access html like an array, for example...print html[0] would print 'website/abc/vcd//HHD003000.JPG'
Yes, you would store the expression like this: href_list = html.xpath('//a/@href') and you now have a list href_list which you can iterate over or access directly using href_list[0].
0

You can use the re.findall method. It returns all non overlapping regular expression matches directly in a list.

print( re.findall("http://.*?\.JPG", imgsUrl) )

Using ".*?" instead of ".*" is important in this case because there can be multiple urls so you want the non greedy match.

The best way to go though is to use an xml parser. For python, beautifulsoup and lxml are pretty popular.

See these answers:

1 Comment

Read stackoverflow.com/a/1732454/648265 at once. And each time you'll think of providing such an answer ever again.
0

You can use BeautifulSoup to parse this string.

from bs4 import BeautifulSoup
soup = BeautifulSoup(sampleStr)
links = soup.find_all("a")
output = []
for link in links:
    output.append(link["href"])

And here's the output:

print(output)
>>> ['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']

1 Comment

Thanks, this method also works for my problem. Not sure if there's a better choice but both work, thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.