Python parsing multiple url into array from XML

Question

I would like to extract multiple urls from a node and place them into a string array. Currently I'm saving all the text from the desired node into a string;

imgsUrl= value.text

then I am parsing the string and getting the correct url.

imgsUrl[imgUrl.find("http://"):imgUrl.find(".JPG")+4]

My issue with this is there could be 1-200 urls I need from imgsUrl, and I'm only able to obtain one of them. Is there a good solution to place all of them into an array that would be less tedious?

sample input:

sampleStr="<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li>
<li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>"

output:

print imgUrlSubString
outputs this:  http://website/abc/vcd//HHD003000.JPG

expected output:

['http://website/abc/vcd//HHD003000.JPG','http://website/abc/vcd//HHD003002.JPG',....]

Regex should do the trick. See [this][1] answer. [1]: stackoverflow.com/a/6883094/447599 — Jules Gagnon-Marchand
– Jules Gagnon-Marchand, Commented Nov 20, 2014 at 18:51
@vikramls alright sample input with corresponding output has been included — BFlint
– BFlint, Commented Nov 20, 2014 at 20:27
possible duplicate of Python xml ElementTree from a string source? — ivan_pozdeev
– ivan_pozdeev, Commented Nov 20, 2014 at 20:38
@Julius This seems to work great. Is this a similar approach that niroyb mentioned below? If so, I'd like to mark one of these as the answer. thanks! — BFlint
– BFlint, Commented Nov 20, 2014 at 21:07

vikramls · Accepted Answer · 2014-11-20 21:18:17Z

0

Here's my answer - I used lxml.html to parse the HTML. It is generally a bad idea to use regexes to parse HTML (see @ivan_pozdeev's answer above).

import lxml.html

sampleStr='<ul><li><a href="http://website/abc/vcd/HHD00300.JPG">HHD00300.JPG</a></li><li><a href="http://website/abc/vcd//HHD003002.jpg">HHD003002.jpg</a></li></ul>'
html = lxml.html.fromstring(sampleStr)
print html.xpath('//a/@href')

The code uses an xpath expression to retrieve all the href properties in all a tags in the string sampleStr.

Sample output:

['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']

answered Nov 20, 2014 at 21:18

vikramls

1,8221 gold badge11 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BFlint Over a year ago

Is it still possible to access html like an array, for example...print html[0] would print 'website/abc/vcd//HHD003000.JPG'

vikramls Over a year ago

Yes, you would store the expression like this: href_list = html.xpath('//a/@href') and you now have a list href_list which you can iterate over or access directly using href_list[0].

Community · Accepted Answer · 2017-05-23 11:49:27Z

0

You can use the re.findall method. It returns all non overlapping regular expression matches directly in a list.

print( re.findall("http://.*?\.JPG", imgsUrl) )

Using ".*?" instead of ".*" is important in this case because there can be multiple urls so you want the non greedy match.

The best way to go though is to use an xml parser. For python, beautifulsoup and lxml are pretty popular.

See these answers:

edited May 23, 2017 at 11:49

CommunityBot

11 silver badge

answered Nov 20, 2014 at 19:09

niroyb

1246 bronze badges

1 Comment

ivan_pozdeev Over a year ago

Read stackoverflow.com/a/1732454/648265 at once. And each time you'll think of providing such an answer ever again.

xbb · Accepted Answer · 2014-11-20 21:24:07Z

0

You can use BeautifulSoup to parse this string.

from bs4 import BeautifulSoup
soup = BeautifulSoup(sampleStr)
links = soup.find_all("a")
output = []
for link in links:
    output.append(link["href"])

And here's the output:

print(output)
>>> ['http://website/abc/vcd/HHD00300.JPG', 'http://website/abc/vcd//HHD003002.jpg']

answered Nov 20, 2014 at 21:24

xbb

2,1731 gold badge21 silver badges37 bronze badges

1 Comment

BFlint Over a year ago

Thanks, this method also works for my problem. Not sure if there's a better choice but both work, thanks a lot!

Collectives™ on Stack Overflow

Python parsing multiple url into array from XML

3 Answers 3

Sample output:

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Sample output:

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related