0

I'm trying to do extract particular strings in markup and save them (for more complex processing on this line). So say for example, I've read in a line from a file and the current line is:

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

But I want to store:

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'

tempWidth = 500

tempHeight = 375

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'

How would I go about doing that in Python?

Thanks

2
  • Let me save you the trouble and just tell you that regex is out for this. Don't think of trying it you will only hit your head later on. If the data is from a web source look into BeautifulSoup or scrapy or any other "scraping" library. If you already have the markup, you can just use a parser and traverse nodes and gather attribute information. Commented Dec 15, 2016 at 17:16
  • HTMLParser or html.parser depending on python version Commented Dec 15, 2016 at 17:16

2 Answers 2

3

Though you can get away with several approaches here, I recommend using an HTML parser, which is extensible and can deal with many issues in the HTML. Here's a working example with BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road
Sign up to request clarification or add additional context in comments.

1 Comment

After finally getting bs4 installed, this is a beautiful solution. Thanks!
0

And the regex approach:

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

All values are string though, so cast it if you want..

And know that with regex copy/paste is a bad idea. There could be mistakes easily.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.