How to extract particular strings in python

Question

I'm trying to do extract particular strings in markup and save them (for more complex processing on this line). So say for example, I've read in a line from a file and the current line is:

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

But I want to store:

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'

tempWidth = 500

tempHeight = 375

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'

How would I go about doing that in Python?

Thanks

Let me save you the trouble and just tell you that regex is out for this. Don't think of trying it you will only hit your head later on. If the data is from a web source look into BeautifulSoup or scrapy or any other "scraping" library. If you already have the markup, you can just use a parser and traverse nodes and gather attribute information. — Corvus Crypto
– Corvus Crypto, Commented Dec 15, 2016 at 17:16

brianpck · Accepted Answer · 2016-12-15 17:16:51Z

3

Though you can get away with several approaches here, I recommend using an HTML parser, which is extensible and can deal with many issues in the HTML. Here's a working example with BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg"  WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
...     print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

answered Dec 15, 2016 at 17:16

brianpck

8,3241 gold badge25 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Johnny Over a year ago

After finally getting bs4 installed, this is a beautiful solution. Thanks!

A. Steiner · Accepted Answer · 2016-12-15 17:54:40Z

0

And the regex approach:

import re

string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]

All values are string though, so cast it if you want..

And know that with regex copy/paste is a bad idea. There could be mistakes easily.

edited Dec 15, 2016 at 17:54

answered Dec 15, 2016 at 17:44

A. Steiner

316 bronze badges

Collectives™ on Stack Overflow

How to extract particular strings in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related