Regex to find a string python

Question

I have a string

<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />

What is the Regex to find ABCDXYZ in Python

Do you know the exact string you are looking for, or is this just a placeholder? — intelis
– intelis, Commented Jan 7, 2013 at 5:08

jdotjdot · Accepted Answer · 2013-01-07 05:12:54Z

5

Don't use regex to parse HTML. Use BeautifulSoup.

from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']

answered Jan 7, 2013 at 5:12

jdotjdot

17.3k15 gold badges71 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

abarnert Over a year ago

I think you mean soup.find('img').attrs['alt']. But otherwise, yes, this is exactly what he should do.

jdotjdot Over a year ago

Yeah just fixed it. Didn't realize the tag was in img.

John Over a year ago

I must use Regex as an requirement

jdotjdot Over a year ago

You really shouldn't. Why is that requirement there?

jdotjdot Over a year ago

@John that wasn't really an answer. I was asking why is he requiring regex?

|

Community · Accepted Answer · 2017-05-23 11:52:33Z

3

If you're looking for the value of that alt attribute, you can do this:

>>> r = r'alt="(.*?)"'

Then:

>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'

And you can use re.findall if you want to find more than one.

However, this code will be easily fooled by something like this:

<span>Here's some text explaining how to do alt="foo" in an img tag.</span>

On the other hand, it'll also fail to pick up something like this:

<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />

How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.

It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.

One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…

Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.

If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

edited May 23, 2017 at 11:52

CommunityBot

11 silver badge

answered Jan 7, 2013 at 5:12

abarnert

368k54 gold badges626 silver badges691 bronze badges

8 Comments

John Over a year ago

thanks. I actually succeeded with parser, but my boss wants me to use regular expression.

jdotjdot Over a year ago

Unless there's some reason I can't see here, your boss is very misguided in asking you to use a regular expression. There's a reason parsers exist.

Amelia Over a year ago

I would advise that you tell your boss that he really doesnt know what he is talking about if he tells you to use regex. See here

abarnert Over a year ago

@John: Too bad you can't fire your boss. :) But maybe you can explain to him why it's impossible. Create some test input that cannot possibly be parsed by any regular expression, explain to him why the code he made you write is wrong, and convince him to let you do it right. (This is one of the many nice things about test-driven development. It's a lot harder to argue with a failing test that's obviously valid than with someone telling you "HTML isn't a regular language".)

abarnert Over a year ago

@jdotjdot: Well, I didn't think you were criticizing me, just criticizing my answer. And I think that putting your example into my answer made it better—if so, the criticism was appropriate (or useful, or whatever the right measure is). So, no problem.

|

Community · Accepted Answer · 2017-05-23 12:15:54Z

0

First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this

Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:

<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />

and you could access the text via the match object's groups attribute.

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered Jan 7, 2013 at 5:16

Ian Stapleton Cordasco

29.2k4 gold badges73 silver badges73 bronze badges

2 Comments

abarnert Over a year ago

What reason do we have to believe that being inside an a tag—and one with a relative URL, and no other attributes—is at all relevant here? In the absence of a realistic problem statement (which would make the problem impossible), it's probably best to assume the simplest possible interpretation.

Ian Stapleton Cordasco Over a year ago

He was fairly specific. In being as simple as possible you're also not answering his question as accurately as possible. I doubt he would have pulled the example out of nowhere if it weren't something he were dealing with specifically. If he can guarantee the conditions he provided (which he likely cannot) then the above works.

Collectives™ on Stack Overflow

Regex to find a string python

3 Answers 3

6 Comments

8 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related