2

I have a string

<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />

What is the Regex to find ABCDXYZ in Python

2
  • Do you know the exact string you are looking for, or is this just a placeholder? Commented Jan 7, 2013 at 5:08
  • i dont know the exact string, it is just a placeholder Commented Jan 7, 2013 at 5:11

3 Answers 3

5

Don't use regex to parse HTML. Use BeautifulSoup.

from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']
Sign up to request clarification or add additional context in comments.

6 Comments

I think you mean soup.find('img').attrs['alt']. But otherwise, yes, this is exactly what he should do.
Yeah just fixed it. Didn't realize the tag was in img.
I must use Regex as an requirement
You really shouldn't. Why is that requirement there?
@John that wasn't really an answer. I was asking why is he requiring regex?
|
3

If you're looking for the value of that alt attribute, you can do this:

>>> r = r'alt="(.*?)"'

Then:

>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'

And you can use re.findall if you want to find more than one.

However, this code will be easily fooled by something like this:

<span>Here's some text explaining how to do alt="foo" in an img tag.</span>

On the other hand, it'll also fail to pick up something like this:

<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />

How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.

It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.

One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…

Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.

If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

8 Comments

thanks. I actually succeeded with parser, but my boss wants me to use regular expression.
Unless there's some reason I can't see here, your boss is very misguided in asking you to use a regular expression. There's a reason parsers exist.
I would advise that you tell your boss that he really doesnt know what he is talking about if he tells you to use regex. See here
@John: Too bad you can't fire your boss. :) But maybe you can explain to him why it's impossible. Create some test input that cannot possibly be parsed by any regular expression, explain to him why the code he made you write is wrong, and convince him to let you do it right. (This is one of the many nice things about test-driven development. It's a lot harder to argue with a failing test that's obviously valid than with someone telling you "HTML isn't a regular language".)
@jdotjdot: Well, I didn't think you were criticizing me, just criticizing my answer. And I think that putting your example into my answer made it better—if so, the criticism was appropriate (or useful, or whatever the right measure is). So, no problem.
|
0

First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this

Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:

<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />

and you could access the text via the match object's groups attribute.

2 Comments

What reason do we have to believe that being inside an a tag—and one with a relative URL, and no other attributes—is at all relevant here? In the absence of a realistic problem statement (which would make the problem impossible), it's probably best to assume the simplest possible interpretation.
He was fairly specific. In being as simple as possible you're also not answering his question as accurately as possible. I doubt he would have pulled the example out of nowhere if it weren't something he were dealing with specifically. If he can guarantee the conditions he provided (which he likely cannot) then the above works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.