1

Starting from an Html input like this:

<p>
<a href="http://www.foo.com">this if foo</a>
<a href="http://www.bar.com">this if bar</a>
</p>

using BeautifulSoup, i would like to change this Html in:

<p>
<a href="http://www.foo.com">this if foo[1]</a>
<a href="http://www.bar.com">this if bar[2]</a>
</p>

saving parsed links in a dictionary with a result like this:

links_dict = {"1":"http://www.foo.com","2":"http://www.bar.com"}

Is it possible to do this using BeautifulSoup? Any valid alternative?

1 Answer 1

4

This should be easy in Beautiful Soup.

Something like:

from BeautifulSoup import BeautifulSoup
from BeautifulSoup import Tag

count = 1
links_dict = {}
soup = BeautifulSoup(text)
for link_tag in soup.findAll('a'):
  if link_tag['href'] and len(link_tag['href']) > 0:
    links_dict[count]  = link_tag['href']  
    newTag = Tag(soup, "a", link_tag.attrs)
    newTag.insert(0, ''.join([''.join(link_tag.contents), "[%s]" % str(count)]))
    link_tag.replaceWith(newTag)
    count += 1

Result of executing this on your text:

>>> soup
<p>
  <a href="http://www.foo.com">this if foo[1]</a>
  <a href="http://www.bar.com">this if bar[2]</a>
</p>

>>> links_dict
{1: u'http://www.foo.com', 2: u'http://www.bar.com'}

The only problem I can foresee with this solution is if your link text contains subtags; then you couldn't do ''.join(link_tag.contents); instead you would need to navigate to the rightmost text element.

Sign up to request clarification or add additional context in comments.

6 Comments

@danben +1 for the effort. Actually this is like the code i made before asking the question. It does not work because you end up with something like <a href="foo.com[1]">this if foo</a> and this is not what i want.
@danben do you think is it possible to change the node's content without recreating a new tag?
I was not able to do that, and the documentation suggests that there is not. Why is creating a new Tag undesirable?
@Danben uhm, because i could have other attributes besides href; a rel="nofollow" for example.Please, have a look to this other question stackoverflow.com/questions/2904542/…
@Danben Ok i found it; i replaced newTag = Tag(soup, "a", [("href", link_tag['href'])]) with newTag = Tag(soup, "a", link_tag.attrs).Thanks!Please update your code.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.