2

Hi I'd like to get movie titles from this website:

enter image description here

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

I got response 200 and have no problem crawling other information. but the problem is in the variable movie_list.

When I print(movie_list), it returns just empty list, which means I'm using the tag wrong.

3 Answers 3

5

If you replace:

movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")

With:

movie_list = html.select("#page_filling_chart table tr > td > b > a")

You get what I think you're looking for. The primary change here is replacing child-selectors (parent > child) with descendant selectors (ancestor descendant), which is a lot more forgiving with respect to what the intervening content looks like.


Update: this is interesting. Your choice of BeautifulSoup parser seems to lead to different behavior.

Compare:

>>> html = BeautifulSoup(raw, 'html.parser')
>>> html.select('#page_filling_chart > table')
[]

With:

>>> html = BeautifulSoup(raw, 'lxml')
>>> html.select('#page_filling_chart > table')
[<table>
<tr><th>Rank</th><th>Movie</th><th>Release<br/>Date</th><th>Distributor</th><th>Genre</th><th>2019 Gross</th><th>Tickets Sold</th></tr>
<tr>
[...]

In fact, using the lxml parser you can almost use your original selector. This works:

html.select("#page_filling_chart > table > tr > td > b > a"

After parsing, a table has no tbody.

After experimenting for a bit, you would have to rewrite your original query like this to get it to work with html.parser:

html.select("#page_filling_chart2 > p > p > p > p > p > table > tr > td > b > a")

It looks like html.parser doesn't synthesize closing </p> elements when they are missing from the source, so all the unclosed <p> tags result in a weird parsed document structure.

Sign up to request clarification or add additional context in comments.

5 Comments

That worked perfectly! But I still wonder why my selector didn't work. As you can see in the uploaded picture, I thought I selected 679 movie titles correctly.
Aren't "#page_filling_chart table tr > td > b > a" and "#page_filling_chart > table > tbody > tr > td > b > a" directing the same tag?
They're not. #page_filling_chart > table says "find a table element that is a child of element id #page_filling_chart, whereas #page_filling_chart table says "find a table element that is a descendant of element id #page_filling_chart".
Ah, right I understood! It is different. May I ask just one more thing sir or ma'am..? Like, I thought I specified the tag by using child-selectors and didn't work. However, as you said, you used descendant selectors which forgives a lot more. I get how your code works But I don't know why my code using child-selectors!
I've updated the question with the result of some experimentation.
2

This should work:

url = 'https://www.the-numbers.com/market/2019/top-grossing-movies'
raw = requests.get(url)  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("table > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

Comments

2

Here is the solution for this question:

from bs4 import BeautifulSoup
import requests

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_table_rows = html.findAll("table")[0].findAll('tr')

movie_list = []
for tr in movie_table_rows[1:]:
    tds = tr.findAll('td')
    movie_list.append(tds[1].text) #Extract Movie Names

print(movie_list)

Basically, the way you are trying to extract the text is incorrect as selectors are different for each movie name anchor tag.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.