Can't use a css selector to get data in python

Question

Hi I'd like to get movie titles from this website:

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

I got response 200 and have no problem crawling other information. but the problem is in the variable movie_list.

When I print(movie_list), it returns just empty list, which means I'm using the tag wrong.

larsks · Accepted Answer · 2020-09-15 12:21:59Z

5

If you replace:

movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")

With:

movie_list = html.select("#page_filling_chart table tr > td > b > a")

You get what I think you're looking for. The primary change here is replacing child-selectors (parent > child) with descendant selectors (ancestor descendant), which is a lot more forgiving with respect to what the intervening content looks like.

Update: this is interesting. Your choice of BeautifulSoup parser seems to lead to different behavior.

Compare:

>>> html = BeautifulSoup(raw, 'html.parser')
>>> html.select('#page_filling_chart > table')
[]

With:

>>> html = BeautifulSoup(raw, 'lxml')
>>> html.select('#page_filling_chart > table')
[<table>
<tr><th>Rank</th><th>Movie</th><th>Release<br/>Date</th><th>Distributor</th><th>Genre</th><th>2019 Gross</th><th>Tickets Sold</th></tr>
<tr>
[...]

In fact, using the lxml parser you can almost use your original selector. This works:

html.select("#page_filling_chart > table > tr > td > b > a"

After parsing, a table has no tbody.

After experimenting for a bit, you would have to rewrite your original query like this to get it to work with html.parser:

html.select("#page_filling_chart2 > p > p > p > p > p > table > tr > td > b > a")

It looks like html.parser doesn't synthesize closing </p> elements when they are missing from the source, so all the unclosed <p> tags result in a weird parsed document structure.

edited Sep 15, 2020 at 12:21

answered Sep 14, 2020 at 13:40

larsks

318k49 gold badges473 silver badges482 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

data_minD Over a year ago

That worked perfectly! But I still wonder why my selector didn't work. As you can see in the uploaded picture, I thought I selected 679 movie titles correctly.

data_minD Over a year ago

Aren't "#page_filling_chart table tr > td > b > a" and "#page_filling_chart > table > tbody > tr > td > b > a" directing the same tag?

larsks Over a year ago

They're not. #page_filling_chart > table says "find a table element that is a child of element id #page_filling_chart, whereas #page_filling_chart table says "find a table element that is a descendant of element id #page_filling_chart".

data_minD Over a year ago

Ah, right I understood! It is different. May I ask just one more thing sir or ma'am..? Like, I thought I specified the tag by using child-selectors and didn't work. However, as you said, you used descendant selectors which forgives a lot more. I get how your code works But I don't know why my code using child-selectors!

larsks Over a year ago

I've updated the question with the result of some experimentation.

Markus · Accepted Answer · 2020-09-14 13:44:17Z

2

This should work:

url = 'https://www.the-numbers.com/market/2019/top-grossing-movies'
raw = requests.get(url)  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("table > tr > td > b > a")
for i in range(len(movie_list)):
    print(movie_list[i].text)

answered Sep 14, 2020 at 13:44

Markus

1231 gold badge1 silver badge8 bronze badges

Comments

Dharman · Accepted Answer · 2020-09-14 13:45:42Z

2

Here is the solution for this question:

from bs4 import BeautifulSoup
import requests

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_table_rows = html.findAll("table")[0].findAll('tr')

movie_list = []
for tr in movie_table_rows[1:]:
    tds = tr.findAll('td')
    movie_list.append(tds[1].text) #Extract Movie Names

print(movie_list)

Basically, the way you are trying to extract the text is incorrect as selectors are different for each movie name anchor tag.

edited Sep 14, 2020 at 13:45

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Sep 14, 2020 at 13:40

Anup Tiwari

4942 silver badges5 bronze badges

Collectives™ on Stack Overflow

Can't use a css selector to get data in python

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related