Python web scraping and saving to a pandas dataframe

Question

I am trying to web scrape a house listing on remax page and save that information to Pandas dataframe. But for some reason, it keeps giving me KeyError. Here is my code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')
details_t = pd.DataFrame(detail_title)

Here is the error I am getting:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-3be49b8e4cfc> in <module>
      6 soup = BeautifulSoup(response.text, 'html.parser')
      7 detail_title = soup.find_all(class_='detail-title')
----> 8 details_t = pd.DataFrame(detail_title)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    449                 else:
    450                     mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451                                        copy=copy)
    452             else:
    453                 mgr = init_dict({}, index, columns, dtype=dtype)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    144     # by definition an array here
    145     # the dtypes will be coerced to a single dtype
--> 146     values = prep_ndarray(values, copy=copy)
    147 
    148     if dtype is not None:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    228         try:
    229             if is_list_like(values[0]) or hasattr(values[0], 'len'):
--> 230                 values = np.array([convert(v) for v in values])
    231             elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
    232                 # GH#21861

~/anaconda3/lib/python3.7/site-packages/bs4/element.py in __getitem__(self, key)
   1014         """tag[key] returns the value of the 'key' attribute for the tag,
   1015         and throws an exception if it's not there."""
-> 1016         return self.attrs[key]
   1017 
   1018     def __iter__(self):

KeyError: 0

Any help would be greatly appreciated!

QuantStats · Accepted Answer · 2019-10-04 14:52:04Z

You can try this. I assume that you want only the text within the <span> tags. But feel free to adapt from my worked example.

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

ls = []

for _ in detail_title:
  ls.append(_.text)

df = pd.DataFrame(data=ls)

print(df)

Output

                           0
0            Property Type:
1             Property Tax:
2             Last Updated:
3        Property Sub Type:
4                  MLSÂ® #:
5           Ownership-Type:
6               Year Built:
7                     sqft:
8              Date Listed:
9                 Lot Size:
10               Occupancy:
11             Subdivision:
12                 Heating:
13          Heating Source:
14          Full Bathrooms:
15          Half Bathrooms:
16                   Rooms:
17                Basement:
18    Basement Development:
19                Flooring:
20          Parking Spaces:
21                 Parking:
22                    Area:
23                Exterior:
24              Foundation:
25                    Roof:
26                   Faces:
27  Miscellaneous Features:
28         Lot Description:
29                   Condo:
30                Board ID:
31                   Suite:
32                Features:

Edit: print(type(detail_title)) gives <class 'bs4.element.ResultSet'>, it is not an accepted data type. From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame

butterflyknife · Accepted Answer · 2019-10-04 14:51:00Z

2

detail_title does not contain something you can put in a dataframe: it's a list of BeautifulSoup "bs4.element.Tag" objects (see what type(detail_title[0]) gives you). Try the following:

Step 1. Extract the column headings

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

headings = [d.text for d in detail_title]
details_t = pd.DataFrame(columns = headings)

Step 2. Go up one level in the html and get the pairs of detail names and values. (The detail names are what you have already extracted in step 1). Write a helper function to return the value given a name.

details = soup.find_all(class_='detail-row ng-star-inserted')
def get_detail_value(detail_title, details): 
    return [(d.find(class_='detail-value')).text for d in details if (d.find(class_='detail-title')).text == detail_title]

This is a bit odd to do if you're only scraping 1 page. I think what you will want to do is run step 1 once to get the detail names, then step 2 on al pages you want to scrape.

Step 3. For each page you scrape, append the found values of the details to the dataframe.

details_t = details_t.append({deet:get_detail_value(deet, details) for deet in details_t.columns}, ignore_index = True)

answered Oct 4, 2019 at 14:51

butterflyknife

1,59412 silver badges19 bronze badges

3 Comments

Sushant Deshpande Over a year ago

Thanks, this was really helpful!

butterflyknife Over a year ago

@SushantDeshpande a gentle tip for you as a new user: stack overflow etiquette is to upvote all answers you found helpful and to put the green tick next to the one which most closely answered your question.

QuantStats Over a year ago

@butterflyknife Sushant Deshpande hasn't unlocked the voting option yet. I will help you. +1 from me.

Collectives™ on Stack Overflow

Python web scraping and saving to a pandas dataframe

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related