1

I have some code which collects the description, price, and old price(if on sale) from online retailers over multiple pages. I'm looking to export this into a DataFrame and have had a go but run into the following error:

ValueError: Shape of passed values is (1, 3210), indices imply (3, 3210).

from bs4 import BeautifulSoup
import requests
import time
import pandas as pd

# Start Timer
then = time.time()

# Headers
headers = {"User-Agent": "Mozilla/5.0"}

# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1

scraped_data = []
while Code == 200:

    # Put url together
    url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
    url = url + str(i)

    # Request URL
    r = requests.get(url, allow_redirects=False, headers=headers)  # No redirects to allow infinite page count
    data = r.text
    Code = r.status_code

    # Soup
    soup = BeautifulSoup(data, 'lxml')

    # For loop each product then scroll through title price, old price and description
    divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these

    for div in divs:

        # Get Description
        Description = div.find('div', attrs={'class': '_3J74XsK'})
        Description = Description.text.strip()
        scraped_data.append(Description)

        # Fetch TitlePrice
        NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
        NewPrice = NewPrice.text.strip("£")
        scraped_data.append(NewPrice)

        # Fetch OldPrice
        try:
            OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
            OldPrice = OldPrice.text.strip("£")
            scraped_data.append(OldPrice)
        except AttributeError:
            OldPrice = ""
            scraped_data.append(OldPrice)

    print('page', i, 'scraped')
        # Print Array
        #array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
        #print(array)
    i = i + 1
else:
    i = i - 2
    now = time.time()
    pd.DataFrame(scraped_data, columns=["A", "B", "C"])
    print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')
4
  • It's almost certainly raised from your constructor in that you're passing data with a different shape from that which you say it ought to be Commented Feb 27, 2020 at 15:33
  • @ifly6 yes it should be 3 columns, Description, Price, Old Price. With n rows for each item found, but I don't know where I'm going wrong Commented Feb 27, 2020 at 15:37
  • Scraped data doesn't have that form though. Your code appends to that list three values on each loop. Consider changing to a dictionary-based representation for each row where you construct the dataframe from a list of dictionaries Commented Feb 27, 2020 at 15:39
  • @ifly6 thanks for the advise. I'm new to Python so don't full understand that concept Commented Feb 27, 2020 at 15:41

2 Answers 2

1

Right now your data is appended to list based on an algorithm that I can describe like this:

  1. Load the web page
  2. Append to list value A
  3. Append to list value B
  4. Append to list value C

What this creates for each run through the dataset is:

[A1, B1, C1, A2, B2, C2]

There exists only one column with data, which is what pandas is telling you. To construct the dataframe properly, either you need to swap it into a format where you have, on each row entry, a tuple of three values (heh) like:

[
    (A1, B1, C1),
    (A2, B2, C2)
]

Or, in my preferred way because it's far more robust to coding errors and inconsistent lengths to your data: creating each row as a dictionary of columns. Thus,

rowdict_list = []
for row in data_source:
    a = extract_a()
    b = extract_b()
    c = extract_c()
    rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})

And the data frame is constructed easily without having to explicitly specify columns in the constructor with df = pd.DataFrame(rowdict_list).

Sign up to request clarification or add additional context in comments.

1 Comment

Very clever, king of pandas
0

You can create a DataFrame using the array dictionary.

You would want to set the values of the array dict to empty lists that way you can append the values from the webpage into the correct list. Also move the array variable outside of the while loop.

array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
    ...

On the line where you were previously defining the array variable you would then want to append the desciption, price and old price values like so.

array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))

Then you can to create a DataFrame using the array variable

pd.DataFrame(array)

So the final solution would look something like

array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
   ...
    # For loop
    for div in divs:

        # Get Description
        Description = div.find('h3', attrs={'class': 'product__title'})
        Description = Description.text.strip()

        # Fetch TitlePrice
        try:
            NewPrice = div.find('div', attrs={'class': 'price product__price--current'})
            NewPrice = NewPrice.text.strip()
        except AttributeError:
            NewPrice = div.find('p', attrs={'class': 'price price--reduced'})
            NewPrice = NewPrice.text.strip()

        # Fetch OldPrice
        try:
            OldPrice = div.find('p', attrs={'class': 'price price--previous'})
            OldPrice = OldPrice.text.strip()
        except AttributeError:
            OldPrice = ""

        array['Description'].append(str(Description))
        array['CurrentPrice'].append(str(NewPrice))
        array['Old Price'].append(str(OldPrice))
        # Print Array
        print(array)
    df = pd.DataFrame(array)
    i = i + 1
else:
    i = i - 2
    now = time.time()
    print('Parse complete with', i, 'pages' + ' in', now - then, 'seconds')

Finally make sure you've imported pandas at the top of the module import pandas as pd

2 Comments

Thank you schwab! This worked with a little tweaking. I had to move array = {"Description": [], "CurrentPrice": [], "Old Price": []} outside of the while loop but apart from that it's great.
@Rosstopher I've updated the answer with the edit for the while loop, in case someone comes across this in the future

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.