1

I am using BeatifulSoup to scrape some web data into a csv file. Some of the elements that I am scraping are lists of specific items; two sets of list to be exact. Below is an example of what the data will come through as:

Name, Image_Filename, [2015, 2016, 2017], [12, 55, 74]

What I need is a row for each individual item in each list like this:

  • Name, Image_Filename, 2015, 12
  • Name, Image_Filename, 2016, 55
  • Name, Image_Filename, 2017, 74

I already have all the data scraped into a csv file and I am looking to avoid going through the entire sheet and manually scrubbing the data. I am not opposed to doing this but if Python can be leveraged to complete this task, I would prefer to go that route...

Here is my entire script I use to scrape the data. I am fairly new to Python with limited experience in web scraping / browser automation. I don't know if formatting the data could be included in this or if this is another one I would have to write:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
import re
import csv

with open('hyperlinks.csv', 'r') as startFile:

    for line in startFile:
        url = urlopen(line)
        soup = BeautifulSoup(url, 'html.parser')
        
        data_container = soup.find('aside')                                            
        image = data_container.find('a',attrs={'class':'image-thumbnail'})
        image_href = image.get('href')                                                   

        img_container = data_container.find('img')
        data_image_name = img_container.get('data-image-name')                                 
        filename = data_image_name.split('.')                                           
        final_filename = filename[0]                                                    
        train_title = data_container.find('h2')                                         
        title_text = train_title.get_text()

        image_filename = final_filename
        full = image_filename +'.jpg'                                           

        series = data_container.find('div', attrs={'data-source':'series'})            
        wave_links = series.find('div')
        wave_set = []                                                  
        wave_links_sep = wave_links.find_all('a')
        for item in wave_links_sep:
            text_only = item.get_text()
            wave_set.append(text_only)

        bag = data_container.find('div', attrs={'data-source':'bag_code'})
        bag_code = bag.find('div')
        bag_text = bag_code.get_text()
        regex = re.compile(r'\s\((2015|2016|2017|2018|2019)\)')
        bag_numbers = re.sub(regex,",",bag_text)
        bag_list = []
        for nums in bag_numbers.split(','):
            bag_list.append(nums)

        filtered_bag_list = list(filter(None,bag_list))

        with open('train_data.csv', 'a', newline='') as myFile:
            writer = csv.writer(myFile)
            writer.writerow([title_text, full, wave_set, filtered_bag_list])
0

1 Answer 1

1

You can zip your both item lists:

for wvs,bgl in zip(wave_set,filtered_bag_list):
    writer.writerow([title_text, full, wvs, bgl])

if your lists are of same length and correspond index-wise.

Full example:

wave_set = [2015, 2016, 2017]
filtered_bag_list = [12, 55, 74]

import csv
with open('train_data.csv', 'a', newline='') as myFile:
    writer = csv.writer(myFile)
    for wvs,bgl in zip(wave_set,filtered_bag_list):
        writer.writerow(["some","text", wvs, bgl])

with open("train_data.csv") as f:
    print(f.read())

Output in file:

some,text,2015,12
some,text,2016,55
some,text,2017,74

zip( [1,2,3],["a","b","c"])

creates tuples (1,"a"), (2,"b"), (3,"c") and provides them as iterator - see f.e. Zip lists in Python for more insights.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome! This worked perfectly for my usecase. Now I will need to learn up on this function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.