2

I have successfully managed to scrape the website listed from JS into a local .html file, but the output falls short.

The issues are:

  • it only produces the last query (audioSource) and not the other requests
  • it finds only episode 1, and stops there. How do I make it repeat until it finds the end?

Many thanks

import requests
import json
from bs4 import BeautifulSoup

JSONDATA = requests.request("GET", "https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1")
JSONDATA = JSONDATA.json()

for line in JSONDATA['posts']:
    soup = BeautifulSoup(line['episodeNumber'],'lxml')
    soup = BeautifulSoup(line['title'],'lxml')
    soup = BeautifulSoup(line['image']['large'],'lxml')
    soup = BeautifulSoup(line['excerpt']['long'],'lxml')
    soup = BeautifulSoup(line['audioSource'],'lxml')
with open("output1.html", "w") as file:
    file.write(str(soup))

2 Answers 2

1

The problem here is :

  1. using w when writing, it replaces the whole file with updated text.
  2. using the same variable name soup for all values.
  3. You don't need bs4 module here to parse the json data.

What you can do is :

Install pandas module and create a dataframe. install it using pip : pip install pandas or conda : conda install pandas.

Then you can use the dataframe and use it however you like.

import requests
import json
import pandas as pd
import os

JSONDATA = requests.request("GET", "https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1")
JSONDATA = JSONDATA.json()

df = pd.DataFrame(JSONDATA)

filename = 'Output.txt'
os.mknod(filename) #create the filename above.

with open(filename, 'a') as fopen:
    for i in range(len(df)):
        fopen.writelines(df.posts[i]['episodeNumber']+'\n')
        fopen.writelines(df.posts[i]['title']+'\n')
        fopen.writelines(df.posts[i]['image']['large']+'\n')
        fopen.writelines(df.posts[i]['excerpt']['long']+'\n')
        fopen.writelines(df.posts[i]['audioSource']+'\n')
        fopen.writelines("\n")
fopen.close()

This is the full code what you want.
Additionally you can use print(df.head()) to see how the dataframe stores the values as a dictionary and do more things.

Output : enter image description here

You can see the whole text here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Running that gives me this output error: File "./test4.py", line 19, in <module> fopen.writelines(df.posts[i]['excerpt']['long']+'\n')TypeError: writelines() argument must be a sequence of strings It produces the file, but only produces the first 3 outputs and there's no excerpt or audioSource link? Also, it only produces the first result - i.e. episode 116, how would I get it to repeat til the very end i.e. episode 1?
This is self explanatory. You need to pass string values. So you can do fopen.writelines(str(df.posts[i]['excerpt']['long']+'\n'))
1

Using pandas library, save data into CSV file at the current project directory

import requests
import pandas as pd

resp = requests.get("https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1").json()
df = pd.DataFrame(resp['posts'], columns=['episodeNumber', 'title', 'image','excerpt','audioSource'])
#it will save data into post csv file and stored in current project directory
df.to_csv("posts.csv")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.