Trying to output JS data from webpage into .html output file

Question

I have successfully managed to scrape the website listed from JS into a local .html file, but the output falls short.

The issues are:

it only produces the last query (audioSource) and not the other requests
it finds only episode 1, and stops there. How do I make it repeat until it finds the end?

Many thanks

import requests
import json
from bs4 import BeautifulSoup

JSONDATA = requests.request("GET", "https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1")
JSONDATA = JSONDATA.json()

for line in JSONDATA['posts']:
    soup = BeautifulSoup(line['episodeNumber'],'lxml')
    soup = BeautifulSoup(line['title'],'lxml')
    soup = BeautifulSoup(line['image']['large'],'lxml')
    soup = BeautifulSoup(line['excerpt']['long'],'lxml')
    soup = BeautifulSoup(line['audioSource'],'lxml')
with open("output1.html", "w") as file:
    file.write(str(soup))

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

The problem here is :

using w when writing, it replaces the whole file with updated text.
using the same variable name soup for all values.
You don't need bs4 module here to parse the json data.

What you can do is :

Install pandas module and create a dataframe. install it using pip : pip install pandas or conda : conda install pandas.

Then you can use the dataframe and use it however you like.

import requests
import json
import pandas as pd
import os

JSONDATA = requests.request("GET", "https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1")
JSONDATA = JSONDATA.json()

df = pd.DataFrame(JSONDATA)

filename = 'Output.txt'
os.mknod(filename) #create the filename above.

with open(filename, 'a') as fopen:
    for i in range(len(df)):
        fopen.writelines(df.posts[i]['episodeNumber']+'\n')
        fopen.writelines(df.posts[i]['title']+'\n')
        fopen.writelines(df.posts[i]['image']['large']+'\n')
        fopen.writelines(df.posts[i]['excerpt']['long']+'\n')
        fopen.writelines(df.posts[i]['audioSource']+'\n')
        fopen.writelines("\n")
fopen.close()

This is the full code what you want.
Additionally you can use print(df.head()) to see how the dataframe stores the values as a dictionary and do more things.

Output :

You can see the whole text here

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jun 10, 2019 at 1:58

ASHu2

2,04718 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

leopheard Over a year ago

Thanks. Running that gives me this output error: File "./test4.py", line 19, in <module> fopen.writelines(df.posts[i]['excerpt']['long']+'\n')TypeError: writelines() argument must be a sequence of strings It produces the file, but only produces the first 3 outputs and there's no excerpt or audioSource link? Also, it only produces the first result - i.e. episode 116, how would I get it to repeat til the very end i.e. episode 1?

ASHu2 Over a year ago

This is self explanatory. You need to pass string values. So you can do fopen.writelines(str(df.posts[i]['excerpt']['long']+'\n'))

bharatk · Accepted Answer · 2019-07-16 04:48:41Z

1

Using pandas library, save data into CSV file at the current project directory

import requests
import pandas as pd

resp = requests.get("https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1").json()
df = pd.DataFrame(resp['posts'], columns=['episodeNumber', 'title', 'image','excerpt','audioSource'])
#it will save data into post csv file and stored in current project directory
df.to_csv("posts.csv")

edited Jul 16, 2019 at 4:48

answered Jun 10, 2019 at 11:07

bharatk

4,3455 gold badges19 silver badges31 bronze badges

Collectives™ on Stack Overflow

Trying to output JS data from webpage into .html output file

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related