2

I need to scrape all article, title of article and paragraf in this web: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19

The problem is than I tried some of div, h3 or p nothing happen add image.

from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook


def parse_url(url):
    response = requests.get(url)
    content = response.content
    parsed_response = BeautifulSoup(content, "lxml")
    return parsed_response


url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"

soup = parse_url(url)


article = soup.find("div", {"class":"article-document"})

article

It seems to be a website with javascript, but I don't know how to get it.

1 Answer 1

3

The website does 3 API calls in order to get the data.
The code below does the same and get the data.

(In the browser do F12 -> Network -> XHR and see the API calls)

import requests

payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
  print(r1.json())

print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
  print(r2.json())

print('------------------')

payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
  print(r3.json())
Sign up to request clarification or add additional context in comments.

5 Comments

Hi balderman, thanks for your help and explanation. Can I make one question more, I'm sorry I'm really new with this. Some parts of text has special caracters like ' or ` and in the extraction appear &, how can chnage this to specific caracter? Thanks again for the support.
I am not sure I understand the question. Can you come up with a specific example?
Hi Balderman For exemple when extract first article inside of 'text': '<p align="JUSTIFY">\n\t1. the text start with Aquesta llei t&eacute; per objecte but in the webside apear 1. Aquesta llei té per objecte: How I can change this to see Aquesta llei té per objecte: instead of Aquesta llei t&eacute; per objecte. Thanks for your support.
Well... I have no idea. Sorry.
Hi balderman Oks, well I will look to sse what I find Really thanks for your help and suport!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.