0

I want to crawl around 500 articles from the site AlJazeera Website and want to collect 4 tags i.e

  • URL
  • Title
  • Tags
  • Author

I have written the script that collects data from home page, but it only collects couple of articles. Other articles are in different categories. How can I iterate through 500 articles. Is there an efficient way to do it.

import bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
from collections import Counter
page = requests.get('https://www.aljazeera.com/')
soup = BeautifulSoup(page.content,"html.parser")
article = soup.find(id='more-top-stories')
inside_articles= article.find_all(class_='mts-article mts-default-article')
article_title = [inside_articles.find(class_='mts-article-title').get_text() for inside_articles in inside_articles]
article_dec = [inside_articles.find(class_='mts-article-p').get_text() for inside_articles in inside_articles]
tag = [inside_articles.find(class_='mts-category').get_text() for inside_articles in inside_articles]
link = [inside_articles.find(class_='mts-article-title').find('a') for inside_articles in inside_articles]
11
  • in websites only 6 articles under more top stories. there are no 500 articles and beautifulsoup only extract data from Html parser Commented Jan 7, 2020 at 11:59
  • classes are different for different sections in the website. What is the better way to approach this problem. Commented Jan 7, 2020 at 12:27
  • yes, you can get articles from different categories still there is no 500 articles. Commented Jan 7, 2020 at 12:57
  • is beautifulsoup is a better way ? or should I explore other libraries as well. Can u suggest any Commented Jan 7, 2020 at 12:59
  • 1
    link = [inside_articles.find(class_='mts-article-title').find('a')['href'] for inside_articles in inside_articles] try this Commented Jan 7, 2020 at 13:16

1 Answer 1

1

You can use scrapy for this purpose.

import scrapy
import json
class BlogsSpider(scrapy.Spider):
    name = 'blogs'
    start_urls = [
        'https://www.aljazeera.com/news/2020/05/fbi-texas-naval-base-shooting-terrorism-related-200521211619145.html',
    ]

    def parse(self, response):
        for data in response.css('body'):
            current_script = data.xpath("//script[contains(., 'mainEntityOfPage')]/text()").extract_first()
            json_data = json.loads(current_script)
            yield {
                'name': json_data['headline'],
                'author': json_data['author']['name'],
                'url': json_data['mainEntityOfPage'],
                'tags': data.css('div.article-body-tags ul li a::text').getall(),
            }

Save this file to file.py and run it by

$scrapy crawl blogs -o output.json

But configure scrapy structure first.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.