1

I have a program that takes 1 URL from a user, crawls the whole site and returns a list of all URL's with some parsed data for each URL.

It all looks like:

class Crawl(url_from_user):
    self.result = [<Page object at 1>, <Page object at 2>, <Page object at 3>]

class Page(url):
    self.data_1 = "string_1"
    self.data_2 = "string_2"
    self.data_3 = "string_3"

class Crawl - handle threading and all common inputs/data for each page.

class Page - store unique data for each page and handle parsing HTML.

I want to put this program to be a web site. With Django, I want to create pages that would take url_from_user and start crawling a site. I want to store the results in a SQL database, to be able to pass it to different views.

The question is how I can dynamically display results during a crawl, while isn't finished? In the middle of Crawl, I can show the result to "stdout" in the console. Can I show not finished result in HTML page?

My first thought is JQuery, but could JQuery hook to stdout output (or better if it would have access to a result list itself with all methods of Page - then I would be able to manipulate individual elements of the list when the list is still growing with running Crawl)?

1 Answer 1

2

Here's what you have to do:

  1. Create a django website that takes data you want to display from a database (can be sqlite) and shows it in desired format
  2. Create a crawling script
  3. Add a view that renders a form (your desired url input functionality) and starts the script. There are actually two was you can go about it:

    3.1 Have the script start in the main job - freezes your site for the user until crawling is done, but is easier to do

    3.2 Have it schedule a crawling job via celery or cron - all-round better, doesn't freeze anything, allows more flexibility, allows to see current progress and so on, but required to set up a job queue and generally harder to do for the first time.

  4. Make your script put scraped urls and required info to the same databse django is taking data from.

Now for dynamic progress display, I am by no means specialist, but I see some ways:

  1. Have script keep a log of events (can do it via a django model, so that events are stored in db) (e.g. "parsed url http://foo.bar"), have a page that displays events for a certain job.
  2. Make the whole interactive crawling process a separate application that runs an async server that sends feedback. For example do it via websockets: django serves a js file. In the js file the application connects to a websocket application (preferrably running from the same host as django) that's doing the crawling and reporting progress over websockets. Mind you this is tricky to set up, but possible.
  3. You could have django display stuff from log files, but I think it could get tricky easily.

For dynamic progress display you will still need some kind of async eiher way. You could do it with long polling: have a js script on django side poll django via AJAX GET for new info to display every second or so. This technique is falling out of fashion lately (because it spams the server with expensive requests), but it still works and is quite simple.

I think the best option is to have celery jobs put crawled data and logs in a database, have django show logs and data to user and accept user input.

Sign up to request clarification or add additional context in comments.

3 Comments

"put scraped URLs and required info to the same database django" - Is there a way to make tables in database session specific? Ex: I open two tabs with my script and want each tab put data to separate tables in database
@MaximAndreev There is a way. I don't know about separate tables, but you can have one table for scraped data and add an "session_id" column to it, so that you can query per-session. I wouldn't make it session specific because if someone's session dies the data is no longer accecible, but still takes space in database.
@MaximAndreev I suggest you add an account system (very simple in django). Then assign ids to scraping jobs, tie ids to users (so that a user can only accesss scraping jobs he started). Tie each scraped data row to scraping job (add a one to many relationship between scraping jobs and scraping data rows).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.