How to expose a web server with REST API and HTML/JavaScript applications from an existing Python application?

Question

I have an existing Python application that crawls the Internet continuously. It uses the requests package to make HTTP requests to various Internet websites such as GitHub, Twitter, etc. and downloads the available data on to a filesystem. It also makes HTTP requests to the REST APIs of GitHub repositories and Twitter and downloads a lot of metadata. It keeps doing this in an infinite loop. After every iteration it invokes time.sleep(3600) to sleep for 1 hour before the next iteration.

Now I want to expose an HTTP server on port 80 from this application so that any client can connect to port 80 of this app to query its internal state. For example if someone runs curl http://myapp/status it should respond with {"status": "crawling"} or {"status": "sleeping"}. If someone visits http://myapp/status with their web browser, it should display an HTML page showing the status. Based on the user agent detected, it would serve both REST API responses or HTML pages. If for any reason, my app goes down or crashes, the HTTP requests to port 80 should of course fail.

How can I expose such an HTTP server from the application? I thought of using Django because as the project goes it has to do a lot of heavy lifting such as authentication, protection against CSRF attacks, accepting user input and querying against the data it has, and so on. Django seems good for this purpose. But the problem with Django is that I cannot embed Django in my current app. I have to run a separate uwsgi server to serve the Django app. The same problem exists with Flask as well.

What is the right way to solve a problem like this in Python?

I would suggest creating a separate application, not changing your current one. The one you have uses the filesystem as storage so all info is here. You need another application that consumes this and shows it as needed through HTTP. Mixing crawling and user reporting is not a good idea, except if you do everything asynchronously which is hard and difficult to retrofit into an existing application. — Patrick Mevzek
– Patrick Mevzek, Commented Apr 20, 2019 at 20:04
@PatrickMevzek I was thinking I could create a new multiprocessing.Process(httpServer) for the user reporting via HTTP. Now this httpServer could be implemented anyhow (perhaps asynchronously) without affecting the existing application. Do you see any problems with this idea? — Lone Learner
– Lone Learner, Commented Apr 21, 2019 at 3:48
An easy solution is just save your status in database, and you can query the status. — aristotll
– aristotll, Commented Apr 21, 2019 at 15:53

Andrew F · Accepted Answer · 2019-04-22 11:30:56Z

The way I see it, you have two high-level ways of tackling this problem:

Have separate applications (a "server" and a "crawler") that have some shared datastore (database, Redis, etc). Each application would operate independently and the crawler could just update its status in the shared datastore. This approach could probably scale better: if you spin it up in something like Docker Swarm, you could scale the crawler instances as much as you can afford.
Have a single application that spawns separate threads for the crawler and server. Since they're in the same process, you can share information between them a bit quicker (though if it's just the crawler status that shouldn't matter much). The advantage to this option seems to just be difficulty of spinning it up -- you wouldn't need a shared datastore, and you wouldn't need to manage more than one service.

I would personally tend towards (1) here, because each of the pieces are simpler. What follows is a solution to (1), and a quick and dirty solution to (2).

1. Separate Processes with shared Datastore

I would use Docker Compose to handle spinning up all of the services. It adds an extra layer of complexity (as you need to have Docker installed), but it greatly simplifies managing the services.

The whole Docker Compose stack

Building on the example configuration here I would make a ./docker-compose.yaml file that looks like

version: '3'
services:
  server:
    build: ./server
    ports:
      - "80:80"
    links:
      - redis
    environment:
      - REDIS_URL=redis://cache
  crawler:
    build: ./crawler
    links:
      - redis
    environment:
      - REDIS_URL=redis://cache
  redis:
    image: "redis/alpine"
    container_name: cache
    expose: 
      - 6379

I would organize the applications into separate directories, like ./server and ./crawler, but that's not the only way to do it. However you organize them, your build arguments in the configuration above should match.

The server

I would write a simple server in ./server/app.py that does something like

import os

from flask import Flask
import redis

app = Flask(__name__)
r_conn = redis.Redis(
    host=os.environ.get('REDIS_HOST'),
    port=6379
)

@app.route('/status')
def index():
    stat = r_conn.get('crawler_status')
    try:
        return stat.decode('utf-8')
    except:
        return 'error getting status', 500

app.run(host='0.0.0.0', port=8000)

Along with it a ./server/requirements.txt file with the dependencies

Flask
redis

And finally a ./server/Dockerfile that tells Docker how to build your server

FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
    python3 -m ensurepip && \
    rm -r /usr/lib/python*/ensurepip && \
    pip3 install --upgrade pip setuptools && \
    rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/server
COPY ./ /opt/server
WORKDIR /opt/server
# install deps and run server
RUN pip3 install -qr requirements.txt
EXPOSE 8000
CMD ["python3", "app.py"]

Stop to check things are alright

At this point, if you open a CMD prompt or terminal in the directory with ./docker-compose.yaml you should be able to run docker-compose build && docker-compose up to check that everything builds and runs happily. You will need to disable the crawler section of the YAML file (since it hasn't been written yet) but you should be able to spin up a server that talks to Redis. If you're happy with it, uncomment the crawler section of the YAML and proceed.

The crawler process

Since Docker handles restarting the crawler process, you can actually just write this as a very simple Python script. Something like ./crawler/app.py could look like

from time import sleep
import os
import sys

import redis

TIMEOUT = 3600  # seconds between runs
r_conn = redis.Redis(
    host=os.environ.get('REDIS_HOST'),
    port=6379
)

# ... update status and then do the work ...
r_conn.set('crawler_status', 'crawling')
sleep(60)
# ... okay, it's done, update status ...
r_conn.set('crawler_status', 'sleeping')

# sleep for a while, then exit so Docker can restart
sleep(TIMEOUT)
sys.exit(0)

And then like before you need a ./crawler/requirements.txt file

redis

And a (very similar to the server's) ./crawler/Dockerfile

FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
    python3 -m ensurepip && \
    rm -r /usr/lib/python*/ensurepip && \
    pip3 install --upgrade pip setuptools && \
    rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/crawler
COPY ./ /opt/crawler
WORKDIR /opt/crawler
# install deps and run server
RUN pip3 install -qr requirements.txt
# NOTE that no port is exposed
CMD ["python3", "app.py"]

Wrapup

In 7 files, you have two separate applications managed by Docker as well as a Redis instance. If you want to scale it, you can look into the --scale option for docker-compose up. This is not necessarily the simplest solution, but it manages some of the unpleasant bits about process management. For reference, I also made a Git repo for it here.

To run it as a headless service, just run docker-compose up -d.

From here, you can and should add nicer logging to the crawler. You can of course use Django instead of Flask for the server (though I'm more familiar with Flask and Django may introduce new dependencies). And of course you can always make it more complicated.

2. Single process with threading

This solution does not require any Docker, and should only require a single Python file to manage. I won't write a full solution unless OP wants it, but the basic sketch would be something like

import threading
import time

from flask import Flask

STATUS = ''

# run the server on another thread
def run_server():
    app = Flask(__name__)
    @app.route('/status')
    def index():
        return STATUS
server_thread = threading.Thread(target=run_server)
server_thread.start()

# run the crawler on another thread
def crawler_loop():
    while True:
        STATUS = 'crawling'
        # crawl and wait
        STATUS = 'sleeping'
        time.sleep(3600)
crawler_thread = threading.Thread(target=crawler_loop)
crawler_thread.start()

# main thread waits until the app is killed
try:
    while True:
        sleep(1)
except:
    server_thread.kill()
    crawler_thread.kill()

This solution doesn't handle anything to do with keeping the services alive, really much to do with error handling, and the block at the end won't handle signals from the OS very well. That said, it's a quick and dirty solution that should get you off the ground.

Collectives™ on Stack Overflow

How to expose a web server with REST API and HTML/JavaScript applications from an existing Python application?

1 Answer 1

1. Separate Processes with shared Datastore

The whole Docker Compose stack

The server

Stop to check things are alright

The crawler process

Wrapup

2. Single process with threading

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1. Separate Processes with shared Datastore

The whole Docker Compose stack

The server

Stop to check things are alright

The crawler process

Wrapup

2. Single process with threading

Comments

Your Answer

Sign up or log in

Post as a guest

Related