The way I see it, you have two high-level ways of tackling this problem:
- Have separate applications (a "server" and a "crawler") that have some shared datastore (database, Redis, etc). Each application would operate independently and the crawler could just update its status in the shared datastore. This approach could probably scale better: if you spin it up in something like Docker Swarm, you could scale the crawler instances as much as you can afford.
- Have a single application that spawns separate threads for the crawler and server. Since they're in the same process, you can share information between them a bit quicker (though if it's just the crawler status that shouldn't matter much). The advantage to this option seems to just be difficulty of spinning it up -- you wouldn't need a shared datastore, and you wouldn't need to manage more than one service.
I would personally tend towards (1) here, because each of the pieces are simpler. What follows is a solution to (1), and a quick and dirty solution to (2).
1. Separate Processes with shared Datastore
I would use Docker Compose to handle spinning up all of the services. It adds an extra layer of complexity (as you need to have Docker installed), but it greatly simplifies managing the services.
The whole Docker Compose stack
Building on the example configuration here I would make a ./docker-compose.yaml file that looks like
version: '3'
services:
server:
build: ./server
ports:
- "80:80"
links:
- redis
environment:
- REDIS_URL=redis://cache
crawler:
build: ./crawler
links:
- redis
environment:
- REDIS_URL=redis://cache
redis:
image: "redis/alpine"
container_name: cache
expose:
- 6379
I would organize the applications into separate directories, like ./server and ./crawler, but that's not the only way to do it. However you organize them, your build arguments in the configuration above should match.
The server
I would write a simple server in ./server/app.py that does something like
import os
from flask import Flask
import redis
app = Flask(__name__)
r_conn = redis.Redis(
host=os.environ.get('REDIS_HOST'),
port=6379
)
@app.route('/status')
def index():
stat = r_conn.get('crawler_status')
try:
return stat.decode('utf-8')
except:
return 'error getting status', 500
app.run(host='0.0.0.0', port=8000)
Along with it a ./server/requirements.txt file with the dependencies
Flask
redis
And finally a ./server/Dockerfile that tells Docker how to build your server
FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
python3 -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
pip3 install --upgrade pip setuptools && \
rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/server
COPY ./ /opt/server
WORKDIR /opt/server
# install deps and run server
RUN pip3 install -qr requirements.txt
EXPOSE 8000
CMD ["python3", "app.py"]
Stop to check things are alright
At this point, if you open a CMD prompt or terminal in the directory with ./docker-compose.yaml you should be able to run docker-compose build && docker-compose up to check that everything builds and runs happily. You will need to disable the crawler section of the YAML file (since it hasn't been written yet) but you should be able to spin up a server that talks to Redis. If you're happy with it, uncomment the crawler section of the YAML and proceed.
The crawler process
Since Docker handles restarting the crawler process, you can actually just write this as a very simple Python script. Something like ./crawler/app.py could look like
from time import sleep
import os
import sys
import redis
TIMEOUT = 3600 # seconds between runs
r_conn = redis.Redis(
host=os.environ.get('REDIS_HOST'),
port=6379
)
# ... update status and then do the work ...
r_conn.set('crawler_status', 'crawling')
sleep(60)
# ... okay, it's done, update status ...
r_conn.set('crawler_status', 'sleeping')
# sleep for a while, then exit so Docker can restart
sleep(TIMEOUT)
sys.exit(0)
And then like before you need a ./crawler/requirements.txt file
redis
And a (very similar to the server's) ./crawler/Dockerfile
FROM alpine:latest
# install Python
RUN apk add --no-cache python3 && \
python3 -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
pip3 install --upgrade pip setuptools && \
rm -r /root/.cache
# copy the app and make it your current directory
RUN mkdir -p /opt/crawler
COPY ./ /opt/crawler
WORKDIR /opt/crawler
# install deps and run server
RUN pip3 install -qr requirements.txt
# NOTE that no port is exposed
CMD ["python3", "app.py"]
Wrapup
In 7 files, you have two separate applications managed by Docker as well as a Redis instance. If you want to scale it, you can look into the --scale option for docker-compose up. This is not necessarily the simplest solution, but it manages some of the unpleasant bits about process management. For reference, I also made a Git repo for it here.
To run it as a headless service, just run docker-compose up -d.
From here, you can and should add nicer logging to the crawler. You can of course use Django instead of Flask for the server (though I'm more familiar with Flask and Django may introduce new dependencies). And of course you can always make it more complicated.
2. Single process with threading
This solution does not require any Docker, and should only require a single Python file to manage. I won't write a full solution unless OP wants it, but the basic sketch would be something like
import threading
import time
from flask import Flask
STATUS = ''
# run the server on another thread
def run_server():
app = Flask(__name__)
@app.route('/status')
def index():
return STATUS
server_thread = threading.Thread(target=run_server)
server_thread.start()
# run the crawler on another thread
def crawler_loop():
while True:
STATUS = 'crawling'
# crawl and wait
STATUS = 'sleeping'
time.sleep(3600)
crawler_thread = threading.Thread(target=crawler_loop)
crawler_thread.start()
# main thread waits until the app is killed
try:
while True:
sleep(1)
except:
server_thread.kill()
crawler_thread.kill()
This solution doesn't handle anything to do with keeping the services alive, really much to do with error handling, and the block at the end won't handle signals from the OS very well. That said, it's a quick and dirty solution that should get you off the ground.
multiprocessing.Process(httpServer)for the user reporting via HTTP. Now thishttpServercould be implemented anyhow (perhaps asynchronously) without affecting the existing application. Do you see any problems with this idea?