I am setting up an array of postgresql 10 servers on several Android tablets running Ubuntu 18.04 via Linux Deploy. I want to send from a local server a complete reference table to the remote servers. Then, I will send shards of a another table I want to join to the reference table using a wide range of record linkage algorithms. Finally, the results from each server will be sent back to the local server. The various MPP software I looked at will not work with my requirements, particularly given the wide range of joins I want to use.
The biggest challenge has to do with bandwidth. All of the tablets connect via Wifi, which is slow. Also, the tablets have limited storage that cannot be expanded. So, it would be greatly helpful to send compressed data directly into the remote servers as well as directly back into the local server.
The closest I believe I have gotten is piping the data from and to psycopg2's COPY commands using Aryeh Leib Taurog's answer here. But of course that is not compressed data.
My code using that piping approach is below. Is it possible to compress the stream locally and have the remote machine use it's CPU to uncompress the stream? The postgresql community is working on network compression, but it hasn't been released yet. I don't wish to use SSL, the only compression I believe to be available within the server.
fromdb = psycopg2.connect("dbname=postgres user=postgres")
todb = psycopg2.connect(f"host={node['host_ip']} dbname=postgres user=postgres")
r_fd, w_fd = os.pipe()
def copy_from():
cur = todb.cursor()
cur.copy_expert(f"COPY {table_name} FROM STDIN WITH CSV HEADER", os.fdopen(r_fd))
cur.close()
todb.commit()
to_thread = threading.Thread(target=copy_from)
to_thread.start()
copy_to_stmt = (f"COPY (SELECT * FROM {table_name} LIMIT {limit} OFFSET {offset}) TO STDOUT WITH CSV HEADER")
cur = fromdb.cursor()
write_f = os.fdopen(w_fd, 'w')
cur.copy_expert(copy_to_stmt, write_f)
write_f.close()
to_thread.join()
fromdb.close()
todb.close()
Right now, my Python code creates zip files on the local machine. It then uses paramiko to transfer the files via sftp and run a psql COPY FROM PROGRAM 'zcat filename.zip' command on the remote server. But this slows things down in many ways, including that the zip files have to be generated and transferred before they can be imported. It also occupies up to twice as much space on the remote machine's storage while the import process is occurring.
The script I am writing runs on the local server, but I'm not opposed to having it interact with Python code on the remote server. The remote machines are also set up as dispy nodes, if that helps, but the jobs I need run remotely are all specific to each machine, which makes dispy less useful, I believe.
Notably, this setup doesn't play very well with network shares. Might be willing to use a local FTP server that the remote machines could access, though. The local machine is Windows, but I'm open to using an Ubuntu virtual machine.
Any ideas?