python postgresql COPY TO/FROM compressed stream over network

Ask Question

Asked 7 years, 2 months ago

Modified 7 years, 2 months ago

Viewed 1k times

I am setting up an array of postgresql 10 servers on several Android tablets running Ubuntu 18.04 via Linux Deploy. I want to send from a local server a complete reference table to the remote servers. Then, I will send shards of a another table I want to join to the reference table using a wide range of record linkage algorithms. Finally, the results from each server will be sent back to the local server. The various MPP software I looked at will not work with my requirements, particularly given the wide range of joins I want to use.

The biggest challenge has to do with bandwidth. All of the tablets connect via Wifi, which is slow. Also, the tablets have limited storage that cannot be expanded. So, it would be greatly helpful to send compressed data directly into the remote servers as well as directly back into the local server.

The closest I believe I have gotten is piping the data from and to psycopg2's COPY commands using Aryeh Leib Taurog's answer here. But of course that is not compressed data.

My code using that piping approach is below. Is it possible to compress the stream locally and have the remote machine use it's CPU to uncompress the stream? The postgresql community is working on network compression, but it hasn't been released yet. I don't wish to use SSL, the only compression I believe to be available within the server.

fromdb = psycopg2.connect("dbname=postgres user=postgres")
todb = psycopg2.connect(f"host={node['host_ip']} dbname=postgres user=postgres")

r_fd, w_fd = os.pipe()

def copy_from():
    cur = todb.cursor()
    cur.copy_expert(f"COPY {table_name} FROM STDIN WITH CSV HEADER", os.fdopen(r_fd))
    cur.close()
    todb.commit()

to_thread = threading.Thread(target=copy_from)
to_thread.start()

copy_to_stmt = (f"COPY (SELECT * FROM {table_name} LIMIT {limit} OFFSET {offset}) TO STDOUT WITH CSV HEADER")

cur = fromdb.cursor()
write_f = os.fdopen(w_fd, 'w')
cur.copy_expert(copy_to_stmt, write_f)
write_f.close()

to_thread.join()
fromdb.close()
todb.close()

Right now, my Python code creates zip files on the local machine. It then uses paramiko to transfer the files via sftp and run a psql COPY FROM PROGRAM 'zcat filename.zip' command on the remote server. But this slows things down in many ways, including that the zip files have to be generated and transferred before they can be imported. It also occupies up to twice as much space on the remote machine's storage while the import process is occurring.

The script I am writing runs on the local server, but I'm not opposed to having it interact with Python code on the remote server. The remote machines are also set up as dispy nodes, if that helps, but the jobs I need run remotely are all specific to each machine, which makes dispy less useful, I believe.

Notably, this setup doesn't play very well with network shares. Might be willing to use a local FTP server that the remote machines could access, though. The local machine is Windows, but I'm open to using an Ubuntu virtual machine.

Any ideas?

asked Sep 10, 2018 at 1:11

mattdatajourno

12 bronze badges

1

What's wrong with SSL compression?

Laurenz Albe
– Laurenz Albe

2018-09-10 05:50:42 +00:00
Commented Sep 10, 2018 at 5:50
I'm not incredibly familiar with SSL, but I would need a certificate for each machine, right? I'm hoping this will be an array of up to 20 machines, but they will be operating on a local network. Does that make the task of obtaining and setting up certificates any easier?

mattdatajourno
– mattdatajourno

2018-09-10 21:29:48 +00:00
Commented Sep 10, 2018 at 21:29
Yeah, I obviously needed to look at this more closely. I'm going to follow this guide tonight and see whether I can make this work effectively. Thanks for the shove. jelastic.com/blog/…

mattdatajourno
– mattdatajourno

2018-09-10 21:55:09 +00:00
Commented Sep 10, 2018 at 21:55
You can use a self-signed certificate for the server - the documentation has a cookbook for that. You need no certificate on the client side. It is really simple if compression is all you want.

Laurenz Albe
– Laurenz Albe

2018-09-11 10:29:17 +00:00
Commented Sep 11, 2018 at 10:29
I have no idea how I can make it work given this issue....dba.stackexchange.com/questions/174000/…

mattdatajourno
– mattdatajourno

2018-09-12 04:25:59 +00:00
Commented Sep 12, 2018 at 4:25

| Show 2 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

python postgresql COPY TO/FROM compressed stream over network

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked