tl;dr I'm trying to figure out the most efficient way to SELECT a record or INSERT it if it doesn't already exist that will work with multiple concurrent connections.
The situation: I'm constructing a Postgres database (9.3.5, x64) containing a whole bunch of information associated with a customer. This database features a "customers" table that contains an "id" column (SERIAL PRIMARY KEY), and a "system_id" column (VARCHAR(64)). The id column is used as a foreign key in other tables to link to the customer. The "system_id" column must be unique, if it is not null.
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
system_id VARCHAR(64),
name VARCHAR(256));
An example of a table that references the id in the customers table:
CREATE TABLE tsrs (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
filename VARCHAR(256) NOT NULL,
name VARCHAR(256),
timestamp TIMESTAMP WITHOUT TIME ZONE);
I have written a python script that uses the multiprocessing module to push data into the database through multiple connections (from different processes).
The first thing that each process needs to do when pushing data into the database is to check if a customer with a particular system_id is in the customers table. If it is, the associated customer.id is cached. If its not already in the table, a new row is added, and the resulting customer.id is cached. I have written an SQL function to do this for me:
CREATE OR REPLACE FUNCTION get_or_insert_customer(p_system_id customers.system_id%TYPE, p_name customers.name%TYPE) RETURNS customers.id%TYPE AS $$
DECLARE
v_id customers.id%TYPE;
BEGIN
LOCK TABLE customers IN EXCLUSIVE MODE;
SELECT id INTO v_id FROM customers WHERE system_id=p_system_id;
IF v_id is NULL THEN
INSERT INTO customers(system_id, name)
VALUES(p_system_id,p_name)
RETURNING id INTO v_id;
END IF;
RETURN v_id;
END;
$$ LANGUAGE plpgsql;
The problem: The table locking was the only way I was able to prevent duplicate system_ids being added to the table by concurrent processes. This isn't really ideal as it effectively serialises all the processing at this point, and basically doubles the amount of time that it takes to push a given amount of data into the db.
I wanted to ask if there was a more efficient/elegant way to implement the "SELECT or INSERT" mechanism that wouldn't cause as much of a slow down? I suspect that there isn't, but figured it was worth asking, just in case.
Many thanks for reading this far. Any advice is much appreciated!
customers.system_idlooks like the natural key for the customers table. In which case there should be aUNIQUEconstraint on it.N-1transactions waiting for the lock), there can be no gain in using multiple connections.