3

tl;dr I'm trying to figure out the most efficient way to SELECT a record or INSERT it if it doesn't already exist that will work with multiple concurrent connections.

The situation: I'm constructing a Postgres database (9.3.5, x64) containing a whole bunch of information associated with a customer. This database features a "customers" table that contains an "id" column (SERIAL PRIMARY KEY), and a "system_id" column (VARCHAR(64)). The id column is used as a foreign key in other tables to link to the customer. The "system_id" column must be unique, if it is not null.

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    system_id VARCHAR(64),
    name VARCHAR(256));

An example of a table that references the id in the customers table:

CREATE TABLE tsrs (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    filename VARCHAR(256) NOT NULL,
    name VARCHAR(256),
    timestamp TIMESTAMP WITHOUT TIME ZONE);

I have written a python script that uses the multiprocessing module to push data into the database through multiple connections (from different processes).

The first thing that each process needs to do when pushing data into the database is to check if a customer with a particular system_id is in the customers table. If it is, the associated customer.id is cached. If its not already in the table, a new row is added, and the resulting customer.id is cached. I have written an SQL function to do this for me:

CREATE OR REPLACE FUNCTION get_or_insert_customer(p_system_id customers.system_id%TYPE, p_name customers.name%TYPE) RETURNS customers.id%TYPE AS $$
DECLARE
    v_id customers.id%TYPE;
BEGIN
    LOCK TABLE customers IN EXCLUSIVE MODE;
    SELECT id INTO v_id FROM customers WHERE system_id=p_system_id;
    IF v_id is NULL THEN
        INSERT INTO customers(system_id, name)
            VALUES(p_system_id,p_name)
            RETURNING id INTO v_id;
    END IF;
    RETURN v_id;
END;
$$ LANGUAGE plpgsql;

The problem: The table locking was the only way I was able to prevent duplicate system_ids being added to the table by concurrent processes. This isn't really ideal as it effectively serialises all the processing at this point, and basically doubles the amount of time that it takes to push a given amount of data into the db.

I wanted to ask if there was a more efficient/elegant way to implement the "SELECT or INSERT" mechanism that wouldn't cause as much of a slow down? I suspect that there isn't, but figured it was worth asking, just in case.

Many thanks for reading this far. Any advice is much appreciated!

7
  • 1
    customers.system_id looks like the natural key for the customers table. In which case there should be a UNIQUE constraint on it. Commented Dec 17, 2014 at 11:28
  • Hi wildplasser, thanks for your response. I added the UNIQUE constraint (which sounds like it should be there regardless of my issue). Unfortunately this just means that postgres raises an error if I try to add duplicates (current transaction is aborted, commands ignored until end of transaction block). I can't seem to figure out how to gracefully recover from that. If I catch the exception with pythong and try to perform another select, it just raises another exception... Commented Dec 17, 2014 at 13:20
  • BTW: do your frontend threads ever commit? Commented Dec 17, 2014 at 15:28
  • Each process commits after processing one full tarfile (tarfile contents are used to add 1 customer per tarfile, plus a bunch of other data to related tabled). Maybe that's too often? Commented Dec 17, 2014 at 16:06
  • BTW: since the exclusive lock effectively serialises ("funnels") your work to one active transaction (N-1 transactions waiting for the lock), there can be no gain in using multiple connections. Commented Dec 17, 2014 at 17:30

1 Answer 1

1

I managed to rewrite the function into plain SQL, changing the order (avoiding the IF and the potential race condition)

CREATE OR REPLACE FUNCTION get_or_insert_customer
        ( p_system_id customers.system_id%TYPE
        , p_name customers.name%TYPE
        )  RETURNS customers.id%TYPE AS $func$

    LOCK TABLE customers IN EXCLUSIVE MODE;
    INSERT INTO customers(system_id, name)
    SELECT p_system_id,p_name
     WHERE NOT EXISTS (SELECT 1 FROM customers WHERE system_id = p_system_id)
        ;

    SELECT id
        FROM customers WHERE system_id = p_system_id
        ;
$func$ LANGUAGE sql;
Sign up to request clarification or add additional context in comments.

1 Comment

Even if it's not an answer - the improvement is appreciated. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.