5

I'm currently trying to implement a function in Postgres that will calculate the cosine distance between 2 real arrays (real[]). The arrays/vectors are 2000 elements in length.

I'll use that function to do 1 to n searches against 500.000 vectors (for now).

I'm trying to get the best performace without starting to consider throwing hardware / CPUs at the server.

I already have a successful solution outside of Postgres. I cache the data into memory and there I can do the cosine search under 1s (using dotnet core). But, getting that production ready requires a lot of development time. Before going into that I want to be sure I exhaust all Postgres options (Postgres is already being used in a lot of our micro services).

Below are the options I tested and my results:

1) plpgsql function (Postgres 10.3)

It was a big failure - took 5 minutes to search 500.000 rows - with parallelization (2 workers).

2) c function with Postgres 10.3

Huge improvement - Took 10 seconds including 2 worker parallelization

Source

#include "postgres.h"
#include "fmgr.h"
#include "math.h"
#include <utils/array.h>

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

PG_FUNCTION_INFO_V1(cosine_distance_vector);

Datum cosine_distance_vector(PG_FUNCTION_ARGS)
{
    ArrayType *input1ValuesArray, *input2ValuesArray;
    float4 *input1Values, *input2Values;
    float4 result;
    float4 dot = 0.0;
    float4 denom_a = 0.0;
    float4 denom_b = 0.0;

    input1ValuesArray = PG_GETARG_ARRAYTYPE_P(0);
    input2ValuesArray = PG_GETARG_ARRAYTYPE_P(1);

    input1Values = (float4 *) ARR_DATA_PTR(input1ValuesArray);
    input2Values = (float4 *) ARR_DATA_PTR(input2ValuesArray);

    for(unsigned int i = 0u; i < sizeof(input1Values); ++i) {
        dot += input1Values[i] * input2Values[i] ;
        denom_a += input1Values[i] * input1Values[i] ;
        denom_b += input2Values[i] * input2Values[i] ;
    }

    result = dot / (sqrt(denom_a) * sqrt(denom_b));
    PG_RETURN_FLOAT4(result);
}

3) c function with Postgres 11.1

Another improvement - Took 9 seconds including 2 worker parallelization

My observations about the C function

As far as I see %90 of the time is spent on the PG_GETARG_ARRAYTYPE_P calls;

I tested it by comparing 2 implementations

Implementation 1 took 9 seconds to complete search =>

Datum cosine_distance_vector(PG_FUNCTION_ARGS)
{
    ArrayType *input1ValuesArray, *input2ValuesArray;

    float4 result = 0.0;

    input1ValuesArray = PG_GETARG_ARRAYTYPE_P(0);
    input2ValuesArray = PG_GETARG_ARRAYTYPE_P(1);

    PG_RETURN_FLOAT4(result);
}

Implementation 2 took 1.5 seconds to execute

Datum cosine_distance_vector(PG_FUNCTION_ARGS)
{
    float4 result = 0.0;

    PG_RETURN_FLOAT4(result);
}

Is there a faster or more specific way to get the Float4 arrays/pointers into the function rather than using the generic PG_GETARG_ARRAYTYPE_P function?

I also tried to implement this function using Version 0 Calling Convention in Postgres 9.6 (10 & 11 don't support them) as it seems more efficient (low level). But, I was not able to implement the function successfully. Even the samples in the Postgres docs were throwing segmentation faults.

I'll use a separate Dockerized Postgres install for this search functionality, so I'm open for any postgres version and any sort of config trick.

Some additional info based on @LaurenzAlbe's comments.

This is the SQL query I use to find the best result:

SELECT 
    * 
FROM 
    documents t 
ORDER BY cosine_distance_vector(
    t.vector, 
    ARRAY [1,1,1,....]::real[]) DESC
LIMIT 1

The array is huge, so I didn't paste it fully.

Here is the EXPLAIN (ANALYZE, BUFFERS, VERBOSE) result:

Explain Screen Shot

2019-01-23 Progress

I dove a little deeper into the source code of Postgres, and focused on why the the cosine function was running slower when the PG_GETARG_ARRAYTYPE_P function was being called.

So, I came across this function being called at some point in fmgr.c:

struct varlena *
pg_detoast_datum(struct varlena *datum)
{
    if (VARATT_IS_EXTENDED(datum))
        return heap_tuple_untoast_attr(datum);
    else
        return datum;
}

If your column's storage type is extended that adds a significant over head.

The row size of the vector table in total was more than 8192 bytes, which is Postgres default Block Size. That's why the vector column storage type was automatically selected as EXTENDED. I tried to convert it to PLAIN and without any error it worked. I tried the query and 500ms!

But, because my row size was now more than 8192 (although the storage type was successfully converted PLAIN) I couldn't add new rows to the table anymore, on INSERTs it started complaining about row size being to big.

Next step, I compiled postgres with 16KB blocksize (took me some time). At the end, I was able to create the perfect table with PLAIN vector storage with INSERTs working.

I tested the query with 100K rows increments. First 100K rows, it took 50ms to run. At 200K rows it took 4 seconds! - Now, I think because of the 16K block size I need to find the prefect balance of .conf file settings. I cannot optimise the function anymore.

6
  • How can your "Implementation 2" take 1.5 seconds? How often did you call it? Commented Jan 21, 2019 at 12:57
  • @LaurenzAlbe - I faked the search using implementation 2. That means it was called 500.000 times. I again measured it a few hours ago on Postgres 11.1 - it took 233 milliseconds with implementation 2. Commented Jan 21, 2019 at 13:06
  • I wonder what part of the execution time is due to the function call and what part is simply fetching the data. Can you try with ORDER BY length(t.vector::text) * random() and compare? It may be worth experimenting with jit = on on PostgreSQL v11. Commented Jan 21, 2019 at 15:10
  • BTW: for(unsigned int i = 0u; i < sizeof(input1Values); ++i) {...} :: sizeof(pointer)will probably be 4 or 8. (the bad news: the real thing will possibly be a bit slower ;-) Commented Jan 21, 2019 at 22:01
  • Final question: do you aim for speed or do you aim for correctness? Commented Jan 23, 2019 at 0:04

1 Answer 1

3

I finally came to a conclusion and I was able to achieve sub-second speeds. But, it was not possible without playing with memory settings.

These made it possible:

1. Block Size => Having a single table row fit into a single block is a huge preformance benefit. For my case it meant building postgres from source with 16KB block size.

2. C Function => C Function is at least 50x faster than SQL functions for obvious reasons. But, the combination of C Function with correct block sizes makes an even bigger difference.

3. Correct postgresql.config params => I used this tool to setup my initial params https://pgtune.leopard.in.ua/#/ - helped me a lot (Not owned by me or anything, found it in another SO question which I can't find anymore)

4. Special care for work_mem parameter => Apparently work_mem param is very important for some aggregates like MIN (which I'm using), so I increased it a lot compared to the suggestions from the website at step 3.

5. Special care for effective_cache_size parameter => This is the real deal that had the biggest impact. Again for obvious reasons having all the data fit into memory is the ultimate speed gain. So, I chose this number to be 500.000 * 16K (blocksize) => 8GB + some buffer. After the 3rd time I executed the query I got 500ms speeds.

6. Server RAM => Yes, the part where I had to cheat. The perfect amount of RAM for my case was naturally rowcount * block size backed up with a correctly set effective_cache_size parameter (Step 5).

So, these were the combination of things I had to do to hit my target.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.