What is the bottleneck of my postgres bulk insert?

Question

I am bulk-inserting some data into a postgres DB on RDS, and it's taking longer than I would like.

A migration to initialise the DB schema would look like this:

CREATE TABLE "addresses" ("address" TEXT NOT NULL);

CREATE UNIQUE INDEX "unique_address" ON "addresses"("address");

CREATE INDEX "autocomplete_index" ON "addresses" USING btree (lower(address) text_pattern_ops);

The data is coming from S3 where I have a collection of around 800 256MB CSV files. For each CSV file, I use the aws_s3.table_import_from_s3 function to copy the data into a temporary table. This part is very fast. Then I insert from the temporary table into my addresses table like this:

INSERT INTO addresses
SELECT * FROM temp_addresses
ON CONFLICT (address) DO NOTHING;

This INSERT takes about 90 minutes to import a single 256MB csv file.

From the performance insights page it seems like that bottleneck is IO. (This is what I infer from the bars here being dominated by "IO:DataFileRead").

The DB instance is a db.t3.small with 2 vCPU and 2 GB RAM, 1024 GB of gp3 storage with 12000 provisioned IOPS and 500 MiBps throughput.

From what I can tell, I am far below the limit in terms of IO throughput:

...and I also seem to be well below the limit in terms of IOPS:

...so I'm struggling to understand what the bottleneck is here. What am I missing?

Extra notes:

Here is a chart of the CPU usage during the load:

And here's one of Freeable memory during the load:

your instance is too small for this kind of operation, and use COPY instead of INSERT which is relatively fast and also disable non-essential indexes during bulk loading and re-enable them afterward. There must be others too you can research on it but its most likely your thourhgput is not bein utilized by small size intace — 0xn0b174
– 0xn0b174, Commented Sep 7, 2024 at 9:31
@0xn0b174 I don't think COPY will work in my case, because I have a uniqueness constraint that I need to maintain. Currently my INSERT is using ON CONFLICT DO NOTHING for this. It's a bit complicated by the fact that in the production version of this system there is another process writing rows to the table, which may include duplicates of the ones from the bulk insert. When you say the instance is too small - which resource in particular do you think makes the difference? CPU, RAM or something else? — dipea
– dipea, Commented Sep 7, 2024 at 9:48
Things that you didn't put into your question: (1) how you're inserting these rows (although from comments it seems discrete/bulk INSERT statements, (2) how big the pipe is from whatever is doing the loading (are you running this on the AWS network, or from your office network?), (3) how long it takes to perform a load using a local database with the same existing data (this calls out whether the performance issue is caused by things like indexes on the tables that have to be rewritten). — kdgregory
– kdgregory, Commented Sep 7, 2024 at 13:08

jjanes · Accepted Answer · 2024-09-07 15:08:02Z

Your bottleneck is reading index pages in order to update them for the new data. These reads are only asking for 8kB per read and are (presumably) randomly scattered. That means you can't max out the throughput, as doing that requires reads to be either large or sequential. You also can't max out IOPS, because doing that requires multiple IO requests to be in flight at the same time and a single PostgreSQL process does not use async IO/prefetching when doing index maintenance.

AWS vaguely describes the gp3 latency as "single digit millisecond". If we assume that to mean 1 millisecond, then you would need to have at least 12 requests in flight at the same time to be able to approach the limit of 12k IOPS.

You could increase your RAM so that more of the index can stay cached and not need to hit disk, but it doesn't seem plausible to increase them enough to handle 200GB of data. But almost all of the pages actually read from disk will get dirtied and will eventually have to be written back. The kernel might do a good job of absorbing those writes internally, and then issuing them to the underlying storage device asynchronously--but I wouldn't count on it working perfectly.

You could try to reduce the latency of IO requests against gp3, but I have no idea how you would go about doing that, there doesn't seem to be any provisioning knobs which adjust that.

You could try to insert many files simultaneously by launching many workers at at a time processing different files. This would be one way to get more async IO requests in flight. However, this would also just introduce a different bottlenecks, as now the multiple workers have to work hard to avoid corrupting each other's memory.

You could drop indexes that are not needed (any non-unique ones) and re-add them only once all the loads are done. Or you could sort the data before/while loading, so that the data is inserted ordered by the index order. That way index maintenance would be directed to the same index leaf pages over and over again, and it would find the hot pages already in cache most of the time and so not need to read them from disk. You might need to combine these, dropping some indexes and ordering by the non-dropped ones.

Based on you recent edit, that would look like:

INSERT INTO addresses
SELECT * FROM temp_addresses order by address
ON CONFLICT (address) DO NOTHING;

Depending on the pattern of capitalization within your data, ordering on "address" might provide a good-enough ordering for lower("address") that the benefit of cacheability would also carry over to that index as well.

Collectives™ on Stack Overflow

What is the bottleneck of my postgres bulk insert?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related