Timeline for What is the bottleneck of my postgres bulk insert?

Current License: CC BY-SA 4.0

25 events

when toggle format	what		by	license	comment
Sep 7, 2024 at 19:52	vote	accept	dipea
Sep 7, 2024 at 15:08	answer	added	jjanes		timeline score: 2
Sep 7, 2024 at 14:36	comment	added	dipea		@0xn0b174 yes that was the Freeable memory during the load
Sep 7, 2024 at 14:35	history	edited	dipea	CC BY-SA 4.0	added 701 characters in body
Sep 7, 2024 at 13:14	comment	added	kdgregory		And lastly, you're almost certainly throwing away money on provisioned IOPS. First, because with that small memory allotment you're going to be frequently hitting the disk, and second, because instance types impose their own limits on IO.
Sep 7, 2024 at 13:13	comment	added	kdgregory		You also said that you can't use COPY, but you might find that COPY to a staging table and insert from that table will also help you.
Sep 7, 2024 at 13:12	comment	added	kdgregory		With only 2 GB of RAM available, you're almost certainly hitting the disk often, shuffling pages in and out of the buffer cache (out because you say there's other processes that are active, and writing pages). Especially if you have multiple indexes on the table. So increasing to a larger instance size is almost certainly the solution (but, as always, test this hypothesis).
Sep 7, 2024 at 13:10	comment	added	kdgregory		That said, I think you're on the right track with suspecting read waits. You have a single server-side thread doing this work. If it needs to read a block from disk, it needs to wait until that block is available. And regardless of your IOPS, individual reads are still in the millisecond range (IOPS primarily indicates how much concurrent activity can take place).
Sep 7, 2024 at 13:08	comment	added	kdgregory		Things that you didn't put into your question: (1) how you're inserting these rows (although from comments it seems discrete/bulk INSERT statements, (2) how big the pipe is from whatever is doing the loading (are you running this on the AWS network, or from your office network?), (3) how long it takes to perform a load using a local database with the same existing data (this calls out whether the performance issue is caused by things like indexes on the tables that have to be rewritten).
Sep 7, 2024 at 12:11	comment	added	Dunes		Indexes are always cached in RAM. This is unlike tables that are cached into RAM on demand, and then evicted if other queries need the RAM. Assuming a single database with a single empty table, with a single index on an 8-byte wide column and each csv record being about 1kB, then your final index size will be about 1.6GB. Your instance doesn't have anywhere near enough RAM to work with the schema and data you have.
Sep 7, 2024 at 11:55	comment	added	0xn0b174		is that memory chart during the bulk insert??
Sep 7, 2024 at 11:38	comment	added	dipea		@JohnRotenstein I have added a chart of Freeable memory. That metric seems to hardly have been affected by the load.
Sep 7, 2024 at 11:37	history	edited	dipea	CC BY-SA 4.0	added 113 characters in body
Sep 7, 2024 at 11:24	comment	added	John Rotenstein		Databases love RAM. Can you show that too? I originally suspected it was due to your usage of a T-family instance (that normally has CPU limits), but it seems that RDS databases using T-family instances have Unlimited mode activated, which gives full CPU at an additional charge (I think).
Sep 7, 2024 at 11:20	comment	added	dipea		I can imagine more RAM would improve it and I may have to change the instance type if I can't figure out a cheaper solution using COPY. I'm not sure how more vCPUs would help though; from what I can see from the first screenshot in my post (showing Database Load) it doesn't look like even one of my two vCPUs is being maxed out, and I can see that I was using hardly any CPU credits during the load.
Sep 7, 2024 at 10:45	comment	added	0xn0b174		`db.t3.small` only has 2 vCPUs and 2 GB of RAM and after your cpu credit is over it will start to slow down do you think that will address your bulk insert. @dipea
Sep 7, 2024 at 10:05	comment	added	Bohemian♦		If it's OK to prevent other processes from accessing the table, try dropping all indexes except the unique one that detects the conflict, and `begin; lock table in exclusive mode` before inserting and `commit` after.
Sep 7, 2024 at 9:48	comment	added	dipea		@0xn0b174 I don't think COPY will work in my case, because I have a uniqueness constraint that I need to maintain. Currently my INSERT is using ON CONFLICT DO NOTHING for this. It's a bit complicated by the fact that in the production version of this system there is another process writing rows to the table, which may include duplicates of the ones from the bulk insert. When you say the instance is too small - which resource in particular do you think makes the difference? CPU, RAM or something else?
Sep 7, 2024 at 9:40	comment	added	dipea		@JohnRotenstein I have added a chart of the CPU usage
S Sep 7, 2024 at 9:39	history	edited	dipea	CC BY-SA 4.0	added 143 characters in body
Sep 7, 2024 at 9:31	comment	added	0xn0b174		your instance is too small for this kind of operation, and use `COPY` instead of `INSERT` which is relatively fast and also disable non-essential indexes during bulk loading and re-enable them afterward. There must be others too you can research on it but its most likely your thourhgput is not bein utilized by small size intace
Sep 7, 2024 at 9:29	comment	added	John Rotenstein		What do the CPU Metrics look like?
S Sep 7, 2024 at 9:22	history	edited	John Rotenstein		edited tags
Sep 7, 2024 at 9:16	review	Close votes
Sep 11, 2024 at 0:01
Sep 7, 2024 at 8:34	history	asked	dipea	CC BY-SA 4.0

toggle format

Collectives™ on Stack Overflow

Timeline for What is the bottleneck of my postgres bulk insert?

Current License: CC BY-SA 4.0