Our Java application stores data pertaining to multiple tenants in one single DB instance on AWS RDS. This setup is replicated across regions in prod. There are multiple schemas and each table within those schemas have a tenantId field. It might not necessarily be the primary key but all tables have indices on the tenantId field.
Now the requirement is to select data stored for one tenant, across all tables and re-insert it with a different tenantId.
To clarify I have 3 tables tableA, tableB, tableC each containing R rows for tenant_A. Now a different tenant_B gets created in the system and the requirement is that each of the R rows across all three tables gets created in those 3 tables with tenantId = tenant_B. This has to happen as a bootstrap DB sync step before tenant_B starts with it's own lifecycle in our system.
One option I was considering to create an one time pg_dump of all the tables, upload it to S3, store it in per tenant buckets and whenever a new tenant (like tenant_B) comes, we restore from the respective SQLs for the parent tenant - tenant_A in our example by replacing tenantId field with tenant_B now (in all those inserts).
But the concern is the data we have is huge and to do a pg_dump daily might be un-necessary overloading the DB instance (if at all it has concerning overhead). I was looking if we can stream the incremental updates on top of the one-time dump and that way we keep the dump up-to-date. However, streaming WAL logs has it's own disadvantage - what if replication/streaming stops and the logs start filling up the DB storage? Is there a more elegant way of doing this?
Another option is to fire pg_dump only when a child tenant (like tenant_B gets created). So that we trigger the dump for it's parent tenant_A upload it to S3 and restore it. However, with more and more concurrent tenants getting created in quick succession - this might not be a feasible solution in the long run.
Why are we uploading pg_dump to S3 : As we might need to run it across Geo. As in the dump for tenant_A is from us-west-2 and tenant_B might get created in eu-central-1. That's our internal logic that handles tenant creation and the region boundary is not guaranteed.
Any help here would be hugely appreciated. Thanks!
aws_s3.query_export_to_s3 : can aws_s3 plug-in be of any help here. I haven’t worked with it before hence I am not aware of the performance. Can someone point me to the right resources around it . Thanks
insert. . . selectstatements and then execute them in an order that works with your FK relationships.select ...part in us-west-2 DB instance (that oftenant_A) andinsertit in eu-central-1 (wheretenant_Bresides). Each tenant however resides in one and only one region. So, that's why a global S3 bucket that can be accessible across geo regions is what I am thinking to go ahead with.copyand intermediate files. In other words, I would treat this as a development project instead of hoping to find a tool.