2

I have to insert around 1.5 billion records in cosmos db using Java SDK which is broken into batches of 7k documents. I have written the code which generates the data first in the loop then put it into 2 containers, Doucment and Document_attr using CosmosClient connection. But it is too slow around 300 items in 1 sec. With this speed, it'll take too long time to get the data into containers. Can someone please suggest the best possible way to achieve insertion of items at faster speed? Throughput is set to auto-scaling and max at 4000RU/s. Since I'm new to Cosmos DB, I'm unable to optimize the process.

      CosmosClient cosmosClient = new CosmosClientBuilder().
            endpoint("<>").key("<>").buildClient();
      CosmosContainer documentContainer = db.getContainer("DOCUMENT");
      CosmosContainer attributeContainer = db.getContainer("DOCUMENT_ATT");
    
      CosmosBulkExecutionOptions bulkExecutionOptions = new CosmosBulkExecutionOptions();
      for(int i=0;i<86400;i++) {
            List<Document> docInsert = new ArrayList<>();
            List<DocumentAttribute> docAttr = new ArrayList<>();

            for (int j = 0; j < 50; j++) {
                String docId = UUID.randomUUID().toString();
                Date expiryTime = DateUtils.addYears(date, 10);

                docInsert.add(new Document(docId, docId, Math.floor(Math.random() * this.maxFileSize) + 1,number));
                docInsert.add(new Document(docId, docId, Math.floor(Math.random() * this.maxFileSize) + 1,number+1));
                number = number + 2 > totalNumber ? 1 : number + 2;
                List<User> users = new ArrayList<>();
                users.add(new User(idPrefix+extUserNumber,"EXT",list1));
                users.add(new User(idPrefix+extUserNumber+1,"EXT",list1));
                extUserNumber = extUserNumber + 2 > totalNumber ? 1 : extUserNumber +2;
                for(int u = 0;u<5;u++)
                    users.add(new User(idPrefix+(intNumber+u),"INT",list2));
                intNumber = intNumber + 5 > totalIntNumber ? 1 : intNumber +5;
                docAttr.add(new DocumentAttribute(docId,users date));
            }

            List<CosmosItemOperation> cosmosItemOperationFlux = docInsert.stream().
                    map(doc -> CosmosBulkOperations.getCreateItemOperation(doc,new PartitionKey(doc.getId())))
                    .collect(Collectors.toList());
            documentContainer.executeBulkOperations(cosmosItemOperationFlux, bulkExecutionOptions);
            List<CosmosItemOperation> cosmosItemOperationFlux1 = docAttr.stream().
                    map(doc -> CosmosBulkOperations.getCreateItemOperation(doc,new PartitionKey(doc.getDocId())))
                    .collect(Collectors.toList());
            attributeContainer.executeBulkOperations(cosmosItemOperationFlux1, bulkExecutionOptions);

            date = DateUtils.addSeconds(date, 1);
        }

I referred this document https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/tutorial-dotnet-bulk-import but CosmosClientOptions is not available in Java SDK it seems.

12
  • Where is the bottle neck? Are you seeing throttling? 4,000 RU/S seems extremely meagre for this requirement. How much RU is the average insert taking? If it is ~13.33 RU per insert then 300 items in 1 sec is the best you can achieve without either reducing insert cost (e.g. by disabling indexing for the bulk load or at least not using wildcard indexing) or increasing the RU/s Commented Jan 15, 2024 at 8:51
  • How much storage do you envisage the 1.5 billion records will consume? You should also ensure that you have enough physical partitions up front rather than having partition splits during the operation Commented Jan 15, 2024 at 8:55
  • 1
    You need to look at the "Metrics" for the account in Azure Monitor. "Total Requests" metric filtered on Status = 'ClientThrottlingError' shows how much throttling you are seeing. "Total Requests" metric with filters for Status = 'success' and OperationType = 'Create' gives you a count of successful inserts and then you can change the metric to "Total Request Units" but keep the filters and time periods the same to see the request units that cost you (and divide one by the other to get an average RU cost per successful create) Commented Jan 15, 2024 at 11:24
  • 1
    For existing containers the logic is different, Each physical partition can support up to 10K RU/S and you will just get partition splits that take you up to ceiling(RU/10000) rather than ceiling(RU/6000) - so 240,000RU/s will just give you 24 physical partitions - not 40. If you started off with a single physical partition then 24 physical partitions will mean that some partitions are dealing with twice the key range of others and you should be aiming for 32 or 64. This will take a while though as it will happen through multiple rounds of binary splits rather than in one go Commented Jan 16, 2024 at 7:13
  • 1
    How many physical partitions is your collection currently using? But anyway to answer your question yes you can do it whilst the insert is still running. You won't actually get access to 240,000 RU until you have at least 24 physical partitions though as each physical partition can only support up to 10K so the available RU for your inserts will increase as and when every partition split happens Commented Jan 16, 2024 at 7:20

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.