Bulk update 100 million+ cosmos records

Adam Pliska 1 Reputation point
2021-06-01T17:13:20.897+00:00

I have a large collection of data stored in cosmos, ~100 million rows, but it fluctuates higher/lower over time. I would like to change the ‘schema’ of the data to support better reporting capability. In this case I need to add just a single field, which will be calculated using values from other fields already on the record.

First, I started by writing a small c# console application to load a record, update the record and save the record back to cosmos. I was using batch updates, and the throughput was poor.

Second, I tried to write a cosmos stored procedure, which took the partition key and the record Id. I loaded the record within that stored procedure, made the update, and then saved it. This increased my though put but not enough. With 200k RU’s provisioned I was still looking at over a week of running.

Third, I tried to modify the stored procedure to grab 500 records from the given partition, do the update, save them, and then return. My c# app calls this method over and over, passing a partition key for a range of records that are not yet updated. I am still using 200k RU’s provisioned. Metrics says I’m maxing that out, and I’m hitting perhaps 2000 records per second…. That’s four hits on the stored procedure. The cost of hitting the stored procedure when a full 500 records are updated is about 7k RU. Right now, it’s reporting that it’ll get done at tomorrow at about 4am.

My question is, is there a better way to be doing this sort of transformation? Should I create a new collection, and COPY the data from one to the other? Is Cosmos perhaps not the right technology to be using when mass updates are required?

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
{count} votes

2 answers

Sort by: Most helpful
  1. Navtej Singh Saini 4,226 Reputation points Microsoft Employee Moderator
    2021-06-09T00:11:42.617+00:00

    @Adam Pliska

    Our team has conveyed the following:

    the easiest approach would be to use the Spark connector. You can find

    RU-wise updating a document has a higher RU charge than inserting a new document. So from a perspective of minimizing the RUs it would be “cheaper” to insert the documents into a new container. But we assume the effort to do the migration, cut-over with no/minimized downtime would outweigh the RU savings – so I would recommend going with the Spark connector, updating the documents and if necessary restricting the RUs that can be used for the updates (sample above shows that) so that your normal workloads would still work. Duration of the udpates will be a function of RUs you allow for the updates and the size of your Spark cluster (Number of cores for executors mostly)

    Regards
    Navtej S

    0 comments No comments

  2. Navtej Singh Saini 4,226 Reputation points Microsoft Employee Moderator
    2021-06-11T01:30:54.3+00:00

    @Adam Pliska

    Our Team has also replied:
    "Just for a sense of scale, updating 100M records will cost approximately 1B RUs – at least. To complete it in 24 hours, you’d need to be efficiently spending about 12k RU/s. If your documents are larger than 1KB, the cost will go up correspondingly.

    I have a rule of thumb that a serial query (single thread, no async programming, no notable network latency, no app processing time) will consume a little over 300 RU/s, so to grow that to 12k RU/s, you’d need to have 40 commands in flight (or more, to compensate for latency, app-side processing, etc) and because it’s exceeding 10k RUs it definitely needs to parallelize across multiple partitions.

    There are a couple of things I can think of that would have a +-30% efficiency impact, but at 200k RU/s, it should complete in a little over an hour if the docs are all 1KB or less. If you’re seeing it take a full day at 200k RUs, something is wrong or missing by an order of magnitude. Updating 2000 docs/sec should cost 20k-26k RUs (again, unless they’re very large), not 200k."

    Regards
    Navtej S

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.