Our team has conveyed the following:
the easiest approach would be to use the Spark connector. You can find
- The quickstart here: azure-sdk-for-java/quick-start.md at master · Azure/azure-sdk-for-java (github.com)
- An end-to-end sample showing how to read/query data as well as update data: azure-sdk-for-java/01_Batch.ipynb at master · Azure/azure-sdk-for-java (github.com)
- And if you need to do it in streaming mode (because the total dataset would require too large of a Spark cluster to handle): azure-sdk-for-java/02_StructuredStreaming.ipynb at master · Azure/azure-sdk-for-java (github.com)
RU-wise updating a document has a higher RU charge than inserting a new document. So from a perspective of minimizing the RUs it would be “cheaper” to insert the documents into a new container. But we assume the effort to do the migration, cut-over with no/minimized downtime would outweigh the RU savings – so I would recommend going with the Spark connector, updating the documents and if necessary restricting the RUs that can be used for the updates (sample above shows that) so that your normal workloads would still work. Duration of the udpates will be a function of RUs you allow for the updates and the size of your Spark cluster (Number of cores for executors mostly)
Regards
Navtej S