Recommended Staging Layer Location in Streaming (ADLS Branch vs. Hyperscale Staging)

Janice Chi 580 Reputation points
2025-08-06T08:03:43.8466667+00:00

In our current project, we are migrating ~80 TB of data from on-prem DB2 to Azure SQL Hyperscale. We are using Kafka + Databricks Structured Streaming to ingest CDC data into Azure. Our ingestion strategy is split into three phases: One-Time Historical Load, Catch-Up (batch CDC), and Real-Time Streaming.

In the Catch-Up phase, we use the following pattern:

  • CDC data is read from Kafka.

Flattened and stored in ADLS Raw layer (Delta).

Merged (using MERGE INTO) into ADLS Branch layer (which acts as an upserted view with ~80+ TB data).

Then loaded into Azure SQL Hyperscale main tables via JDBC.

Reconciliation is done after Hyperscale load.

This design works because we maintain the full upserted snapshot in the ADLS Branch layer — allowing efficient downstream loads and retries.


Now, in the Streaming phase, we have a question regarding where the staging/merge should happen:

We are considering two options:

Option A: Use ADLS Branch Layer as Staging (like Catch-Up)

This would mean recreating a full 80+ TB upserted state again in ADLS for streaming.

However, in real-time use cases, this may be overkill, since streaming only processes micro-batches.

Option B: Use Azure SQL Hyperscale Staging Table as Staging

Only micro-batch CDC data is pushed from ADLS Raw to Hyperscale Staging.

Merge is done inside Hyperscale itself (Staging → Main).

This avoids rebuilding the 80+ TB upserted state in Delta again.


Our Questions:

Is it technically valid to avoid building the full ADLS Branch (~80 TB) in streaming, and still perform correct and consistent merges?

If yes, how can we maintain merge logic (UPSERT/DELETE) without that historical upserted view?

  If not, then does that mean the Branch layer is *mandatory* for streaming merge logic to be correct?
  
  **Which staging location is technically preferred for Streaming:**
  
     ADLS Branch Layer (Delta format)
     
        or Hyperscale Staging Table (relational format)?
        
        **Please explain the pros and cons of each staging location** specifically for **real-time streaming use cases**:
        
           Merge reliability
           
              Storage cost
              
                 Reconciliation complexity
                 
                    Retry logic
                    
                       Maintenance overhead
                       
                          Performance impact on Hyperscale
                          
                          **What is Microsoft's recommended final decision** for staging design in Streaming pipelines — considering long-term scalability, correctness, and maintainability?
                          

Context Constraints:

No intermediate storage is preferred in real-time (per client policy).

Streaming payloads are processed micro-batch-wise via Databricks.

Merge logic is currently implemented using Spark in batch, but in Streaming we are evaluating in-Hyperscale merge (Staging → Main).

Please help us finalize whether a persistent Branch layer in ADLS is needed in Streaming, or if Hyperscale staging is the better-suited approach — and why.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 5,330 Reputation points Microsoft External Staff Moderator
    2025-08-06T09:33:30.71+00:00

    Hi Janice Chi In your case, both options for staging have their merits, and the decision will largely depend on balancing performance, cost, and complexity in your specific streaming use case. Here’s a breakdown of the two options and the trade-offs to help you finalize the best approach for your streaming pipeline:

    Option A: ADLS Branch Layer (Delta format) for Staging

    Pros:

    1. Consistency with Batch Approach: This method aligns with your existing Catch-Up pattern, which can reduce the complexity of the transition from batch to streaming.
    2. Isolation from Hyperscale Load: Staging in ADLS helps isolate the streaming layer from direct load and merge operations within Hyperscale, which may provide better performance for real-time ingestion.
    3. Retries & Resilience: With a full upserted snapshot in ADLS, you have a more resilient mechanism for retrying failed operations. Any missing or erroneous data can be recalculated and reloaded from the ADLS Branch layer.

    Cons:

    1. Storage Cost & Complexity: Rebuilding the full 80+ TB dataset for streaming may incur higher storage costs and significant I/O operations, especially with micro-batches. Additionally, maintaining such a large dataset can increase overhead.
    2. Performance Impact: With real-time streaming, working with a large dataset in Delta format may not be as efficient as smaller, batch-sized merges. Processing the entire upserted state continuously may impact throughput and latency.

    Option B: Azure SQL Hyperscale Staging Table for Staging

    Pros:

    1. Avoids Rebuilding Full ADLS Snapshot: This method allows you to avoid replicating the entire 80+ TB data into Delta format, making it more storage-efficient, especially for real-time use cases.
    2. In-Hyperscale Merge Operations: Merging directly within Azure SQL allows for more streamlined and optimized merges without needing an external staging layer, taking advantage of Hyperscale’s performance.
    3. Simplified Architecture: By keeping everything within Hyperscale, you reduce the complexity of managing separate data stores (ADLS Branch), reducing the overhead of managing the persistence layer for streaming.

    Cons:

    1. Merge Logic Complexity: Implementing the merge within Hyperscale might introduce challenges, particularly if the merge logic needs to handle out-of-order data or frequent reprocessing of historical records.
    2. Potential Impact on Hyperscale Performance: Since the merge logic would occur directly in Hyperscale, there could be a performance hit depending on the volume of CDC data. Heavy real-time ingestion can impact the performance of both the staging and main tables in Hyperscale.
    3. Reconciliation Complexity: You may need to implement more sophisticated reconciliation mechanisms to ensure consistency between your raw data in ADLS and the merged data in Hyperscale.

    Is the Branch Layer Mandatory for Streaming?

    It is not mandatory to rebuild the full ADLS Branch layer for real-time streaming, but it may still be beneficial if you need precise control over merge operations and retry mechanisms, especially in case of complex out-of-order data. However, if the merge logic in Hyperscale can handle the streaming load reliably, you can avoid the additional overhead of maintaining a full Branch layer in ADLS.

    Recommended Approach:

    For real-time streaming pipelines, using Azure SQL Hyperscale Staging Table is generally the preferred method, given that it simplifies the architecture and avoids unnecessary storage costs associated with maintaining a large ADLS Branch layer for streaming use cases. However, you should consider the following best practices:

    Merge Reliability: Ensure that your merge logic is optimized within Hyperscale. Leveraging Hyperscale's own capabilities can lead to better performance, but you may need to adjust the logic for handling real-time streaming data.

    Scalability & Maintainability: With Hyperscale as your staging area, you reduce the need for intermediate storage layers, which makes the pipeline simpler and easier to maintain long-term.

    Storage & Performance Trade-offs: Keep a close eye on the performance of Hyperscale as you increase your ingestion volume. Proper indexing, partitioning, and optimization of your merge operations will be critical to avoid performance bottlenecks.

    Finally: If your primary goal is to minimize storage costs and simplify the architecture for real-time streaming, using the Hyperscale staging table should work well, assuming you have optimized the merge logic for real-time processing. If there’s a need for more flexibility and error resilience, especially in cases of high failure rates, the ADLS Branch layer might still be worth considering.


    Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.