Azure OpenAI “On your data”: Stream only one JSON field to the UI while processing the rest in the background (without extra latency)

Question

Azure OpenAI “On your data”: Stream only one JSON field to the UI while processing the rest in the background (without extra latency)

Kaja Sherief 115

I'm experiencing performance issues when trying to optimize streaming responses from Azure OpenAI "on your data" feature for selective JSON property extraction.
Current Setup:

Using Azure OpenAI "on your data" feature
Implementing streaming responses (chunk-by-chunk delivery)
UI correctly receives and displays streaming data
Azure "on your data" doesn't natively support JSON object responses, so I'm converting string responses to JSON programmatically

Current Challenge: With streaming responses, I receive the same JSON format as non-streaming, but my UI only requires one specific property from the JSON response. The remaining JSON properties need to be processed for background tasks.

Performance Issue:

Direct streaming response: 3.5-4.5 seconds ✅
Post-processing approach (wait for complete response → clean up → re-chunk): 6-7 seconds ❌

What I've Tried:

Waiting for the complete streaming response to finish
Processing/cleaning the full JSON response
Re-sending the processed data chunk-by-chunk via API
Various optimization attempts on the post-processing pipeline

Questions

Is there a supported way in Azure OpenAI “On your data” streaming to separate fields (e.g., stream answer tokens as plain text while delivering citations/metadata as a distinct channel or event type) so that I can render the answer immediately and handle the rest in the background?
If not, what’s the recommended pattern to keep latency near pure streaming (~3.5–4.5s) while still obtaining structured data for background tasks?
What are the best practices for handling mixed streaming scenarios where UI needs immediate data but background processing requires the full response?
Can I implement real-time JSON parsing during streaming to extract the required property without waiting for the complete response?
Are there examples for incremental/streaming JSON parsing that work well with token-level deltas (so I can extract only the answer path without waiting for the full object)?
Would Microsoft recommend a JSON Lines (JSONL) or framed output approach for streaming (e.g., model first emits {"type":"answer","delta":"..."} lines, then {"type":"meta",...}), or is there a better practice for On your data?

Any guidance on optimizing this workflow while maintaining the 3.5-4.5 second response time would be greatly appreciated.

Ravada Shivaprasad 915 Reputation points Microsoft External Staff Moderator

2025-08-11T20:32:45.2733333+00:00

Hi Kaja Sherief,

The key to optimizing Azure OpenAI streaming responses lies in implementing an incremental processing approach that separates UI updates from background processing. This architectural shift allows you to maintain optimal performance while ensuring all data is properly processed. The current approach of waiting for complete responses before processing introduces significant latency, typically adding 2-3 seconds to the overall response time. By processing the response incrementally as it streams, you can extract and display UI-critical properties immediately while continuing to process the remaining data in the background.

The implementation strategy centers around three core components: a streaming JSON parser to process incoming data token-by-token, a dual-path processing pipeline for separating UI and background tasks, and optimized prompt engineering to ensure critical fields appear early in the stream. While Azure OpenAI doesn't support server-side selective property streaming, you can make extraction faster through prompt engineering by structuring your request so the model outputs the UI property first in the JSON object, followed by other properties. This ensures the critical field appears earlier in the stream, and content filtering, which is always enabled, is applied in parallel without blocking incremental streaming of the main content.

For large or mixed workloads, you can split processing into two pipelines: a primary (UI) pipeline that extracts and sends the needed property to the UI as soon as it's detected in the stream, and a secondary (background) pipeline that handles parsing and storing or further processing the rest of the response asynchronously. To reduce unnecessary delay, minimize payload size and adjust max_tokens to limit overly large responses. This combination of incremental parsing, prompt structuring, and dual-path processing helps maintain the desired 3.5-4.5 second UI response time while still supporting complete background processing of all returned data.

The key to success lies in proper implementation of the streaming parser and careful management of the dual processing pipelines. By separating UI updates from background processing, you can ensure optimal performance while maintaining data integrity and completeness. Monitor these metrics continuously and adjust the configuration based on your specific workload patterns. Consider implementing automated scaling for background processors during peak loads while keeping UI update resources stable.

Reference links : Azure OpenAI Responses API (Preview) , Learn how to use JSON mode , Content streaming Content streaming

Hope it helps!

Thank you

Share via

Azure OpenAI “On your data”: Stream only one JSON field to the UI while processing the rest in the background (without extra latency)

Your answer