Hey Muhammad Umer!
It looks like you’re digging into Azure OpenAI’s brand-new Responses API. Below is a quick primer on what it is, why you might choose it over plain Chat Completions or the (soon-to-retire) Assistants API, plus a ready-to-paste code snippet and a few gotchas that trip folks up on day one.
- What the heck is the Responses API?
- Stateful by design. Instead of re-sending the entire conversation on every turn, you post
input
once and Azure returns aresponse_id
. Pass that ID asprevious_response_id
in the next call and the service stitches the history together for you. Conversation state lives for 30 days (or until youdelete()
it).
One API ≈ two APIs. It merges the best bits of Chat Completions and the Assistants API—tool calling, JSON‐formatted function output, vector-store search, streaming, even the new computer-use-preview model—under a single /responses
route.
Token savings. Because you’re not shipping the full transcript every time, you often cut prompt-side tokens by 70-90 %.
Preview-only (for now). You must hit the api-version=preview
endpoint and use an SDK release dated May 2025 or later. Older clients will 404.
- Where and with which models can I use it?
Region (Aug 2025) | Core text models* | Special models |
---|---|---|
australiaeast, eastus, eastus2, francecentral, japaneast, norwayeast, polandcentral, southindia, swedencentral, switzerlandnorth, uaenorth, uksouth, westus, westus3 | gpt-4o (2024-11-20 / 08-06 / 05-13), gpt-4.1, gpt-4o-mini, gpt-4.1-nano/mini, o1, o3, o3-mini, o4-mini | computer-use-preview, gpt-image-1 |
australiaeast, eastus, eastus2, francecentral, japaneast, norwayeast, polandcentral, southindia, swedencentral, switzerlandnorth, uaenorth, uksouth, westus, westus3 | gpt-4o (2024-11-20 / 08-06 / 05-13) , gpt-4.1 , gpt-4o-mini , gpt-4.1-nano/mini , o1 , o3 , o3-mini , o4-mini |
computer-use-preview , gpt-image-1 |
*Not every model is lit up in every region, so always check your portal first.
- Quick-start (Python, SDK ≥ 1.23.0)
python
Copy
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
# 1️⃣ Auth
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
base_url="https://<your-resource>.openai.azure.com/openai/v1/",
azure_ad_token_provider=token_provider,
api_version="preview"
)
# 2️⃣ First turn – create a response
first = client.responses.create(
model="gpt-4o", # your deployment name
input="Summarise the plot of Dune in a single sentence."
)
print(first.output_text) # ➜ “In a desert future…”
# 3️⃣ Follow-up turn – reference the previous response
second = client.responses.create(
model="gpt-4o",
previous_response_id=first.id,
input=[{
"role": "user",
"content": "Now explain why spice is so valuable."
}]
)
print(second.output_text)
Tip: If you store first.id
somewhere (Redis, Cosmos DB…), you can pick up the conversation later without re-feeding the transcript.
- Known limitations (August 2025)
- Web Search tool isn’t wired up yet on Azure (still OpenAI-only).
Image generation editing/streaming: multi-turn edits arrive “soon.”
File upload: PDFs are okay, but user_data
uploads and raw image-file inputs are still blocked.
Retention: 30-day hard limit; export before that if you need long-tail analytics.
SDK parity: Python, Java, and .NET already support responses
. Node’s “AzureOpenAI” client gets it in v1.11; until then, call REST or pass extra_headers={"ms-azure-ai-type": "openai"}
.
- When should I not use Responses API?
Use case | Better choice |
---|---|
One-shot Q&A or stateless function call | Chat Completions (less overhead) |
One-shot Q&A or stateless function call | Chat Completions (less overhead) |
Need first-class vector search & RAG orchestration today | Assistants API (until mid-2026) |
Heavy multistep tool invocation with long-lived memory across many threads | Wait for Agents SDK + Responses GA (slated 1H 2026) |
- Troubleshooting checklist
404 or 409 “OperationNotAllowed” – ensure you’re in a Responses-enabled region and your resource has the preview feature registered.
previous_response_id
ignored – are you passing it and re-using the same model deployment? Cross-model chaining isn’t supported yet.
Tool call schema errors still show up as 400/500. Validate with pydantic
locally before sending.
Slow first token on o4-mini
? Enable stream=True to pipe bytes while the full JSON builds.
Hope this clears up the mystery around the Responses API and gets you building quickly. Shout if you hit any other snags!
Best regards,
Jerald Felix