Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you learn how to write query requests for foundation models optimized for reasoning tasks, and send them to your Foundation Model API endpoint.
Mosaic AI Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.
Types of reasoning models
There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:
Reasoning model type | Details | Model examples | Parameters |
---|---|---|---|
Hybrid reasoning | Supports both fast, instant replies and deeper reasoning when needed. | Claude models like databricks-claude-3-7-sonnet and databricks-claude-sonnet-4 . |
Include the following parameters to use hybrid reasoning:
|
Reasoning only | These models always use internal reasoning in their responses. | GPT OSS models like databricks-gpt-oss-120b and databricks-gpt-oss-20b . |
Use the following parameter in your request:
|
Query examples
All reasoning models are accessed through the chat completions endpoint.
Claude model example
from openai import OpenAI
import base64
import httpx
client = OpenAI(
api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
)
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]
print("Reasoning:", reasoning)
print("Answer:", answer)
GPT OSS model example
The reasoning_effort
parameter accepts "low"
, "medium"
(default), or "high"
values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.
curl -X POST "https://<workspace_host>/serving-endpoints/databricks-gpt-oss-120b/invocations" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"max_tokens": 4096,
"reasoning_effort": "high"
}'
The API response includes both thinking and text content blocks:
ChatCompletionMessage(
role="assistant",
content=[
{
"type": "reasoning",
"summary": [
{
"type": "summary_text",
"text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
"signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
}
]
},
{
"type": "text",
"text": (
"# Why the Sky Is Blue\n\n"
"The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
)
}
],
refusal=None,
annotations=None,
audio=None,
function_call=None,
tool_calls=None
)
Manage reasoning across multiple turns
This section is specific to the databricks-claude-3-7-sonnet model
.
In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.
If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:
assistant_message = response.choices[0].message
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": text_content},
{"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
assistant_message,
{"role": "user", "content": "Can you simplify the previous answer?"}
],
max_tokens=20480,
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 10240
}
}
)
answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)
How does a reasoning model work?
Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-3-7-sonnet
, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.