Query reasoning models

2025-08-07

In this article, you learn how to write query requests for foundation models optimized for reasoning tasks, and send them to your Foundation Model API endpoint.

Mosaic AI Foundation Model API provides a unified API to interact with all Foundation Models, including reasoning models. Reasoning gives foundation models enhanced capabilities to tackle complex tasks. Some models also provide transparency by revealing their step-by-step thought process before delivering a final answer.

Types of reasoning models

There are two types of models, reasoning-only and hybrid. The following table describes how different models use different approaches to control reasoning:

Reasoning model type	Details	Model examples	Parameters
Hybrid reasoning	Supports both fast, instant replies and deeper reasoning when needed.	Claude models like `databricks-claude-3-7-sonnet` and `databricks-claude-sonnet-4`.	Include the following parameters to use hybrid reasoning: `thinking` `budget_tokens`: controls how many tokens the model can use for internal thought. Higher budgets can improve quality for complex tasks, but usage above 32K may vary. `budget_tokens` must be less than `max_tokens`.
Reasoning only	These models always use internal reasoning in their responses.	GPT OSS models like `databricks-gpt-oss-120b` and `databricks-gpt-oss-20b`.	Use the following parameter in your request: `reasoning_effort`: accepts values of `"low"`, `"medium"` (default), or `"high"`. Higher reasoning effort may result in more thoughtful and accurate responses but may increase latency and token usage. This parameter is only accepted by a limited set of models, including `databricks-gpt-oss-120b` and `databricks-gpt-oss-20b`.

Query examples

All reasoning models are accessed through the chat completions endpoint.

Claude model example

from openai import OpenAI
import base64
import httpx

client = OpenAI(
  api_key=os.environ.get('YOUR_DATABRICKS_TOKEN'),
  base_url=os.environ.get('YOUR_DATABRICKS_BASE_URL')
  )

response = client.chat.completions.create(
    model="databricks-claude-3-7-sonnet",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

msg = response.choices[0].message
reasoning = msg.content[0]["summary"][0]["text"]
answer = msg.content[1]["text"]

print("Reasoning:", reasoning)
print("Answer:", answer)

GPT OSS model example

The reasoning_effort parameter accepts "low", "medium" (default), or "high" values. Higher reasoning effort may result in more thoughtful and accurate responses, but may increase latency and token usage.

curl -X POST "https://<workspace_host>/serving-endpoints/databricks-gpt-oss-120b/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "max_tokens": 4096,
    "reasoning_effort": "high"
  }'

The API response includes both thinking and text content blocks:

ChatCompletionMessage(
    role="assistant",
    content=[
        {
            "type": "reasoning",
            "summary": [
                {
                    "type": "summary_text",
                    "text": ("The question is asking about the scientific explanation for why the sky appears blue... "),
                    "signature": ("EqoBCkgIARABGAIiQAhCWRmlaLuPiHaF357JzGmloqLqkeBm3cHG9NFTxKMyC/9bBdBInUsE3IZk6RxWge...")
                }
            ]
        },
        {
            "type": "text",
            "text": (
                "# Why the Sky Is Blue\n\n"
                "The sky appears blue because of a phenomenon called Rayleigh scattering. Here's how it works..."
            )
        }
    ],
    refusal=None,
    annotations=None,
    audio=None,
    function_call=None,
    tool_calls=None
)

Manage reasoning across multiple turns

This section is specific to the databricks-claude-3-7-sonnet model.

In multi-turn conversations, only the reasoning blocks associated with the last assistant turn or tool-use session are visible to the model and counted as input tokens.

If you don't want to pass reasoning tokens back to the model (for example, you don't need it to reason over its prior steps), you can omit the reasoning block entirely. For example:

response = client.chat.completions.create(
    model="databricks-claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

However, if you do need the model to reason over its previous reasoning process - for instance, if you're building experiences that surface its intermediate reasoning - you must include the full, unmodified assistant message, including the reasoning block from the previous turn. Here's how to continue a thread with the full assistant message:

assistant_message = response.choices[0].message

response = client.chat.completions.create(
    model="databricks-claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": text_content},
        {"role": "user", "content": "Can you explain in a way that a 5-year-old child can understand?"},
        assistant_message,
        {"role": "user", "content": "Can you simplify the previous answer?"}
    ],
    max_tokens=20480,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 10240
        }
    }
)

answer = response.choices[0].message.content[1]["text"]
print("Answer:", answer)

How does a reasoning model work?

Reasoning models introduce special reasoning tokens in addition to the standard input and output tokens. These tokens let the model "think" through the prompt, breaking it down and considering different ways to respond. After this internal reasoning process, the model generates its final answer as visible output tokens. Some models, like databricks-claude-3-7-sonnet, display these reasoning tokens to users, while others, such as the OpenAI o series, discard them and do not expose them in the final output.