Query a chat model

2025-08-07

In this article, you learn how to write query requests for foundation models that are optimized for chat tasks and send them to your model serving endpoint.

The examples in this article apply to querying foundation models that are made available using either:

Foundation Models APIs which are referred to as Databricks-hosted foundation models.
External models which are referred to as foundation models hosted outside of Databricks.

Requirements

See Requirements.
Install the appropriate package to your cluster based on the querying client option you choose.

Query examples

The examples in this section show how to query the Meta Llama 3.3 70B Instruct model made available by the Foundation Model APIs pay-per-token endpoint, databricks-meta-llama-3-3-70b-instruct, using the different client options.

For a batch inference example, see Perform batch LLM inference using AI Functions.

OpenAI client

To use the OpenAI client, specify the model serving endpoint name as the model input.


from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
openai_client = w.serving_endpoints.get_open_ai_client()

response = openai_client.chat.completions.create(
    model="databricks-meta-llama-3-3-70b-instruct",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

To query foundation models outside of your workspace, you must use the OpenAI client directly. You also need your Databricks workspace instance to connect the OpenAI client to Databricks. The following example assumes you have a Databricks API token and openai installed on your compute.


import os
import openai
from openai import OpenAI

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

response = client.chat.completions.create(
    model="databricks-meta-llama-3-3-70b-instruct",
    messages=[
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is a mixture of experts model?",
      }
    ],
    max_tokens=256
)

SQL

Important

The following example uses the built-in SQL function, ai_query. This function is in Public Preview and the definition might change.

SELECT ai_query(
    "databricks-meta-llama-3-3-70b-instruct",
    "Can you explain AI in ten words?"
  )

REST API

Important

The following example uses REST API parameters for querying serving endpoints that serve foundation models. These parameters are in Public Preview and the definition might change. See POST /serving-endpoints/{name}/invocations.

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": " What is a mixture of experts model?"
    }
  ]
}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-meta-llama-3-3-70b-instruct/invocations \

MLflow Deployments SDK

Important

The following example uses the predict() API from the MLflow Deployments SDK.


import mlflow.deployments

# Only required when running this example outside of a Databricks Notebook
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

chat_response = client.predict(
    endpoint="databricks-meta-llama-3-3-70b-instruct",
    inputs={
        "messages": [
            {
              "role": "user",
              "content": "Hello!"
            },
            {
              "role": "assistant",
              "content": "Hello! How can I assist you today?"
            },
            {
              "role": "user",
              "content": "What is a mixture of experts model??"
            }
        ],
        "temperature": 0.1,
        "max_tokens": 20
    }
)

Databricks Python SDK

This code must be run in a notebook in your workspace. See Use the Databricks SDK for Python from an Azure Databricks notebook.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole

w = WorkspaceClient()
response = w.serving_endpoints.query(
    name="databricks-meta-llama-3-3-70b-instruct",
    messages=[
        ChatMessage(
            role=ChatMessageRole.SYSTEM, content="You are a helpful assistant."
        ),
        ChatMessage(
            role=ChatMessageRole.USER, content="What is a mixture of experts model?"
        ),
    ],
    max_tokens=128,
)
print(f"RESPONSE:\n{response.choices[0].message.content}")

LangChain

To query a foundation model endpoint using LangChain, you can use the ChatDatabricks ChatModel class and specify the endpoint.

%pip install databricks-langchain

from langchain_core.messages import HumanMessage, SystemMessage
from databricks_langchain import ChatDatabricks

messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is a mixture of experts model?"),
]

llm = ChatDatabricks(endpoint_name="databricks-meta-llama-3-3-70b-instruct")
llm.invoke(messages)

As an example, the following is the expected request format for a chat model when using the REST API. For external models, you can include additional parameters that are valid for a given provider and endpoint configuration. See Additional query parameters.

{
  "messages": [
    {
      "role": "user",
      "content": "What is a mixture of experts model?"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.1
}

The following is an expected response format for a request made using the REST API:

{
  "model": "databricks-meta-llama-3-3-70b-instruct",
  "choices": [
    {
      "message": {},
      "index": 0,
      "finish_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 74,
    "total_tokens": 81
  },
  "object": "chat.completion",
  "id": null,
  "created": 1698824353
}

Supported models

See Foundation model types for supported chat models.

Share via

Query a chat model

Requirements

Query examples

OpenAI client

SQL

REST API

MLflow Deployments SDK

Databricks Python SDK

LangChain

Supported models

Additional resources

Feedback

Additional resources