Query vision models

2025-08-07

In this article, you learn how to write query requests for foundation models optimized for vision tasks, and send them to your model serving endpoint.

Mosaic AI Model Serving provides a unified API to understand and analyze images using a variety of foundation models, unlocking powerful multimodal capabilities. This functionality is available through select Databricks-hosted models as part of Foundation Model APIs and serving endpoints that serve external models.

Requirements

See Requirements.
Install the appropriate package to your cluster based on the querying client option you choose.

Query examples


from openai import OpenAI
import base64
import httpx

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

# encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_data = base64.standard_b64encode(httpx.get(image_url).content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
    model="databricks-claude-3-7-sonnet",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)

The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.


from openai import OpenAI
import base64
import httpx

client = OpenAI(
    api_key="dapi-your-databricks-token",
    base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

# Encode multiple images

image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image1_data = base64.standard_b64encode(httpx.get(image1_url).content).decode("utf-8")

image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image2_data = base64.standard_b64encode(httpx.get(image1_url).content).decode("utf-8")

# OpenAI request

completion = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What are in these images? Is there any difference between them?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
},
],
}
],
)

print(completion.choices[0].message.content)

Supported models

See Foundation model types for supported vision models.

Input image requirements

This section applies only to Foundation Model APIs. For external models, refer to the provider's documentation.

Multiple images per request

Up to 20 images for Claude.ai
Up to 100 images for API requests
All provided images are processed in a request, which is useful for comparing or contrasting them.

Size limitations

Images larger than 8000x8000 px will be rejected.
If more than 20 images are submitted in one API request, the maximum allowed size per image is 2000 x 2000 px.

Image resizing recommendations

For optimal performance, resize images before uploading if they are too large.
If an image's long edge exceeds 1568 pixels or its size exceeds ~1,600 tokens, it will be _automatically scaled down_while preserving aspect ratio.
Very small images (under 200 pixels on any edge) may degrade performance.
To reduce latency, keep images within 1.15 megapixels and at most 1568 pixels in both dimensions.

Image quality considerations

Supported formats: JPEG, PNG, GIF, WebP.
Clarity: Avoid blurry or pixelated images.
Text in images:
- Ensure text is legible and not too small.
- Avoid cropping out key visual context just to enlarge the text.

Calculate costs

This section applies only to Foundation Model APIs. For external models, refer to the provider's documentation.

Each image in a request to foundation model adds to your token usage.

Token counts and estimates

If no resizing is needed, estimate tokens with: tokens = (width px × height px) / 750

Approximate token counts for different image sizes:

Image Size	Tokens
200×200 px (0.04 MP)	~54
1000×1000 px (1 MP)	~1334
1092×1092 px (1.19 MP)	~1590

Limitations of image understanding

This section applies only to Foundation Model APIs. For external models, refer to the provider's documentation.

There are limitations to the advanced image understanding of the Claude model on Databricks:

People identification: Cannot identify or name people in images.
Accuracy: May misinterpret low-quality, rotated, or very small images (<200 px).
Spatial reasoning: Struggles with precise layouts, such as reading analog clocks or chess positions.
Counting: Provides approximate counts, but may be inaccurate for many small objects.
AI-generated images: Cannot reliably detect synthetic or fake images.
Inappropriate content: Blocks explicit or policy-violating images.
Healthcare: Not suited for complex medical scans (for example, CTs and MRIs). It's not a diagnostic tool.

Review all outputs carefully, especially for high-stakes use cases. Avoid using Claude for tasks requiring perfect precision or sensitive analysis without human oversight.

Share via

Query vision models

Requirements

Query examples

Supported models

Input image requirements

Multiple images per request

Size limitations

Image resizing recommendations

Image quality considerations

Calculate costs

Limitations of image understanding

Additional resources

Feedback

Additional resources