MLflow 3 for GenAI

2025-07-22

This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality. MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout the entire application lifecycle.

When you use MLflow 3 for GenAI on Databricks, you get all of the advantages of the Databricks platform, including the following:

Unified platform. The entire GenAI development process in one place, from development debugging to production monitoring.
Open and flexible. Use any LLM provider and any framework.
Enterprise-ready. The Databricks platform provides enterprise security, scale, and governance.

Agent Evaluation SDK methods are integrated with Databricks-managed MLflow 3. For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide.

For a set of tutorials to get you started, see Get started with MLflow 3 for GenAI.

Note

Open source telemetry collection was introduced in MLflow 3.2.0, and is disabled on Databricks by default. For more details, refer to the MLflow usage tracking documentation.

Observe and debug GenAI apps with tracing

See exactly what your GenAI application is doing with comprehensive observability that captures every step of execution. You need only add a single line of code, and MLflow Tracing captures all prompts, retrievals, tool calls, responses, latency, and token counts throughout your application.

# Just add one line to capture everything
mlflow.autolog()

# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!

Evaluation Comparison

Feature	Description
Automatic instrumentation	One-line instrumentation for 20+ libraries including OpenAI, LangChain, LlamaIndex, Anthropic, and DSPy.
Review your app's behavior and performance	Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs.
Production observability	Use the same instrumentation in development and production environments for consistent evaluation.
OpenTelemetry compatibility	Export traces anywhere while maintaining full data ownership and integration flexibility.

Automated quality evaluation of GenAI apps

Replace manual testing with automated evaluation using built-in and custom LLM-based scorers that match human expertise and can be applied in both development and production.

Feature	Description
Built-in scorers	Ready-to-use scorers that assess safety, hallucinations, relevance, correctness, and retrieval quality.
Custom scorers	Create tailored judges that enforce your specific business requirements and align with domain expert judgment.

Turn production data into improvements

Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Trace Summary

Feature	Description
Expert feedback collection	The Review App provides a structured process and UI for collecting domain expert feedback including ratings, corrections, and guidelines on real interactions with your application.
Live app testing	Subject matter experts can chat with your app and provide instant feedback for continuous improvement.
Evaluation datasets from production	Evaluation datasets enable consistent, repeatable evaluation. Problematic production traces become test cases for continuous improvement and regression testing.
User feedback collection	Capture and link user feedback to specific traces for debugging and quality improvement insights. Collect thumbs up/down and comments programmatically from your deployed application.
Evaluate and improve quality with traces	Analyze traces to identify quality issues, create evaluation datasets from trace data, implement targeted improvements, and measure the impact of your changes.

Manage your GenAI application lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.

Feature	Description
Application versioning	Track code, parameters, and evaluation metrics for each version.
Production trace linking	Link traces, evaluations, and feedback to specific application versions.
Prompt Registry	Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration.
Enterprise integration	Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features. Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders. Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor.

Get started with MLflow 3 for GenAI

Start building better GenAI applications with comprehensive observability and evaluation tools.

Task	Description
Quick start guide	Get up and running in minutes with step-by-step instructions for instrumenting your first application.
Databricks Notebook setup	Start in a managed environment with pre-configured dependencies and instant access to MLflow 3 features.
Local IDE development	Develop on your local machine with full MLflow 3 capabilities and seamless cloud integration.
Data Intelligence integration	Connect your GenAI data to business data in the Databricks Lakehouse for custom analytics and insights.