Share via


MLflow 3 for GenAI

This page describes how MLflow 3 for GenAI, integrated with the Databricks platform, helps you build production-grade GenAI apps.

Traditional software and ML tests aren't built for GenAI's free-form language, making it difficult for teams to measure and improve quality. MLflow 3 solves this by combining AI-powered metrics that reliably measure GenAI quality with comprehensive trace observability, enabling you to measure, improve, and monitor quality throughout the entire application lifecycle.

When you use MLflow 3 for GenAI on Databricks, you get all of the advantages of the Databricks platform, including the following:

  • Unified platform. The entire GenAI development process in one place, from development debugging to production monitoring.
  • Open and flexible. Use any LLM provider and any framework.
  • Enterprise-ready. The Databricks platform provides enterprise security, scale, and governance.

Agent Evaluation SDK methods are integrated with Databricks-managed MLflow 3. For information about agent evaluation in MLflow 2, see Mosaic AI Agent Evaluation (MLflow 2) and the migration guide.

For a set of tutorials to get you started, see Get started with MLflow 3 for GenAI.

Note

Open source telemetry collection was introduced in MLflow 3.2.0, and is disabled on Databricks by default. For more details, refer to the MLflow usage tracking documentation.

Observe and debug GenAI apps with tracing

See exactly what your GenAI application is doing with comprehensive observability that captures every step of execution. You need only add a single line of code, and MLflow Tracing captures all prompts, retrievals, tool calls, responses, latency, and token counts throughout your application.

# Just add one line to capture everything
mlflow.autolog()

# Your existing code works unchanged
response = client.chat.completions.create(...)
# Traces are automatically captured!

Evaluation Comparison

Feature Description
Automatic instrumentation One-line instrumentation for 20+ libraries including OpenAI, LangChain, LlamaIndex, Anthropic, and DSPy.
Review your app's behavior and performance Complete execution visibility allows you to capture prompts, retrievals, tool calls, responses, latency, and costs.
Production observability Use the same instrumentation in development and production environments for consistent evaluation.
OpenTelemetry compatibility Export traces anywhere while maintaining full data ownership and integration flexibility.

Automated quality evaluation of GenAI apps

Replace manual testing with automated evaluation using built-in and custom LLM-based scorers that match human expertise and can be applied in both development and production.

Feature Description
Built-in scorers Ready-to-use scorers that assess safety, hallucinations, relevance, correctness, and retrieval quality.
Custom scorers Create tailored judges that enforce your specific business requirements and align with domain expert judgment.

Turn production data into improvements

Every production interaction becomes an opportunity to improve with integrated feedback and evaluation workflows.

Trace Summary

Feature Description
Expert feedback collection The Review App provides a structured process and UI for collecting domain expert feedback including ratings, corrections, and guidelines on real interactions with your application.
Live app testing Subject matter experts can chat with your app and provide instant feedback for continuous improvement.
Evaluation datasets from production Evaluation datasets enable consistent, repeatable evaluation. Problematic production traces become test cases for continuous improvement and regression testing.
User feedback collection Capture and link user feedback to specific traces for debugging and quality improvement insights. Collect thumbs up/down and comments programmatically from your deployed application.
Evaluate and improve quality with traces Analyze traces to identify quality issues, create evaluation datasets from trace data, implement targeted improvements, and measure the impact of your changes.

Manage your GenAI application lifecycle

Version, track, and govern your entire GenAI application with enterprise-grade lifecycle management and governance tools.

Feature Description
Application versioning Track code, parameters, and evaluation metrics for each version.
Production trace linking Link traces, evaluations, and feedback to specific application versions.
Prompt Registry Centralized management for versioning and sharing prompts across your organization with A/B testing capabilities and Unity Catalog integration.
Enterprise integration Unity Catalog. Unified governance for all AI assets with enterprise security, access control, and compliance features.
Data intelligence. Connect your GenAI data to your business data in the Databricks Lakehouse and deliver custom analytics to your business stakeholders.
Mosaic AI Agent Serving. Deploy agents to production with scaling and operational rigor.

Get started with MLflow 3 for GenAI

Start building better GenAI applications with comprehensive observability and evaluation tools.

Task Description
Quick start guide Get up and running in minutes with step-by-step instructions for instrumenting your first application.
Databricks Notebook setup Start in a managed environment with pre-configured dependencies and instant access to MLflow 3 features.
Local IDE development Develop on your local machine with full MLflow 3 capabilities and seamless cloud integration.
Data Intelligence integration Connect your GenAI data to business data in the Databricks Lakehouse for custom analytics and insights.