Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta.
Overview
MLflow enables you to automatically run scorers on your production GenAI application traces to continuously monitor quality. You can schedule any scorer (including custom metrics and built-in/custom LLM Judges) to automatically evaluate a sample of your production traffic.
Key benefits:
- Automated quality assessment without manual intervention.
- Flexible sampling to balance coverage with computational cost.
- Consistent evaluation using the same scorers from development.
- Continuous monitoring with periodic background execution.
Prerequisites
Before setting up quality monitoring, ensure you have:
- MLflow Experiment: An MLflow experiment where traces are being logged. If not specified, the active experiment is used.
- Instrumented production application: Your GenAI app must be logging traces using MLflow Tracing. See the Production Tracing guide.
- Defined scorers: Tested scorers that work with your application's trace format.
Tip
If you used your production app as the predict_fn
in mlflow.genai.evaluate()
during development, your scorers are likely already compatible.
Get started with production monitoring
This section includes example code showing how to create the different types of scorers.
Note
At any given time, at most 20 scorers can be associated with an experiment for continuous quality monitoring.
Use predefined scorers
MLflow provides several predefined scorers that you can use out-of-the-box for monitoring.
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer") # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))
Use guidelines-based LLM scorers
Guidelines-based LLM scorers can evaluate inputs and outputs using pass/fail natural language criteria.
from mlflow.genai.scorers import Guidelines
# Create and register the guidelines scorer
english_scorer = Guidelines(
name="english",
guidelines=["The response must be in English"]
).register(name="is_english") # name must be unique to experiment
# Start monitoring with the specified sample rate
english_scorer = english_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))
Use prompt-based scorers
For more flexibility than guidelines-based LLM scorers, you can use prompt-based scorers which allow for multi-level quality assessment with customizable choice categories (e.g., excellent/good/poor) and optional numeric scoring.
from mlflow.genai.scorers import scorer, ScorerSamplingConfig
@scorer
def formality(inputs, outputs, trace):
# Must be imported inline within the scorer function body
from mlflow.genai.judges.databricks import custom_prompt_judge
from mlflow.entities.assessment import DEFAULT_FEEDBACK_NAME
formality_prompt = """
You will look at the response and determine the formality of the response.
<request>{{request}}</request>
<response>{{response}}</response>
You must choose one of the following categories.
[[formal]]: The response is very formal.
[[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the response mentions friendship, etc.
[[not_formal]]: The response is not formal.
"""
my_prompt_judge = custom_prompt_judge(
name="formality",
prompt_template=formality_prompt,
numeric_values={
"formal": 1,
"semi_formal": 0.5,
"not_formal": 0,
},
)
result = my_prompt_judge(request=inputs, response=inputs)
if hasattr(result, "name"):
result.name = DEFAULT_FEEDBACK_NAME
return result
# Register the custom scorer and start monitoring
formality_scorer = formality.register(name="my_formality_scorer") # name must be unique to experiment
formality_scorer = formality_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))
Use custom scorer functions
For maximum flexibility, including the option to forego LLM-based scoring, you can define and use a custom scorer function for monitoring.
When defining custom scorers, do not use type hints that need to be imported in the function signature. If the scorer function body uses packages that need to be imported, import these packages inline within the function for proper serialization.
Some packages are available by default without the need for an inline import. These include databricks-agents
, mlflow-skinny
, openai
, and all packages included in Serverless environment version 2.
from mlflow.genai.scorers import scorer, ScorerSamplingConfig
# Custom metric: Check if response mentions Databricks
@scorer
def mentions_databricks(outputs):
"""Check if the response mentions Databricks"""
return "databricks" in str(outputs.get("response", "")).lower()
# Custom metric: Response length check
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(str(outputs.get("response", "")))
# Custom metric with multiple inputs
@scorer
def response_relevance_score(inputs, outputs):
"""Score relevance based on keyword matching"""
query = str(inputs.get("query", "")).lower()
response = str(outputs.get("response", "")).lower()
# Simple keyword matching (replace with your logic)
query_words = set(query.split())
response_words = set(response.split())
if not query_words:
return 0.0
overlap = len(query_words & response_words)
return overlap / len(query_words)
# Register and start monitoring custom scorers
databricks_scorer = mentions_databricks.register(name="databricks_mentions")
databricks_scorer = databricks_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
length_scorer = response_length.register(name="response_length")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))
relevance_scorer = response_relevance_score.register(name="response_relevance_score") # name must be unique to experiment
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))
Multiple scorer configuration
For comprehensive monitoring setup, you can register and start multiple scorers individually.
from mlflow.genai.scorers import Safety, RelevanceToQuery, ScorerSamplingConfig
# Configure multiple scorers for comprehensive monitoring
safety_scorer = Safety().register(name="safety_check") # name must be unique within an MLflow experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0)) # Check all traces
relevance_scorer = RelevanceToQuery().register(name="relevance_check")
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5)) # Sample 50%
length_scorer = response_length.register(name="length_analysis")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.3))
Manage scheduled scorers
List current scorers
To view all registered scorers for your experiment:
from mlflow.genai.scorers import list_scorers
# List all registered scorers
scorers = list_scorers()
for scorer in scorers:
print(f"Name: {scorer._server_name}")
print(f"Sample rate: {scorer.sample_rate}")
print(f"Filter: {scorer.filter_string}")
print("---")
Update a scorer
To modify existing scorer configurations:
from mlflow.genai.scorers import get_scorer
# Get existing scorer and update its configuration (immutable operation)
safety_scorer = get_scorer(name="safety_monitor")
updated_scorer = safety_scorer.update(sampling_config=ScorerSamplingConfig(sample_rate=0.8)) # Increased from 0.5
# Note: The original scorer remains unchanged; update() returns a new scorer instance
print(f"Original sample rate: {safety_scorer.sample_rate}") # Original rate
print(f"Updated sample rate: {updated_scorer.sample_rate}") # New rate
Stop and delete scorers
To stop monitoring or remove a scorer entirely:
from mlflow.genai.scorers import get_scorer, delete_scorer
# Get existing scorer
databricks_scorer = get_scorer(name="databricks_mentions")
# Stop monitoring (sets sample_rate to 0, keeps scorer registered)
stopped_scorer = databricks_scorer.stop()
print(f"Sample rate after stop: {stopped_scorer.sample_rate}") # 0
# Remove scorer entirely from the server
delete_scorer(name=databricks_scorer.name)
# Or restart monitoring from a stopped scorer
restarted_scorer = stopped_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
Evaluate historical traces (metric backfill)
You can retroactively apply new or updated metrics to historical traces.
Basic metric backfill using current sample rates
from databricks.agents.scorers import backfill_scorers
safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Use existing sample rates for specified scorers
job_id = backfill_scorers(
scorers=["safety_check", "response_length"]
)
Metric backfill using custom sample rates and time range
from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, Correctness
safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Define custom sample rates for backfill
custom_scorers = [
BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]
job_id = backfill_scorers(
experiment_id=YOUR_EXPERIMENT_ID,
scorers=custom_scorers,
start_time=datetime(2024, 6, 1),
end_time=datetime(2024, 6, 30)
)
Recent data backfill
from datetime import datetime, timedelta
# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)
job_id = backfill_scorers(
scorers=[
BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
],
start_time=one_week_ago
)
View results
After scheduling scorers, allow 15-20 minutes for initial processing. Then:
- Navigate to your MLflow experiment.
- Open the Traces tab to see assessments attached to traces.
- Use the monitoring dashboards to track quality trends.
Best practices
Sampling strategy
Balance coverage with cost, as shown in these examples:
# High-priority scorers: higher sampling
safety_scorer = Safety().register(name="safety")
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0)) # 100% coverage for critical safety
# Expensive scorers: lower sampling
complex_scorer = ComplexCustomScorer().register(name="complex_analysis")
complex_scorer = complex_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.05)) # 5% for expensive operations
Custom scorer design
Keep custom scorers self-contained, as shown in the following example:
@scorer
def well_designed_scorer(inputs, outputs):
# ✅ All imports inside the function
import re
import json
# ✅ Handle missing data gracefully
response = outputs.get("response", "")
if not response:
return 0.0
# ✅ Return consistent types
return float(len(response) > 100)
Troubleshooting
Scorers not running
If scorers aren't executing, check the following:
- Check experiment: Ensure that traces are logged to the experiment, not to individual runs.
- Sampling rate: With low sample rates, it may take time to see results.
Serialization issues
When you create a custom scorer, include imports in the function definition.
# ❌ Avoid external dependencies
import external_library # Outside function
@scorer
def bad_scorer(outputs):
return external_library.process(outputs)
# ✅ Include imports in the function definition
@scorer
def good_scorer(outputs):
import json # Inside function
return len(json.dumps(outputs))
# ❌ Avoid using type hints in scorer function signature that requires imports
from typing import List
@scorer
def scorer_with_bad_types(outputs: List[str]):
return False
Metric backfill issues
"Scheduled scorer 'X' not found in experiment"
- Ensure the scorer name matches a registered scorer in your experiment
- Check available scorers using
list_scorers
method
Next steps
Continue your journey with the following tutorial.
- Create custom scorers - Build scorers tailored to your needs.
Reference guides
Explore detailed documentation for concepts and features mentioned in this guide.
- Production Monitoring - Deep dive into monitoring concepts.
- Scorers - Understand the metrics that power monitoring.
- Evaluation Harness - How offline evaluation relates to production.