Share via


Production monitoring

Important

This feature is in Beta.

Overview

MLflow enables you to automatically run scorers on your production GenAI application traces to continuously monitor quality. You can schedule any scorer (including custom metrics and built-in/custom LLM Judges) to automatically evaluate a sample of your production traffic.

Key benefits:

  • Automated quality assessment without manual intervention.
  • Flexible sampling to balance coverage with computational cost.
  • Consistent evaluation using the same scorers from development.
  • Continuous monitoring with periodic background execution.

Prerequisites

Before setting up quality monitoring, ensure you have:

  1. MLflow Experiment: An MLflow experiment where traces are being logged. If not specified, the active experiment is used.
  2. Instrumented production application: Your GenAI app must be logging traces using MLflow Tracing. See the Production Tracing guide.
  3. Defined scorers: Tested scorers that work with your application's trace format.

Tip

If you used your production app as the predict_fn in mlflow.genai.evaluate() during development, your scorers are likely already compatible.

Get started with production monitoring

This section includes example code showing how to create the different types of scorers.

Note

At any given time, at most 20 scorers can be associated with an experiment for continuous quality monitoring.

Use predefined scorers

MLflow provides several predefined scorers that you can use out-of-the-box for monitoring.

from mlflow.genai.scorers import Safety, ScorerSamplingConfig

# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer")  # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

Use guidelines-based LLM scorers

Guidelines-based LLM scorers can evaluate inputs and outputs using pass/fail natural language criteria.

from mlflow.genai.scorers import Guidelines

# Create and register the guidelines scorer
english_scorer = Guidelines(
  name="english",
  guidelines=["The response must be in English"]
).register(name="is_english")  # name must be unique to experiment

# Start monitoring with the specified sample rate
english_scorer = english_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

Use prompt-based scorers

For more flexibility than guidelines-based LLM scorers, you can use prompt-based scorers which allow for multi-level quality assessment with customizable choice categories (e.g., excellent/good/poor) and optional numeric scoring.

from mlflow.genai.scorers import scorer, ScorerSamplingConfig


@scorer
def formality(inputs, outputs, trace):
    # Must be imported inline within the scorer function body
    from mlflow.genai.judges.databricks import custom_prompt_judge
    from mlflow.entities.assessment import DEFAULT_FEEDBACK_NAME

    formality_prompt = """
    You will look at the response and determine the formality of the response.

    <request>{{request}}</request>
    <response>{{response}}</response>

    You must choose one of the following categories.

    [[formal]]: The response is very formal.
    [[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the response mentions friendship, etc.
    [[not_formal]]: The response is not formal.
    """

    my_prompt_judge = custom_prompt_judge(
        name="formality",
        prompt_template=formality_prompt,
        numeric_values={
            "formal": 1,
            "semi_formal": 0.5,
            "not_formal": 0,
        },
    )

    result = my_prompt_judge(request=inputs, response=inputs)
    if hasattr(result, "name"):
        result.name = DEFAULT_FEEDBACK_NAME
    return result

# Register the custom scorer and start monitoring
formality_scorer = formality.register(name="my_formality_scorer")  # name must be unique to experiment
formality_scorer = formality_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))

Use custom scorer functions

For maximum flexibility, including the option to forego LLM-based scoring, you can define and use a custom scorer function for monitoring.

When defining custom scorers, do not use type hints that need to be imported in the function signature. If the scorer function body uses packages that need to be imported, import these packages inline within the function for proper serialization.

Some packages are available by default without the need for an inline import. These include databricks-agents, mlflow-skinny, openai, and all packages included in Serverless environment version 2.

from mlflow.genai.scorers import scorer, ScorerSamplingConfig


# Custom metric: Check if response mentions Databricks
@scorer
def mentions_databricks(outputs):
    """Check if the response mentions Databricks"""
    return "databricks" in str(outputs.get("response", "")).lower()

# Custom metric: Response length check
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(str(outputs.get("response", "")))

# Custom metric with multiple inputs
@scorer
def response_relevance_score(inputs, outputs):
    """Score relevance based on keyword matching"""
    query = str(inputs.get("query", "")).lower()
    response = str(outputs.get("response", "")).lower()

    # Simple keyword matching (replace with your logic)
    query_words = set(query.split())
    response_words = set(response.split())

    if not query_words:
        return 0.0

    overlap = len(query_words & response_words)
    return overlap / len(query_words)

# Register and start monitoring custom scorers
databricks_scorer = mentions_databricks.register(name="databricks_mentions")
databricks_scorer = databricks_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

length_scorer = response_length.register(name="response_length")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))

relevance_scorer = response_relevance_score.register(name="response_relevance_score")  # name must be unique to experiment
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))

Multiple scorer configuration

For comprehensive monitoring setup, you can register and start multiple scorers individually.

from mlflow.genai.scorers import Safety, RelevanceToQuery, ScorerSamplingConfig

# Configure multiple scorers for comprehensive monitoring
safety_scorer = Safety().register(name="safety_check")  # name must be unique within an MLflow experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))  # Check all traces

relevance_scorer = RelevanceToQuery().register(name="relevance_check")
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))  # Sample 50%

length_scorer = response_length.register(name="length_analysis")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.3))

Manage scheduled scorers

List current scorers

To view all registered scorers for your experiment:

from mlflow.genai.scorers import list_scorers

# List all registered scorers
scorers = list_scorers()
for scorer in scorers:
    print(f"Name: {scorer._server_name}")
    print(f"Sample rate: {scorer.sample_rate}")
    print(f"Filter: {scorer.filter_string}")
    print("---")

Update a scorer

To modify existing scorer configurations:

from mlflow.genai.scorers import get_scorer

# Get existing scorer and update its configuration (immutable operation)
safety_scorer = get_scorer(name="safety_monitor")
updated_scorer = safety_scorer.update(sampling_config=ScorerSamplingConfig(sample_rate=0.8))  # Increased from 0.5

# Note: The original scorer remains unchanged; update() returns a new scorer instance
print(f"Original sample rate: {safety_scorer.sample_rate}")  # Original rate
print(f"Updated sample rate: {updated_scorer.sample_rate}")   # New rate

Stop and delete scorers

To stop monitoring or remove a scorer entirely:

from mlflow.genai.scorers import get_scorer, delete_scorer

# Get existing scorer
databricks_scorer = get_scorer(name="databricks_mentions")

# Stop monitoring (sets sample_rate to 0, keeps scorer registered)
stopped_scorer = databricks_scorer.stop()
print(f"Sample rate after stop: {stopped_scorer.sample_rate}")  # 0

# Remove scorer entirely from the server
delete_scorer(name=databricks_scorer.name)

# Or restart monitoring from a stopped scorer
restarted_scorer = stopped_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

Evaluate historical traces (metric backfill)

You can retroactively apply new or updated metrics to historical traces.

Basic metric backfill using current sample rates

from databricks.agents.scorers import backfill_scorers

safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Use existing sample rates for specified scorers
job_id = backfill_scorers(
    scorers=["safety_check", "response_length"]
)

Metric backfill using custom sample rates and time range

from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, Correctness

safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Define custom sample rates for backfill
custom_scorers = [
    BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
    BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]

job_id = backfill_scorers(
    experiment_id=YOUR_EXPERIMENT_ID,
    scorers=custom_scorers,
    start_time=datetime(2024, 6, 1),
    end_time=datetime(2024, 6, 30)
)

Recent data backfill

from datetime import datetime, timedelta

# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)

job_id = backfill_scorers(
    scorers=[
        BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
        BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
    ],
    start_time=one_week_ago
)

View results

After scheduling scorers, allow 15-20 minutes for initial processing. Then:

  1. Navigate to your MLflow experiment.
  2. Open the Traces tab to see assessments attached to traces.
  3. Use the monitoring dashboards to track quality trends.

Best practices

Sampling strategy

Balance coverage with cost, as shown in these examples:

# High-priority scorers: higher sampling
safety_scorer = Safety().register(name="safety")
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))  # 100% coverage for critical safety

# Expensive scorers: lower sampling
complex_scorer = ComplexCustomScorer().register(name="complex_analysis")
complex_scorer = complex_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.05))  # 5% for expensive operations

Custom scorer design

Keep custom scorers self-contained, as shown in the following example:

@scorer
def well_designed_scorer(inputs, outputs):
    # ✅ All imports inside the function
    import re
    import json

    # ✅ Handle missing data gracefully
    response = outputs.get("response", "")
    if not response:
        return 0.0

    # ✅ Return consistent types
    return float(len(response) > 100)

Troubleshooting

Scorers not running

If scorers aren't executing, check the following:

  1. Check experiment: Ensure that traces are logged to the experiment, not to individual runs.
  2. Sampling rate: With low sample rates, it may take time to see results.

Serialization issues

When you create a custom scorer, include imports in the function definition.

# ❌ Avoid external dependencies
import external_library  # Outside function

@scorer
def bad_scorer(outputs):
    return external_library.process(outputs)

# ✅ Include imports in the function definition
@scorer
def good_scorer(outputs):
    import json  # Inside function
    return len(json.dumps(outputs))

# ❌ Avoid using type hints in scorer function signature that requires imports
from typing import List

@scorer
def scorer_with_bad_types(outputs: List[str]):
    return False

Metric backfill issues

"Scheduled scorer 'X' not found in experiment"

  • Ensure the scorer name matches a registered scorer in your experiment
  • Check available scorers using list_scorers method

Next steps

Continue your journey with the following tutorial.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.