Groundedness judge & scorer

2025-08-11

The judges.is_grounded() predefined judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.

This judge is available through the predefined RetrievalGroundedness scorer for evaluating RAG applications that need to ensure responses are grounded in retrieved information.

API Signature

For details, see mlflow.genai.judges.is_grounded().

from mlflow.genai.judges import is_grounded

def is_grounded(
    *,
    request: str,               # User's original query
    response: str,              # Application's response
    context: Any,               # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    name: Optional[str] = None  # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

Prerequisites for running the examples

Install MLflow and required packages

pip install --upgrade "mlflow[databricks]>=3.1.0"

Create an MLflow experiment by following the setup your environment quickstart.

Direct SDK Usage

from mlflow.genai.judges import is_grounded

# Example 1: Response is grounded in context
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of groundedness

# Example 2: Response contains hallucination
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris, which has a population of 10 million people",
    context=[
        {"content": "Paris is the capital of France."}
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Identifies unsupported claim about population

Using the prebuilt scorer

The is_grounded judge is available through the RetrievalGroundedness prebuilt scorer.

Requirements:

Trace requirements:
- The MLflow Trace must contain at least one span with span_type set to RETRIEVER
- inputs and outputs must be on the Trace's root span

Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

Databricks-hosted LLMs

Use MLflow to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

OpenAI-hosted LLMs

Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

Use the scorer judge:

from mlflow.genai.scorers import RetrievalGroundedness
from mlflow.entities import Document
from typing import List


# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval based on query
    if "mlflow" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="MLflow is an open-source platform for managing the ML lifecycle.",
                metadata={"source": "mlflow_docs.txt"}
            ),
            Document(
                id="doc_2",
                page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
                metadata={"source": "mlflow_features.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_3",
                page_content="Machine learning involves training models on data.",
                metadata={"source": "ml_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve relevant documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response using LLM
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model=model_name,
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset
eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"}
    },
    {
        "inputs": {"query": "What are the main features of MLflow?"}
    }
]

# Run evaluation with RetrievalGroundedness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[RetrievalGroundedness()]
)

Using in a custom scorer

When evaluating applications with different data structures than the requirements the predefined scorer, wrap the judge in a custom scorer:

from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"},
        "outputs": {
            "response": "MLflow is used for managing the ML lifecycle, including experiment tracking and model deployment.",
            "retrieved_context": [
                {"content": "MLflow is a platform for managing the ML lifecycle."},
                {"content": "MLflow includes capabilities for experiment tracking, model packaging, and deployment."}
            ]
        }
    },
    {
        "inputs": {"query": "Who created MLflow?"},
        "outputs": {
            "response": "MLflow was created by Databricks in 2018 and has over 10,000 contributors.",
            "retrieved_context": [
                {"content": "MLflow was created by Databricks."},
                {"content": "MLflow was open-sourced in 2018."}
            ]
        }
    }
]

@scorer
def groundedness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[groundedness_scorer]
)

Interpreting Results

The judge returns a Feedback object with:

value: "yes" if response is grounded, "no" if it contains hallucinations
rationale: Detailed explanation identifying:
- Which statements are supported by context
- Which statements lack support (hallucinations)
- Specific quotes from context that support or contradict claims

Next Steps

Evaluate context sufficiency - Check if your retriever provides adequate information
Evaluate context relevance - Ensure retrieved documents are relevant to queries
Run comprehensive RAG evaluation - Combine multiple judges for complete RAG assessment