Skip to content

Metric Plugins

Metric plugins add custom evaluation metrics to AutoRAG-Research. There are two types: retrieval metrics evaluate search quality, and generation metrics evaluate answer quality.

Type Entry Point Group Base Config Class Implementation
Retrieval Metric autorag_research.metrics BaseRetrievalMetricConfig Function-based
Generation Metric autorag_research.metrics BaseGenerationMetricConfig Function-based

Both types follow the same pattern: a standalone metric function paired with a dataclass config that wraps it.

Scaffold

Use the CLI to generate a starter plugin:

# Retrieval metric
autorag-research plugin create my_recall --type=metric_retrieval

# Generation metric
autorag-research plugin create my_bleu --type=metric_generation

This creates a project directory with the config class, metric function stub, YAML config, pyproject.toml, and a basic test file.

First-Party Trust-Align Plugin

For Trust-Align exact refusal/correctness metrics, use the dedicated plugin package:

Retrieval Metric

A retrieval metric is a plain function that computes a score. The config class wraps it via get_metric_func().

from collections.abc import Callable
from dataclasses import dataclass

from autorag_research.config import BaseRetrievalMetricConfig


def my_recall_metric(**kwargs) -> float:
    """Compute custom recall metric."""
    # Your metric logic here
    return score


@dataclass
class MyRecallMetricConfig(BaseRetrievalMetricConfig):
    """Configuration for custom recall metric."""

    def get_metric_func(self) -> Callable:
        return my_recall_metric

Key points:

  • The metric is a standalone function, not a class method.
  • The config class wraps the function and exposes it through get_metric_func().
  • BaseRetrievalMetricConfig automatically sets metric_type = MetricType.RETRIEVAL.
  • get_metric_name() is inherited and returns the function name by default.

Inherited Fields

BaseMetricConfig provides these fields to all metric configs:

Field Type Default Description
description str "" Optional description
metric_type MetricType Auto-set RETRIEVAL or GENERATION

Override get_metric_kwargs() to pass extra arguments to the metric function at evaluation time.

Generation Metric

Generation metrics follow the same pattern but extend BaseGenerationMetricConfig.

from collections.abc import Callable
from dataclasses import dataclass

from autorag_research.config import BaseGenerationMetricConfig


def my_bleu_metric(**kwargs) -> float:
    """Compute custom BLEU metric."""
    # Your metric logic here
    return score


@dataclass
class MyBleuMetricConfig(BaseGenerationMetricConfig):
    """Configuration for custom BLEU metric."""

    def get_metric_func(self) -> Callable:
        return my_bleu_metric

The only difference is the base class: BaseGenerationMetricConfig sets metric_type = MetricType.GENERATION.

YAML Configuration

Each metric plugin ships a YAML config file in a subcategory directory.

Retrieval metric:

# src/my_recall_plugin/retrieval/my_recall.yaml
_target_: my_recall_plugin.metric.MyRecallMetricConfig
description: "Custom recall metric"

Generation metric:

# src/my_bleu_plugin/generation/my_bleu.yaml
_target_: my_bleu_plugin.metric.MyBleuMetricConfig
description: "Custom BLEU metric"

The _target_ field must be the fully-qualified path to the config class.

Entry Points

Register the plugin in pyproject.toml so AutoRAG-Research can discover it:

[project.entry-points."autorag_research.metrics"]
my_recall = "my_recall_plugin"

After editing pyproject.toml, reinstall the package (pip install -e .) and run autorag-research plugin sync to copy configs into the project.

Use in Experiment

Reference your metric by name in the experiment config:

# configs/experiment.yaml
metrics:
  retrieval:
    - recall       # built-in
    - my_recall    # your plugin
  generation:
    - rouge        # built-in
    - my_bleu      # your plugin

The metric name matches the entry point key defined in pyproject.toml.

Testing

Test that the config class instantiates correctly and returns a callable metric function:

from my_recall_plugin.metric import MyRecallMetricConfig


def test_metric_config():
    config = MyRecallMetricConfig()
    func = config.get_metric_func()
    assert func is not None
    assert callable(func)

For integration tests that call real APIs or require data, use the @pytest.mark.api or @pytest.mark.data markers.

Next