Metric Plugins¶
Metric plugins add custom evaluation metrics to AutoRAG-Research. There are two types: retrieval metrics evaluate search quality, and generation metrics evaluate answer quality.
| Type | Entry Point Group | Base Config Class | Implementation |
|---|---|---|---|
| Retrieval Metric | autorag_research.metrics |
BaseRetrievalMetricConfig |
Function-based |
| Generation Metric | autorag_research.metrics |
BaseGenerationMetricConfig |
Function-based |
Both types follow the same pattern: a standalone metric function paired with a dataclass config that wraps it.
Scaffold¶
Use the CLI to generate a starter plugin:
# Retrieval metric
autorag-research plugin create my_recall --type=metric_retrieval
# Generation metric
autorag-research plugin create my_bleu --type=metric_generation
This creates a project directory with the config class, metric function stub, YAML config,
pyproject.toml, and a basic test file.
First-Party Trust-Align Plugin¶
For Trust-Align exact refusal/correctness metrics, use the dedicated plugin package:
Retrieval Metric¶
A retrieval metric is a plain function that computes a score. The config class wraps it via
get_metric_func().
from collections.abc import Callable
from dataclasses import dataclass
from autorag_research.config import BaseRetrievalMetricConfig
def my_recall_metric(**kwargs) -> float:
"""Compute custom recall metric."""
# Your metric logic here
return score
@dataclass
class MyRecallMetricConfig(BaseRetrievalMetricConfig):
"""Configuration for custom recall metric."""
def get_metric_func(self) -> Callable:
return my_recall_metric
Key points:
- The metric is a standalone function, not a class method.
- The config class wraps the function and exposes it through
get_metric_func(). BaseRetrievalMetricConfigautomatically setsmetric_type = MetricType.RETRIEVAL.get_metric_name()is inherited and returns the function name by default.
Inherited Fields¶
BaseMetricConfig provides these fields to all metric configs:
| Field | Type | Default | Description |
|---|---|---|---|
description |
str |
"" |
Optional description |
metric_type |
MetricType |
Auto-set | RETRIEVAL or GENERATION |
Override get_metric_kwargs() to pass extra arguments to the metric function at evaluation time.
Generation Metric¶
Generation metrics follow the same pattern but extend BaseGenerationMetricConfig.
from collections.abc import Callable
from dataclasses import dataclass
from autorag_research.config import BaseGenerationMetricConfig
def my_bleu_metric(**kwargs) -> float:
"""Compute custom BLEU metric."""
# Your metric logic here
return score
@dataclass
class MyBleuMetricConfig(BaseGenerationMetricConfig):
"""Configuration for custom BLEU metric."""
def get_metric_func(self) -> Callable:
return my_bleu_metric
The only difference is the base class: BaseGenerationMetricConfig sets
metric_type = MetricType.GENERATION.
YAML Configuration¶
Each metric plugin ships a YAML config file in a subcategory directory.
Retrieval metric:
# src/my_recall_plugin/retrieval/my_recall.yaml
_target_: my_recall_plugin.metric.MyRecallMetricConfig
description: "Custom recall metric"
Generation metric:
# src/my_bleu_plugin/generation/my_bleu.yaml
_target_: my_bleu_plugin.metric.MyBleuMetricConfig
description: "Custom BLEU metric"
The _target_ field must be the fully-qualified path to the config class.
Entry Points¶
Register the plugin in pyproject.toml so AutoRAG-Research can discover it:
[project.entry-points."autorag_research.metrics"]
my_recall = "my_recall_plugin"
After editing pyproject.toml, reinstall the package (pip install -e .) and run
autorag-research plugin sync to copy configs into the project.
Use in Experiment¶
Reference your metric by name in the experiment config:
# configs/experiment.yaml
metrics:
retrieval:
- recall # built-in
- my_recall # your plugin
generation:
- rouge # built-in
- my_bleu # your plugin
The metric name matches the entry point key defined in pyproject.toml.
Testing¶
Test that the config class instantiates correctly and returns a callable metric function:
from my_recall_plugin.metric import MyRecallMetricConfig
def test_metric_config():
config = MyRecallMetricConfig()
func = config.get_metric_func()
assert func is not None
assert callable(func)
For integration tests that call real APIs or require data, use the @pytest.mark.api or
@pytest.mark.data markers.
Next¶
- Retrieval Pipeline -- build a custom retrieval pipeline plugin
- Generation Pipeline -- build a custom generation pipeline plugin
- Best Practices -- naming, security, and common pitfalls