Skip to content

Generation Metrics

Metrics for evaluating text generation quality.

Available Metrics

Metric Measures When to Use
BLEU N-gram overlap Translation-style tasks
METEOR Alignment Better for paraphrases
ROUGE N-gram recall Summarization
BERTScore Semantic similarity Meaning preservation
SemScore Embedding similarity Semantic correctness
Response Relevancy Question-answer alignment RAGAS-style relevance checks

Trust-Align exact refusal/correctness metrics are available as a plugin: Trust-Align Metrics Plugin.

Base Class

from autorag_research.evaluation.metrics import BaseGenerationMetricConfig
from dataclasses import dataclass


@dataclass
class MyMetricConfig(BaseGenerationMetricConfig):
    def get_metric_func(self):
        return my_metric_function