Generation Metrics¶
Metrics for evaluating text generation quality.
Available Metrics¶
| Metric | Measures | When to Use |
|---|---|---|
| BLEU | N-gram overlap | Translation-style tasks |
| METEOR | Alignment | Better for paraphrases |
| ROUGE | N-gram recall | Summarization |
| BERTScore | Semantic similarity | Meaning preservation |
| SemScore | Embedding similarity | Semantic correctness |
| Response Relevancy | Question-answer alignment | RAGAS-style relevance checks |
Trust-Align exact refusal/correctness metrics are available as a plugin: Trust-Align Metrics Plugin.
Base Class¶
from autorag_research.evaluation.metrics import BaseGenerationMetricConfig
from dataclasses import dataclass
@dataclass
class MyMetricConfig(BaseGenerationMetricConfig):
def get_metric_func(self):
return my_metric_function