BARTScore¶
Directional semantic evaluation with a pretrained BART sequence-to-sequence model.
Overview¶
| Field | Value |
|---|---|
| Type | Generation |
| Range | (-∞, 0] |
| Higher is better | Yes |
Description¶
BARTScore treats evaluation as conditional text generation and scores the average token log-probability of one text given another. This makes it useful for direction-sensitive checks in RAG:
bart_score_faithfulness: retrieved context → answerbart_score_precision: reference → answerbart_score_recall: answer → referencebart_score_f1: arithmetic mean of precision and recall
The implementation uses the paper's standard facebook/bart-large-cnn
checkpoint by default and keeps the metric deterministic.
Installation¶
BARTScore uses local torch + transformers runtime dependencies.
pip install "autorag-research[gpu]"
In a development checkout, use:
uv sync --all-extras --all-groups
You can still run the metric on CPU by setting device: cpu; the gpu extra
name reflects the shared optional dependency group, not a hard GPU requirement.
Configuration¶
_target_: autorag_research.evaluation.metrics.generation.BartScoreFaithfulnessConfig
checkpoint: facebook/bart-large-cnn
batch_size: 4
max_length: 1024
device: auto
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
| checkpoint | str | facebook/bart-large-cnn |
Hugging Face BART checkpoint |
| batch_size | int | 4 |
Pair scoring batch size |
| max_length | int | 1024 |
Max tokenizer length for source and target |
| device | str | auto |
cuda, mps, cpu, or automatic selection |
Variant selection¶
| Variant | YAML | Required fields | Perspective |
|---|---|---|---|
| Faithfulness | bart_score_faithfulness.yaml |
retrieved_contents, generated_texts |
Does the answer follow the retrieved context? |
| Precision | bart_score_precision.yaml |
generation_gt, generated_texts |
How well is the answer supported by the reference? |
| Recall | bart_score_recall.yaml |
generation_gt, generated_texts |
How much reference content is covered by the answer? |
| F1 | bart_score_f1.yaml |
generation_gt, generated_texts |
Balanced semantic overlap |
When multiple references are available, AutoRAG keeps the best per-example BARTScore for precision and recall before computing F1.
When to Use¶
Good for:
- RAG faithfulness checks without an LLM judge
- Complementing BERTScore / AlignScore-style semantic metrics
- Separating support (
reference → answer) from coverage (answer → reference)
Limitations:
- Slower than lexical metrics
- Requires a local BART checkpoint download on first use
- Scores are negative log-likelihoods, so raw values are less intuitive than bounded metrics