BARTScore¶

Directional semantic evaluation with a pretrained BART sequence-to-sequence model.

Overview¶

Field	Value
Type	Generation
Range	(-∞, 0]
Higher is better	Yes

Description¶

BARTScore treats evaluation as conditional text generation and scores the average token log-probability of one text given another. This makes it useful for direction-sensitive checks in RAG:

bart_score_faithfulness: retrieved context → answer
bart_score_precision: reference → answer
bart_score_recall: answer → reference
bart_score_f1: arithmetic mean of precision and recall

The implementation uses the paper's standard facebook/bart-large-cnn checkpoint by default and keeps the metric deterministic.

Installation¶

BARTScore uses local torch + transformers runtime dependencies.

pip install "autorag-research[gpu]"

In a development checkout, use:

uv sync --all-extras --all-groups

You can still run the metric on CPU by setting device: cpu; the gpu extra name reflects the shared optional dependency group, not a hard GPU requirement.

Configuration¶

_target_: autorag_research.evaluation.metrics.generation.BartScoreFaithfulnessConfig
checkpoint: facebook/bart-large-cnn
batch_size: 4
max_length: 1024
device: auto

Options¶

Option	Type	Default	Description
checkpoint	str	`facebook/bart-large-cnn`	Hugging Face BART checkpoint
batch_size	int	`4`	Pair scoring batch size
max_length	int	`1024`	Max tokenizer length for source and target
device	str	`auto`	`cuda`, `mps`, `cpu`, or automatic selection

Variant selection¶

Variant	YAML	Required fields	Perspective
Faithfulness	`bart_score_faithfulness.yaml`	`retrieved_contents`, `generated_texts`	Does the answer follow the retrieved context?
Precision	`bart_score_precision.yaml`	`generation_gt`, `generated_texts`	How well is the answer supported by the reference?
Recall	`bart_score_recall.yaml`	`generation_gt`, `generated_texts`	How much reference content is covered by the answer?
F1	`bart_score_f1.yaml`	`generation_gt`, `generated_texts`	Balanced semantic overlap

When multiple references are available, AutoRAG keeps the best per-example BARTScore for precision and recall before computing F1.

When to Use¶

Good for:

RAG faithfulness checks without an LLM judge
Complementing BERTScore / AlignScore-style semantic metrics
Separating support (reference → answer) from coverage (answer → reference)

Limitations:

Slower than lexical metrics
Requires a local BART checkpoint download on first use
Scores are negative log-likelihoods, so raw values are less intuitive than bounded metrics