Skip to content

BLEU

Bilingual Evaluation Understudy - measures n-gram precision.

Overview

Field Value
Type Generation
Range [0, 1]
Higher is better Yes

Description

BLEU measures how many n-grams in the generated text appear in the reference text. Originally designed for machine translation evaluation.

Configuration

_target_: autorag_research.evaluation.metrics.generation.BleuConfig
tokenize: default
smooth_method: exp
max_ngram_order: 4
effective_order: true

Options

Option Type Default Description
tokenize str default Tokenization method
smooth_method str exp Smoothing for zero counts
max_ngram_order int 4 Maximum n-gram size
effective_order bool true Use effective order

When to Use

Good for:

  • Translation-style tasks
  • Tasks requiring exact phrase matching
  • Comparing against single reference

Limitations:

  • Doesn't capture semantic similarity
  • Penalizes paraphrases
  • Requires exact n-gram matches