Skip to content

BM25

Sparse retrieval based on term frequency.

Overview

Field Value
Type Retrieval
Algorithm TF-IDF variant
Modality Text
Paper Robertson & Zaragoza, 2009

How It Works

Ranks documents by:

  1. Term frequency in document
  2. Inverse document frequency
  3. Document length normalization

Uses VectorChord-BM25 PostgreSQL extension for efficient full-text search.

Configuration

_target_: autorag_research.pipelines.retrieval.bm25.BM25PipelineConfig
name: bm25
tokenizer: bert
top_k: 10
batch_size: 100

Options

Option Type Default Description
name str required Unique pipeline instance name
tokenizer str bert Tokenization method
index_name str idx_chunk_bm25 BM25 index name in PostgreSQL
top_k int 10 Results per query
batch_size int 100 Queries per batch

Tokenizers

Tokenizer Description
bert BERT WordPiece
wiki_tocken Wikipedia-based
gemma2b Gemma 2B model
llmlingua2 LLMLingua2

When to Use

Good for:

  • Keyword queries
  • Exact term matching
  • Low latency requirements
  • No embedding model needed

Consider dense retrieval for:

  • Semantic similarity
  • Paraphrase matching
  • Multilingual queries

Citation

@article{robertson2009probabilistic,
  title={The probabilistic relevance framework: BM25 and beyond},
  author={Robertson, Stephen and Zaragoza, Hugo and others},
  journal={Foundations and trends{\textregistered} in information retrieval},
  volume={3},
  number={4},
  pages={333--389},
  year={2009},
  publisher={Now Publishers, Inc.}
}

@book{robertson1995okapi,
  title={Okapi at TREC-3},
  author={Robertson, Stephen E and Walker, Steve and Jones, Susan and Hancock-Beaulieu, Micheline M and Gatford, Mike and others},
  year={1995},
  publisher={British Library Research and Development Department}
}