BM25¶

Sparse retrieval based on term frequency.

Overview¶

Field	Value
Type	Retrieval
Algorithm	TF-IDF variant
Modality	Text
Paper	Robertson & Zaragoza, 2009

How It Works¶

Ranks documents by:

Term frequency in document
Inverse document frequency
Document length normalization

Uses VectorChord-BM25 PostgreSQL extension for efficient full-text search.

Configuration¶

_target_: autorag_research.pipelines.retrieval.bm25.BM25PipelineConfig
name: bm25
tokenizer: bert
top_k: 10
batch_size: 100

Options¶

Option	Type	Default	Description
name	str	required	Unique pipeline instance name
tokenizer	str	`bert`	Tokenization method
index_name	str	`idx_chunk_bm25`	BM25 index name in PostgreSQL
top_k	int	10	Results per query
batch_size	int	100	Queries per batch

Tokenizers¶

Tokenizer	Description
bert	BERT WordPiece
wiki_tocken	Wikipedia-based
gemma2b	Gemma 2B model
llmlingua2	LLMLingua2

When to Use¶

Good for:

Keyword queries
Exact term matching
Low latency requirements
No embedding model needed

Consider dense retrieval for:

Semantic similarity
Paraphrase matching
Multilingual queries

Citation¶

@article{robertson2009probabilistic,
  title={The probabilistic relevance framework: BM25 and beyond},
  author={Robertson, Stephen and Zaragoza, Hugo and others},
  journal={Foundations and trends{\textregistered} in information retrieval},
  volume={3},
  number={4},
  pages={333--389},
  year={2009},
  publisher={Now Publishers, Inc.}
}

@book{robertson1995okapi,
  title={Okapi at TREC-3},
  author={Robertson, Stephen E and Walker, Steve and Jones, Susan and Hancock-Beaulieu, Micheline M and Gatford, Mike and others},
  year={1995},
  publisher={British Library Research and Development Department}
}