BM25¶
Sparse retrieval based on term frequency.
Overview¶
| Field | Value |
|---|---|
| Type | Retrieval |
| Algorithm | TF-IDF variant |
| Modality | Text |
| Paper | Robertson & Zaragoza, 2009 |
How It Works¶
Ranks documents by:
- Term frequency in document
- Inverse document frequency
- Document length normalization
Uses VectorChord-BM25 PostgreSQL extension for efficient full-text search.
Configuration¶
_target_: autorag_research.pipelines.retrieval.bm25.BM25PipelineConfig
name: bm25
tokenizer: bert
top_k: 10
batch_size: 100
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
| name | str | required | Unique pipeline instance name |
| tokenizer | str | bert |
Tokenization method |
| index_name | str | idx_chunk_bm25 |
BM25 index name in PostgreSQL |
| top_k | int | 10 | Results per query |
| batch_size | int | 100 | Queries per batch |
Tokenizers¶
| Tokenizer | Description |
|---|---|
| bert | BERT WordPiece |
| wiki_tocken | Wikipedia-based |
| gemma2b | Gemma 2B model |
| llmlingua2 | LLMLingua2 |
When to Use¶
Good for:
- Keyword queries
- Exact term matching
- Low latency requirements
- No embedding model needed
Consider dense retrieval for:
- Semantic similarity
- Paraphrase matching
- Multilingual queries
Citation¶
@article{robertson2009probabilistic,
title={The probabilistic relevance framework: BM25 and beyond},
author={Robertson, Stephen and Zaragoza, Hugo and others},
journal={Foundations and trends{\textregistered} in information retrieval},
volume={3},
number={4},
pages={333--389},
year={2009},
publisher={Now Publishers, Inc.}
}
@book{robertson1995okapi,
title={Okapi at TREC-3},
author={Robertson, Stephen E and Walker, Steve and Jones, Susan and Hancock-Beaulieu, Micheline M and Gatford, Mike and others},
year={1995},
publisher={British Library Research and Development Department}
}