Hybrid Retrieval¶

Combine multiple retrieval pipelines by its relevance scores. There are two types of fusion strategies: RRF and CC (Convex Combination; Weighted Sum)

Overview¶

Field	Value
Type	Retrieval
Methods	RRF, Convex Combination
Pipelines	Any 2 retrieval pipelines

Methods¶

Reciprocal Rank Fusion (RRF)¶

Combines results based on rank positions, ignoring raw scores.

Formula: RRF(d) = sum(1/(k + rank_i(d)))

Advantages:

Score-scale independent
Robust to different retrieval methods
No normalization needed

Missing Document Handling:

Documents that appear in only one pipeline are assigned rank fetch_k + 1 for the missing pipeline, giving them a small but non-zero contribution: 1/(k + fetch_k + 1). This prevents documents from being unfairly penalized when they're highly relevant in one pipeline but absent from the other.

Convex Combination (CC)¶

Combines normalized scores with configurable weights.

Formula: combined = weight * norm(score_1) + (1-weight) * norm(score_2)

Normalization Methods:

Method	Description	Missing Score Floor
`mm`	Min-max scaling to [0, 1] using actual min/max	0.0
`tmm`	Theoretical min with actual max (e.g., BM25 min=0, cosine min=-1)	0.0
`z`	Z-score standardization	-3.0
`dbsf`	3-sigma distribution-based	0.0

Missing Document Handling:

Documents that appear in only one pipeline receive a semantically correct floor value after normalization:

Missing scores are excluded from normalization statistics (min/max/mean/std)
After normalization, missing scores are replaced with method-specific floor values
For z-score, -3.0 represents 3 standard deviations below the mean (very low relevance)
For other methods, 0.0 represents the minimum of the normalized range

Configuration¶

RRF Pipeline¶

_target_: autorag_research.pipelines.retrieval.hybrid.HybridRRFRetrievalPipelineConfig
name: hybrid_rrf
retrieval_pipeline_1_name: vector_search
retrieval_pipeline_2_name: bm25
rrf_k: 60
fetch_k_multiplier: 2
top_k: 10

CC Pipeline¶

_target_: autorag_research.pipelines.retrieval.hybrid.HybridCCRetrievalPipelineConfig
name: hybrid_cc
retrieval_pipeline_1_name: vector_search
retrieval_pipeline_2_name: bm25
weight: 0.5
normalize_method: mm
fetch_k_multiplier: 2
top_k: 10

Options¶

Common Options¶

Option	Type	Default	Description
name	str	required	Unique pipeline name
retrieval_pipeline_1_name	str	required	First pipeline name
retrieval_pipeline_2_name	str	required	Second pipeline name
fetch_k_multiplier	int	2	Multiplier for top_k when fetching from sub-pipelines. Each sub-pipeline fetches `top_k * fetch_k_multiplier` results before fusion.
top_k	int	10	Results per query
batch_size	int	100	Queries per batch

RRF Options¶

Option	Type	Default	Description
rrf_k	int	60	RRF constant (higher = more top-rank emphasis)

CC Options¶

Option	Type	Default	Description
weight	float	0.5	Weight for pipeline_1 (0=full pipeline_2, 1=full pipeline_1)
normalize_method	str	mm	Normalization: mm, tmm, z, dbsf
pipeline_1_min	float	None	Theoretical min score for tmm (pipeline_1)
pipeline_2_min	float	None	Theoretical min score for tmm (pipeline_2)

Usage¶

Programmatic¶

from autorag_research.pipelines.retrieval import (
    HybridRRFRetrievalPipeline,
    HybridCCRetrievalPipeline,
    VectorSearchRetrievalPipeline,
    BM25RetrievalPipeline,
)

# Create sub-pipelines
vector = VectorSearchRetrievalPipeline(session_factory, "vector")
bm25 = BM25RetrievalPipeline(session_factory, "bm25")

# Create RRF hybrid with instantiated pipelines
hybrid_rrf = HybridRRFRetrievalPipeline(
    session_factory=session_factory,
    name="hybrid_rrf",
    retrieval_pipeline_1=vector,
    retrieval_pipeline_2=bm25,
    rrf_k=60,
    fetch_k_multiplier=2,  # Fetch 2x top_k from each pipeline
)

# Create CC hybrid with pipeline names (auto-loaded from YAML)
hybrid_cc = HybridCCRetrievalPipeline(
    session_factory=session_factory,
    name="hybrid_cc",
    retrieval_pipeline_1="vector_search",  # Loads from configs/
    retrieval_pipeline_2="bm25",
    weight=0.6,  # 60% vector, 40% BM25
    normalize_method="mm",
    fetch_k_multiplier=3,  # Fetch 3x top_k for better fusion
)

results = hybrid_rrf.retrieve("What is machine learning?", top_k=10)

CLI¶

autorag run --pipeline hybrid_rrf --top-k 10

When to Use¶

Use RRF when:

Combining different retrieval paradigms (dense + sparse)
Score scales differ significantly
Want robust, parameter-free fusion

Use CC when:

Fine-tuning weight between pipelines
Score distributions are known
Want explicit control over fusion balance

References¶

Reciprocal Rank Fusion - Cormack et al., 2009
Hybrid Search - Survey on hybrid retrieval methods

Citation¶

@inproceedings{cormack2009reciprocal,
  title={Reciprocal rank fusion outperforms condorcet and individual rank learning methods},
  author={Cormack, Gordon V and Clarke, Charles LA and Buettcher, Stefan},
  booktitle={Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval},
  pages={758--759},
  year={2009}
}

@article{bruch2023analysis,
  title={An analysis of fusion functions for hybrid retrieval},
  author={Bruch, Sebastian and Gai, Siyu and Ingber, Amir},
  journal={ACM Transactions on Information Systems},
  volume={42},
  number={1},
  pages={1--35},
  year={2023},
  publisher={ACM New York, NY}
}