Skip to content

Hybrid Retrieval

Combine multiple retrieval pipelines by its relevance scores. There are two types of fusion strategies: RRF and CC (Convex Combination; Weighted Sum)

Overview

Field Value
Type Retrieval
Methods RRF, Convex Combination
Pipelines Any 2 retrieval pipelines

Methods

Reciprocal Rank Fusion (RRF)

Combines results based on rank positions, ignoring raw scores.

Formula: RRF(d) = sum(1/(k + rank_i(d)))

Advantages:

  • Score-scale independent
  • Robust to different retrieval methods
  • No normalization needed

Missing Document Handling:

Documents that appear in only one pipeline are assigned rank fetch_k + 1 for the missing pipeline, giving them a small but non-zero contribution: 1/(k + fetch_k + 1). This prevents documents from being unfairly penalized when they're highly relevant in one pipeline but absent from the other.

Convex Combination (CC)

Combines normalized scores with configurable weights.

Formula: combined = weight * norm(score_1) + (1-weight) * norm(score_2)

Normalization Methods:

Method Description Missing Score Floor
mm Min-max scaling to [0, 1] using actual min/max 0.0
tmm Theoretical min with actual max (e.g., BM25 min=0, cosine min=-1) 0.0
z Z-score standardization -3.0
dbsf 3-sigma distribution-based 0.0

Missing Document Handling:

Documents that appear in only one pipeline receive a semantically correct floor value after normalization:

  • Missing scores are excluded from normalization statistics (min/max/mean/std)
  • After normalization, missing scores are replaced with method-specific floor values
  • For z-score, -3.0 represents 3 standard deviations below the mean (very low relevance)
  • For other methods, 0.0 represents the minimum of the normalized range

Configuration

RRF Pipeline

_target_: autorag_research.pipelines.retrieval.hybrid.HybridRRFRetrievalPipelineConfig
name: hybrid_rrf
retrieval_pipeline_1_name: vector_search
retrieval_pipeline_2_name: bm25
rrf_k: 60
fetch_k_multiplier: 2
top_k: 10

CC Pipeline

_target_: autorag_research.pipelines.retrieval.hybrid.HybridCCRetrievalPipelineConfig
name: hybrid_cc
retrieval_pipeline_1_name: vector_search
retrieval_pipeline_2_name: bm25
weight: 0.5
normalize_method: mm
fetch_k_multiplier: 2
top_k: 10

Options

Common Options

Option Type Default Description
name str required Unique pipeline name
retrieval_pipeline_1_name str required First pipeline name
retrieval_pipeline_2_name str required Second pipeline name
fetch_k_multiplier int 2 Multiplier for top_k when fetching from sub-pipelines. Each sub-pipeline fetches top_k * fetch_k_multiplier results before fusion.
top_k int 10 Results per query
batch_size int 100 Queries per batch

RRF Options

Option Type Default Description
rrf_k int 60 RRF constant (higher = more top-rank emphasis)

CC Options

Option Type Default Description
weight float 0.5 Weight for pipeline_1 (0=full pipeline_2, 1=full pipeline_1)
normalize_method str mm Normalization: mm, tmm, z, dbsf
pipeline_1_min float None Theoretical min score for tmm (pipeline_1)
pipeline_2_min float None Theoretical min score for tmm (pipeline_2)

Usage

Programmatic

from autorag_research.pipelines.retrieval import (
    HybridRRFRetrievalPipeline,
    HybridCCRetrievalPipeline,
    VectorSearchRetrievalPipeline,
    BM25RetrievalPipeline,
)

# Create sub-pipelines
vector = VectorSearchRetrievalPipeline(session_factory, "vector")
bm25 = BM25RetrievalPipeline(session_factory, "bm25")

# Create RRF hybrid with instantiated pipelines
hybrid_rrf = HybridRRFRetrievalPipeline(
    session_factory=session_factory,
    name="hybrid_rrf",
    retrieval_pipeline_1=vector,
    retrieval_pipeline_2=bm25,
    rrf_k=60,
    fetch_k_multiplier=2,  # Fetch 2x top_k from each pipeline
)

# Create CC hybrid with pipeline names (auto-loaded from YAML)
hybrid_cc = HybridCCRetrievalPipeline(
    session_factory=session_factory,
    name="hybrid_cc",
    retrieval_pipeline_1="vector_search",  # Loads from configs/
    retrieval_pipeline_2="bm25",
    weight=0.6,  # 60% vector, 40% BM25
    normalize_method="mm",
    fetch_k_multiplier=3,  # Fetch 3x top_k for better fusion
)

results = hybrid_rrf.retrieve("What is machine learning?", top_k=10)

CLI

autorag run --pipeline hybrid_rrf --top-k 10

When to Use

Use RRF when:

  • Combining different retrieval paradigms (dense + sparse)
  • Score scales differ significantly
  • Want robust, parameter-free fusion

Use CC when:

  • Fine-tuning weight between pipelines
  • Score distributions are known
  • Want explicit control over fusion balance

References

Citation

@inproceedings{cormack2009reciprocal,
  title={Reciprocal rank fusion outperforms condorcet and individual rank learning methods},
  author={Cormack, Gordon V and Clarke, Charles LA and Buettcher, Stefan},
  booktitle={Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval},
  pages={758--759},
  year={2009}
}

@article{bruch2023analysis,
  title={An analysis of fusion functions for hybrid retrieval},
  author={Bruch, Sebastian and Gai, Siyu and Ingber, Amir},
  journal={ACM Transactions on Information Systems},
  volume={42},
  number={1},
  pages={1--35},
  year={2023},
  publisher={ACM New York, NY}
}