RETRO*¶
Wrap an existing retrieval pipeline, then rerank its candidates with rubric-based LLM scoring inspired by the RETRO* paper.
Overview¶
| Field | Value |
|---|---|
| Type | Retrieval |
| Algorithm | Pointwise rubric-based reranking |
| Modality | Text |
| Paper | Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval |
How It Works¶
- Retrieve an initial candidate set from an existing retrieval pipeline such as BM25, vector search, or hybrid search.
- Prompt an LLM with a task-specific relevance rubric for each query-document pair.
- Parse the final
<score>value from each reasoning trace. - Optionally sample multiple reasoning traces per pair and integrate their scores.
- Return the highest-scoring documents as the wrapper pipeline output.
Scope¶
This implementation covers the paper's practical inference-time reranking pattern only.
Out of scope for this MVP:
- the paper's SFT training stage
- the paper's RL optimization stage
- reproducing the published BRIGHT benchmark numbers end to end
Configuration¶
_target_: autorag_research.pipelines.retrieval.retro_star.RetroStarPipelineConfig
name: retro_star_bm25
llm: openai-gpt5-mini
retrieval_pipeline_name: bm25
candidate_top_k: 100
relevance_definition: >
A document is relevant when it helps answer the query, including evidence
that is indirect but still necessary for the required reasoning.
query_type: query
document_type: document
num_samples: 4
sample_weights: [0.1, 0.2, 0.3, 0.4]
top_k: 10
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
| name | str | required | Unique pipeline instance name |
| llm | str | required | LLM config name (from configs/llm/) |
| retrieval_pipeline_name | str | required | Existing retrieval pipeline config to wrap |
| candidate_top_k | int | 100 | Number of wrapped candidates to rerank |
| relevance_definition | str | generic reasoning-aware definition | Rubric definition inserted into the prompt |
| query_type | str | query |
Label used inside the prompt |
| document_type | str | document |
Label used inside the prompt |
| num_samples | int | 1 | Number of reasoning traces to sample per candidate |
| sample_weights | list[float] | null | null | Optional score-integration weights |
| max_document_tokens | int | 768 | Max candidate document tokens sent to the LLM |
| max_query_tokens | int | 256 | Max query tokens sent to the LLM |
| max_rerank_concurrency | int | 4 | Concurrent candidate-scoring calls per query |
| top_k | int | 10 | Final results per query |
Default Prompt Behavior¶
The built-in prompt asks the model to:
- analyze the query intent
- analyze the candidate document
- justify a 0-100 relevance score
- end with the final score inside
<score>tags
This mirrors the paper's rubric-based inference pattern while staying generic enough for arbitrary datasets.
Usage¶
Python API¶
from langchain_openai import ChatOpenAI
from autorag_research.orm.connection import DBConnection
from autorag_research.pipelines.retrieval.bm25 import BM25RetrievalPipeline
from autorag_research.pipelines.retrieval.retro_star import RetroStarRetrievalPipeline
db = DBConnection.from_config()
session_factory = db.get_session_factory()
wrapped_retriever = BM25RetrievalPipeline(
session_factory=session_factory,
name="bm25",
tokenizer="bert",
)
pipeline = RetroStarRetrievalPipeline(
session_factory=session_factory,
name="retro_star_bm25",
llm=ChatOpenAI(model="gpt-5-mini"),
retrieval_pipeline=wrapped_retriever,
candidate_top_k=100,
num_samples=4,
)
YAML / Executor¶
# configs/pipelines/retrieval/retro_star_bm25.yaml
_target_: autorag_research.pipelines.retrieval.retro_star.RetroStarPipelineConfig
name: retro_star_bm25
llm: openai-gpt5-mini
retrieval_pipeline_name: bm25
candidate_top_k: 100
num_samples: 4
The executor resolves retrieval_pipeline_name, instantiates the wrapped retriever, and injects it automatically.
When to Use¶
Good for:
- reasoning-intensive benchmarks such as BRIGHT
- difficult queries where indirectly useful evidence matters
- comparing a stronger LLM-based reranking baseline against simpler retrievers
Consider other methods when:
- you need a lightweight retriever without LLM latency
- you want fully trained paper-faithful RETRO* checkpoints
- you only need sparse or dense retrieval without reranking