HyDE (Hypothetical Document Embeddings)¶

Dense retrieval using LLM-generated hypothetical documents.

Overview¶

Field	Value
Type	Retrieval
Algorithm	Dense (vector similarity with hypothetical documents)
Modality	Text
Paper	Precise Zero-Shot Dense Retrieval without Relevance Labels

How It Works¶

Receives a query
Uses LLM to generate a hypothetical document that would answer the query
Embeds the hypothetical document (not the original query)
Performs vector similarity search with the hypothetical embedding

This bridges the semantic gap between queries and documents by generating document-like text.

Configuration¶

_target_: autorag_research.pipelines.retrieval.hyde.HyDEPipelineConfig
name: hyde_gpt4
llm: openai-gpt4
embedding: openai-small
prompt_template: |
  Please write a passage to answer the question.
  Question: {query}
  Passage:
top_k: 10
batch_size: 100

Options¶

Option	Type	Default	Description
name	str	required	Unique pipeline instance name
llm	str	required	LLM config name (from configs/llm/)
embedding	str	required	Embedding config name (from configs/embedding/)
prompt_template	str	see below	Template with {query} placeholder
top_k	int	10	Results per query
batch_size	int	100	Queries per batch

Default prompt template:

Please write a passage to answer the question.
Question: {query}
Passage:

Custom Prompts¶

The paper recommends domain-specific prompts. Examples:

Web search (DL19/20, DBPedia):

prompt_template: |
  Please write a passage to answer the question
  Question: {query}
  Passage:

SciFact:

prompt_template: |
  Please write a scientific paper passage to support/refute the claim
  Claim: {query}
  Passage:

TREC-COVID:

prompt_template: |
  Please write a scientific paper passage to answer the question
  Question: {query}
  Passage:

FiQA:

prompt_template: |
  Please write a financial article passage to answer the question
  Question: {query}
  Passage:

TREC-NEWS:

prompt_template: |
  Please write a news passage about the topic.
  Topic: {query}
  Passage:

ArguAna:

prompt_template: |
  Please write a counter argument for the passage
  Passage: {query}
  Counter Argument:

Mr.TyDi (Multilingual):

prompt_template: |
  Please write a passage in Korean to answer the question in detail.
  Question: {query}
  Passage:

Usage¶

Python API¶

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from autorag_research.orm.connection import DBConnection
from autorag_research.pipelines.retrieval.hyde import HyDERetrievalPipeline

db = DBConnection.from_config()
session_factory = db.get_session_factory()

pipeline = HyDERetrievalPipeline(
    session_factory=session_factory,
    name="hyde_gpt4",
    llm=ChatOpenAI(model="gpt-4"),
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    # Optional: custom prompt for domain-specific documents
    prompt_template="Write a Wikipedia passage about: {query}\n\nPassage:",
)

# Single query
results = pipeline.retrieve("What is machine learning?", top_k=10)

# Batch processing
stats = pipeline.run(top_k=10)

With Config¶

from autorag_research.pipelines.retrieval.hyde import HyDEPipelineConfig

config = HyDEPipelineConfig(
    name="hyde_gpt4",
    llm="openai-gpt4",      # Auto-converted to LLM instance
    embedding="openai-small", # Auto-converted to Embeddings instance
    top_k=10,
)

When to Use¶

Good for:

Zero-shot retrieval (no labeled data needed)
Bridging query-document semantic gap
Complex questions requiring reasoning

Consider other methods when:

Low latency is critical (LLM adds latency)
Embedding model cost is a concern
Pre-computed query embeddings are available