Text RAG Benchmark¶

Run full RAG: retrieval + generation with LLM.

Prerequisites¶

LLM API key (OpenAI, Anthropic, or local model)
Dataset with generation ground truth (RAGBench recommended)

Download Dataset¶

autorag-research data restore ragbench covidqa_openai-small

RAGBench includes generation ground truth (expected answers).

Create Experiment Config¶

# configs/experiment.yaml
db_name: ragbench_covidqa_test_openai_small

pipelines:
  retrieval:
    - bm25
  generation:
    - basic_rag

metrics:
  retrieval:
    - recall
    - ndcg
  generation:
    - rouge
    - bleu
    - bert_score

Configure LLM¶

# configs/pipelines/generation/basic_rag.yaml
_target_: autorag_research.pipelines.generation.basic_rag.BasicRAGPipelineConfig
name: basic_rag
retrieval_pipeline_name: bm25
llm: gpt-4o-mini
prompt_template: |
  Context:
  {context}

  Question: {query}

  Answer:
top_k: 5

Run¶

autorag-research run --config-name=experiment

Expected Output¶

Pipeline: bm25
  Recall@10: 0.823
  NDCG@10: 0.698

Pipeline: basic_rag
  ROUGE-L: 0.412
  BLEU: 0.287
  BERTScore: 0.856

Recommended Datasets¶

Dataset	Has Generation GT
RAGBench	Yes
BEIR	No (retrieval only)
MTEB	No (retrieval only)

Next¶

Multimodal - Visual documents
Custom Metric - Add evaluation