Skip to content

Text RAG Benchmark

Run full RAG: retrieval + generation with LLM.

Prerequisites

  • LLM API key (OpenAI, Anthropic, or local model)
  • Dataset with generation ground truth (RAGBench recommended)

Download Dataset

autorag-research data restore ragbench covidqa_openai-small

RAGBench includes generation ground truth (expected answers).

Create Experiment Config

# configs/experiment.yaml
db_name: ragbench_covidqa_test_openai_small

pipelines:
  retrieval:
    - bm25
  generation:
    - basic_rag

metrics:
  retrieval:
    - recall
    - ndcg
  generation:
    - rouge
    - bleu
    - bert_score

Configure LLM

# configs/pipelines/generation/basic_rag.yaml
_target_: autorag_research.pipelines.generation.basic_rag.BasicRAGPipelineConfig
name: basic_rag
retrieval_pipeline_name: bm25
llm: gpt-4o-mini
prompt_template: |
  Context:
  {context}

  Question: {query}

  Answer:
top_k: 5

Run

autorag-research run --config-name=experiment

Expected Output

Pipeline: bm25
  Recall@10: 0.823
  NDCG@10: 0.698

Pipeline: basic_rag
  ROUGE-L: 0.412
  BLEU: 0.287
  BERTScore: 0.856
Dataset Has Generation GT
RAGBench Yes
BEIR No (retrieval only)
MTEB No (retrieval only)

Next