Core Concepts¶
Dataset¶
A dataset contains:
- Documents: Content to search (text or images)
- Queries: Questions to answer
- Ground Truth: Which documents are relevant to each query
Stored in PostgreSQL with vector embeddings.
Retrieval Pipeline¶
Takes a query, returns relevant documents.
results = retrieval_pipeline.retrieve(query="What causes fever?", top_k=10)
# Returns: [{"doc_id": 42, "score": 0.95}, ...]
Generation Pipeline¶
Takes a query + retrieved documents, generates an answer.
answer = generation_pipeline.generate(query="What causes fever?", top_k=5)
# Returns: "Fever is caused by..."
Metric¶
Measures pipeline quality against ground truth.
Retrieval metrics: Did we find the right documents? (Recall, NDCG, MRR)
Generation metrics: Is the answer correct? (ROUGE, BERTScore)
Executor¶
Orchestrates pipeline execution and metric evaluation.
# experiment.yaml
db_name: beir_scifact_test
pipelines:
retrieval: [bm25]
metrics:
retrieval: [recall, ndcg]