Multimodal Retrieval¶
Benchmark visual document retrieval (PDF pages, screenshots).
Prerequisites¶
- GPU recommended for image embeddings
- Multimodal embedding model (ColPali)
Download Dataset¶
autorag-research data restore vidorev3 arxivqa_colpali
ViDoRe v3 contains document images with queries.
Key Differences from Text¶
| Aspect | Text | Multimodal |
|---|---|---|
| Documents | Plain text | Images (PDF pages) |
| Embeddings | Text models | Vision models (ColPali) |
| BM25 | Available | Not available |
Create Experiment Config¶
# configs/experiment.yaml
db_name: vidorev3_arxivqa_test_colpali
pipelines:
retrieval:
- colpali
generation: []
metrics:
retrieval:
- recall
- ndcg
generation: []
Run¶
autorag-research run --config-name=experiment
Recommended Datasets¶
| Dataset | Description |
|---|---|
| ViDoRe v3 | Document images, multiple domains |
| VisRAG | Visual RAG benchmark |
See Multimodal Datasets for all options.
Next¶
- Custom Dataset - Add your own images
- Custom Pipeline - New retrieval method