Researchers systematically evaluated four open-source PDF-to-Markdown frameworks (Docling, MinerU, Marker, DeepSeek OCR) for RAG document preprocessing. Using a 50-question benchmark on Portuguese administrative documents, Docling with hierarchical splitting achieved the highest accuracy at 94.1%, significantly outperforming naive extraction (86.9%) but falling short of manual curation (97.1%).
Research
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Docling's hierarchical splitting achieves 94.1% accuracy for RAG document preprocessing—substantially better than naive extraction (86.9%) but still 3 points short of manual curation (97.1%) on Portuguese administrative documents.
Wednesday, April 8, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
research