[AAAI 2025] Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition
We have developed a reference-based post-OCR processing pipeline that leverages large language models (LLMs) and ebook references to achieve highly accurate recognition of historical documents in diacritic-rich languages. This work, “Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition,” was accepted as a full paper at the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), one of the most prestigious conferences in artificial intelligence with an acceptance rate of around 10–20%. AAAI is ranked as a CORE A venue, underscoring the high impact of this contribution.
Our method introduces a post-processing pipeline that goes beyond conventional spell correction models. Instead of relying solely on statistical or transformer-based correction, we integrate ebooks as semantic references and use LLMs to generate pseudo page-to-page alignments with high fidelity. This design directly addresses the difficulties of diacritic variation, noisy scans, archaic vocabulary, and the mismatch between modern and classical corpora, enabling robust recovery of historical texts.
Experimental evaluations highlight the strength of this approach. In human assessments, our system achieved an average score of 8.72/10, significantly outperforming the state-of-the-art Vietnamese transformer-based spell correction baseline (7.03/10). We also created VieBookRead, a dataset of more than 25,000 pages of high-quality, diacritic-corrected Vietnamese texts, which can serve as a valuable resource for future research. Furthermore, fine-tuning OCR baselines such as Tesseract on this dataset reduced word error rate from 0.27 to 0.18, demonstrating clear downstream benefits.
The applications of this work are broad. In digital humanities and cultural preservation, it enables the digitization and restoration of aged Vietnamese texts and other diacritic-rich languages. In language technology, it supports the development of more accurate OCR systems for under-resourced languages. For education, archives, and historical research, it provides tools for historians, social scientists, and heritage institutions to preserve and analyze cultural materials. Finally, by design, the approach is adaptable to cross-lingual OCR research, with potential extensions to Czech, Slavic, French, and other languages rich in diacritics.
Our work shows that combining ebooks with large language models enables high-precision correction of OCR outputs for historical documents, providing both a valuable dataset and a practical solution for preserving diacritic-rich languages.