COMPARATIVE STUDY OF TRANSFORMER-BASED AND INTELLIGENT DOCUMENT ANALYSIS METHODS FOR AUTOMATED EXTRACTION OF MEDICAL DATA FROM PDF DOCUMENTS
DOI:
https://doi.org/10.20998/2079-0023.2026.01.14Keywords:
optical character recognition, information extraction, medical documents, medical data processing, layout-aware methods, data extraction, document analysis, decision support systems, artificial intelligence toolsAbstract
This paper presents a study on automated processing of medical laboratory reports in PDF format, with a focus on text recognition and structured information extraction. The research investigates the effectiveness of different approaches to optical character recognition (OCR), including classical methods and transformer-based models, as well as techniques for extracting key medical data from unstructured and semi-structured text. A comparative experimental analysis was conducted using medical documents with different structural characteristics, including tabular and text-based formats. The study evaluates the performance of OCR methods and extraction pipelines using a set of quantitative metrics, including Character Error Rate (CER), Word Error Rate (WER), Exact Match (EM), Precision, Recall, and F1-score. The obtained results demonstrate that OCR accuracy alone does not guarantee high-quality structured data extraction, as recognition errors significantly affect downstream processing and reduce the reliability of extracted information. Special attention is given to layout-aware approaches that utilize the structural properties of PDF documents. The proposed method based on direct text extraction using pdfplumber shows superior performance by preserving spatial relationships between document elements and eliminating the need for OCR in documents with an embedded text layer. This approach ensures higher stability and accuracy when processing structured medical data. The findings highlight that the main challenge in processing medical documents lies in the extraction stage rather than in text recognition. The study demonstrates the importance of integrating layout-aware and intelligent extraction methods for improving the reliability, robustness, and scalability of automated data processing systems. The proposed approach can be used as a foundation for developing medical information systems and decision support tools aimed at efficient and accurate clinical data management.
References
Patil S., Golbhavi S. Advances and Applications of Natural Language Processing in Healthcare. Available at: https://www.ijraset.com/bestjournal/advances-and-applications-of-natural-language-processingin-healthcare (accessed: 28.02.2026). DOI: https://doi.org/10.22214/ijraset.2025.73380.
Heilmeyer F., Böhringer D., Reinhard T., et al. Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study. Available at https://medinform.jmir.org/2024/1/e59617: (accessed: 28.02.2026). DOI: https://doi.org/10.2196/59617.
Mazzucato E., Seinen M. Advancements in Multilingual Biomedical Natural Language Processing: exploring Large Language Models for Named Entity Recognition and Linking. Available at: https://www.medrxiv.org/content/10.64898/2026.01.22.26344605v1 (accessed: 28.02.2026). DOI: https://doi.org/10.64898/2026.01.22.26344605.
Almeida S. S., Fontes R. S., Alves L. et al. Artificial intelligence in healthcare text processing: a review applied to named entity recognition. Available at: https://www.frontiersin.org/journals/artificialintelligence/articles/10.3389/frai.2025.1584203/full (accessed: 28.02.2026). DOI: https://doi.org/10.3389/frai.2025.1584203.
Loor-Torres R., Duran M., Toro-Tobon D., et al. A systematic review of natural language processing methods and applications in thyroidology. Available at: https://www.mcpdigitalhealth.org/article/S2949-7612(24)00027- 0/fulltext (accessed: 28.02.2026). DOI: 10.1016/j.mcpdig.2024.03.007.
Guo B., Liu H., Niu L. Integration of natural and deep artificial cognitive models in medical images: BERT-based NER and relation extraction for electronic medical records. Available at: https://www.frontiersin.org/journals/neuroscience/articles/10.3389/f nins.2023.1266771/full (accessed: 28.02.2026). DOI: https://doi.org/10.3389/fnins.2023.1266771.
Jothi G., Virgeniya S. C. Medical data NER and classification using hybridized BERT model. Available at: https://wjaets.com/content/medical-data-ner-and-classificationusing-hybridized-bert-model (accessed: 28.02.2026). DOI: https://doi.org/10.30574/wjaets.2024.13.1.0376.
Ma M.-W., Gao X.-S., Zong H. et al. Extracting laboratory test information from paper-based reports. Available at: https://link.springer.com/article/10.1186/s12911-023-02346-6 (accessed: 28.02.2026). DOI: https://doi.org/10.1186/s12911-023- 02346-6.
Abhijeet F. S. Hybrid approaches for NER in noisy OCR medical records. Available at: https://ijsra.net/content/hybrid-approaches-nernoisy-ocr-medical-records (accessed: 28.02.2026). DOI: https://doi.org/10.30574/ijsra.2025.16.3.2499.
Zhang Q. Improving classification accuracy for unstructured medical documents via multi-engine OCR and deep learning collaboration. Available at: https://scipublication.com/index.php/JACS/article/view/292 (accessed: 28.02.2026). DOI: https://doi.org/10.69987/jacs.2026.60201.
Frei J., Kramer F. GERNERMED – an open German medical NER model. Available at: https://www.sciencedirect.com/science/article/ pii/S2665963821000944 (accessed: 28.02.2026). DOI: https://doi.org/10.1016/j.simpa.2021.100212.
Wei Q., Chen X., Cao C. A technical framework for recognizing and interpreting complex medical records: based on multimodal large language model. Available at: https://dl.acm.org/doi/10.1145/3702386.3702396 (accessed: 28.02.2026). DOI: https://doi.org/10.1145/3702386.3702396.
Negm M., Mourad A., Fawzi S. et al. Leveraging large language models for digitization and clinical interpretation of handwritten psoriasis reports. Available at: https://ieeexplore.ieee.org/document/ 11418814 (accessed: 28.02.2026). DOI: https://doi.org/10.1109/MELECON64486.2026.11418814.
Bai E., Luo X., Kutzin J. M. et al. Assessment and integration of large language models for automated electronic health record documentation in emergency medical services. Available at: https://link.springer.com/article/10.1007/s10916-025-02197-w (accessed: 28.02.2026). DOI: https://doi.org/10.1007/s10916-025- 02197-w.
Shrivastava D., Malathi H., Bansal S. et al. Integrating natural language processing in medical information science for clinical text analysis. Available at: https://mw.ageditor.ar/index.php/ mw/article/view/513 (accessed: 28.02.2026). DOI: https://doi.org/10.56294/mw2024513.
Dhote M. G., Deore M. P., Jadhav T. et al. Hybrid Vision-Language Models for Real-Time Surgical Report Generation and Documentation. Available at: https://jneonatalsurg.com/index.php/ jns/article/view/2752 (accessed: 28.02.2026). DOI: https://doi.org/10.52783/jns.v14.2752.
Tang Y. Research on NLP-Based Automatic Summarization for Medical Records. Available at: https://www.clausiuspress.com/article/ 9613.html (accessed: 28.02.2026). DOI: https://doi.org/10.23977/acss.2023.070903.
Thorat Aditya S. Automated Data Entry Through Image Processing for Medical Records. Available at: https://ijsrem.com/download/automated-data-entry-through-imageprocessing-for-advanced-medical-inventory/ (accessed: 28.02.2026). DOI: https://doi.org/10.55041/ijsrem50196.
N. Bina. The Role of Natural Language Processing in Medical Documentation. Available at: https://rojournals.org/wpcontent/uploads/2025/02/ROJBAS-51-2025-P3.pdf (accessed: 28.02.2026). DOI: https://doi.org/10.59298/rojbas%2F2025%2F511114.
Raja M. S., Aarthi C. R., Gayathri P., Pavithra J. P. Automated Prescription Analysis and Alternative Suggestion Using OCR and NLP. Available at: https://internationaljournalssrp.org/index.php/ijmst/ article/view/92 (accessed: 28.02.2026). DOI: https://doi.org/10.64137/31079911%2Fijmst-v1i2p101.
Paithankar S., Patil S., Shivgan A., Mahabudhe M., Dharmadhikari P. A. Medical Prescription Generator using Natural Language Processing. Available at: https://ieeexplore.ieee.org/document/10969167 (accessed: 28.02.2026). DOI: https://doi.org/10.1109/ISACC65211.2025.10969167.
Gomathi S., Roopa Chandrika R. Advancing Medical Image Processing with Deep Learning: Innovations and Impact Available at: https://ictactjournals.in/ArticleDetails?id=bpkwui (accessed: 28.02.2026). DOI: https://doi.org/10.21917/ijivp.2025.0494.
Horlatch V., Pasichnyk V. Automated Extraction of Key Parameters and Detection of Inconsistencies in Clinical Documentation Using Large Language Models. Available at: https://journals.uran.ua /eejet/article/view/337915 (accessed: 28.02.2026). DOI: https://doi.org/10.15587/1729-4061.2025.337915.
Stuhlmiller T. J., Rabe A. J., Rapp J. et al. A Scalable Method for Validated Data Extraction from Electronic Health Records with Large Language Models. Available at: https://www.medrxiv.org/content/ 10.1101/2025.02.25.25322898v1 (accessed: 28.02.2026). DOI: https://doi.org/10.1101/2025.02.25.25322898.
Senkadi Kh., Belmiloud M., Benslimane S. M., Dif N. Multi-Label Classification of Digitized Clinical Records Using TransformerBased Models. Available at: https://ieeexplore.ieee.org/document/11298127 (accessed: 28.02.2026) DOI: https://doi.org/10.1109/ICNAS68168.2025.11298127.
Grinchenko М., Kutsenko D. Models of medical center business processes to improve decision-making efficiency. Available at: https://journals.uran.ua/itssi/article/view/348508. (accessed: 28.02.2026) DOI: https://doi.org/10.30837/2522-9818.2025.4.005.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).