Scientific article
OA Policy
English

Unsupervised extraction of body-text from clinical PDF documents

Published inStudies in health technology and informatics, vol. 316, p. 214-215
Publication date2024-08-22
Abstract

Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.

Keywords
  • DBSCAN
  • Clinical data
  • Information extraction
  • Pdf
  • Unsupervised
  • Natural Language Processing
  • Algorithms
  • Data Mining / methods
  • Humans
  • Electronic Health Records
  • Unsupervised Machine Learning
Citation (ISO format)
BENSALAH TALET, Adel Fouad et al. Unsupervised extraction of body-text from clinical PDF documents. In: Studies in health technology and informatics, 2024, vol. 316, p. 214–215. doi: 10.3233/SHTI240382
Main files (1)
Article (Published version)
Identifiers
ISSN of the journal0926-9630
15views
2downloads

Technical informations

Creation02/09/2024 08:47:49
First validation23/09/2024 08:45:00
Update time23/09/2024 08:45:00
Status update23/09/2024 08:45:00
Last indexation05/10/2024 20:17:26
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack