Skip to Main content Skip to Navigation
Conference papers

Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques

Abstract : This paper tackles the task of NER applied to historical texts obtained from processing digital images of news papers using OCR techniques. The main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text, which can impact the performance of the NER. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based ona hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical data sets
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03320332
Contributor : Antoine Doucet Connect in order to contact the contributor
Submitted on : Sunday, August 15, 2021 - 2:06:46 PM
Last modification on : Tuesday, October 19, 2021 - 6:22:58 PM
Long-term archiving on: : Tuesday, November 16, 2021 - 6:05:22 PM

File

main(1).pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03320332, version 1

Citation

Emanuela Boros, Ahmed Hamdi, Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, José G. Moreno, et al.. Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques. Conférence en Recherche d'Informations et Applications (CORIA 2021), ARIA : Association Francophone de Recherche d’Information (RI) et Applications, Apr 2021, Grenoble (virtuel), France. pp.1 - 7. ⟨hal-03320332⟩

Share

Metrics

Record views

59

Files downloads

43