Resumen:
|
[ES] El reconocimiento de entidades nombradas es un problema relevante en tareas de Procesamiento de Lenguaje Natural. Por lo que resulta de especial relevancia en el reconocimiento de imágenes de textos manuscritos. Este ...[+]
[ES] El reconocimiento de entidades nombradas es un problema relevante en tareas de Procesamiento de Lenguaje Natural. Por lo que resulta de especial relevancia en el reconocimiento de imágenes de textos manuscritos. Este problema se aborda usualmente en textos electrónicos planos que no suelen presentar problemas de ruido. En el caso de que el texto a tratar sea el resultado de un proceso previo de reconocimiento de texto manuscrito es de esperar que este texto presente problemas de rurido (palabras mal reconocidas, variantes de la misma palabra, etc.). En reconocimiento de texto manuscrito es posible obtener la k mejores transcripcciones a partir de un modelo entrenado. Estas k mejores transcripciones pueden obtenerse tanto en el test como en el training. En este trabajo se pretende estudiar cómo fortalece el problema del reconocimiento de entidades nombradas haciendo uso de las k mejores transcripciones. Los resultados obtendos se evaluarán con medidas estándar y la experimentación se realizará con una base de datos de documentos antiguos del Archivos General de Simancas.
[-]
[EN] Named Entity Recognition (NER) in ancient handwritten texts is a challenging area of
research in the field of artificial intelligence and natural language processing. It consists
of identifying and classifying ...[+]
[EN] Named Entity Recognition (NER) in ancient handwritten texts is a challenging area of
research in the field of artificial intelligence and natural language processing. It consists
of identifying and classifying specific entities, such as names of people, places, dates,
organizations, etc., in handwritten texts in different ancient languages. This process in-
volves several challenges due to the variable nature of handwriting, the evolution of lan-
guage over time, inconsistent spelling, and the presence of abbreviations, among other
factors. To address these challenges, image processing and deep learning techniques
are applied, including convolutional and recurrent neural networks, which have demon-
strated to be able to get competitive results for NER over the last decade.
By means of this master’s thesis, a recurrent neural network has been designed for
NER in manuscript texts of the XVI century belonging to some pages of the ancient col-
lection of Books of records of royal decrees, located in the General Archive of Simancas,
one of the most important archives that narrates the cultural, political and social evolu-
tion of Spain throughout history. Specifically, the neural model learns distributed repre-
sentations of words and characters, (referred as embeddings), composed of bidirectional
short and long term memory modules (Bi-LSTM) and a conditional random field (CRF) iv
as output layer. The results obtained at line level for the manual transcriptions reflect
generally good recognition performances on all types of entities (F1 score of 0.83 and
WTER of 5.4%), and specifically on person names and surnames (F1 scores of 0.94 and
0.89 respectively). On the other hand, the model has been evaluated with the k-best tran-
scriptions of each line generated by a handwritten text recognition process, which may
fail to detect certain words present in the manuscripts. A 12% increase in WTER and a
0.20 drop in F1 score was detected for the best decodings (known as the 1-best), and a 7%
increase in WTER and a 0.09 drop in F1 score was detected after considering the 10 best
decodings (10-best).
[-]
|