Resumen:
|
[EN] Content-based classification of manuscripts is an important task that is generally carried out by expert archivists. Nevertheless, many historical manuscript collections are so vast that in most cases this task is ...[+]
[EN] Content-based classification of manuscripts is an important task that is generally carried out by expert archivists. Nevertheless, many historical manuscript collections are so vast that in most cases this task is hardly feasible, even for large, well staffed archives. Nowadays, manuscripts are generally preserved in the form of sets of digital images. Therefore, the technical problem we are interested in is automatic classification of "'image documents", each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The traditional Pattern Recognition classification paradigm does provide the basic tools to deal with this problem. However, in practice, the set of relevant classes of a large documental series is seldom known in advance. Therefore, a classifier trained with a predefined set of classes will systematically fail when new image documents arrive which do not belong to any of the classes assumed in training. Here we adopt the "Open Set Classification" framework to extend and consolidate our previous work on image document classification in order to adequately handle new documents from unknown classes. The proposed approaches are based on a relatively novel technology for text image representation known as "probabilistic indexing", which proves very effective to characterise the intrinsic word-level uncertainty exhibited by historical handwritten text images. We assess the performance of this approach on a moderately sized but representative dataset extracted from a huge series of complex notarial manuscripts from the Spanish Archivo Historico Provincial de Cadiz , with good results.
[-]
|
Agradecimientos:
|
Work partially supported by : Universitat Politcnica de Valencia under grant FPI-I/SP20190010, Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121), by grant PID2020116813RBI00 a of MCIN/AEI/10.13039/501100011033 ...[+]
Work partially supported by : Universitat Politcnica de Valencia under grant FPI-I/SP20190010, Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121), by grant PID2020116813RBI00 a of MCIN/AEI/10.13039/501100011033 and by a Maria Zambrano grant of the Spanish Ministerio de Universidades and the European Union NextGenerationEU/PRTR.
[-]
|