- -

Batch-adaptive rejection threshold estimation with application to OCR post-processing

RiuNet: Institutional repository of the Polithecnic University of Valencia

Share/Send to

Cited by

Statistics

  • Estadisticas de Uso

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Show simple item record

Files in this item

dc.contributor.author Navarro Cerdán, José Ramón es_ES
dc.contributor.author Arlandis Navarro, Joaquim Francesc es_ES
dc.contributor.author Llobet Azpitarte, Rafael es_ES
dc.contributor.author Perez-Cortes, Juan-Carlos es_ES
dc.date.accessioned 2015-10-29T09:13:57Z
dc.date.available 2015-10-29T09:13:57Z
dc.date.issued 2015-06-24
dc.identifier.issn 0957-4174
dc.identifier.uri http://hdl.handle.net/10251/56691
dc.description.abstract An OCR process is often followed by the application of a language model to find the best transformation of an OCR hypothesis into a string compatible with the constraints of the document, field or item under consideration. The cost of this transformation can be taken as a confidence value and compared to a threshold to decide if a string is accepted as correct or rejected in order to satisfy the need for bounding the error rate of the system. Widespread tools like ROC, precision-recall, or error-reject curves, are commonly used along with fixed thresholding in order to achieve that goal. However, those methodologies fail when a test sample has a confidence distribution that differs from the one of the sample used to train the system, which is a very frequent case in post-processed OCR strings (e.g., string batches showing particularly careful handwriting styles in contrast to free styles). In this paper, we propose an adaptive method for the automatic estimation of the rejection threshold that overcomes this drawback, allowing the operator to define an expected error rate within the set of accepted (non-rejected) strings of a complete batch of documents (as opposed to trying to establish or control the probability of error of a single string), regardless of its confidence distribution. The operator (expert) is assumed to know the error rate that can be acceptable to the user of the resulting data. The proposed system transforms that knowledge into a suitable rejection threshold. The approach is based on the estimation of an expected error vs. transformation cost distribution. First, a model predicting the probability of a cost to arise from an erroneously transcribed string is computed from a sample of supervised OCR hypotheses. Then, given a test sample, a cumulative error vs. cost curve is computed and used to automatically set the appropriate threshold that meets the user-defined error rate on the overall sample. The results of experiments on batches coming from different writing styles show very accurate error rate estimations where fixed thresholding clearly fails. An original procedure to generate distorted strings from a given language is also proposed and tested, which allows the use of the presented method in tasks where no real supervised OCR hypotheses are available to train the system. es_ES
dc.language Inglés es_ES
dc.publisher Elsevier es_ES
dc.relation.ispartof Expert Systems with Applications es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Rejection threshold es_ES
dc.subject OCR post-processing es_ES
dc.subject Language models es_ES
dc.subject Weighted finite-state transducers es_ES
dc.subject Error vs. cost curve es_ES
dc.subject Cumulative error vs. cost curve es_ES
dc.subject OCR error-generation model es_ES
dc.subject.classification ESTADISTICA E INVESTIGACION OPERATIVA es_ES
dc.subject.classification ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Batch-adaptive rejection threshold estimation with application to OCR post-processing es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1016/j.eswa.2015.06.022
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Estadística e Investigación Operativa Aplicadas y Calidad - Departament d'Estadística i Investigació Operativa Aplicades i Qualitat es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Navarro Cerdan, JR.; Arlandis Navarro, JF.; Llobet Azpitarte, R.; Perez-Cortes, J. (2015). Batch-adaptive rejection threshold estimation with application to OCR post-processing. Expert Systems with Applications. 42(21):8111-8122. doi:10.1016/j.eswa.2015.06.022 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://dx.doi.org/10.1016/j.eswa.2015.06.022 es_ES
dc.description.upvformatpinicio 8111 es_ES
dc.description.upvformatpfin 8122 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 42 es_ES
dc.description.issue 21 es_ES
dc.relation.senia 294506 es_ES


This item appears in the following Collection(s)

Show simple item record