Batch-adaptive rejection threshold estimation with application to OCR post-processing

Navarro Cerdán, José Ramón; Arlandis Navarro, Joaquim Francesc; Llobet Azpitarte, Rafael; Perez-Cortes, Juan-Carlos

doi:10.1016/j.eswa.2015.06.022

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: thresholdEstimati ...

Tamaño: 545.5Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: 1-s2.0-S095741741 ...

Tamaño: 1.201Mb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Navarro Cerdán, José Ramón	es_ES
dc.contributor.author	Arlandis Navarro, Joaquim Francesc	es_ES
dc.contributor.author	Llobet Azpitarte, Rafael	es_ES
dc.contributor.author	Perez-Cortes, Juan-Carlos	es_ES
dc.date.accessioned	2015-10-29T09:13:57Z
dc.date.available	2015-10-29T09:13:57Z
dc.date.issued	2015-06-24
dc.identifier.issn	0957-4174
dc.identifier.uri	http://hdl.handle.net/10251/56691
dc.description.abstract	An OCR process is often followed by the application of a language model to find the best transformation of an OCR hypothesis into a string compatible with the constraints of the document, field or item under consideration. The cost of this transformation can be taken as a confidence value and compared to a threshold to decide if a string is accepted as correct or rejected in order to satisfy the need for bounding the error rate of the system. Widespread tools like ROC, precision-recall, or error-reject curves, are commonly used along with fixed thresholding in order to achieve that goal. However, those methodologies fail when a test sample has a confidence distribution that differs from the one of the sample used to train the system, which is a very frequent case in post-processed OCR strings (e.g., string batches showing particularly careful handwriting styles in contrast to free styles). In this paper, we propose an adaptive method for the automatic estimation of the rejection threshold that overcomes this drawback, allowing the operator to define an expected error rate within the set of accepted (non-rejected) strings of a complete batch of documents (as opposed to trying to establish or control the probability of error of a single string), regardless of its confidence distribution. The operator (expert) is assumed to know the error rate that can be acceptable to the user of the resulting data. The proposed system transforms that knowledge into a suitable rejection threshold. The approach is based on the estimation of an expected error vs. transformation cost distribution. First, a model predicting the probability of a cost to arise from an erroneously transcribed string is computed from a sample of supervised OCR hypotheses. Then, given a test sample, a cumulative error vs. cost curve is computed and used to automatically set the appropriate threshold that meets the user-defined error rate on the overall sample. The results of experiments on batches coming from different writing styles show very accurate error rate estimations where fixed thresholding clearly fails. An original procedure to generate distorted strings from a given language is also proposed and tested, which allows the use of the presented method in tasks where no real supervised OCR hypotheses are available to train the system.	es_ES
dc.language	Inglés	es_ES
dc.publisher	Elsevier	es_ES
dc.relation.ispartof	Expert Systems with Applications	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Rejection threshold	es_ES
dc.subject	OCR post-processing	es_ES
dc.subject	Language models	es_ES
dc.subject	Weighted finite-state transducers	es_ES
dc.subject	Error vs. cost curve	es_ES
dc.subject	Cumulative error vs. cost curve	es_ES
dc.subject	OCR error-generation model	es_ES
dc.subject.classification	ESTADISTICA E INVESTIGACION OPERATIVA	es_ES
dc.subject.classification	ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Batch-adaptive rejection threshold estimation with application to OCR post-processing	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1016/j.eswa.2015.06.022
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Estadística e Investigación Operativa Aplicadas y Calidad - Departament d'Estadística i Investigació Operativa Aplicades i Qualitat	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Navarro Cerdan, JR.; Arlandis Navarro, JF.; Llobet Azpitarte, R.; Perez-Cortes, J. (2015). Batch-adaptive rejection threshold estimation with application to OCR post-processing. Expert Systems with Applications. 42(21):8111-8122. doi:10.1016/j.eswa.2015.06.022	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	http://dx.doi.org/10.1016/j.eswa.2015.06.022	es_ES
dc.description.upvformatpinicio	8111	es_ES
dc.description.upvformatpfin	8122	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	42	es_ES
dc.description.issue	21	es_ES
dc.relation.senia	294506	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48344]

Mostrar el registro sencillo del ítem

Batch-adaptive rejection threshold estimation with application to OCR post-processing

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)