Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Garcés Díaz-Munío, Gonçal; Silvestre Cerdà, Joan Albert; Jorge-Cano, Javier; Giménez Pastor, Adrián; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González de Martos, Alejandro Manuel; Civera Saiz, Jorge; Sanchis Navarro, José Alberto; Juan, Alfons

doi:10.21437/Interspeech.2021-1905

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: GarcesSilvestreJo ...

Tamaño: 189.0Kb

Formato: PDF

Descripción: Versión editorial

Abrir

dc.contributor.author	Garcés Díaz-Munío, Gonçal	es_ES
dc.contributor.author	Silvestre Cerdà, Joan Albert	es_ES
dc.contributor.author	Jorge-Cano, Javier	es_ES
dc.contributor.author	Giménez Pastor, Adrián	es_ES
dc.contributor.author	Iranzo-Sánchez, Javier	es_ES
dc.contributor.author	Baquero-Arnal, Pau	es_ES
dc.contributor.author	Roselló, Nahuel	es_ES
dc.contributor.author	Pérez-González de Martos, Alejandro Manuel	es_ES
dc.contributor.author	Civera Saiz, Jorge	es_ES
dc.contributor.author	Sanchis Navarro, José Alberto	es_ES
dc.contributor.author	Juan, Alfons	es_ES
dc.date.accessioned	2023-03-08T06:48:28Z
dc.date.available	2023-03-08T06:48:28Z
dc.date.issued	2021-09-03	es_ES
dc.identifier.uri	http://hdl.handle.net/10251/192418
dc.description.abstract	[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament¿s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.	es_ES
dc.description.abstract	[CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.	es_ES
dc.description.sponsorship	This work has received funding from the EU's H2020 research and innovation programme under grant agreements 761758 (X5gon) and 952215 (TAILOR); the Government of Spain's research project Multisub (RTI2018-094879-B-I00, MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; the Generalitat Valenciana's research project Classroom Activity Recognition (PROMETEO/2019/111) and predoctoral research scholarship ACIF/2017/055; and the Universitat Politecnica de València's ` PAID-01-17 R&D support programme.	es_ES
dc.language	Inglés	es_ES
dc.publisher	International Speech Communication Association (ISCA)	es_ES
dc.relation.ispartof	Proc. Interspeech 2021	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Automatic speech recognition	es_ES
dc.subject	Speech corpus	es_ES
dc.subject	Speech data filtering	es_ES
dc.subject	Speech data verbatimization	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization	es_ES
dc.type	Comunicación en congreso	es_ES
dc.identifier.doi	10.21437/Interspeech.2021-1905	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-094879-B-I00/ES/SUBTITULACION MULTILINGUE DE CLASES DE AULA Y SESIONES PLENARIAS/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MIU//FPU18%2F04135/ES/NOVEL CONTRIBUTIONS TO NEURAL SPEECH TRANSLATION/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/761758/EU/X5gon: Cross Modal, Cross Cultural, Cross Lingual, Cross Domain, and Cross Site Global OER Network/X5gon	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//PROMETEO%2F2019%2F111/ES/CLASSROOM ACTIVITY RECOGNITION/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/952215/EU/Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization/TAILOR	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//ACIF%2F2017%2F055/ES/Subvenciones para la contratación de personal investigador de carácter predoctoral	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MECD/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016 en I+D+i/FPU14%2F03981/ES/Ayudas para la formación de profesorado universitario de los subprogramas de Formación y Movilidad	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/UPV/Programas de Apoyo a la I+D+i/PAID-01-17/ES/Ayudas para Contratos de Acceso de personal investigador doctor en estructuras de investigación de la Universitat Politècnica de València 2017- Subprograma 1/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escuela Politécnica Superior de Alcoy - Escola Politècnica Superior d'Alcoi	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica	es_ES
dc.description.bibliographicCitation	Garcés Díaz-Munío, G.; Silvestre Cerdà, JA.; Jorge-Cano, J.; Giménez Pastor, A.; Iranzo-Sánchez, J.; Baquero-Arnal, P.; Roselló, N.... (2021). Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization. International Speech Communication Association (ISCA). 3695-3699. https://doi.org/10.21437/Interspeech.2021-1905	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.conferencename	22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021)	es_ES
dc.relation.conferencedate	Agosto 30-Septiembre 03,2021	es_ES
dc.relation.conferenceplace	Brno, Czechia	es_ES
dc.relation.publisherversion	https://doi.org/10.21437/Interspeech.2021-1905	es_ES
dc.description.upvformatpinicio	3695	es_ES
dc.description.upvformatpfin	3699	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.relation.pasarela	S\445607	es_ES
dc.contributor.funder	Universitat Politècnica de València	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Ítems relacionados