%PDF-1.5
%
1 0 obj
<>
endobj
2 0 obj
<>stream
Síntesis de voz (TTS)
Reconocimiento automático del habla (ASR)
Texto a voz
Traducción de voz a voz
Aprendizaje profundo
Aprendizaje automático
Inteligencia artificial
Procesamiento del lenguaje natural
Aprendizaje potenciado por la tecnología
Videoconferencias
Accesibilidad
Speech synthesis
Text-to-speech
Speech-to-speech translation
Deep learning
Machine learning
Artificial intelligence
Natural language processing
Technology enhanced learning
Video lectures
Accessibility
[ES] En los últimos años, el aprendizaje profundo ha cambiado significativamente el panorama en diversas áreas del campo de la inteligencia artificial, entre las que se incluyen la visión por computador, el procesamiento del lenguaje natural, robótica o teoría de juegos. En particular, el sorprendente éxito del aprendizaje profundo en múltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automático del habla (ASR), la traducción automática (MT) o la síntesis de voz (TTS), ha supuesto una mejora drástica en la precisión de estos sistemas, extendiendo así su implantación a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologías de reconocimiento automático del habla y traducción automática pueden ser empleadas para producir, de forma efectiva, subtítulos multilingües de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vídeos educativos, donde las condiciones acústicas son normalmente favorables para los sistemas de ASR y el discurso está gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavía debe comprobarse si esta tecnología está lo suficientemente madura como para mejorar la accesibilidad y la participación en el aprendizaje en línea. Además, existen diversas tareas en el campo de la síntesis de voz que todavía suponen un reto, como la clonación de voz inter-lingüe, la síntesis incremental o la adaptación zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de síntesis de voz basados en redes neuronales, así como la extensión de su aplicación en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en línea. En este sentido, este trabajo presta especial atención a la adaptación a nuevos locutores y a la clonación de voz inter-lingüe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.
[CAT] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses àrees del camp de la intel·ligència artificial, entre les quals s'inclouen la visió per computador, el processament del llenguatge natural, robòtica o la teoria de jocs. En particular, el sorprenent èxit de l'aprenentatge profund en múltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automàtic de la parla (ASR), la traducció automàtica (MT) o la síntesi de veu (TTS), ha suposat una millora dràstica en la precisió i qualitat d'aquests sistemes, estenent així la seva implantació a un ventall més ampli a la vida real. En aquest moment, és evident que les tecnologies de reconeixement automàtic de la parla i traducció automàtica poden ser emprades per a produir, de forma efectiva, subtítols multilingües d'alta qualitat de continguts audiovisuals. Això és particularment cert en el context dels vídeos educatius, on les condicions acústiques són normalment favorables per als sistemes d'ASR i el discurs està gramaticalment ben format. No obstant això, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaços de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia és ja prou madura com per millorar l'accessibilitat i la participació en l'aprenentatge en línia. A més, hi ha diverses tasques al camp de la síntesi de veu que encara suposen un repte, com ara la clonació de veu inter-lingüe, la síntesi incremental o l'adaptació zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de síntesi de veu basats en xarxes neuronals, així com l'extensió de la seva aplicació en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en línia. En aquest sentit, aquest treball presta especial atenció a l'adaptació a nous locutors i a la clonació de veu interlingüe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.
[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.
Reserva de todos los derechos
Abierto
Alfonso Juan Císcar
José Alberto Sanchis Navarro
http://hdl.handle.net/10251/184019
Universitat Politècnica de València
Inglés
endstream
endobj
3 0 obj
<<>>
endobj
4 0 obj
<<>>
endobj
5 0 obj
<>
endobj
6 0 obj
<>
endobj
7 0 obj
<>
endobj
9 0 obj
<>
endobj
10 0 obj
(Abstract)
endobj
13 0 obj
<>
endobj
14 0 obj
<>
endobj
12 0 obj
<>
endobj
16 0 obj
(Resumen)
endobj
18 0 obj
<>
endobj
19 0 obj
<>
endobj
17 0 obj
<>
endobj
21 0 obj
(Resum)
endobj
23 0 obj
<>
endobj
24 0 obj
<>
endobj
22 0 obj
<>
endobj
26 0 obj
(Contents)
endobj
28 0 obj
<>
endobj
29 0 obj
<>
endobj
27 0 obj
<>
endobj
31 0 obj
(1 Introduction)
endobj
35 0 obj
<>
endobj
36 0 obj
<>
endobj
33 0 obj
<>
endobj
37 0 obj
(1.1 Framework and motivation)
endobj
39 0 obj
<>
endobj
40 0 obj
<>
endobj
38 0 obj
<>
endobj
42 0 obj
(1.2 Scientific and technological goals)
endobj
43 0 obj
<>
endobj
44 0 obj
<>
endobj
34 0 obj
<>
endobj
46 0 obj
(1.3 Document structure)
endobj
47 0 obj
<>
endobj
48 0 obj
<>
endobj
32 0 obj
<>
endobj
50 0 obj
(2 Preliminaries)
endobj
54 0 obj
<>
endobj
55 0 obj
<>
endobj
52 0 obj
<>
endobj
56 0 obj
(2.1 Machine Learning)
endobj
58 0 obj
<>
endobj
59 0 obj
<>
endobj
57 0 obj
<>
endobj
61 0 obj
(2.2 Sequence-to-Sequence with Attention Mechanism)
endobj
63 0 obj
<>
endobj
64 0 obj
<>
endobj
62 0 obj
<>
endobj
66 0 obj
(2.3 Transformer)
endobj
68 0 obj
<>
endobj
69 0 obj
<>
endobj
67 0 obj
<>
endobj
71 0 obj
(2.4 Generative Adversarial Networks)
endobj
73 0 obj
<>
endobj
74 0 obj
<>
endobj
72 0 obj
<>
endobj
76 0 obj
(2.5 Automatic Speech Recognition)
endobj
78 0 obj
<>
endobj
79 0 obj
<>
endobj
77 0 obj
<>
endobj
81 0 obj
(2.6 Machine Translation)
endobj
83 0 obj
<>
endobj
84 0 obj
<>
endobj
82 0 obj
<>
endobj
86 0 obj
(2.7 Text-To-Speech)
endobj
89 0 obj
<>
endobj
90 0 obj
<>
endobj
87 0 obj
<>
endobj
92 0 obj
(2.7.1 Text-to-spectrogram)
endobj
94 0 obj
<>
endobj
95 0 obj
<>
endobj
93 0 obj
<>
endobj
96 0 obj
(2.7.2 Spectrogram-to-wave)
endobj
97 0 obj
<>
endobj
98 0 obj
<>
endobj
88 0 obj
<>
endobj
100 0 obj
(2.7.3 Evaluation metrics)
endobj
101 0 obj
<>
endobj
102 0 obj
<>
endobj
53 0 obj
<>
endobj
104 0 obj
(2.8 Speech-To-Speech Translation)
endobj
107 0 obj
<>
endobj
108 0 obj
<>
endobj
105 0 obj
<>
endobj
109 0 obj
(2.8.1 Streaming ASR)
endobj
111 0 obj
<>
endobj
112 0 obj
<>
endobj
110 0 obj
<>
endobj
114 0 obj
(2.8.2 Simultaneous MT)
endobj
115 0 obj
<>
endobj
116 0 obj
<>
endobj
106 0 obj
<>
endobj
118 0 obj
(2.8.3 Incremental TTS)
endobj
119 0 obj
<>
endobj
120 0 obj
<>
endobj
51 0 obj
<>
endobj
122 0 obj
(3 Cross-lingual Voice Cloning with Tacotron 2)
endobj
126 0 obj
<>
endobj
127 0 obj
<>
endobj
124 0 obj
<>
endobj
128 0 obj
(3.1 Introduction)
endobj
130 0 obj
<>
endobj
131 0 obj
<>
endobj
129 0 obj
<>
endobj
133 0 obj
(3.2 Tacotron 2)
endobj
135 0 obj
<>
endobj
136 0 obj
<>
endobj
134 0 obj
<>
endobj
138 0 obj
(3.3 Extending Tacotron 2 with cross-lingual voice cloning capabilities)
endobj
140 0 obj
<>
endobj
141 0 obj
<>
endobj
139 0 obj
<>
endobj
143 0 obj
(3.4 Overcoming the exposure bias and attention failures)
endobj
145 0 obj
<>
endobj
146 0 obj
<>
endobj
144 0 obj
<>
endobj
148 0 obj
(3.5 Improving stop token prediction)
endobj
150 0 obj
<>
endobj
151 0 obj
<>
endobj
149 0 obj
<>
endobj
153 0 obj
(3.6 Proposed model and general training procedure)
endobj
154 0 obj
<>
endobj
155 0 obj
<>
endobj
125 0 obj
<>
endobj
157 0 obj
(3.7 Conclusions)
endobj
158 0 obj
<>
endobj
159 0 obj
<>
endobj
123 0 obj
<>
endobj
161 0 obj
(4 Cross-lingual Voice Cloning for UPV[Media])
endobj
165 0 obj
<>
endobj
166 0 obj
<>
endobj
163 0 obj
<>
endobj
167 0 obj
(4.1 Introduction)
endobj
169 0 obj
<>
endobj
170 0 obj
<>
endobj
168 0 obj
<>
endobj
172 0 obj
(4.2 The UPV[Media] platform)
endobj
174 0 obj
<>
endobj
175 0 obj
<>
endobj
173 0 obj
<>
endobj
177 0 obj
(4.3 The Docncia en Xarxa multilingual TTS dataset)
endobj
179 0 obj
<>
endobj
180 0 obj
<>
endobj
178 0 obj
<>
endobj
182 0 obj
(4.4 Model training)
endobj
184 0 obj
<>
endobj
185 0 obj
<>
endobj
183 0 obj
<>
endobj
187 0 obj
(4.5 Evaluation)
endobj
190 0 obj
<>
endobj
191 0 obj
<>
endobj
188 0 obj
<>
endobj
193 0 obj
(4.5.1 Naturalness)
endobj
195 0 obj
<>
endobj
196 0 obj
<>
endobj
194 0 obj
<>
endobj
198 0 obj
(4.5.2 Speaker similarity)
endobj
200 0 obj
<>
endobj
201 0 obj
<>
endobj
199 0 obj
<>
endobj
203 0 obj
(4.5.3 Real or synthetic)
endobj
204 0 obj
<>
endobj
205 0 obj
<>
endobj
189 0 obj
<>
endobj
207 0 obj
(4.5.4 Questionnaire and comments)
endobj
208 0 obj
<>
endobj
209 0 obj
<>
endobj
164 0 obj
<>
endobj
211 0 obj
(4.6 Conclusions)
endobj
212 0 obj
<>
endobj
213 0 obj
<>
endobj
162 0 obj
<>
endobj
215 0 obj
(5 Robust, Efficient and Controllable Neural Text-To-Speech)
endobj
219 0 obj
<>
endobj
220 0 obj
<>
endobj
217 0 obj
<>
endobj
221 0 obj
(5.1 Introduction)
endobj
223 0 obj
<>
endobj
224 0 obj
<>
endobj
222 0 obj
<>
endobj
226 0 obj
(5.2 Non-autoregressive TTS with explicit duration modeling)
endobj
228 0 obj
<>
endobj
229 0 obj
<>
endobj
227 0 obj
<>
endobj
231 0 obj
(5.3 GAN-based neural vocoders)
endobj
233 0 obj
<>
endobj
234 0 obj
<>
endobj
232 0 obj
<>
endobj
236 0 obj
(5.4 The Blizzard Challenge 2021)
endobj
239 0 obj
<>
endobj
240 0 obj
<>
endobj
237 0 obj
<>
endobj
241 0 obj
(5.4.1 Introduction)
endobj
243 0 obj
<>
endobj
244 0 obj
<>
endobj
242 0 obj
<>
endobj
246 0 obj
(5.4.2 Data processing)
endobj
248 0 obj
<>
endobj
249 0 obj
<>
endobj
247 0 obj
<>
endobj
251 0 obj
(5.4.3 Forced-aligner autoencoder model)
endobj
253 0 obj
<>
endobj
254 0 obj
<>
endobj
252 0 obj
<>
endobj
256 0 obj
(5.4.4 Acoustic model)
endobj
258 0 obj
<>
endobj
259 0 obj
<>
endobj
257 0 obj
<>
endobj
261 0 obj
(5.4.5 Vocoder model)
endobj
262 0 obj
<>
endobj
263 0 obj
<>
endobj
238 0 obj
<>
endobj
264 0 obj
(5.4.6 Subjective results)
endobj
265 0 obj
<>
endobj
266 0 obj
<>
endobj
218 0 obj
<>
endobj
268 0 obj
(5.5 Conclusions)
endobj
269 0 obj
<>
endobj
270 0 obj
<>
endobj
216 0 obj
<>
endobj
272 0 obj
(6 Simultaneous Speech-To-Speech Translation)
endobj
276 0 obj
<>
endobj
277 0 obj
<>
endobj
274 0 obj
<>
endobj
278 0 obj
(6.1 Introduction)
endobj
280 0 obj
<>
endobj
281 0 obj
<>
endobj
279 0 obj
<>
endobj
283 0 obj
(6.2 The Europarl-ST dataset)
endobj
285 0 obj
<>
endobj
286 0 obj
<>
endobj
284 0 obj
<>
endobj
288 0 obj
(6.3 Streaming ASR)
endobj
290 0 obj
<>
endobj
291 0 obj
<>
endobj
289 0 obj
<>
endobj
292 0 obj
(6.4 Simultaneous Machine Translation)
endobj
294 0 obj
<>
endobj
295 0 obj
<>
endobj
293 0 obj
<>
endobj
297 0 obj
(6.5 Incremental Multilingual Text-To-Speech)
endobj
301 0 obj
<>
endobj
302 0 obj
<>
endobj
299 0 obj
<>
endobj
303 0 obj
(6.5.1 Adapted prefix-to-prefix framework)
endobj
305 0 obj
<>
endobj
306 0 obj
<>
endobj
304 0 obj
<>
endobj
308 0 obj
(6.5.2 Model architecture)
endobj
310 0 obj
<>
endobj
311 0 obj
<>
endobj
309 0 obj
<>
endobj
313 0 obj
(6.5.3 Experiments)
endobj
314 0 obj
<>
endobj
315 0 obj
<>
endobj
300 0 obj
<>
endobj
317 0 obj
(6.5.4 Evaluation)
endobj
318 0 obj
<>
endobj
319 0 obj
<>
endobj
298 0 obj
<>
endobj
321 0 obj
(6.6 S2S latency evaluation)
endobj
322 0 obj
<>
endobj
323 0 obj
<>
endobj
275 0 obj
<>
endobj
325 0 obj
(6.7 Conclusions)
endobj
326 0 obj
<>
endobj
327 0 obj
<>
endobj
273 0 obj
<>
endobj
329 0 obj
(7 Zero-Shot Speaker Adaptation)
endobj
333 0 obj
<>
endobj
334 0 obj
<>
endobj
331 0 obj
<>
endobj
335 0 obj
(7.1 Introduction)
endobj
337 0 obj
<>
endobj
338 0 obj
<>
endobj
336 0 obj
<>
endobj
340 0 obj
(7.2 Speaker conditioning via transfer learning)
endobj
342 0 obj
<>
endobj
343 0 obj
<>
endobj
341 0 obj
<>
endobj
345 0 obj
(7.3 The LibriTTS multi-speaker English corpus)
endobj
347 0 obj
<>
endobj
348 0 obj
<>
endobj
346 0 obj
<>
endobj
350 0 obj
(7.4 Proposed zero-shot multi-speaker architecture)
endobj
352 0 obj
<>
endobj
353 0 obj
<>
endobj
351 0 obj
<>
endobj
355 0 obj
(7.5 Least Squares Generative Adversarial Networks for TTS acoustic modeling)
endobj
357 0 obj
<>
endobj
358 0 obj
<>
endobj
356 0 obj
<>
endobj
360 0 obj
(7.6 Experiments)
endobj
362 0 obj
<>
endobj
363 0 obj
<>
endobj
361 0 obj
<>
endobj
365 0 obj
(7.7 Evaluation)
endobj
367 0 obj
<>
endobj
368 0 obj
<>
endobj
366 0 obj
<>
endobj
370 0 obj
(7.8 Integration into UPV[Media] transcription and translation pipeline)
endobj
371 0 obj
<>
endobj
372 0 obj
<>
endobj
332 0 obj
<>
endobj
374 0 obj
(7.9 Conclusions)
endobj
375 0 obj
<>
endobj
376 0 obj
<>
endobj
330 0 obj
<>
endobj
378 0 obj
(8 Conclusions and future work)
endobj
382 0 obj
<>
endobj
383 0 obj
<>
endobj
380 0 obj
<>
endobj
384 0 obj
(8.1 Scientific and technological achievements)
endobj
386 0 obj
<>
endobj
387 0 obj
<>
endobj
385 0 obj
<>
endobj
389 0 obj
(8.2 Publications)
endobj
390 0 obj
<>
endobj
391 0 obj
<>
endobj
381 0 obj
<>
endobj
393 0 obj
(8.3 Future work)
endobj
394 0 obj
<>
endobj
379 0 obj
<>
endobj
395 0 obj
(List of figures)
endobj
397 0 obj
<>
endobj
398 0 obj
<>
endobj
396 0 obj
<>
endobj
400 0 obj
(List of tables)
endobj
402 0 obj
<>
endobj
403 0 obj
<>
endobj
401 0 obj
<>
endobj
405 0 obj
(Bibliography)
endobj
406 0 obj
<>
endobj
407 0 obj
<>
endobj
411 0 obj
<>stream
JFIF ,, Exif MM * b j( 1 r2 ~i , , GIMP 2.6.11 2013:01:29 16:19:35 0210 0100 Q " *( 2 H H JFIF C
$.' ",#(7),01444'9=82<.342 C
2!!22222222222222222222222222222222222222222222222222 > "
} !1AQa"q2#BR$3br
%&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz
w !1AQaq"2B #3Rbr
$4%&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz ? |Em:,owH㳄GWbH1k-Y""K]1'8R$Ƞc=HgTNY>2}w.P80S ԺuK// ɑdǕ<~epi4-F5ˈװ5w4M,̠Hr0@-q_/^[,hDn/Pxh([&ڎ