- -

Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion

RiuNet: Institutional repository of the Polithecnic University of Valencia

Share/Send to

Cited by

Statistics

Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion

Show simple item record

Files in this item

dc.contributor.author Tejedor, Javier es_ES
dc.contributor.author Toledano, Doroteo T. es_ES
dc.contributor.author Anguera, Xavier es_ES
dc.contributor.author Varona, Amparo es_ES
dc.contributor.author Hurtado Oliver, Lluis Felip es_ES
dc.contributor.author Miguel, Antonio es_ES
dc.contributor.author Colás, José es_ES
dc.date.accessioned 2014-09-29T11:42:50Z
dc.date.available 2014-09-29T11:42:50Z
dc.date.issued 2013-09-17
dc.identifier.issn 1687-4722
dc.identifier.uri http://hdl.handle.net/10251/40402
dc.description The final publication is available at Springer via http://dx.doi.org/10.1186/1687-4722-2013-23 es_ES
dc.description.abstract Query-by-Example Spoken Term Detection (QbE STD) aims at retrieving data from a speech data repository given an acoustic query containing the term of interest as input. Nowadays, it has been receiving much interest due to the high volume of information stored in audio or audiovisual format. QbE STD differs from automatic speech recognition (ASR) and keyword spotting (KWS)/spoken term detection (STD) since ASR is interested in all the terms/words that appear in the speech signal and KWS/STD relies on a textual transcription of the search term to retrieve the speech data. This paper presents the systems submitted to the ALBAYZIN 2012 QbE STD evaluation held as a part of ALBAYZIN 2012 evaluation campaign within the context of the IberSPEECH 2012 Conferencea. The evaluation consists of retrieving the speech files that contain the input queries, indicating their start and end timestamps within the appropriate speech file. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from MAVIR workshopsb, which amount at about 7 h of speech in total. We present the database metric systems submitted along with all results and some discussion. Four different research groups took part in the evaluation. Evaluation results show the difficulty of this task and the limited performance indicates there is still a lot of room for improvement. The best result is achieved by a dynamic time warping-based search over Gaussian posteriorgrams/posterior phoneme probabilities. This paper also compares the systems aiming at establishing the best technique dealing with that difficult task and looking for defining promising directions for this relatively novel task. es_ES
dc.language Español es_ES
dc.publisher SpringerOpen es_ES
dc.relation.ispartof EURASIP Journal on Audio, Speech, and Music Processing es_ES
dc.rights Reconocimiento (by) es_ES
dc.subject Query-by-example es_ES
dc.subject Spoken term detection es_ES
dc.subject International evaluation es_ES
dc.subject Search on spontaneous speech es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1186/1687-4722-2013-23
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Tejedor, J.; Toledano, DT.; Anguera, X.; Varona, A.; Hurtado Oliver, LF.; Miguel, A.; Colás, J. (2013). Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion. EURASIP Journal on Audio, Speech, and Music Processing. (23):1-17. doi:10.1186/1687-4722-2013-23 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://link.springer.com/article/10.1186/1687-4722-2013-23 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 17 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.issue 23 es_ES
dc.relation.senia 261848
dc.relation.references Zhang T, Kuo CCJ: Hierarchical classification of audio data for archiving and retrieving. In Proceedings of ICASSP. Phoenix; 15–19 March 1999:3001-3004. es_ES
dc.relation.references Helén M, Virtanen T: Query by example of audio signals using Euclidean distance between Gaussian Mixture Models. In Proceedings of ICASSP. Honolulu; 15–20 April 2007:225-228. es_ES
dc.relation.references Helén M, Virtanen T: Audio query by example using similarity measures between probability density functions of features. EURASIP J. Audio Speech Music Process 2010, 2010: 2:1-2:12. es_ES
dc.relation.references Tzanetakis G, Ermolinskyi A, Cook P: Pitch histograms in audio and symbolic music information retrieval. In Proceedings of the Third International Conference on Music Information Retrieval: ISMIR. Paris; 2002:31-38. es_ES
dc.relation.references Tsai HM, Wang WH: A query-by-example framework to retrieve music documents by singer. In Proceedings of the IEEE International Conference on Multimedia and Expo. Taipei; 27–30 June 2004:1863-1866. es_ES
dc.relation.references Chia TK, Sim KC, Li H, Ng HT: A lattice-based approach to query-by-example spoken document retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore; 20–24 July 2008:363-370. es_ES
dc.relation.references Tejedor J, Fapšo M, Szöke I, Černocký H, Grézl F: Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):18:1-18:34. es_ES
dc.relation.references Muscariello A, Gravier G, Bimbot F: Zero-resource audio-only spoken term detection based on a combination of template matching techniques. In Proceedings of Interspeech. Florence; 27–31 August 2011:921-924. es_ES
dc.relation.references Lin H, Stupakov A, Bilmes J: Spoken keyword spotting via multi-lattice alignment. In 9th International Speech Communication Association Annual Conference. Brisbane, Australia; September 2008:2191-2194. es_ES
dc.relation.references Parada C, Sethy A, Ramabhadran B: Query-by-Example Spoken Term Detection for OOV terms. In Proceedings of ASRU. Merano; 13-17 December 2009:404-409. es_ES
dc.relation.references Shen W, White TJ, Hazen CM: A comparison of Query-by-Example methods for Spoken Term Detection. In Proceedings of Interspeech. Brighton; September 2009:2143-2146. es_ES
dc.relation.references Lin H, Stupakov A, Bilmes J: Improving multi-lattice alignment based spoken keyword spotting. In Proceedings of ICASSP. Taipei; 19–24 April 2009:4877-4880. es_ES
dc.relation.references Barnard E, Davel M, van Heerden C, Kleynhans N, Bali K: Phone recognition for spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:5-6. es_ES
dc.relation.references Buzo A, Cucu H, Safta M, Ionescu B, Burileanu C: ARF@MediaEval 2012: a Romanian ASR-based approach to spoken term detection. In Proceedings of MediaEval. Pisa; 4–5 October 2012:7-8. es_ES
dc.relation.references Abad A, Astudillo RF: The L2F spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:9-10. es_ES
dc.relation.references Varona A, Penagarikano M, Rodríguez-Fuentes L, Bordel L, Diez M: GTTS system for the spoken web search task at MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:13-14. es_ES
dc.relation.references Szöke I, Faps̆o M, Veselý K: BUT2012 Approaches for spoken web search - MediaEval 2012. In Proceedings of MediaEval. Pisa;4–5October 2012:15-16. es_ES
dc.relation.references Hazen W, Shen TJ, White CM: Query-by-Example spoken term detection using phonetic posteriorgram templates. In Proceedings of ASRU. Merano; 13–17 December 2009:421-426. es_ES
dc.relation.references Zhang Y, Glass JR: Unsupervised spoken keyword spotting via segmental DTW on Gaussian Posteriorgrams. In Proceedings of ASRU. Merano; 13–17 December 2009:398-403. es_ES
dc.relation.references Chan C, Lee L: Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proceedings of Interspeech. Makuhari; 26–30 September 2010:693-696. es_ES
dc.relation.references Anguera X, Macrae R, Oliver N: Partial sequence matching using an unbounded dynamic time warping algorithm. In Proceedings of ICASSP. Dallas; 14–19 March 2010:3582-3585. es_ES
dc.relation.references Anguera X: Telefonica system for the spoken web search Task at Mediaeval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:3-4. es_ES
dc.relation.references Muscariello A, Gravier G: Irisa MediaEval 2011 spoken web search system. In Proceedings of MediaEval. Pisa; 1–2 September 2011:9-10. es_ES
dc.relation.references Szöke I, Tejedor J, Faps̆o M, Colás J: BUT-HCTLab approaches for spoken web search - MediaEval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:11-12. es_ES
dc.relation.references Wang H, Lee T: CUHK System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:3-4. es_ES
dc.relation.references Joder C, Weninger F, Wöllmer M, Schuller M: The TUM cumulative DTW approach for the Mediaeval 2012 spoken web search task. In Proceedings of MediaEval. Pisa; 4–5 October 2012:5-6. es_ES
dc.relation.references Vavrek J, Pleva M, Juhár J: TUKE MediaEval 2012: spoken web search using DTW and unsupervised SVM. In Proceedings of MediaEval. Pisa; 4–5 October 2012:11-12. es_ES
dc.relation.references Jansen A, Durme P, Clark BV: The JHU-HLTCOE spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:17-18. es_ES
dc.relation.references Anguera X: Telefonica Research System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:19-20. es_ES
dc.relation.references NIST: The Ninth Text REtrieval Conference (TREC 9). 2000. http://trec.nist.gov . Accessed 16 September 2013 es_ES
dc.relation.references NIST: The Spoken Term Detection (STD) 2006 Evaluation Plan. 10 (National Institute of Standards and Technology (NIST), Gaithersburg, 2006). . Accessed 16 September 2013 http://www.nist.gov/speech/tests/std es_ES
dc.relation.references Sakai T, Joho H: Overview of NTCIR-9. Proceedings of NTCIR-9 Workshop 2011, 1-7. es_ES
dc.relation.references Rajput N, Metze F: Spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:1-2. es_ES
dc.relation.references Metze F, Barnard E, Davel M, van Heerden C, Anguera X, Gravier G, Rajput N: Spoken web search. In Proceedings of MediaEval. Pisa; 4–5 October 2012:1-2. es_ES
dc.relation.references Tokyo University of Technology: Evaluation of information access technologies: information retrieval, question answering and cross-lingual information access. 2013. http://research.nii.ac.jp/ntcir/ntcir-10/ . Accessed 16 September 2013 es_ES
dc.relation.references NIST: The OpenKWS13 evaluation plan. 1, (National Institute of Standards and Technology (NIST), Gaithersburg, 2013). . Accessed 16 September 2013 http://www.nist.gov/itl/iad/mig/openkws13.cfm es_ES
dc.relation.references Taras B, Nadeu C: Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music Process 2011, 1: 1-10. es_ES
dc.relation.references Zelenák M, Schulz H, Hernando J: Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music Process 2012, 19: 1-9. es_ES
dc.relation.references Rodríguez-Fuentes LJ, Penagarikano M, Varona A, Díez M, Bordel G: The Albayzin 2010 language recognition evaluation. In Proceedings of Interspeech. Florence; 27–31 August 2011:1529-1532. es_ES
dc.relation.references Méndez F, Docío L, Arza M, Campillo F: The Albayzin 2010 text-to-speech evaluation. In Proceedings of FALA. Vigo; November 2010:317-340. es_ES
dc.relation.references Fiscus JG, Ajot J, Garofolo JS, Doddington G: Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR Workshop Searching Spontaneous Conversational Speech. Rhodes; 22–25 September 2007:45-50. es_ES
dc.relation.references Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. In Proceedings of Eurospeech. Rhodes; 22-25 September 1997:1895-1898. es_ES
dc.relation.references NIST: NIST Speech Tools and APIs: 2006 (National Institute of Standards and Technology (NIST), Gaithersburg, 1996). . Accessed 16 September 2013 http://www.nist.gov/speech/tools/index.htm es_ES
dc.relation.references Iberspeech 2012: VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop. . Accessed 16 September 2013 http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdf es_ES
dc.relation.references Anguera X: Speaker independent discriminant feature extraction for acoustic pattern-matching. In Proceedings of ICASSP. Kyoto; 25–30 March 2012:485-488. es_ES
dc.relation.references Anguera X, Ferrarons M: Memory efficient subsequence DTW for Query-by-Example spoken term detection. Proceedings of ICME 2013. http://www.xavieranguera.com/papers/sdtw_icme2013.pdf es_ES
dc.relation.references Anguera X: Telefonica Research System for the Query-by-example task at Albayzin 2012. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:626-632. es_ES
dc.relation.references Schwarz P: Phoneme recognition based on long temporal context. PhD Thesis, FIT, BUT, Brno, Czech Republic. 2008. es_ES
dc.relation.references Stolckem A: SRILM - an extensible language modeling toolkit. In Proceedings of Interspeech. Denver; 2002:901-904. es_ES
dc.relation.references Wang D, King S, Frankel J: Stochastic pronunciation modelling for out-of-vocabulary spoken term detection. IEEE Trans. Audio Speech Language Process 2011, 19(4):688-698. es_ES
dc.relation.references Wang D, Tejedor J, King S, Frankel J: Term-dependent confidence normalization for out-of-vocabulary spoken term detection. J. Comput. Sci. Technol 2012, 27(2):358-375. 10.1007/s11390-012-1228-x es_ES
dc.relation.references Wang D, King S, Frankel J, Vipperla R, Evans N, Troncy R: Direct posterior confidence for out-of-vocabulary spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):1-34. es_ES
dc.relation.references Varona A, Penagarikano M, Rodríguez-Fuentes LJ, Bordel G, Diez M: GTTS systems for the query-by-example spoken term detection task of the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:619-625. es_ES
dc.relation.references Gómez J, Sanchis E, Castro-Bleda M: Automatic speech segmentation based on acoustical clustering. Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition 2010, 540-548. es_ES
dc.relation.references Gómez J, Castro M: Automatic segmentation of speech at the phonetic level. Proceedings of the joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition 2002, 672-680. es_ES
dc.relation.references Sanchis E, Hurtado LF, Gómez JA, Calvo M, Fabra R: The ELiRF Query-by-example STD systems for the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:611-618. es_ES
dc.relation.references Park A, Glass J: Towards unsupervised pattern discovery in speech. In Proceedings of ASRU. Cancun; 27 November to 1 December 2005:53-58. es_ES
dc.relation.references Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book. Engineering Department, Cambridge University; 2006. es_ES
dc.relation.references Miguel A, Villalba J, Ortega A, Lleida E: Albayzin 2012 search on speech @ ViVoLab UZ. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:633-642. es_ES
dc.relation.references Boersma P, Weenink D: Praat: Doing Phonetics by Computer. University of Amsterdam, Spuistraat, 210, Amsterdam, Holland. 2007. http://www.fon.hum.uva.nl/praat/ . Accessed 16 September 2013 es_ES
dc.relation.references Goldwater S, Jurafsky D, Maning CD: Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 2009, 52(3):181-200. es_ES
dc.relation.references Mertens T, Wallace R, Schneider D: Cross-site combination and evaluation of subword spoken term detection systems. In Proceedings of CBMI. Madrid; 13–15 June 2011:61-66. es_ES


This item appears in the following Collection(s)

Show simple item record