Voice-based information retrieval — how far are we from the text-based information retrieval ?

Although network content access is primarily text-based today, almost all roles of text can be accomplished by voice. Voice-based information retrieval refers to the situation that the user query and/or the content to be retried are in form of voice. This paper tries to compare the voice-based information retrieval with the currently very successful text-based information retrieval, and identifies two major issues in which voice-based information retrieval is far behind: retrieval accuracy and user-system interaction. These two issues are reviewed, analyzed and discussed in detail. It is found that very good approaches have been proposed and very good improvements have been achieved, although there is still a very long way to go. A few successful prototype systems, among many others are presented at the end.

[1]  Pascale Fung,et al.  Rhetorical-State Hidden Markov Models for extractive speech summarization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Lin-Shan Lee,et al.  Multi-layered summarization of spoken document archives by information extraction and semantic structuring , 2006, INTERSPEECH.

[4]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[5]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[6]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[7]  David Carmel,et al.  Spoken document retrieval from call-center conversations , 2006, SIGIR.

[8]  Lin-Shan Lee,et al.  Statistics-based segment pattern lexicon-a new direction for Chinese language modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Lin-Shan Lee,et al.  Automatic title generation for Chinese spoken documents with a delicate scored Viterbi algorithm , 2008, 2008 IEEE Spoken Language Technology Workshop.

[10]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[11]  Lin-Shan Lee,et al.  Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA) , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Dragutin Petkovic,et al.  Phonetic confusion matrix based spoken document retrieval , 2000, SIGIR '00.

[13]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[14]  Peng Yu,et al.  Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures , 2006, NAACL.

[15]  Lin-Shan Lee,et al.  Type-II dialogue systems for information access from unstructured knowledge sources , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[16]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[17]  Biing-Hwang Juang,et al.  Towards the integration of automatic speech recognition and information retrieval for spoken query processing , 2008, INTERSPEECH.

[18]  Hwee Tou Ng,et al.  A lattice-based approach to query-by-example spoken document retrieval , 2008, SIGIR '08.

[19]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[20]  Dong Wang,et al.  A comparison of phone and grapheme-based spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Biing-Hwang Juang,et al.  A scalable method for voice search to nationwide business listings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Satoshi Nakamura,et al.  Generalized posterior probability for minimizing verification errors at subword, word and sentence levels , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[23]  Lin-Shan Lee,et al.  Efficient interactive retrieval of spoken documents with key terms ranked by reinforcement learning , 2006, INTERSPEECH.

[24]  Lin-Shan Lee,et al.  Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA) , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[26]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[27]  Lin-Shan Lee,et al.  Hierarchical topic organization and visual presentation of spoken documents using probabilistic latent semantic analysis (PLSA) for efficient retrieval/browsing applications , 2005, INTERSPEECH.

[28]  James R. Glass,et al.  City browser: developing a conversational automotive HMI , 2009, CHI Extended Abstracts.

[29]  Tatsuya Kawahara,et al.  Automatic extraction of key sentences from oral presentations using statistical measure based on discourse markers , 2004, INTERSPEECH.

[30]  Sadaoki Furui,et al.  Speech-to-text and speech-to-speech summarization of spontaneous speech , 2004, IEEE Transactions on Speech and Audio Processing.

[31]  Lin-Shan Lee,et al.  Analytical comparison between position specific posterior lattices and confusion networks based on words and subword units for spoken document indexing , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[32]  Dimitrios Gunopulos,et al.  Approximate embedding-based subsequence matching of time series , 2008, SIGMOD Conference.

[33]  Lin-Shan Lee,et al.  Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese , 2002, IEEE Trans. Speech Audio Process..

[34]  Julia Hirschberg,et al.  Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[35]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[36]  Lin-Shan Lee,et al.  Latent semantic retrieval of personal photos with sparse user annotation by fused image/speech/text features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[38]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[39]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[40]  Lin-Shan Lee,et al.  A Multi-layered Summarization System for Multi-media Archives by Understanding and Structuring of Chinese Spoken Documents , 2006, ISCSLP.

[41]  James R. Glass,et al.  Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[42]  Johanna D. Moore,et al.  Incorporating Speaker and Discourse Features into Speech Summarization , 2006, NAACL.

[43]  Ye-Yi Wang,et al.  Spoken language understanding , 2005, IEEE Signal Processing Magazine.

[44]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[45]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.

[46]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[47]  Geoffrey Zweig,et al.  Live search for mobile:Web services by voice on the cellphone , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Peng Yu,et al.  Vocabulary-independent indexing of spontaneous speech , 2005, IEEE Transactions on Speech and Audio Processing.

[49]  Beth Logan,et al.  Approaches to reduce the effects of OOV queries on indexed spoken audio , 2005, IEEE Transactions on Multimedia.

[50]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[51]  Biing-Hwang Juang,et al.  Spoken Query Processing for Information Retrieval , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  Rong Jin,et al.  Automatic Title Generation for Spoken Broadcast News , 2001, HLT.

[53]  Mikko Kurimo,et al.  Indexing confusion networks for morph-based spoken document retrieval , 2007, SIGIR.

[54]  Lin-Shan Lee,et al.  Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks , 2006, ISCSLP.

[55]  Alex Acero,et al.  Position Specific Posterior Lattices for Indexing Speech , 2005, ACL.

[56]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[57]  Seiichi Nakagawa,et al.  SUMMARIZATION OF SPOKEN LECTURES BASED ON LINGUISTIC SURFACE AND PROSODIC INFORMATION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[58]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[59]  Junlan Feng,et al.  Speech and language processing over the web , 2008, IEEE Signal Processing Magazine.

[60]  Stephen Wan,et al.  Using Thematic Information in Statistical Headline Generation , 2003, Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering -.

[61]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[62]  Keitaro Naruse,et al.  Speech and song wave search in the web: system design and implementation (音声) , 2007 .

[63]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Lin-Shan Lee,et al.  Subword-based position specific posterior lattices (s-PSPL) for indexing speech information , 2007, INTERSPEECH.

[65]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[66]  Michele Banko,et al.  Headline Generation Based on Statistical Translation , 2000, ACL.

[67]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[68]  Steve J. Young,et al.  Talking to machines (statistically speaking) , 2002, INTERSPEECH.

[69]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[70]  Rong Yan,et al.  A learning-based hybrid tagging and browsing approach for efficient manual image annotation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Lie Lu,et al.  Searching the Audio Notebook: Keyword Search in Recorded Conversation , 2005, HLT.

[72]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[73]  Frank K. Soong,et al.  Tone-Enhanced Generalized Character Posterior Probability (GCPP) for Cantonese LVCSR , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[74]  Vibhu O. Mittal,et al.  Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries , 1999, SIGIR '99.

[75]  Alex Acero,et al.  Soft indexing of speech content for search in spoken documents , 2007, Comput. Speech Lang..

[76]  Tatsuya Kawahara,et al.  Speech-Based Interactive Information Guidance System using Question-Answering Technique , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[77]  Bhuvana Ramabhadran,et al.  Phonetic query expansion for spoken document retrieval , 2008, INTERSPEECH.

[78]  Hui Lin,et al.  Spoken keyword spotting via multi-lattice alignment , 2008, INTERSPEECH.

[79]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[80]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[81]  Kenney Ng Towards robust methods for spoken document retrieval , 1998, ICSLP.

[82]  Lin-Shan Lee,et al.  Learning on demand - course lecture distillation by information extraction and semantic structuring for spoken documents , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[83]  Lin-Shan Lee,et al.  Automatic title generation for Chinese spoken documents using an adaptive k nearest-neighbor approach , 2003, INTERSPEECH.

[84]  Vibhu O. Mittal,et al.  Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries (poster abstract). , 1998, SIGIR 1999.

[85]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[86]  Lin-shan Lee,et al.  A multi-modal dialogue system for information navigation and retrieval across spoken document archives with topic hierarchies , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[87]  Yu Shi,et al.  Towards spoken-document retrieval for the enterprise: Approximate word-lattice indexing with text indexers , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[88]  Gerald Penn,et al.  Summarization of spontaneous conversations , 2006, INTERSPEECH.

[89]  Keitaro Naruse,et al.  Speech and Song Search on the Web: System Design and Implementation , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).