Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features

This paper proposes a set of approaches to automatically extract key terms from spoken course lectures including audio signals, ASR transcriptions and slides. We divide the key terms into two types: key phrases and keywords and develop different approaches to extract them in order. We extract key phrases using right/left branching entropy and extract keywords by learning from three sets of features: prosodic features, lexical features and semantic features from Probabilistic Latent Semantic Analysis (PLSA). The learning approaches include an unsupervised method (K-means exemplar) and two supervised ones (AdaBoost and neural network). Very encouraging preliminary results were obtained with a corpus of course lectures, and it is found that all approaches and all sets of features proposed here are useful.

[1]  Lin-Shan Lee,et al.  IMPROVED SUMMARIZATION OF CHINESE SPOKEN DOCUMENTS BY PROBABILISTIC LATENT SEMANTIC ANALYSIS (PLSA) WITH FURTHER ANALYSIS AND INTEGRATED SCORING , 2006, 2006 IEEE Spoken Language Technology Workshop.

[2]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[3]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Lin-Shan Lee,et al.  Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA) , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Lin-Shan Lee,et al.  Learning on demand - course lecture distillation by information extraction and semantic structuring for spoken documents , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[8]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[9]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[11]  Fei Liu,et al.  Automatic keyword extraction for the meeting corpus using supervised approach and bigram expansion , 2008, 2008 IEEE Spoken Language Technology Workshop.

[12]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[13]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[14]  Yaakov HaCohen-Kerner,et al.  Automatic Extraction and Learning of Keyphrases from Scientific Articles , 2005, CICLing.

[15]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[16]  Julia Hirschberg,et al.  Communication and prosody: Functional aspects of prosody , 2002, Speech Commun..

[17]  Hsinchun Chen,et al.  Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management , 1999 .

[18]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[19]  Pascale Fung,et al.  Improving lecture speech summarization using rhetorical information , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[20]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.