Using Ontology-based Approaches to Representing Speech Transcripts for Automated Speech Scoring

This paper presents a thesis proposal on approaches to automatically scoring non-native speech from second language tests. Current speech scoring systems assess speech by primarily using acoustic features such as fluency and pronunciation; however content features are barely involved. Motivated by this limitation, the study aims to investigate the use of content features in speech scoring systems. For content features, a central question is how speech content can be represented in appropriate means to facilitate automated speech scoring. The study proposes using ontology-based representation to perform concept level representation on speech transcripts, and furthermore the content features computed from ontology-based representation may facilitate speech scoring. One baseline and two ontology-based representations are compared in experiments. Preliminary results show that ontology-based representation slightly improves performance of one content feature for automated scoring over the baseline system.

[1]  Christoph Meinel,et al.  E-Librarian Service - User-Friendly Semantic Search in Digital Libraries , 2011, X.media.publishing.

[2]  Bob Rehder,et al.  How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[3]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[4]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[5]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[6]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[7]  Heting Chu,et al.  Information Representation and Retrieval in the Digital Age (ASIST Monograph Series) by Heting Chu, Medford, NJ: Information Today; 2003. 248 p. ISBN 1-57387-172-9 , 2003, Information retrieval (Boston).

[8]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[9]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Dan Roth,et al.  An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) , 2012, LREC.

[11]  Thomas R. Gruber,et al.  A Translation Approach to Portable Ontologies , 1993 .

[12]  E. M. Adams Ontological Investigations: An Inquiry into the Categories of Nature, Man and Society , 1991 .

[13]  L. Boves,et al.  Quantitative assessment of second language learners' fluency: comparisons between read and spontaneous speech. , 2002, The Journal of the Acoustical Society of America.

[14]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[17]  Michael Gruninger,et al.  Methodology for the Design and Evaluation of Ontologies , 1995, IJCAI 1995.

[18]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[19]  Salvatore Valenti,et al.  An Overview of Current Research on Automated Essay Grading , 2003, J. Inf. Technol. Educ..

[20]  Margarita P. Steinel,et al.  FACETS OF SPEAKING PROFICIENCY , 2012, Studies in Second Language Acquisition.

[21]  Asunción Gómez-Pérez,et al.  Ontology Specification Languages for the Semantic Web , 2002, IEEE Intell. Syst..

[22]  Gilles Bisson,et al.  Designing Clustering Methods for Ontology Building - The Mo'K Workbench , 2000, ECAI Workshop on Ontology Learning.

[23]  Steffen Staab,et al.  Combining Data-Driven and Semantic Approaches for Text Mining , 2011, Foundations for the Web of Information and Services.

[24]  Timothy W. Finin,et al.  Enabling Technology for Knowledge Sharing , 1991, AI Mag..

[25]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[26]  Liang Chen,et al.  A new differential LSI space-based probabilistic document classifier , 2003, Inf. Process. Lett..

[27]  Klaus Zechner,et al.  Exploring Content Features for Automated Speech Scoring , 2012, HLT-NAACL.

[28]  Samuel Kaski,et al.  Computationally Efficient Approximation of a Probabilistic Model for Document Representation in the WEBSOM Full-Text Analysis Method , 1997 .

[29]  David D. Lewis,et al.  Representation Quality in Text Classification: An Introduction and Experiment , 1990, HLT.

[30]  Xiaoming Xi,et al.  AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATERSM V1.0 , 2008 .

[31]  Jia Zeng,et al.  A “stereo” document representation for textual information retrieval , 2006 .

[32]  Steffen Staab,et al.  What Is an Ontology? , 2009, Handbook on Ontologies.

[33]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  Klaus Zechner,et al.  Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech , 2011, ACL.

[36]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[37]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[38]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[39]  P. Schmitz,et al.  Inducing Ontology from Flickr Tags , 2006 .

[40]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[41]  Jian Qin,et al.  Semantic Relation Extraction from Socially-Generated Tags: A Methodology for Metadata Generation , 2008, Dublin Core Conference.

[42]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[43]  Lyle F. Bachman 语言测试要略 = Fundamental considerations in language testing , 1990 .

[44]  Jill Burstein,et al.  The E-rater® scoring engine: Automated essay scoring with natural language processing. , 2003 .

[45]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[46]  Semire Dikli,et al.  An Overview of Automated Scoring of Essays. , 2006 .

[47]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[48]  M. Canale From communicative competence to communicative language pedagogy , 2014 .

[49]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[50]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[51]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[52]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[53]  Lyle F. Bachman,et al.  Language testing in practice : designing and developing useful language tests , 1996 .

[54]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[55]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[56]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[57]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[58]  HigginsDerrick,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009 .

[59]  Kevin Knight,et al.  Toward Distributed Use of Large-Scale Ontologies t , 1997 .

[60]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[61]  Lyle F. Bachman Statistical analyses for language assessment , 2004 .

[62]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[63]  Jill Burstein,et al.  Automated Essay Scoring : A Cross-disciplinary Perspective , 2003 .

[64]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[65]  Anne Cutler,et al.  A theory of lexical access in speech production , 1999, Behavioral and Brain Sciences.

[66]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[67]  Klaus Zechner,et al.  Using an Ontology for Improved Automated Content Scoring of Spontaneous Non-Native Speech , 2012, BEA@NAACL-HLT.

[68]  M. Swain,et al.  THEORETICAL BASES OF COMMUNICATIVE APPROACHES TO SECOND LANGUAGE TEACHING AND TESTING , 1980 .

[69]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[70]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[71]  Daoud Clarke,et al.  A Context-Theoretic Framework for Compositionality in Distributional Semantics , 2011, Computational Linguistics.

[72]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[73]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[74]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[75]  Roger B. Bradford,et al.  An empirical study of required dimensionality for large-scale latent semantic indexing applications , 2008, CIKM '08.

[76]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[77]  Peter W. Foltz,et al.  Automated Essay Scoring: Applications to Educational Technology , 1999 .

[78]  T. McNamara,et al.  Assessed Levels of Second Language Speaking Proficiency: How Distinct? , 2007 .

[79]  L. Boves,et al.  Quantitative assessment of second language learners' fluency by means of automatic speech recognition technology. , 2000, The Journal of the Acoustical Society of America.

[80]  Jian Cheng,et al.  Validating automated speaking tests , 2010 .

[81]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[82]  Carola Eschenbach,et al.  Formal Ontology in Information Systems , 2008 .

[83]  Michael K. Buckland,et al.  Information as Thing , 1991 .

[84]  Hussein A. Abbass,et al.  A Comparative Study for Domain Ontology Guided Feature Extraction , 2003, ACSC.

[85]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[86]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[87]  Edel Garcia Latent Semantic Indexing (LSI) A Fast Track Tutorial , 2006 .

[88]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[89]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[90]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[91]  Vesna Bagarić,et al.  DEFINING COMMUNICATIVE COMPETENCE , 2007 .

[92]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[93]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[94]  Marina Dodigovic,et al.  Speech Processing Technology in Second Language Testing , 2009 .

[95]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .