A word prediction methodology for automatic sentence completion

Word prediction generally relies on n-grams occurrence statistics, which may have huge data storage requirements and does not take into account the general meaning of the text. We propose an alternative methodology, based on Latent Semantic Analysis, to address these issues. An asymmetric Word-Word frequency matrix is employed to achieve higher scalability with large training datasets than the classic Word-Document approach. We propose a function for scoring candidate terms for the missing word in a sentence. We show how this function approximates the probability of occurrence of a given candidate word. Experimental results show that the proposed approach outperforms non neural network language models.

[1]  Maneesh Sahani,et al.  Regularization and nonlinearities for neural language models: when are they needed? , 2013, ArXiv.

[2]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[3]  Radim Řehůřek Scalability of Semantic Analysis in Natural Language Processing , 2011 .

[4]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[5]  Giovanni Pilato,et al.  TSVD as a Statistical Estimator in the Latent Semantic Analysis Paradigm , 2015, IEEE Transactions on Emerging Topics in Computing.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[10]  Tina Magnuson,et al.  Profet, A New Generation of Word Prediction: An Evaluation Study , 1997 .

[11]  Shuicheng Yan,et al.  Straightforward Feature Selection for Scalable Latent Semantic Indexing , 2009, SDM.

[12]  Soon Cheol Park,et al.  A Novel Document Clustering Model Based on Latent Semantic Analysis , 2007 .

[13]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[14]  Wei Song,et al.  A Novel Document Clustering Model Based on Latent Semantic Analysis , 2007, Third International Conference on Semantics, Knowledge and Grid (SKG 2007).

[15]  Jerome R. Bellegarda,et al.  A multispan language modeling framework for large vocabulary speech recognition , 1998, IEEE Trans. Speech Audio Process..

[16]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.

[17]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[18]  Andreas Vlachos,et al.  Dependency Language Models for Sentence Completion , 2013, EMNLP.

[19]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..