Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely corpus-based and language-independent way without using any kind of prior linguistic knowledge. Unsupervised methods for identifying "phrases", mining subphrase structure and finding words in a fully automated way are described. This can be considered as a step towards automatically computing a "general dictionary and grammar of the corpus". We hope that in the long run variants of our approach turn out to be useful for other kind of sequence data as well, such as, e.g., speech, genom sequences, or music annotation. Even if we are not primarily interested in immediate applications, results obtained for a variety of languages show that our methods are interesting for many practical tasks in text mining, terminology extraction and lexicography, search engine technology, and related fields.

[1]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[2]  Maurice Gross,et al.  A Bootstrap Method for Constructing Local Grammars , 1999 .

[3]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[4]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[5]  C. C. Fries The structure of English;: An introduction to the construction of English sentences , 2005 .

[6]  Ayumi Shinohara,et al.  On-line construction of symmetric compact directed acyclic word graphs , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[7]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[8]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[9]  Aharon Ben-Tal,et al.  Posterior Convergence Under Incomplete Information , 1992 .

[10]  Masaru Tomita,et al.  Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems , 1985 .

[11]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[12]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[13]  I. C. Mogotsi,et al.  Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to information retrieval , 2010, Information Retrieval.

[14]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[17]  Hinrich Schütze,et al.  A Vector Model for Syntagmatic and Paradigmatic Relatedness , 1993 .

[18]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[19]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[20]  Yasubumi Sakakibara,et al.  Probabilistic Context-Free Grammars , 2010, Encyclopedia of Machine Learning.

[21]  M. Gross Local grammars and their representation by finite automata , 1992 .

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  P. Langlais Corpus-Based Terminology Extraction , 2005 .

[25]  Daniel Zeman Unsupervised Acquiring of Morphological Paradigms from Tokenized Text , 2007, CLEF.

[26]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[27]  Xuanjing Huang,et al.  Gated Recursive Neural Network for Chinese Word Segmentation , 2015, ACL.

[28]  Radu Soricut,et al.  Unsupervised Morphology Induction Using Word Embeddings , 2015, NAACL.

[29]  Miao Li,et al.  Combination of Unsupervised Keyphrase Extraction Algorithms , 2013, 2013 International Conference on Asian Language Processing.

[30]  Maya Ingle,et al.  Empirical Studies on Machine Learning Based Text Classification Algorithms , 2011 .

[31]  Andy Way,et al.  Combining Semantic and Syntactic Generalization in Example-Based Machine Translation , 2011, EAMT.

[32]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[33]  D. T. Lee,et al.  An optimal algorithm for shortest paths on weighted interval and circular-arc graphs, with applications , 1993, Algorithmica.

[34]  George G. Judge,et al.  Consistency of empirical likelihood and maximum A-Posteriori probability under misspecification , 2008 .

[35]  Anne Cutler,et al.  Recognition and Representation of Function Words in English-Learning Infants , 2006 .

[36]  Barbara Höhle,et al.  German-learning infants' ability to detect unstressed closed class elements in continuous speech , 2003 .

[37]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[38]  Yuan J. Lui,et al.  Extraction of Significant Phrases from Text , 2007 .

[39]  Petra Storjohann Lexical-semantic relations : theoretical and practical perspectives , 2010 .

[40]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[41]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[42]  Aurélien Lemay,et al.  Learning regular languages using RFSAs , 2004, Theor. Comput. Sci..

[43]  M. Gross The Construction of Local Grammars , 1997 .

[44]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[45]  Nils J. Nilsson,et al.  Correction to "A Formal Basis for the Heuristic Determination of Minimum Cost Paths" , 1972, SGAR.

[46]  Philipp Mayr,et al.  Improving Retrieval Results with Discipline-Specific Query Expansion , 2012, TPDL.