Python, performance, and natural language processing

We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.

[1]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .

[2]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[3]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[4]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[5]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[6]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[9]  Suzanne Stevenson,et al.  Supervised Learning of Lexical Semantic Verb Classes Using Frequency Distributions , 1999, SIGLEX Workshop On Standardizing Lexical Resources.

[10]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[11]  Satoshi Matsuoka,et al.  Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora , 2015, 2015 IEEE International Conference on Data Science and Data Intensive Systems.

[12]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[13]  K. Jarrod Millman,et al.  Python for Scientists and Engineers , 2011, Comput. Sci. Eng..

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[16]  Kathleen McKeown,et al.  Learning Methods to Combine Linguistic Indicators:Improving Aspectual Classification and Revealing Linguistic Insights , 2000, CL.

[17]  Carl Friedrich Bolz,et al.  Tracing the meta-level: PyPy's tracing JIT compiler , 2009, ICOOOLPS@ECOOP.

[18]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[19]  Joey Bernard Running scientific code using IPython and SciPy , 2013 .

[20]  Vladimír Benko,et al.  Aranea: Yet Another Family of (Comparable) Web Corpora , 2014, TSD.

[21]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[22]  Sabine Schulte im Walde,et al.  Combining EM Training and the MDL Principle for an Automatic Verb Classification Incorporating Selectional Preferences , 2008, ACL.

[23]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[24]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[25]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[26]  Mats Rooth,et al.  Inducing a Semantically Annotated Lexicon via EM-Based Clustering , 1999, ACL.

[27]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[28]  Jianguo Li Disambiguating Levin Verbs Using Untagged Data , 2007 .

[29]  Alessandro Lenci,et al.  Acquisition and Representation of Word Meaning. Theoretical and Computational Perspectives , 2006 .

[30]  Sabine Schulte im Walde Clustering Verbs Semantically According to their Alternation Behaviour , 2000, COLING.

[31]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[32]  Krista Lagus,et al.  SEMANTIC CLUSTERING OF VERBS Analysis of Morphosyntactic Contexts Using the SOM Algorithm , 2009 .

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.