Featurizing Text : Converting Text into Predictors for Regression Analysis

Modern data streams routinely combine text with the familiar numerical data used in regression analysis. For example, listings for real estate that show the price of a property typically include a verbal description. Some descriptions include numerical data, such as the number of rooms or the size of the home. Many others, however, only verbally describe the property, often using an idiosyncratic vernacular. For modeling such data, we describe several methods that that convert such text into numerical features suitable for regression analysis. The proposed featurizing techniques create regressors directly from text, requiring minimal user input. The techniques range naive to subtle. One can simply use raw counts of words, obtain principal components from these counts, or build regressors from counts of adjacent words. Our example that models real estate prices illustrates the surprising success of these methods. To partially explain this success, we offer a motivating probabilistic model. Because the derived regressors are difficult to interpret, we further show how the presence of partial quantitative features extracted from text can elucidate the structure of a model. Key Phrases: sentiment analysis, n-gram, latent semantic analysis, text mining ∗Research supported by NSF grant 1106743

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Dean P. Foster,et al.  α‐investing: a procedure for sequential control of expected false discoveries , 2008 .

[3]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[4]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[5]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[6]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[7]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[8]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[9]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[10]  James H. Martin,et al.  Speech and Language Processing An Introduction to Natural Language Processing , Computational Linguistics , and Speech Recognition Second Edition , 2008 .

[11]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[12]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  G. Āllport The Psycho-Biology of Language. , 1936 .