Sparse Machine Learning Methods for Understanding Large Text Corpora.

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

[1]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[2]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[3]  Eric P. Xing,et al.  On Sparse Nonparametric Conditional Covariance Selection , 2010, ICML.

[4]  Latifur Khan,et al.  SISC: A Text Classification Approach Using Semi Supervised Subspace Clustering , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[5]  Lieven Vandenberghe,et al.  Topology Selection in Graphical Models of Autoregressive Processes , 2010, J. Mach. Learn. Res..

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[8]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[9]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[10]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[11]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[12]  Luke Miratrix,et al.  Discovering word associations in news media via feature selection and sparse classification , 2010, MIR '10.

[13]  Wendell R. Ricks,et al.  Cognitive models of pilot categorization and prioritization of flight-deck information , 1995 .

[14]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.

[15]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Latifur Khan,et al.  Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction , 2010, J. Artif. Intell. Res..

[17]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[18]  Renato D. C. Monteiro,et al.  Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression , 2009, Mathematical Programming.

[19]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[20]  Laurent El Ghaoui,et al.  Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems , 2010, 1009.4219.

[21]  Shai Avidan,et al.  Fast Pixel/Part Selection with Sparse Eigenvectors , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  L. Ghaoui,et al.  Sparse PCA: Convex Relaxations, Algorithms and Applications , 2010, 1011.3781.

[23]  Arindam Banerjee,et al.  Discriminative Mixed-Membership Models , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[24]  Vincent Ng,et al.  Semi-Supervised Cause Identification from Aviation Safety Reports , 2009, ACL.

[25]  Dipanjan Das Andr,et al.  A Survey on Automatic Text Summarization , 2007 .

[26]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[27]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[28]  Latifur Khan,et al.  Multi-label large margin hierarchical perceptron , 2008, Int. J. Data Min. Model. Manag..

[29]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[30]  Xinyu Dai,et al.  SBA-term: Sparse Bilingual Association for Terms , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[31]  Bo Zhao,et al.  TopCells: Keyword-based search of top-k aggregated documents in text cube , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[32]  Arindam Banerjee,et al.  Gaussian Process Topic Models , 2010, UAI.

[33]  Luke Miratrix,et al.  Summarizing large-scale, multiple-document news data: sparse methods and human validation , 2013 .

[34]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  Frank Schilder,et al.  FastSum: Fast and Accurate Query-based Multi-document Summarization , 2008, ACL.

[36]  Latifur Khan,et al.  Multi-concept Document Classification Using a Perceptron-Like Algorithm , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[37]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[38]  John C. Stutz,et al.  Classification of Aeronautics System Health and Safety Documents , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[39]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[40]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[41]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[42]  Jiawei Han,et al.  Topic modeling for OLAP on multidimensional text databases: topic cube and its applications , 2009, Stat. Anal. Data Min..

[43]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[44]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[45]  Leonhard Hennig,et al.  Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis , 2009, RANLP.

[46]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[47]  Michael Elad,et al.  Learning Multiscale Sparse Representations for Image and Video Restoration , 2007, Multiscale Model. Simul..