Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data

Unsupervised matrix-factorization-based dimensionality reduction (DR) techniques are popularly used for feature engineering with the goal of improving the generalization performance of predictive models, especially with massive, sparse feature sets. Often DR is employed for the same purpose as supervised regularization and other forms of complexity control: exploiting a bias/variance tradeoff to mitigate overfitting. Contradicting this practice, there is consensus among existing expert guidelines that supervised regularization is a superior way to improve predictive performance. However, these guidelines are not always followed for this sort of data, and it is not unusual to find DR used with no comparison to modeling with the full feature set. Further, the existing literature does not take into account that DR and supervised regularization are often used in conjunction. We experimentally compare binary classification performance using DR features versus the original features under numerous conditions: using a total of 97 binary classification tasks, 6 classifiers, 3 DR techniques, and 4 evaluation metrics. Crucially, we also experiment using varied methodologies to tune and evaluate various key hyperparameters. We find a very clear, but nuanced result. Using state-of-the-art hyperparameter-selection methods, applying DR does not add value beyond supervised regularization, and can often diminish performance. However, if regularization is not done well (e.g., one just uses the default regularization parameter), DR does have relatively better performance—but these approaches result in lower performance overall. These latter results provide an explanation for why practitioners may be continuing to use DR without undertaking the necessary comparison to using the original features. However, this practice seems generally wrongheaded in light of the main results, if the goal is to maximize generalization performance.

[1]  Brian Whitman Semantic rank reduction of music audio , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[2]  Billur Barshan,et al.  Recognizing Daily and Sports Activities in Two Open Source Machine Learning Environments Using Body-Worn Sensor Units , 2014, Comput. J..

[3]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[4]  Mykola Pechenizkiy,et al.  PCA-based feature transformation for classification: issues in medical diagnostics , 2004 .

[5]  Kristof Coussement,et al.  Integrating the voice of customers through call center emails into a decision support system for churn prediction , 2008, Inf. Manag..

[6]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[7]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[8]  Foster J. Provost,et al.  Enhancing Transparency and Control When Drawing Data-Driven Inferences About Individuals , 2016, Big Data.

[9]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[10]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[11]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[14]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[15]  Theodoros Evgeniou,et al.  A benchmarking study of classification techniques for behavioral data , 2019, International Journal of Data Science and Analytics.

[16]  Henrik Boström,et al.  Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[17]  Galit Shmueli,et al.  Predictive Analytics in Information Systems Research , 2010, MIS Q..

[18]  Lin Sun,et al.  An improved selective ensemble method for spam filtering , 2013, 2013 15th IEEE International Conference on Communication Technology.

[19]  Thomas G. Dietterich,et al.  Advances in neural information processing systems : proceedings of the ... conference , 1989 .

[20]  David J. Miller,et al.  Semi-supervised Multi-Label Topic Models for Document Classification and Sentence Labeling , 2016, CIKM.

[21]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[22]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[23]  Billur Barshan,et al.  Human Activity Recognition Using Inertial/Magnetic Sensor Units , 2010, HBU.

[24]  Foster J. Provost,et al.  Scalable hands-free transfer learning for online advertising , 2014, KDD.

[25]  Patrick O. Perry,et al.  Bi-cross-validation of the SVD and the nonnegative matrix factorization , 2009, 0908.2062.

[26]  Monica Chiarini Tremblay,et al.  Identifying fall-related injuries: Text mining the electronic medical record , 2009, Inf. Technol. Manag..

[27]  Billur Barshan,et al.  Comparative study on classifying human activities with miniature inertial and magnetic sensors , 2010, Pattern Recognit..

[28]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[29]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[30]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[31]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[32]  Krishna P. Gummadi,et al.  A measurement-driven analysis of information propagation in the flickr social network , 2009, WWW '09.

[33]  Abdulhamit Subasi,et al.  EEG signal classification using PCA, ICA, LDA and support vector machines , 2010, Expert Syst. Appl..

[34]  Pierre Baldi,et al.  Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  O. T. Arulogun,et al.  On the Classification of Gasoline-fuelled Engine Exhaust Fume Related Faults Using Electronic Nose and Principal Component Analysis , 2012 .

[36]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[37]  Xin Xu,et al.  An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines , 2005, ADMA.

[38]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[39]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[40]  Ying Wang,et al.  Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants , 2007, ISMB/ECCB.

[41]  Isabelle Guyon,et al.  Design and analysis of the KDD cup 2009: fast scoring on a large orange customer database , 2009, SKDD.

[42]  Lawrence K. Saul,et al.  Knock it off: profiling the online storefronts of counterfeit merchandise , 2014, KDD.

[43]  R. Khan,et al.  Predicting Odor Pleasantness from Odorant Structure: Pleasantness as a Reflection of the Physical World , 2007, The Journal of Neuroscience.

[44]  Tom Fawcett,et al.  Data science for business , 2013 .

[45]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[46]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[47]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[48]  Yury Lifshits,et al.  Estimation of the Click Volume by Large Scale Regression Analysis , 2007, CSR.

[49]  Louis L. Scharf,et al.  The SVD and reduced rank signal processing , 1991, Signal Process..

[50]  Mark A. Girolami,et al.  Employing Latent Dirichlet Allocation for fraud detection in telecommunications , 2007, Pattern Recognit. Lett..

[51]  Harald Martens,et al.  Variable selection in PCA in sensory descriptive and consumer data , 2003 .

[52]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[53]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[54]  Peter D. Hoff,et al.  Model Averaging and Dimension Selection for the Singular Value Decomposition , 2006, math/0609042.

[55]  Pietro Perona,et al.  Learning to Recognize Volcanoes on Venus , 1998, Machine Learning.

[56]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[57]  Lars Kai Hansen,et al.  Dimensionality reduction for click-through rate prediction: Dense versus sparse representation , 2013, NIPS 2013.

[58]  Foster J. Provost,et al.  Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics , 2016, MIS Q..

[59]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[61]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[62]  Dirk Thorleuchter,et al.  Analyzing existing customers' websites to improve the customer acquisition process as well as the profitability prediction in B-to-B marketing , 2012, Expert Syst. Appl..

[63]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[64]  Rashmi Data Mining: A Knowledge Discovery Approach , 2012 .

[65]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[66]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[67]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[68]  Filippo Menczer,et al.  Customer Targeting: A Neural Network Approach Guided by Genetic Algorithms , 2005, Manag. Sci..

[69]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[70]  Juan Manuel Górriz,et al.  Principal component analysis-based techniques and supervised classification schemes for the early detection of Alzheimer's disease , 2011, Neurocomputing.

[71]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[72]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[73]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[74]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[75]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[76]  Ingoo Han,et al.  Extracting underlying meaningful features and canceling noise using independent component analysis for direct marketing , 2007, Expert Syst. Appl..

[77]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[78]  Foster J. Provost,et al.  Scalable supervised dimensionality reduction using clustering , 2013, KDD.