Multi-tree Monte Carlo methods for fast, scalable machine learning

As modern applications of machine learning and data mining are forced to deal with ever more massive quantities of data, practitioners quickly run into difficulty with the scalability of even the most basic and fundamental methods. We propose to provide scalability through a marriage between classical, empirical-style Monte Carlo approximation and deterministic multi-tree techniques. This union entails a critical compromise: losing determinism in order to gain speed. In the face of large-scale data, such a compromise is arguably often not only the right but the only choice. We refer to this new approximation methodology as Multi-Tree Monte Carlo. In particular, we have developed the following fast approximation methods: (1) Fast training for kernel conditional density estimation by injecting Monte Carlo into state-of-the-art dual-tree methods. Speedups as high as 105 have been shown on datasets of up to 1 million points. (2) Fast training for general kernel estimators (kernel density estimation, kernel regression, etc.) by injecting multiple trees into Monte Carlo. Speedups as high as 106 have been shown on tens of millions of points. (3) Fast singular value decomposition using a new form of sampling tree called the cosine tree. Speedups as high as 10 5 have been shown on dataset matrices containing billions of entries. The level of acceleration shown by our methods represents improvement over the prior state of the art by several orders of magnitude. Such improvement not only speeds existing applications, it represents a qualitative shift, a commoditization, that opens doors to all manner of new applications and method concepts that were previously invisible, outside the realm of practicality. Further, we show how the diverse operations of our approximation methods can be unified in a Multi-Tree Monte Carlo meta-algorithm which lends itself as scaffolding to the development of fast approximations for other methods we have not yet considered. Thus, our contribution includes not just the particular algorithms we have derived but also the Multi-Tree Monte Carlo methodological framework, which we hope will lead to many more fast algorithms that can provide the kind of scalability we have shown here to other important methods from machine learning and related fields.

[1]  Christophe Andrieu,et al.  Sequential Monte Carlo Methods for Optimal Filtering , 2001, Sequential Monte Carlo Methods in Practice.

[2]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Timo Aila,et al.  Mutated Kd-tree Importance Sampling , 2006 .

[5]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[6]  Rob J Hyndman,et al.  Estimating and Visualizing Conditional Densities , 1996 .

[7]  Petros Drineas,et al.  An Experimental Evaluation of a Monte-Carlo Algorithm for Singular Value Decomposition , 2001, Panhellenic Conference on Informatics.

[8]  Andrew W. Moore,et al.  Nonparametric Density Estimation: Toward Computational Tractability , 2003, SDM.

[9]  I. S. Shiganov Refinement of the upper bound of the constant in the central limit theorem , 1986 .

[10]  Alexander G. Gray,et al.  Fast Nonparametric Conditional Density Estimation , 2007, UAI.

[11]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[12]  Ramani Duraiswami,et al.  Fast optimal bandwidth selection for kernel density estimation , 2006, SDM.

[13]  Tony R. Martinez,et al.  The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[14]  Frank Dellaert,et al.  Mixture trees for modeling and fast conditional sampling with applications in vision and graphics , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[16]  Nando de Freitas,et al.  Fast particle smoothing: if I had a million particles , 2006, ICML.

[17]  G. Celeux,et al.  A stochastic approximation type EM algorithm for the mixture problem , 1992 .

[18]  Nicol N. Schraudolph,et al.  Combining Conjugate Direction Methods with Stochastic Approximation of Gradients , 2003, AISTATS.

[19]  Sridevi Parise,et al.  Clustering Markov States into Equivalence Classes using SVD and Heuristic Search Algorithms , 2003, AISTATS.

[20]  Paul Glasserman,et al.  Monte Carlo Methods in Financial Engineering , 2003 .

[21]  Ramgopal R. Mettu,et al.  Approximation algorithms for np -hard clustering problems , 2002 .

[22]  Peter Müller,et al.  Feedforward Neural Networks for Nonparametric Regression , 1998 .

[23]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[24]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[25]  Shai Avidan,et al.  Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms , 2005, NIPS.

[26]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[27]  Anil Kumar,et al.  Nonparametric conditional density estimation of labour force participation , 2006 .

[28]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[29]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[30]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[31]  Andrew Y. Ng,et al.  Fast Gaussian Process Regression using KD-Trees , 2005, NIPS.

[32]  L. Wasserman,et al.  Fast Algorithms and Efficient Statistics: N-Point Correlation Functions , 2000, astro-ph/0012333.

[33]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[34]  Antoine Bordes,et al.  Sequence Labelling SVMs Trained in One Pass , 2008, ECML/PKDD.

[35]  Vladimir Estivill-Castro,et al.  Hierarchical Monte-Carlo Localisation Balances Precision and Speed , 2004 .

[36]  Amir F. Atiya,et al.  New results on recurrent network training: unifying the algorithms and accelerating convergence , 2000, IEEE Trans. Neural Networks Learn. Syst..

[37]  Johannes Fürnkranz,et al.  On the Use of Fast Subsampling Estimates for Algorithm Recommendation , 2002 .

[38]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[39]  C. McCulloch Maximum Likelihood Variance Components Estimation for Binary Data , 1994 .

[40]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[41]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[42]  Michael Isard,et al.  Contour Tracking by Stochastic Propagation of Conditional Density , 1996, ECCV.

[43]  Michael I. Jordan,et al.  Tree-dependent Component Analysis , 2002, UAI.

[44]  Alexander G. Gray Fast kernel matrix-vector multiplication with application to Gaussian process learning , 2004 .

[45]  Jianqing Fan,et al.  A crossvalidation method for estimating conditional densities , 2004 .

[46]  C. C. Homes,et al.  Bayesian Radial Basis Functions of Variable Dimension , 1998, Neural Computation.

[47]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[48]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[49]  Alexander G. Gray,et al.  Ultrafast Monte Carlo for kernel estimators and generalized statistical summations , 2007, NIPS 2007.

[50]  Wolfram Burgard,et al.  Robust Monte Carlo localization for mobile robots , 2001, Artif. Intell..

[51]  Frank Dellaert,et al.  EM, MCMC, and Chain Flipping for Structure from Motion with Unknown Correspondence , 2004, Machine Learning.

[52]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[53]  Uffe Kjærulff,et al.  Blocking Gibbs sampling in very large probabilistic expert systems , 1995, Int. J. Hum. Comput. Stud..

[54]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[55]  Nando de Freitas,et al.  Empirical Testing of Fast Kernel Density Estimation Algorithms , 2005 .

[56]  Andrew W. Moore,et al.  Rapid Evaluation of Multiple Density Models , 2003, AISTATS.

[57]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[58]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[59]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[60]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[61]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[62]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[63]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[64]  James T. Kajiya,et al.  The rendering equation , 1998 .

[65]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[66]  Petros Drineas,et al.  Fast Monte-Carlo algorithms for approximate matrix multiplication , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[67]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[68]  A. J. Feelders,et al.  Learning Bayesian Network Models from Incomplete Data using Importance Sampling , 2005, AISTATS.

[69]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[70]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[71]  Darren J. Wilkinson,et al.  Conditional simulation from highly structured Gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models , 2002, Stat. Comput..

[72]  Mykola Pechenizkiy,et al.  The impact of sample reduction on PCA-based feature extraction for supervised learning , 2006, SAC '06.

[73]  Rashid Ansari,et al.  Multiple object tracking with kernel particle filter , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[74]  Alexander G. Gray,et al.  A Parallel N-Body Data Mining Framework , 2007 .

[75]  Dimitris Achlioptas,et al.  Fast computation of low-rank matrix approximations , 2007, JACM.

[76]  Ramani Duraiswami,et al.  The improved fast Gauss transform with applications to machine learning , 2005 .

[77]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[78]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[79]  D. Rudoy,et al.  Monte Carlo Methods for Multi-Modal Distributions , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[80]  Jianqing Fan,et al.  Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems , 1996 .

[81]  Nando de Freitas,et al.  Fast Krylov Methods for N-Body Learning , 2005, NIPS.

[82]  Johannes Fürnkranz,et al.  Integrative Windowing , 1998, J. Artif. Intell. Res..

[83]  Kilian Q. Weinberger,et al.  Metric Learning for Kernel Regression , 2007, AISTATS.

[84]  Alexander G. Gray,et al.  Fast Mean Shift with Accurate and Stable Convergence , 2007, AISTATS.

[85]  Arnaud Doucet,et al.  Sequential Monte Carlo Methods to Train Neural Network Models , 2000, Neural Computation.

[86]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[87]  Rob J. Hyndman,et al.  Bandwidth selection for kernel conditional density estimation , 2001 .

[88]  Ramani Duraiswami,et al.  A fast algorithm for learning large scale preference relations , 2007, AISTATS.

[89]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[90]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[91]  Peter Stone,et al.  Modeling Auction Price Uncertainty Using Boosting-based Conditional Density Estimation , 2002, ICML.

[92]  Zoubin Ghahramani,et al.  Factorial Learning and the EM Algorithm , 1994, NIPS.

[93]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[94]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[95]  Andrew W. Moore,et al.  Multi-Tree Methods for Statistics on Very Large Datasets in Astronomy , 2004 .

[96]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[97]  Alexander G. Gray,et al.  Faster Gaussian Summation: Theory and Experiment , 2006, UAI.

[98]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[99]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[100]  Andrew W. Moore,et al.  Dual-Tree Fast Gauss Transforms , 2005, NIPS.

[101]  Nicol N. Schraudolph,et al.  Towards stochastic conjugate gradient methods , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[102]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[103]  Matthew Brand,et al.  Fast Online SVD Revisions for Lightweight Recommender Systems , 2003, SDM.

[104]  Stefan Wrobel,et al.  Incremental Maximization of Non-Instance-Averaging Utility Functions with Applications to Knowledge Discovery Problems , 2001, ICML.

[105]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[106]  Alexander G. Gray,et al.  QUIC-SVD: Fast SVD Using Cosine Trees , 2008, NIPS.

[107]  Prabhakar Raghavan,et al.  Competitive recommendation systems , 2002, STOC '02.

[108]  Shmuel Friedland,et al.  A simultaneous reconstruction of missing data in DNA microarrays , 2006 .

[109]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[110]  Walter R. Gilks,et al.  A Language and Program for Complex Bayesian Modelling , 1994 .

[111]  Nando de Freitas,et al.  Toward Practical N2 Monte Carlo: the Marginal Particle Filter , 2005, UAI.

[112]  Alexander G. Gray,et al.  Massive-Scale Kernel Discriminant Analysis: Mining for Quasars , 2008, SDM.

[113]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[114]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[115]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[116]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[117]  D. Zerom Godefay,et al.  On conditional density estimation , 2003 .

[118]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[119]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[120]  Shmuel Friedland,et al.  Fast Monte-Carlo low rank approximations for matrices , 2006, 2006 IEEE/SMC International Conference on System of Systems Engineering.

[121]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[122]  Huan Liu,et al.  A selective sampling approach to active feature selection , 2004, Artif. Intell..

[123]  Huan Liu,et al.  Feature Selection with Selective Sampling , 2002, International Conference on Machine Learning.

[124]  Alexander G. Gray,et al.  Large-Scale Kernel Discriminant Analysis with Application to Quasar Discovery , 2006 .

[125]  J. Painter,et al.  Antialiased ray tracing by adaptive progressive refinement , 1989, SIGGRAPH.

[126]  Miguel Á. Carreira-Perpiñán,et al.  Kernel Density Estimation, Affinity-Based Clustering, And Typical Cuts , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[127]  Nando de Freitas,et al.  Fast maximum a-posteriori inference on Monte Carlo state spaces , 2005, AISTATS.

[128]  Judea Pearl,et al.  Evidential Reasoning Using Stochastic Simulation of Causal Models , 1987, Artif. Intell..