A Sketch-based Sampling Algorithm on Sparse Data

We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity. Sampling methods have become popular in large-scale data mining and information retrieval, where high data sparsity is a norm. A distinct feature of our algorithm is that it combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH). While most sketch-based algorithms are designed for specific summary statistics, our proposed algorithm is a general purpose technique, useful for estimating any summary statistics including two-way and multi-way distances and joint histograms.

[1]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[2]  Theodore Johnson,et al.  Squashing flat files flatter , 1999, KDD '99.

[3]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[4]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[5]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[6]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[7]  Xiaotong Shen,et al.  Empirical Likelihood , 2002 .

[8]  David M. Rocke,et al.  Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[9]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[12]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[13]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[16]  Frank McSherry,et al.  Fast computation of low rank matrix. , 2001, STOC 2001.

[17]  Philip S. Yu,et al.  Discovering unexpected information from your competitors' web sites , 2001, KDD '01.

[18]  Santosh S. Vempala,et al.  A random sampling based algorithm for learning the intersection of half-spaces , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[19]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[20]  W. R. Grei,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR 1998.

[21]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[22]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[23]  Kenneth Ward Church,et al.  Improving Random Projections Using Marginal Information , 2006, COLT.

[24]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[25]  Michael Mitzenmacher,et al.  Estimating Resemblance of MIDI Documents , 2001, ALENEX.

[26]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[27]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[28]  Sunita Sarawagi,et al.  Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications , 1998, SIGMOD '98.

[29]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[30]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[31]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[32]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[33]  Lakshmish Ramaswamy,et al.  Techniques for efficient fragment detection in web pages , 2003, CIKM '03.

[34]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[35]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[36]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[37]  Art B. Owen,et al.  Data Squashing by Empirical Likelihood , 2004, Data Mining and Knowledge Discovery.

[38]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[39]  David Wai-Lok Cheung,et al.  Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules , 1998, Data Mining and Knowledge Discovery.

[40]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[41]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[42]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[43]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[44]  F. F. Stephan An Iterative Method of Adjusting Sample Frequency Tables When Expected Marginal Totals are Known , 1942 .

[45]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[46]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[47]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[48]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[49]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[50]  Warren R. Greiff,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[51]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[52]  S. Fienberg An Iterative Procedure for Estimation in Contingency Tables , 1970 .

[53]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[54]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[55]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[56]  B. Rosén Asymptotic Theory for Successive Sampling with Varying Probabilities Without Replacement, II , 1972 .

[57]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[58]  Jignesh M. Patel,et al.  Using histograms to estimate answer sizes for XML queries , 2003, Inf. Syst..

[59]  Ping Li,et al.  Using Sketches to Estimate Two-way and Multi-way Associations , 2005 .

[60]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[61]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[62]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[63]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[64]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[65]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[66]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[67]  Christian Posse,et al.  Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[68]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[69]  Piotr Indyk,et al.  Algorithmic applications of low-distortion geometric embeddings , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[70]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[71]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[72]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[73]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[74]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[75]  Andrei Z. Broder,et al.  A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[76]  Ashwin Srinivasan,et al.  A Study of Two Sampling Methods for Analyzing Large Datasets with ILP , 1999, Data Mining and Knowledge Discovery.

[77]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[78]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[79]  Siu-Ming Yiu,et al.  Finding Motifs with Insufficient Number of Strong Binding Sites , 2005, J. Comput. Biol..

[80]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[81]  P. Diaconis,et al.  SHUFFLING CARDS AND STOPPING-TIMES , 1986 .

[82]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[83]  Chung Keung Poon,et al.  An Email Classifier Based on Resemblance , 2003, ISMIS.

[84]  William DuMouchel,et al.  Applications of sampling and fractional factorial designs to model-free data squashing , 2003, KDD '03.