论文信息 - A Sketch-based Sampling Algorithm on Sparse Data

A Sketch-based Sampling Algorithm on Sparse Data

We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity. Sampling methods have become popular in large-scale data mining and information retrieval, where high data sparsity is a norm. A distinct feature of our algorithm is that it combines the advantages of both conventional random sampling and more modern randomized algorithms such as local sensitive hashing (LSH). While most sketch-based algorithms are designed for specific summary statistics, our proposed algorithm is a general purpose technique, useful for estimating any summary statistics including two-way and multi-way distances and joint histograms.

Kenneth Ward Church | T. Hastie | Ping Li

[1] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[2] Theodore Johnson,et al. Squashing flat files flatter , 1999, KDD '99.

[3] Dimitris Achlioptas,et al. Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[4] Piotr Indyk,et al. Scalable Techniques for Clustering the Web , 2000, WebDB.

[5] Jörg Kindermann,et al. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[6] Chew Lim Tan,et al. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[7] Xiaotong Shen,et al. Empirical Likelihood , 2002 .

[8] David M. Rocke,et al. Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[9] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[10] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11] Sudipto Guha,et al. Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[12] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[13] Petros Drineas,et al. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[14] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15] Donald E. Knuth,et al. The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[16] Frank McSherry,et al. Fast computation of low rank matrix. , 2001, STOC 2001.

[17] Philip S. Yu,et al. Discovering unexpected information from your competitors' web sites , 2001, KDD '01.

[18] Santosh S. Vempala,et al. A random sampling based algorithm for learning the intersection of half-spaces , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[19] M. Newman. Power laws, Pareto distributions and Zipf's law , 2005 .

[20] W. R. Grei,et al. A theory of term weighting based on exploratory data analysis , 1998, SIGIR 1998.

[21] Jeffrey Scott Vitter,et al. Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[22] Walter Willinger,et al. On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[23] Kenneth Ward Church,et al. Improving Random Projections Using Marginal Information , 2006, COLT.

[24] W. Deming,et al. On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[25] Michael Mitzenmacher,et al. Estimating Resemblance of MIDI Documents , 2001, ALENEX.

[26] S. Muthukrishnan,et al. One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[27] Rajeev Motwani,et al. Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[28] Sunita Sarawagi,et al. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications , 1998, SIGMOD '98.

[29] Santosh S. Vempala,et al. An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[30] Clement T. Yu,et al. Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[31] S. Muthukrishnan,et al. Selectively estimation for Boolean queries , 2000, PODS '00.

[32] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[33] Lakshmish Ramaswamy,et al. Techniques for efficient fragment detection in web pages , 2003, CIKM '03.

[34] Theodore Johnson,et al. Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[35] Philip S. Yu,et al. Fast algorithms for projected clustering , 1999, SIGMOD '99.

[36] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[37] Art B. Owen,et al. Data Squashing by Empirical Likelihood , 2004, Data Mining and Knowledge Discovery.

[38] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[39] David Wai-Lok Cheung,et al. Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules , 1998, Data Mining and Knowledge Discovery.

[40] Bernhard Schölkopf,et al. Sampling Techniques for Kernel Methods , 2001, NIPS.

[41] Kun Liu,et al. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[42] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[43] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[44] F. F. Stephan. An Iterative Method of Adjusting Sample Frequency Tables When Expected Marginal Totals are Known , 1942 .

[45] Osamu Watanabe,et al. Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[46] Rajeev Motwani,et al. Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[47] Rajeev Motwani,et al. Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[48] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[49] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[50] Warren R. Greiff,et al. A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[51] Dmitriy Fradkin,et al. Experiments with random projections for machine learning , 2003, KDD '03.

[52] S. Fienberg. An Iterative Procedure for Estimation in Contingency Tables , 1970 .

[53] Donald E. Knuth,et al. The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[54] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .

[55] Walter Willinger,et al. On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[56] B. Rosén. Asymptotic Theory for Successive Sampling with Varying Probabilities Without Replacement, II , 1972 .

[57] Heikki Mannila,et al. Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[58] Jignesh M. Patel,et al. Using histograms to estimate answer sizes for XML queries , 2003, Inf. Syst..

[59] Ping Li,et al. Using Sketches to Estimate Two-way and Multi-way Associations , 2005 .

[60] Michalis Faloutsos,et al. On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[61] Rajeev Motwani,et al. On random sampling over joins , 1999, SIGMOD '99.

[62] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[63] Carla E. Brodley,et al. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[64] Patrick Pantel,et al. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[65] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[66] Kenneth Ward Church,et al. Very sparse random projections , 2006, KDD '06.

[67] Christian Posse,et al. Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[68] Jeremy Buhler,et al. Finding motifs using random projections , 2001, RECOMB.

[69] Piotr Indyk,et al. Algorithmic applications of low-distortion geometric embeddings , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[70] H. White. Maximum Likelihood Estimation of Misspecified Models , 1982 .

[71] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[72] Susan T. Dumais,et al. Improving the retrieval of information from external sources , 1991 .

[73] Santosh S. Vempala,et al. Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[74] Bruce G. Lindsay,et al. Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[75] Andrei Z. Broder,et al. A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[76] Ashwin Srinivasan,et al. A Study of Two Sampling Methods for Analyzing Large Datasets with ILP , 1999, Data Mining and Knowledge Discovery.

[77] Thorsten Joachims,et al. Text categorization with support vector machines , 1999 .

[78] Éva Tardos,et al. Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[79] Siu-Ming Yiu,et al. Finding Motifs with Insufficient Number of Strong Binding Sites , 2005, J. Comput. Biol..

[80] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[81] P. Diaconis,et al. SHUFFLING CARDS AND STOPPING-TIMES , 1986 .

[82] Santosh S. Vempala,et al. The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[83] Chung Keung Poon,et al. An Email Classifier Based on Resemblance , 2003, ISMIS.

[84] William DuMouchel,et al. Applications of sampling and fractional factorial designs to model-free data squashing , 2003, KDD '03.