Mining of Massive Datasets

The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

[1]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[2]  Chris Anderson,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[5]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[6]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[7]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[8]  Jennifer Widom,et al.  A First Course in Database Systems , 1997 .

[9]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[10]  Jeffrey D. Ullman,et al.  Transitive closure and recursive Datalog implemented on clusters , 2012, EDBT '12.

[11]  Petros Drineas,et al.  Tensor-CUR decompositions for tensor-based data , 2006, KDD '06.

[12]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[13]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[14]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[15]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[16]  Richard C. Singleton,et al.  Nonrandom binary superimposed codes , 1964, IEEE Trans. Inf. Theory.

[17]  Jure Leskovec,et al.  Supervised random walks: predicting and recommending links in social networks , 2010, WSDM '11.

[18]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[19]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[20]  Abraham Silberschatz,et al.  View maintenance issues for the chronicle data model (extended abstract) , 1995, PODS.

[21]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[22]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[23]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[24]  Bala Kalyanasundaram,et al.  An optimal deterministic algorithm for online b-matching , 1996, Theor. Comput. Sci..

[25]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[27]  Marco Rosa,et al.  HyperANF: approximating the neighbourhood function of very large graphs on a budget , 2010, WWW.

[28]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[29]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[32]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[33]  Luis von Ahn Games with a Purpose , 2006, Computer.

[34]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[36]  Robert J. Kauffman,et al.  Understanding evolution in technology ecosystems , 2008, Commun. ACM.

[37]  Yannis E. Ioannidis,et al.  On the Computation of the Transitive Closure of Relational Operators , 1986, VLDB.

[38]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[39]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[40]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[41]  Gene H. Golub,et al.  Matrix computations , 1983 .

[42]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[43]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[44]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[45]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[46]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[47]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[48]  Jeffrey D. Ullman,et al.  A New Computation Model for Cluster Computing , 2009 .

[49]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[50]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[51]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[52]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[53]  David J. DeWitt,et al.  Clustera: an integrated computation and data management system , 2008, Proc. VLDB Endow..

[54]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[55]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[56]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[57]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[58]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[59]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[60]  Mohamed Medhat Gaber,et al.  Scientific Data Mining and Knowledge Discovery - Principles and Foundations , 2009 .

[61]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[62]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[63]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[64]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[65]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[66]  Aranyak Mehta,et al.  AdWords and Generalized On-line Matching , 2005, FOCS.

[67]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[68]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[69]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[70]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[71]  AhnLuis von Games with a Purpose , 2006 .

[72]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[73]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[74]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[75]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[76]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[77]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[78]  Jimeng Sun,et al.  Less is More: Compact Matrix Decomposition for Large Sparse Graphs , 2007, SDM.

[79]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[80]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[81]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[82]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[83]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[84]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[85]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[86]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[87]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[88]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[89]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[90]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[91]  R. Merton The Matthew Effect in Science , 1968, Science.

[92]  Jeffrey D. Ullman,et al.  Cluster Computing, Recursion and Datalog , 2010, Datalog.

[93]  Patrick Valduriez,et al.  Evaluation of Recursive Queries Using Join Indices , 1986, Expert Database Conf..

[94]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[95]  Yehuda Koren,et al.  The BellKor Solution to the Netflix Grand Prize , 2009 .

[96]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain , 2004, Machine Learning.

[97]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[98]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[99]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[100]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[101]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[102]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[103]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[104]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).