Scaling big data mining infrastructure: the twitter experience

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.

[1]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[4]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[5]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[6]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[7]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Douglas Aberdeen,et al.  The War Against Spam: A report from the front line , 2007 .

[10]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[11]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[12]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[13]  Jimmy J. Lin,et al.  Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce , 2008, WMT@ACL.

[14]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[15]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[16]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[17]  Roberto J. Bayardo,et al.  MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles , 2011 .

[18]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[19]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  James A. Landay,et al.  Gestalt: integrated support for implementation and analysis in machine learning , 2010, UIST.

[22]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[23]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[24]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[25]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[26]  Miguel Rios,et al.  Distilling Massive Amounts of Data into Simple Visualizations : Twitter Case Studies , 2012 .

[27]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[28]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[29]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[30]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[31]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[32]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[33]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[34]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[35]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[36]  Jun Rao,et al.  Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[37]  Chuang Liu,et al.  The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[38]  Ron Bekkerman,et al.  High-precision phrase-based document classification on a modern scale , 2011, KDD.

[39]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[40]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[41]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[42]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[43]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[44]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[45]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[46]  Christopher J. C. Burges,et al.  Scaling Up Machine Learning: Large-Scale Learning to Rank Using Boosted Decision Trees , 2011 .

[47]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[48]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[49]  Jimmy Lin,et al.  Full-text indexing for optimizing selection operations in large-scale data analytics , 2011, MapReduce '11.

[50]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[51]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[52]  D. Sculley,et al.  Detecting adversarial advertisements in the wild , 2011, KDD.