WebScalding: A Framework for Big Data Web Services

CareerBuilder (CB) currently has 50 million active resumes and 2 million active job postings. Our team has been working to provide the most relevant jobs for job seekers and resumes for employers and recruiters. These goals often lead to Big Data problems. In this paper, we introduce WebScalding, a Big Data framework designed and developed to solve some of the common large scale data challenges at CB. The WebScalding framework raises the level of abstraction of Twitter's Scalding framework to adapt to CB's unique challenges. The WebScalding framework helps users by ensuring that: 1) All internal web services are available as cascading pipe operations, 2) These pipe operations can read from our common data sources and create a pipe assembly and, 3) The pipe assembly such created can be executed in the CB Hadoop cluster as well as local machines without making any changes. We describe WebScalding using three case studies taken from actual internal projects that explain how data scientists at CB not well versed in Big Data tools and methodologies leverage WebScalding to design, implement, and test Big Data applications. We also compare the execution time of a WebScalding program with its sequential Python counterpart to illustrate the super linear speed up of WebScalding programs.

[1]  Paco Xander Nathan,et al.  Enterprise Data Workflows with Cascading , 2013 .

[2]  Andreas Neumann,et al.  Oozie: towards a scalable workflow management system for Hadoop , 2012, SWEET '12.

[3]  Sam Shah,et al.  The big data ecosystem at LinkedIn , 2013, SIGMOD '13.

[4]  Faizan Javed,et al.  sCooL: A system for academic institution name normalization , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[5]  Faizan Javed,et al.  SKILL: A System for Skill Identification and Normalization , 2015, AAAI.

[6]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[7]  Yan Liu,et al.  Domain-Specific Languages for Developing and Deploying Signature Discovery Workflows , 2014, Computing in Science & Engineering.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[10]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11]  Jimmy J. Lin,et al.  Scaling big data mining infrastructure: the twitter experience , 2013, SKDD.

[12]  Faizan Javed,et al.  Towards a Job Title Classification System , 2016, ArXiv.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  E. Michael Maximilien,et al.  A Domain-Specific Language for Web APIs and Services Mashups , 2007, ICSOC.