A system for scalable and reliable technical-skill testing in online labor markets

The emergence of online labor platforms, online crowdsourcing sites, and even Massive Open Online Courses (MOOCs), has created an increasing need for reliably evaluating the skills of the participating users (e.g., "does a candidate know Java") in a scalable way. Many platforms already allow job candidates to take online tests to assess their competence in a variety of technical topics. However the existing approaches face many problems. First, cheating is very common in online testing without supervision, as the test questions often "leak" and become easily available online along with the answers. Second, technical-skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.In this article we present a platform, which continuously generates test questions and evaluates their quality as predictors of the user skill level. Our platform leverages content that is already available on question answering sites such as Stack Overflow and re-purposes these questions to generate tests. This approach has some major benefits: we continuously generate new questions, decreasing the impact of cheating, and we also create questions that are closer to the real problems that the skill holder is expected to solve in real life. Our platform leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers to examine the external validity of the generated test questions: questions that have external validity also have a strong predictive ability for identifying early the workers that have the potential to succeed in the online job marketplaces. Our experimental evaluation shows that our system generates questions of comparable or higher quality compared to existing tests, with a cost of approximately $3-$5 dollars per question, which is lower than the cost of licensing questions from existing test banks, and an order of magnitude lower than the cost of producing such questions from scratch using experts.

[1]  Justin Cheng,et al.  Peer and self assessment in massive online classes , 2013, ACM Trans. Comput. Hum. Interact..

[2]  Leonard S. Cahen,et al.  Educational Testing Service , 1970 .

[3]  M. Spence Job Market Signaling , 1973 .

[4]  I. M. Schlesinger,et al.  Systematic Construction of Distractors for Ability and Achievement Test Items , 1967 .

[5]  Linda L. Cook,et al.  SPECIFYING THE CHARACTERISTICS OF LINKING ITEMS USED FOR ITEM RESPONSE THEORY ITEM CALIBRATION1,2 , 1987 .

[6]  Amanda Pallais Ineffiient Hiring in Entry-Level Labor Markets , 2012 .

[7]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[8]  Jacqueline Fleming,et al.  Are Standardized Tests Fair to African Americans? Predictive Validity of the SAT in Black and White Institutions. , 1998 .

[9]  Aniket Kittur,et al.  Reviewing versus doing: learning and performance in crowd assessment , 2014, CSCW.

[10]  Anthony S. Bryk,et al.  Authentic Intellectual Work and Standardized Tests: Conflict or Coexistence? Improving Chicago's Schools. , 2001 .

[11]  Saul Geiser,et al.  Validity of High-School Grades in Predicting Student Success beyond the Freshman Year: High-School Record vs. Standardized Tests as Indicators of Four-Year College Outcomes. Research & Occasional Paper Series: CSHE.6.07. , 2007 .

[12]  Phuoc Tran-Gia,et al.  Quantification of YouTube QoE via Crowdsourcing , 2011, 2011 IEEE International Symposium on Multimedia.

[13]  Paul Resnick,et al.  Reputation systems , 2000, CACM.

[14]  W. James Popham,et al.  Why Standardized Test Scores Don't Measure Educational Quality. , 1999 .

[15]  Scott R. Klemmer,et al.  Shepherding the crowd yields better work , 2012, CSCW.

[16]  A. Jensen,et al.  Bias in Mental Testing , 1980 .

[17]  Steven P. Reise,et al.  Item Response Theory , 2005 .

[18]  Phuoc Tran-Gia,et al.  Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing , 2014, IEEE Transactions on Multimedia.

[19]  M. Schervish Theory of Statistics , 1995 .

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  R. Hambleton,et al.  Fundamentals of Item Response Theory , 1991 .

[22]  George A. Akerlof The Market for “Lemons”: Quality Uncertainty and the Market Mechanism , 1970 .

[23]  Chrysanthos Dellarocas,et al.  The Digitization of Word-of-Mouth: Promise and Challenges of Online Feedback Mechanisms , 2003, Manag. Sci..

[24]  Tore Dybå,et al.  Construction and Validation of an Instrument for Measuring Programming Skill , 2014, IEEE Transactions on Software Engineering.

[25]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[26]  Steven P. Reise,et al.  Item Response Theory , 2014 .

[27]  J. Mckillip,et al.  Fundamentals of item response theory , 1993 .