iClass: Combining Multiple Multi-label Classification with Expert Knowledge

Roper Center is one of the largest public opinion data archives in the world. It collects data sets of polled survey questions from numerous media outlets and organizations. The volume of data introduces search complexities over survey questions and poses challenges when analyzing search trends. Roper Center question-level retrieval applications used human metadata experts to assign topics to content. This has been insufficient to reach required levels of consistency and provides an inadequate base for creating an advanced search experience. The objective of this work is to combine the human expert teams' knowledge of the nature of the survey questions and the concepts and topics these questions express, with the ability of multi-label classifiers to learn this knowledge and apply it to an automated, fast and accurate classification mechanism. This approach cuts down the question analysis and tagging time significantly as well as provides enhanced consistency and scalability for topics' descriptions. At the same time, creating an ensemble of machine learning classifiers combined with expert knowledge is expected to enhance the search experience and provide much needed analytic capabilities to the survey questions databases. In our design, we use classification from several machine learning algorithms like SVM and Decision Trees, combined with expert knowledge in form of handcrafted rules, data analysis and result review. We consolidate the different techniques into a Multipath Classifier with a Confidence point system that decides upon the relevance of topics assigned to survey questions with nearly perfect accuracy.