论文信息 - Data mining via support vector machines: scalability, applicability, and interpretability

Data mining via support vector machines: scalability, applicability, and interpretability

KDD (Knowledge Discovery and Data mining) has been extensively studied in the last decade as data is continuously increasing in size and complexity. This thesis introduces three practical data mining problems—(1) classifying with large data sets, (2) classifying without negative data (i.e., single-class classification), and (3) discovering discriminant feature combinations—and presents solutions that are based on a principled methodology, i.e., Support Vector Machines (SVMs), to produce higher quality results with less human intervention. We first address several challenges in adopting SVM technology to the practice of data mining: (1) scalability: SVMs are unscalable to data size while common data mining applications often involve millions or billions of data objects, (2) applicability: SVMs are limited to (semi-) supervised learning which is mostly applied to binary classification problems, and (3) interpretability: It is hard to interpret and extract knowledge from SVM models. We then propose three principled solutions, which address these challenges, for the problems of the large-scale classification, the single-class classification, and the discriminant feature combination discovery. The contributions of this thesis cover the applications of bioinformatics and text-and-Web mining as well as methodologies of data mining and machine learning.

Jiawei Han | Hwan-Jo Yu