Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013

The track 1 problem in KDD Cup 2013 is to discriminate between papers confirmed by the given authors from the other deleted papers. This paper describes the winning solution of team National Taiwan University for track 1 of KDD Cup 2013. First, we conduct the feature engineering to transform the various provided text information into 97 features. Second, we train classification and ranking models using these features. Last, we combine our individual models to boost the performance by using results on the internal validation set and the official Valid set. Some effective post-processing techniques have also been proposed. Our solution achieves 0.98259 MAP score and ranks the first place on the private leaderboard of Test set.

[1]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[2]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[3]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[4]  Cristina V. Lopes,et al.  Bagging gradient-boosted trees for high precision, low variance ranking models , 2011, SIGIR.

[5]  Nitesh V. Chawla,et al.  Link Prediction in Heterogeneous Networks : Influence and Time Matters , 2012 .

[6]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[7]  Kuan-Wei Wu,et al.  A Two-Stage Ensemble of Diverse Models for Advertisement Ranking in KDD Cup 2012 , 2012 .

[8]  Charu C. Aggarwal,et al.  When will it happen?: relationship prediction in heterogeneous information networks , 2012, WSDM '12.

[9]  Yu-Yang Huang,et al.  Unsupervised link prediction using aggregative statistics on heterogeneous social networks , 2013, KDD.

[10]  J. Friedman Stochastic gradient boosting , 2002 .

[11]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[12]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[13]  Anthony K. H. Tung,et al.  Indexing Mixed Types for Approximate Retrieval , 2005, VLDB.

[14]  Martine De Cock,et al.  The Microsoft academic search dataset and KDD Cup 2013 , 2013, KDD Cup '13.

[15]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[16]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[17]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[18]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[19]  Jiawei Han,et al.  Citation Prediction in Heterogeneous Bibliographic Networks , 2012, SDM.

[20]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[24]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[27]  Shou-De Lin,et al.  Feature Engineering and Classifier Ensemble for KDD Cup 2010 , 2010, KDD 2010.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Henry N. Adorna,et al.  Link Prediction in a Modified Heterogeneous Bibliographic Network , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[30]  Chih-Jen Lin,et al.  Large-Scale Linear RankSVM , 2014, Neural Computation.

[31]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, Neural Information Processing Systems.