Extracting Researcher Metadata with Labeled Features

Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the F1 value for the affiliation field, while the overall F1 improves by 9%.

[1]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[2]  Robert E. Schapire,et al.  Incorporating Prior Knowledge into Boosting , 2002, ICML.

[3]  Jie Tang,et al.  Social Network Extraction of Academic Researchers , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[4]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[5]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[7]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[8]  Yanjun Qi,et al.  Semi-Supervised Sequence Labeling with Self-Learned Features , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[9]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[10]  M. de Rijke,et al.  Broad expertise retrieval in sparse data environments , 2007, SIGIR.

[11]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.

[12]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[13]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[14]  Jia Li,et al.  Extracting Author Meta-Data from Web Using Visual Features , 2007 .

[15]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[16]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..