Information Extraction for Computer Science Academic Rankings System

Today the academic ranking for computer science is a hot and importmant problem. This paper introduces Computer Science Academic Rankings System (CSAR) which aims at academic information extracting, mining and ranking. In this paper we mainly present approaches for information extraction and normalization in CSAR. For semi-structured and unstructured web pages such as paper-view pages, we propose a method with natural language processing n-gram model and web grammar. We generate an optimal matching bipartite graph to extract authors and organizations information with maximum likelihood. CSAR also uses KM algorithm and Hungarian algorithm to find authors and emails correspondence. For information normalization, we introduce n-gram model, EM algorithm and trigram model with linear interpolation to construct part-of-speech tagger, with which to extract useful information from web source. Then TF-IDF model and string edit distance are applied to finish normalizing organization names. In experiment, our proposed approaches obtain high accuracy rate and great improvements of academic information extraction.

[1]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[2]  Sheng-H. Chuang,et al.  Compound feature recognition by web grammar parsing , 1991 .

[3]  L. Egghe An improvement of the h-index: the g-index , 2006 .

[4]  Kun Yu,et al.  Resume Information Extraction with Cascaded Hybrid Model , 2005, ACL.

[5]  Maarten de Rijke,et al.  Finding experts and their eetails in e-mail corpora , 2006, WWW '06.

[6]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[7]  Azriel Rosenfeld,et al.  Web Grammars , 1969, IJCAI.

[8]  Alireza Noruzi Google Scholar: The New Generation of Citation Indexes , 2005 .

[9]  M. Bernardine Dias,et al.  The Dynamic Hungarian Algorithm for the Assignment Problem with Changing Costs , 2007 .

[10]  TothPaolo,et al.  Algorithm 548: Solution of the Assignment Problem [H] , 1980 .

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Kôiti Hasida,et al.  POLYPHONET: An advanced social network extraction system from the Web , 2007, J. Web Semant..

[13]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[14]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Paolo Toth,et al.  Algorithm 548: Solution of the Assignment Problem [H] , 1980, TOMS.

[17]  Yuji Matsumoto,et al.  A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields , 2007, EMNLP.

[18]  Jie Tang,et al.  Information Extraction: Methodologies and Applications , 2008 .

[19]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Thierry Artières,et al.  Maximizing Edit Distance Accuracy with Hidden Conditional Random Fields , 2013, CAIP.

[22]  Lutz Bornmann,et al.  Does the h-index for ranking of scientists really work? , 2005, Scientometrics.

[23]  Jan Hajic,et al.  A New State-of-The-Art Czech Named Entity Recognizer , 2013, TSD.