论文信息 - Information Extraction for Computer Science Academic Rankings System

Information Extraction for Computer Science Academic Rankings System

Today the academic ranking for computer science is a hot and importmant problem. This paper introduces Computer Science Academic Rankings System (CSAR) which aims at academic information extracting, mining and ranking. In this paper we mainly present approaches for information extraction and normalization in CSAR. For semi-structured and unstructured web pages such as paper-view pages, we propose a method with natural language processing n-gram model and web grammar. We generate an optimal matching bipartite graph to extract authors and organizations information with maximum likelihood. CSAR also uses KM algorithm and Hungarian algorithm to find authors and emails correspondence. For information normalization, we introduce n-gram model, EM algorithm and trigram model with linear interpolation to construct part-of-speech tagger, with which to extract useful information from web source. Then TF-IDF model and string edit distance are applied to finish normalizing organization names. In experiment, our proposed approaches obtain high accuracy rate and great improvements of academic information extraction.

Minglu Li | Chengkai Shi | Jiahui Quan

[1] Jie Tang,et al. ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[2] Sheng-H. Chuang,et al. Compound feature recognition by web grammar parsing , 1991 .

[3] L. Egghe. An improvement of the h-index: the g-index , 2006 .

[4] Kun Yu,et al. Resume Information Extraction with Cascaded Hybrid Model , 2005, ACL.

[5] Maarten de Rijke,et al. Finding experts and their eetails in e-mail corpora , 2006, WWW '06.

[6] Jie Tang,et al. Email data cleaning , 2005, KDD '05.

[7] Azriel Rosenfeld,et al. Web Grammars , 1969, IJCAI.

[8] Alireza Noruzi. Google Scholar: The New Generation of Citation Indexes , 2005 .

[9] M. Bernardine Dias,et al. The Dynamic Hungarian Algorithm for the Assignment Problem with Changing Costs , 2007 .

[10] TothPaolo,et al. Algorithm 548: Solution of the Assignment Problem [H] , 1980 .

[11] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.