Document Analysis and Retrieval Tasks in Scientific Digital Libraries

Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.

[1]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[2]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[3]  Olfa Nasraoui,et al.  Web data mining: exploring hyperlinks, contents, and usage data , 2008, SKDD.

[4]  Cornelia Caragea,et al.  Context Sensitive Topic Models for Author Influence in Document Networks , 2011, IJCAI.

[5]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[6]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8]  Min-Yen Kan,et al.  Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles , 2009, MWE@IJCNLP.

[9]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[10]  Yanjun Qi,et al.  Extracting Researcher Metadata with Labeled Features , 2014, SDM.

[11]  Timothy Baldwin,et al.  Automatic keyphrase extraction from scientific articles , 2013, Lang. Resour. Evaluation.

[12]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[13]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[14]  Antonina Dattolo,et al.  Automatic keyphrase extraction and ontology mining for content-based tag recommendation , 2010 .

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[17]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.

[18]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[21]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[22]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[23]  Zhaohui Wu,et al.  Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX , 2014, Feedback Computing.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[26]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[27]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[28]  José Luis Ortega,et al.  Longitudinal Study of Contents and Elements in the Scientific Web environment , 2006 .

[29]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[30]  Cornelia Caragea,et al.  On identifying academic homepages for digital libraries , 2011, JCDL '11.

[31]  Concepción S. Wilson,et al.  The Literature of Bibliometrics, Scientometrics, and Informetrics , 2001, Scientometrics.

[32]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[33]  C. Lee Giles,et al.  Ranking experts using author-document-topic graphs , 2013, JCDL '13.

[34]  Rajeev Rastogi,et al.  Entity disambiguation with hierarchical topic models , 2011, KDD.

[35]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[36]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[37]  Jaime G. Carbonell,et al.  Key Phrase Extraction of Lightly Filtered Broadcast News , 2012, TSD.

[38]  Jian Pei,et al.  Detecting topic evolution in scientific literature: how can citations help? , 2009, CIKM.

[39]  Daniel Kifer,et al.  What Is an Opinion About? Exploring Political Standpoints Using Opinion Scoring Model , 2010, AAAI.

[40]  Florian Boudin,et al.  A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction , 2013, IJCNLP.

[41]  Mohamed Firdhous Automating Legal Research through Data Mining , 2012, ArXiv.

[42]  C. Lee Giles,et al.  Learning to Rank Homepages For Researcher-Name Queries , 2011 .

[43]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[44]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[45]  C. Lee Giles,et al.  Extracting Author Meta-Data from Web Using Visual Features , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[46]  Svetha Venkatesh,et al.  Infrequent Item Mining in Multiple Data Streams , 2007 .

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[49]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[50]  Cornelia Caragea,et al.  Extracting Keyphrases from Research Papers Using Citation Networks , 2014, AAAI.

[51]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[52]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[53]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[54]  Aleks Jakulin,et al.  Analyzing the U.S. Senate in 2003: Similarities, Clusters, and Blocs , 2009, Political Analysis.

[55]  C. Lee Giles,et al.  Automatic tag recommendation for metadata annotation using probabilistic topic modeling , 2013, JCDL '13.

[56]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[58]  See-Kiong Ng,et al.  Biological Data Mining and Its Applications in Healthcare , 2013 .

[59]  Gene H. Golub,et al.  Computing PageRank using Power Extrapolation , 2003 .

[60]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.