On identifying academic homepages for digital libraries

Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.

[1]  David R. Anderson,et al.  Statistical inference from capture data on closed animal populations , 1980 .

[2]  E. George Capture—recapture estimation via Gibbs sampling , 1992 .

[3]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4]  Giles,et al.  Searching the world wide Web , 1998, Science.

[5]  Kan Deng,et al.  On the Greediness of Feature Selection Algorithms , 1999 .

[6]  S. Pledger Unified Maximum Likelihood Estimates for Closed Capture–Recapture Models Using Mixtures , 2000, Biometrics.

[7]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[8]  S. Brooks,et al.  On the Bayesian analysis of population size , 2001 .

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[11]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[12]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  David Poole,et al.  Estimating the size of the telephone universe: a Bayesian Mark-recapture approach , 2004, KDD.

[15]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[16]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Mari Ostendorf,et al.  Classification by Augmenting the Bag-of-Words Representation with Redundancy-Compensated Bigrams ∗ , 2005 .

[19]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[20]  Yuxin Wang,et al.  Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection , 2006, ICADL.

[21]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[22]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[23]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[24]  Luca Becchetti,et al.  A Comparison of Sampling Techniques for Web Graph Characterization , 2006 .

[25]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[26]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[27]  M. de Rijke,et al.  Determining Expert Profiles (With an Application to Expert Finding) , 2007, IJCAI.

[28]  Guillermo Ricardo Simari,et al.  Inconsistent Ontology Handling by Translating Description Logics into Defeasible Logic Programming , 2007, Inteligencia Artif..

[29]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[30]  Jie Tang,et al.  Social Network Extraction of Academic Researchers , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[31]  Andrew McCallum,et al.  Mining a digital library for influential authors , 2007, JCDL '07.

[32]  Guillermo Ricardo Simari,et al.  On the construction of Dialectical Databases , 2007, Inteligencia Artif..

[33]  C. Lee Giles,et al.  Extracting Author Meta-Data from Web Using Visual Features , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[34]  Jia Li,et al.  Extracting Author Meta-Data from Web Using Visual Features , 2007 .

[35]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[36]  Lyle H. Ungar,et al.  Web-scale named entity recognition , 2008, CIKM '08.

[37]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[38]  Vasant Honavar,et al.  Combining Super-Structuring and Abstraction on Sequence Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[39]  Efstathios Stamatatos,et al.  Learning to recognize webpage genres , 2009, Inf. Process. Manag..

[40]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[41]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[42]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[43]  C. Lee Giles,et al.  Estimating the web robot population , 2010, WWW '10.

[44]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[45]  Jing Hao Figure , 1972, Analysing Scientific Discourse From a Systemic Functional Linguistic Perspective.