University Louis Pasteur Strasbourg I LGeCo, INSA Strasbourg University of West Bohemia in Pilsen Faculty of Applied Sciences WEB MINING METHODS FOR THE DETECTION OF AUTHORITATIVE SOURCES

La partie innovante de cette these porte sur les definitions, les explications et teste des modifications de la formule standard de PageRank adaptee aux reseaux bibliographiques. Les nouvelles versions de PageRank tiennent compte non seulement du graphe de citations mais aussi du graphe de collaboration. On verifie l’applicabilite des nouveaux algorithmes en traitant des donnees issues de la bibliotheque numerique DBLP et en comparant les rangs des laureats du prix « ACM SIGMOD E. F. Codd Innovations Award ». Les classements reposant sur les informations concernant a la fois les citations et les collaborations s’averent meilleurs que les classements generes par PageRank standard. Dans un autre chapitre de la these, on presente une methodologie et deux etudes de cas concernant la recherche des chercheurs faisant autorite en analysant des sites Web academiques.

[1]  François Rousselot,et al.  Ranking Algorithms for Web Sites - Finding Authoritative Academic Web Sites and Researchers , 2007, WEBIST.

[2]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[3]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[4]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[5]  Mike Thelwall,et al.  Extracting macroscopic information from Web links , 2001, J. Assoc. Inf. Sci. Technol..

[6]  Wolfgang Glänzel,et al.  A Hirsch-type index for journals , 2006, Scientometrics.

[7]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[8]  Stuart Dillon,et al.  Authorship patterns in information systems , 1997, Scientometrics.

[9]  Lutz Bornmann,et al.  Does the h-index for ranking of scientists really work? , 2005, Scientometrics.

[10]  Alan F. Smeaton,et al.  Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century? , 2002, SIGIR Forum.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Stephen P. Harter,et al.  ISI's impact factor as misnomer: a proposed new measure to assess journal impact , 1997 .

[13]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[14]  Yannis Manolopoulos,et al.  A new perspective to automatically rank scientific conferences using digital libraries , 2005, Inf. Process. Manag..

[15]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[16]  J. Testa A base de dados ISI e seu processo de seleção de revistas , 1998 .

[17]  Katherine W. McCain Core journal networks and cocitation maps in the marine sciences: tools and information management in interdisciplinary research , 1992 .

[18]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[19]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[20]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[21]  Les Carr,et al.  Trailblazing the literature of hypertext: author co-citation analysis (1989–1998) , 1999, HYPERTEXT '99.

[22]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[23]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[24]  Fabien Mathieu,et al.  BackRank: an alternative for PageRank? , 2005, WWW '05.

[25]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[26]  Wen-Syan Li,et al.  Defining logical domains in a web site , 2000, HYPERTEXT '00.

[27]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[28]  Jaideep Srivastava,et al.  Incremental page rank computation on evolving graphs , 2005, WWW '05.

[29]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[30]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[31]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[32]  Jörg Sander,et al.  Analysis of SIGMOD's co-authorship graph , 2003, SGMD.

[33]  D. Christakis,et al.  Impact factor: a valid measure of journal quality? , 2003, Journal of the Medical Library Association : JMLA.

[34]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[35]  François Rousselot,et al.  Finding Authoritative Researchers on Academic Web Sites , 2008 .

[36]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[37]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[38]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[39]  Debora Donato,et al.  The Web as a graph: How far we are , 2007, TOIT.

[40]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[41]  Johan Bollen,et al.  Co-authorship networks in the digital library research community , 2005, Inf. Process. Manag..

[42]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[43]  P. Seglen,et al.  Education and debate , 1999, The Ethics of Public Health.

[44]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[45]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[46]  Yuan An,et al.  Characterizing and Mining Citation Graph of Computer Science Literature , 2001 .

[47]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[48]  Bharath Kumar Mohan Searching association networks for nurturers , 2005, Computer.

[49]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[50]  Sebastiano Vigna TruRank: taking PageRank to the limit , 2005, WWW '05.

[51]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[52]  Z. Neda,et al.  Networks in life: Scaling properties and eigenvalue spectra , 2002, cond-mat/0303106.

[53]  Amanda Spink,et al.  A comparison of foreign authorship distribution in JASIST and the Journal of Documentation , 2002, J. Assoc. Inf. Sci. Technol..

[54]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[55]  Grant Lewison Researchers" and users" perceptions of the relative standing of biomedical papers in different journals , 2004, Scientometrics.

[56]  Ronald Rousseau,et al.  Social network analysis: a powerful strategy, also for the information sciences , 2002, J. Inf. Sci..

[57]  Mike Thelwall,et al.  Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university Web sites , 2002, J. Assoc. Inf. Sci. Technol..

[58]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[59]  E. Garfield Journal impact factor: a brief review. , 1999, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[60]  Sunghun Kim,et al.  Properties of academic paper references , 2004, HYPERTEXT '04.

[61]  Henk F. Moed,et al.  Assessing the quality of scholarly journals in Linguistics:An alternative to citation-based journal impact factors , 2001, Scientometrics.

[62]  Lutz Bornmann,et al.  What do we know about the h index? , 2007, J. Assoc. Inf. Sci. Technol..

[63]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[64]  Ingemar J. Cox,et al.  A Comparison of On-Line Computer Science Citation Databases , 2005, ECDL.

[65]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[66]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[67]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[68]  Michael R. Lyu,et al.  Predictive ranking: a novel page ranking approach by estimating the web structure , 2005, WWW '05.

[69]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[70]  Ian H. Witten,et al.  Extracting text from PostScript , 1998 .

[71]  Michael Chau,et al.  Comparison of Three Vertical Search Spiders , 2003, Computer.

[72]  Eugene Garfield,et al.  Citation indexing - its theory and application in science, technology, and humanities , 1979 .

[73]  Fan Chung Graham,et al.  A random graph model for massive graphs , 2000, STOC '00.

[74]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[75]  Isabel Gómez,et al.  Advantages and limitations in the use of impact factor measures for the assessment of research performance , 2002, Scientometrics.

[76]  Ricardo A. Baeza-Yates,et al.  Crawling the Infinite Web: Five Levels Are Enough , 2004, WAW.

[77]  Marco Gori,et al.  A unified probabilistic framework for Web page scoring systems , 2004, IEEE Transactions on Knowledge and Data Engineering.

[78]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[79]  Gultekin Özsoyoglu,et al.  Evaluating Publication Similarity Measures , 2005, IEEE Data Eng. Bull..

[80]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[81]  Richard C. Holt,et al.  The small world of software reverse engineering , 2004, 11th Working Conference on Reverse Engineering.

[82]  Dongwon Lee,et al.  On six degrees of separation in DBLP-DB and more , 2005, SGMD.

[83]  Chabane Djeraba,et al.  High performance crawling system , 2004, MIR '04.

[84]  Yannis Manolopoulos,et al.  Generalized comparison of graph-based ranking algorithms for publications and authors , 2006, J. Syst. Softw..

[85]  Sebastiano Vigna,et al.  Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations , 2004, WAW.

[86]  François Rousselot,et al.  A Comparison Of Two Algorithms ForDiscovering Repeated Word Sequences , 2005 .

[87]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[88]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[89]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[90]  Andreas Thor,et al.  Citation analysis of database publications , 2005, SGMD.

[91]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[92]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[93]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[94]  Yannis Manolopoulos,et al.  A citation-based system to assist prize awarding , 2005, SGMD.

[95]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, COCOON.

[96]  Patrick Reuther,et al.  Maintaining an Online Bibliographical Database: The Problem of Data Quality , 2006, EGC.

[97]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[98]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[99]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[100]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[101]  Mike Thelwall,et al.  The relationship between the WIFs or inlinks of Computer Science Departments in UK and their RAE ratings or research productivities in 2001 , 2003, Scientometrics.

[102]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.