Collecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)

Les moteurs de recherche verticaux, qui se concentrent sur des segments specifiques du Web, deviennent aujourd'hui de plus en plus presents dans le paysage d'Internet. Les moteurs de recherche thematiques, notamment, peuvent obtenir de tres bonnes performances en limitant le corpus indexe a un theme connu. Les ambiguites de la langue sont alors d'autant plus controlables que le domaine est bien cible. De plus, la connaissance des objets et de leurs proprietes rend possible le developpement de techniques d'analyse specifiques afin d'extraire des informations pertinentes.Dans le cadre de cette these, nous nous interessons plus precisement a la procedure de collecte de documents thematiques a partir du Web pour alimenter un moteur de recherche thematique. La procedure de collecte peut etre realisee en s'appuyant sur un moteur de recherche generaliste existant (recherche orientee) ou en parcourant les hyperliens entre les pages Web (exploration orientee).Nous etudions tout d'abord la recherche orientee. Dans ce contexte, l'approche classique consiste a combiner des mot-cles du domaine d'interet, a les soumettre a un moteur de recherche et a telecharger les meilleurs resultats retournes par ce dernier.Apres avoir evalue empiriquement cette approche sur 340 themes issus de l'OpenDirectory, nous proposons de l'ameliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requetes thematiques plus pertinentes pour le theme afin d'augmenter la precision de la collecte. Nous definissons une metrique fondee sur un graphe de cooccurrences et un algorithme de marche aleatoire, dans le but de predire la pertinence d'une requete thematique. En aval du moteur de recherche, nous proposons de filtrer les documents telecharges afin d'ameliorer la qualite du corpus produit. Pour ce faire, nous modelisons la procedure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aleatoire biaise afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette these, nous nous focalisons sur l'exploration orientee du Web. Au coeur de tout robot d'exploration orientee se trouve une strategie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un theme, tout en minimisant le nombre de pages visitees qui ne sont pas en rapport avec le theme. En pratique, cette strategie definit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement independante du theme a partir de donnees existantes annotees automatiquement.

[1]  Yasuhiko Kitamura,et al.  Keyword Spices: A New Method for Building Domain-Specific Web Search Engines , 2001, IJCAI.

[2]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[3]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[4]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[5]  William H. Fletcher,et al.  Concordancing the Web with KWiCFinder , 2001 .

[6]  Mounia Lalmas,et al.  Workshop on aggregated search , 2008, SIGF.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[9]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[10]  Marco Baroni,et al.  Building general- and special-purpose corpora by Web crawling , 2006 .

[11]  Parikshit Sondhi,et al.  Using query context models to construct topical search engines , 2010, IIiX.

[12]  Lakshminarayanan Subramanian,et al.  Contextual Information Portals , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[13]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[14]  Zhaohui Zheng,et al.  Learning to model relatedness for news recommendation , 2011, WWW.

[15]  Yong Yu,et al.  Identifying ambiguous queries in web search , 2007, WWW '07.

[16]  David Hawking,et al.  Quality and relevance of domain-specific search: A case study in mental health , 2006, Information Retrieval.

[17]  Iadh Ounis,et al.  A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[18]  Antoinette Renouf,et al.  WebCorp: an integrated system for web text search , 2007 .

[19]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[20]  Alexander Mehler,et al.  Genres on the Web: Computational Models and Empirical Studies , 2010 .

[21]  Rada Mihalcea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006, International Conference on Semantic Computing (ICSC 2007).

[22]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[23]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[24]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[25]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[26]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[27]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[28]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[29]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[30]  Ophir Frieder,et al.  Predicting query difficulty on the web by learning visual clues , 2005, SIGIR '05.

[31]  Kristian J. Hammond,et al.  Watson: Anticipating and Contextualizing Information Needs , 1999 .

[32]  M. Narasimha Murty,et al.  Focused crawling with scalable ordinal regression solvers , 2007, ICML '07.

[33]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[34]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[35]  Chun Chen,et al.  Guide focused crawler efficiently and effectively using on-line topical importance estimation , 2008, SIGIR '08.

[36]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[37]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[38]  P. Diaconis Group representations in probability and statistics , 1988 .

[39]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[40]  Jialun Qin,et al.  Building domain-specific Web collections for scientific digital libraries: a meta-search enhanced focused crawling method , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[41]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[42]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[43]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[44]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[45]  Michael Kluck,et al.  Evaluation of Cross-Language Information Retrieval Using the Domain-Specific GIRT Data as Parallel German-English Corpus , 2004, LREC.

[46]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[47]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[48]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[49]  Stefan Evert A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.

[50]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[51]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[52]  Clément de Groc,et al.  Experiments on Pseudo Relevance Feedback Using Graph Random Walks , 2012, SPIRE.

[53]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[54]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[55]  W. Knight A Computer Method for Calculating Kendall's Tau with Ungrouped Data , 1966 .

[56]  Rayid Ghani,et al.  Learning a monolingual language model from a multilingual text database , 2000, CIKM '00.

[57]  Ayman Farahat,et al.  Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization , 2005, SIAM J. Sci. Comput..

[58]  Chun Chen,et al.  Quantify query ambiguity using ODP metadata , 2007, SIGIR.

[59]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[60]  William H. Fletcher Making the Web More Useful as a Source for Linguistic Corpora , 2004 .

[61]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[62]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[63]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[64]  Milad Shokouhi,et al.  Federated Search , 2011, Found. Trends Inf. Retr..

[65]  Geoffrey Williams,et al.  METRICC: Harnessing comparable corpora for multilingual lexicon development , 2012 .

[66]  Jason Renniey,et al.  Eecient Web Spidering with Reinforcement Learning , 1999 .

[67]  Clément de Groc,et al.  GrawlTCQ: Terminology and Corpora Building by Ranking Simultaneously Terms, Queries and Documents using Graph Random Walks , 2011, Graph-based Methods for Natural Language Processing.

[68]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[69]  Robert Steele,et al.  Techniques for specialized search engines , 2001 .

[70]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[71]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[72]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[73]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[74]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[75]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[76]  Milad Shokouhi,et al.  From federated to aggregated search , 2010, SIGIR.

[77]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[78]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[79]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[80]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[81]  Rayid Ghani,et al.  Building Minority Language Corpora by Learning to Generate Web Search Queries , 2003, Knowledge and Information Systems.

[82]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[83]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[84]  Clément de Groc,et al.  Mining Product Features from the Web: A Self-supervised Approach , 2012, WEBIST.

[85]  Cristina V. Lopes,et al.  Bagging gradient-boosted trees for high precision, low variance ranking models , 2011, SIGIR.

[86]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[87]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[88]  Sandeep Pandey,et al.  Crawl ordering by search impact , 2008, WSDM '08.

[89]  Eli Upfal,et al.  The Web as a graph , 2000, PODS.

[90]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[91]  Claudia Hauff,et al.  Predicting the effectiveness of queries and retrieval systems , 2010, SIGF.

[92]  Clément de Groc,et al.  Self-supervised Product Feature Extraction using a Knowledge Base and Visual Clues , 2012, WEBIST.

[93]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[94]  Ben Choi,et al.  Web Page Classification , 2005 .

[95]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[96]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[97]  Jeremy T. Bradley,et al.  PageRank: Splitting Homogeneous Singular Linear Systems of Index One , 2009, ICTIR.

[98]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[99]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[100]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[101]  O Baujard,et al.  MARVIN, multi-agent softbot to retrieve multilingual medical information on the Web. , 1998, Medical informatics = Medecine et informatique.

[102]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[103]  Soumen Chakrabarti Interactive Focused Crawler : Setup , Monitoring and Control through User Feedback , 2003 .

[104]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[105]  Patrick Gallinari,et al.  Document structure meets page layout: loopy random fields for web news content extraction , 2010, DocEng '10.

[106]  Christina Lioma,et al.  Random walk term weighting for information retrieval , 2007, SIGIR.

[107]  Emine Yilmaz,et al.  Document selection methodologies for efficient and effective learning-to-rank , 2009, SIGIR.

[108]  Fernando Diaz,et al.  Learning to aggregate vertical results into web search results , 2011, CIKM '11.

[109]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[110]  Clément de Groc Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[111]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[112]  D. Sculley,et al.  Large Scale Learning to Rank , 2009 .

[113]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[114]  Jean-Daniel Fekete,et al.  Overlaying Graph Links on Treemaps , 2003 .

[115]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[116]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[117]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[118]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[119]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[120]  Kevin Duh,et al.  Learning to rank with partially-labeled data , 2008, SIGIR '08.

[121]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[122]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[123]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[124]  Xin Jiang,et al.  A ranking approach to keyphrase extraction , 2009, SIGIR.

[125]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[126]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[127]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[128]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[129]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[130]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[131]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[132]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[133]  Clément de Groc,et al.  Un critère de cohésion thématique fondé sur un graphe de cooccurrences (Topical Cohesion using Graph Random Walks) [in French] , 2012, JEP-TALN-RECITAL.

[134]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[135]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[136]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[137]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[138]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[139]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[140]  David Hawking,et al.  Quality-Oriented Search for Depression Portals , 2009, ECIR.

[141]  M. de Rijke,et al.  Using Coherence-Based Measures to Predict Query Difficulty , 2008, ECIR.

[142]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[143]  Christopher D. Manning,et al.  Random Walks for Text Semantic Similarity , 2009, Graph-based Methods for Natural Language Processing.

[144]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[145]  Fredric C. Gey,et al.  The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval , 2000, CLEF.

[146]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[147]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[148]  Filippo Menczer,et al.  MySpiders: Evolve Your Own Intelligent Web Crawlers , 2002, Autonomous Agents and Multi-Agent Systems.

[149]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.