Focusing on novelty: a crawling strategy to build diverse language models

Word prediction performed by language models has an important role in many tasks as e.g. word sense disambiguation, speech recognition, hand-writing recognition, query spelling and query segmentation. Recent research has exploited the textual content of the Web to create language models. In this paper, we propose a new focused crawling strategy to collect Web pages that focuses on novelty in order to create diverse language models. In each crawling cycle, the crawler tries to ll the gaps present in the current language model built from previous cycles, by avoiding visiting pages whose vocabulary is already well represented in the model. It relies on an information theoretic measure to identify these gaps and then learns link patterns to pages in these regions in order to guide its visitation policy. To handle constantly evolving domains, a key feature of our crawler approach is its ability to adjust its focus as the crawl progresses. We evaluate our approach in two different scenarios in which our solution can be useful. First, we demonstrate that our approach produces more effective language models than the ones created by a baseline crawler in the context of a speech recognition task of broadcast news. In fact, in some cases, our crawler was able to obtain similar results to the baseline by crawling only 12.5% of the pages collected by the latter. Secondly, since in the news domain avoiding well-represented content might lead to novelty, i.e. up-to-date pages, we show that our diversity-based crawler can also be helpful to guide the crawler for the most recent content in the news. The results show that our approach was able to obtain on average 50% more up-to-date pages than the baseline crawler.

[1]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Andreas Stolcke,et al.  Web resources for language modeling in conversational speech recognition , 2007, TSLP.

[4]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[5]  Gerald Penn,et al.  Web-based language modelling for automatic lecture transcription , 2007, INTERSPEECH.

[6]  Taghi M. Khoshgoftaar,et al.  Building Useful Models from Imbalanced Data with Sampling and Boosting , 2008, FLAIRS.

[7]  Xiaolong Li,et al.  Efficacy of a constantly adaptive language modeling technique for web-scale applications , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[11]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[12]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[13]  Jian-Yun Nie,et al.  Time-Sensitive Language Modelling for Online Term Recurrence Prediction , 2009, ICTIR.

[14]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[15]  Ruhi Sarikaya,et al.  Rapid language model development using external resources for new spoken dialog domains , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Mari Ostendorf,et al.  Analyzing and predicting language model improvements , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[17]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[18]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[19]  Susanne Boll,et al.  Adaptive geospatially focused crawling , 2009, CIKM.

[20]  Lyle H. Ungar,et al.  Web-scale named entity recognition , 2008, CIKM '08.

[21]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[23]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[24]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[25]  Dilek Z. Hakkani-Tür,et al.  The AT&T WATSON speech recognizer , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[26]  Ruoming Jin,et al.  Data discretization unification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).