Web scale NLP: a case study on url word breaking

This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniques can be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed as introducing additional assumptions to the basic BMR framework, and describe a generic yet efficient implementation called word synchronous beam search. Testing the framework and its implementation on a series of large scale experiments reveals the following. First, the language style used to build the model plays a critical role in the word breaking task, and the most suitable for the URL word breaking task appears to be that of the document title where the best performance is obtained. Models created from other language styles, such as from document body, anchor text, and even queries, exhibit varying degrees of mismatch. Although all styles benefit from increasing modeling power which, in our experiments, corresponds to the use of a higher order N-gram, the gain is most recognizable for the title model. The heuristics proposed by the prior arts do contribute to the word breaking performance for mismatched or less powerful models, but are less effective and, in many cases, lead to poorer performance than the matched model with minimal assumptions. For the matched model based on document titles, an accuracy rate of 97.18% can already be achieved using simple trigram without any heuristics.

[1]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[2]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[3]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[4]  Ralf D. Brown Corpus-driven splitting of compound words. , 2002, TMI.

[5]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[6]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[7]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[8]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[9]  Andrew Lim,et al.  Word segmentation and recognition for web document framework , 1999, CIKM '99.

[10]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[11]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[12]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[13]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[14]  Franco Salvetti,et al.  Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach , 2006, NAACL.

[15]  Haifeng Wang,et al.  Discriminative Pruning of Language Models for Chinese Word Segmentation , 2006, ACL.

[16]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[17]  Sanjeet Khaitan,et al.  Data-driven compound splitting method for english compounds in domain names , 2009, CIKM.

[18]  Rutger van Haasteren,et al.  Gibbs Sampling , 2010, Encyclopedia of Machine Learning.

[19]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[20]  Chris Brockett,et al.  Using a Broad-Coverage Parser for Word-Breaking in Japanese , 2000, COLING.

[21]  Kuansan Wang,et al.  PSkip: estimating relevance ranking quality from web search clickthrough data , 2009, KDD.

[22]  Jianfeng Gao,et al.  Multi-style language model for web scale information retrieval , 2010, SIGIR '10.

[23]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[24]  Wei-Ying Ma,et al.  Exploring URL Hit Priors for Web Search , 2006, ECIR.

[25]  Enrique Alfonseca,et al.  Decompounding query keywords from compounding languages , 2008, ACL.

[26]  A. Gelfand,et al.  Identifiability, Improper Priors, and Gibbs Sampling for Generalized Linear Models , 1999 .