论文信息 - Web scale NLP: a case study on url word breaking

Web scale NLP: a case study on url word breaking

This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniques can be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed as introducing additional assumptions to the basic BMR framework, and describe a generic yet efficient implementation called word synchronous beam search. Testing the framework and its implementation on a series of large scale experiments reveals the following. First, the language style used to build the model plays a critical role in the word breaking task, and the most suitable for the URL word breaking task appears to be that of the document title where the best performance is obtained. Models created from other language styles, such as from document body, anchor text, and even queries, exhibit varying degrees of mismatch. Although all styles benefit from increasing modeling power which, in our experiments, corresponds to the use of a higher order N-gram, the gain is most recognizable for the title model. The heuristics proposed by the prior arts do contribute to the word breaking performance for mismatched or less powerful models, but are less effective and, in many cases, lead to poorer performance than the matched model with minimal assumptions. For the matched model based on document titles, an accuracy rate of 97.18% can already be achieved using simple trigram without any heuristics.

Kuansan Wang | Bo-June Paul Hsu | Christopher Thrasher

[1] Xiaolong Li,et al. An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[2] Michele Banko,et al. Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[3] Jianfeng Gao,et al. Exploring web scale language models for search query processing , 2010, WWW '10.

[4] Ralf D. Brown. Corpus-driven splitting of compound words. , 2002, TMI.

[5] Philipp Koehn,et al. Empirical Methods for Compound Splitting , 2003, EACL.

[6] Min-Yen Kan,et al. Fast webpage classification using URL features , 2005, CIKM '05.

[7] Anand Venkataraman,et al. A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[8] Stephen E. Robertson,et al. Relevance weighting for query independent evidence , 2005, SIGIR '05.

[9] Andrew Lim,et al. Word segmentation and recognition for web document framework , 1999, CIKM '99.

[10] Maarten de Rijke,et al. Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[11] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.