论文信息 - Syntactic complexity of Web search queries through the lenses of language models, networks and users

Syntactic complexity of Web search queries through the lenses of language models, networks and users

We present a holistic view on the syntactic complexity of Web search queries.We use three perspectives: statistical language modeling, complex network analysis, and "native speaker" intuition.The three complementary viewpoints show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than natural language.Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication. Across the world, millions of users interact with search engines every day to satisfy their information needs. As the Web grows bigger over time, such information needs, manifested through user search queries, also become more complex. However, there has been no systematic study that quantifies the structural complexity of Web search queries. In this research, we make an attempt towards understanding and characterizing the syntactic complexity of search queries using a multi-pronged approach. We use traditional statistical language modeling techniques to quantify and compare the perplexity of queries with natural language (NL). We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models. Finally, we conduct experiments to study whether search engine users are able to identify real queries, when presented along with model-generated ones. The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL. Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication.

[1] S. Shen-Orr,et al. Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[2] Jean-Luc Gauvain,et al. Language recognition using phone latices , 2004, INTERSPEECH.

[3] Denise Brandão de Oliveira e Britto,et al. The faculty of language , 2007 .

[4] W. Bruce Croft,et al. A language modeling approach to information retrieval , 1998, SIGIR '98.

[5] J. Golbeck. In real life , 2016, Science.

[6] Sebastian Wernicke,et al. A Faster Algorithm for Detecting Network Motifs , 2005, WABI.

[7] W. Bruce Croft,et al. Structural annotation of search queries using pseudo-relevance feedback , 2010, CIKM.

[8] Albert-László Barabási,et al. Statistical mechanics of complex networks , 2001, ArXiv.

[9] Amanda Spink,et al. Searching the Web: the public and their queries , 2001 .

[10] Rajesh P. N. Rao,et al. Entropic Evidence for Linguistic Structure in the Indus Script , 2009, Science.

[11] Marc A. Zissman,et al. Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Falk Schreiber,et al. MAVisto: a tool for the exploration of network motifs , 2005, Bioinform..

[13] Rishiraj Saha Roy,et al. UNDERSTANDING THE LINGUISTIC STRUCTURE AND EVOLUTION OF WEB SEARCH QUERIES , 2014 .

[14] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[15] Rohini K. Srihari,et al. Biterm language models for document retrieval , 2002, SIGIR '02.

[16] ChengXiang Zhai,et al. Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[17] Uri Alon,et al. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[18] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[20] Lalit R. Bahl,et al. A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Marius Pasca,et al. Low-Cost Supervision for Multiple-Source Attribute Extraction , 2009, CICLing.

[22] Sargur N. Srihari,et al. Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] J. Stephen Downie,et al. Evaluating a simple approach to music information retrieval : conceiving melodic n-grams as text , 1999 .

[24] Xiaohui Yan,et al. A biterm topic model for short texts , 2013, WWW.

[25] Paul Erdös,et al. On random graphs, I , 1959 .

[26] Yorick Wilks,et al. A Closer Look at Skip-gram Modelling , 2006, LREC.

[27] Ricardo Baeza-Yates,et al. Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[28] Animesh Mukherjee,et al. Global topology of word co-occurrence networks: Beyond the two-regime power-law , 2010, COLING.

[29] R. Mantegna,et al. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[30] Noam Chomsky,et al. The Native Speaker is Dead! An Informal Discussion of a Linguistic Myth with Noam Chomsky and Other Linguists, Philosophers, Psychologists, and Lexicographers , 1985 .

[31] Jerome R. Bellegarda,et al. Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[32] Rishiraj Saha Roy,et al. Unsupervised query segmentation using only query logs , 2011, WWW.

[33] Marco Pennacchiotti,et al. Open Entity Extraction from Web Search Query Logs , 2010, COLING.

[34] Amanda Spink,et al. Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[35] Ricardo Baeza-Yates,et al. A Multi-faceted Approach to Query Intent Classification , 2011, SPIRE.

[36] Animesh Mukherjee,et al. The Structure and Dynamics of Linguistic Networks , 2009 .

[37] Christopher Olston,et al. What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[38] Stephan Vogel,et al. Language Model Adaptation for Statistical Machine Translation via Structured Query Models , 2004, COLING.

[39] Alexander Mehler. Large Text Networks as an Object of Corpus Linguistic Studies , 2009 .

[40] Benjamin Van Durme,et al. What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[41] Xiaohua Jia. Proceedings of the 1st international conference on Scalable information systems , 2006 .

[42] Ramon Ferrer i Cancho,et al. The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[43] Karsten Weihe,et al. Network Motifs Are a Powerful Tool for Semantic Distinction , 2016 .

[44] Matthew Lease,et al. Crowdsourcing for search evaluation , 2011, SIGF.

[45] Michael Gamon,et al. Active objects: actions for entity-centric search , 2012, WWW.

[46] Philip Koehn,et al. Statistical Machine Translation , 2010, EAMT.

[47] Claude E. Shannon,et al. Prediction and Entropy of Printed English , 1951 .

[48] Jean-Louis Dessalles. Du protolangage au langage : modèle d'une transition , 2006 .

[49] Éric Guichard. L'internet : mesures des appropriations d'une technique intellectuelle , 2002 .

[50] Rishiraj Saha Roy,et al. An IR-based evaluation framework for web search query segmentation , 2012, SIGIR '12.

[51] Fuji Ren,et al. Role-explicit query identification and intent role annotation , 2012, CIKM '12.

[52] W. Bruce Croft,et al. A general language model for information retrieval , 1999, CIKM '99.

[53] Fuchun Peng,et al. Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[54] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.