Syntactic complexity of Web search queries through the lenses of language models, networks and users

We present a holistic view on the syntactic complexity of Web search queries.We use three perspectives: statistical language modeling, complex network analysis, and "native speaker" intuition.The three complementary viewpoints show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than natural language.Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication. Across the world, millions of users interact with search engines every day to satisfy their information needs. As the Web grows bigger over time, such information needs, manifested through user search queries, also become more complex. However, there has been no systematic study that quantifies the structural complexity of Web search queries. In this research, we make an attempt towards understanding and characterizing the syntactic complexity of search queries using a multi-pronged approach. We use traditional statistical language modeling techniques to quantify and compare the perplexity of queries with natural language (NL). We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models. Finally, we conduct experiments to study whether search engine users are able to identify real queries, when presented along with model-generated ones. The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL. Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication.

[1]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[2]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[3]  Denise Brandão de Oliveira e Britto,et al.  The faculty of language , 2007 .

[4]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[5]  J. Golbeck In real life , 2016, Science.

[6]  Sebastian Wernicke,et al.  A Faster Algorithm for Detecting Network Motifs , 2005, WABI.

[7]  W. Bruce Croft,et al.  Structural annotation of search queries using pseudo-relevance feedback , 2010, CIKM.

[8]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[9]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[10]  Rajesh P. N. Rao,et al.  Entropic Evidence for Linguistic Structure in the Indus Script , 2009, Science.

[11]  Marc A. Zissman,et al.  Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Falk Schreiber,et al.  MAVisto: a tool for the exploration of network motifs , 2005, Bioinform..

[13]  Rishiraj Saha Roy,et al.  UNDERSTANDING THE LINGUISTIC STRUCTURE AND EVOLUTION OF WEB SEARCH QUERIES , 2014 .

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[16]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[17]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[18]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[20]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Marius Pasca,et al.  Low-Cost Supervision for Multiple-Source Attribute Extraction , 2009, CICLing.

[22]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  J. Stephen Downie,et al.  Evaluating a simple approach to music information retrieval : conceiving melodic n-grams as text , 1999 .

[24]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[25]  Paul Erdös,et al.  On random graphs, I , 1959 .

[26]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[27]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[28]  Animesh Mukherjee,et al.  Global topology of word co-occurrence networks: Beyond the two-regime power-law , 2010, COLING.

[29]  R. Mantegna,et al.  Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[30]  Noam Chomsky,et al.  The Native Speaker is Dead! An Informal Discussion of a Linguistic Myth with Noam Chomsky and Other Linguists, Philosophers, Psychologists, and Lexicographers , 1985 .

[31]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[32]  Rishiraj Saha Roy,et al.  Unsupervised query segmentation using only query logs , 2011, WWW.

[33]  Marco Pennacchiotti,et al.  Open Entity Extraction from Web Search Query Logs , 2010, COLING.

[34]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[35]  Ricardo Baeza-Yates,et al.  A Multi-faceted Approach to Query Intent Classification , 2011, SPIRE.

[36]  Animesh Mukherjee,et al.  The Structure and Dynamics of Linguistic Networks , 2009 .

[37]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[38]  Stephan Vogel,et al.  Language Model Adaptation for Statistical Machine Translation via Structured Query Models , 2004, COLING.

[39]  Alexander Mehler Large Text Networks as an Object of Corpus Linguistic Studies , 2009 .

[40]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[41]  Xiaohua Jia Proceedings of the 1st international conference on Scalable information systems , 2006 .

[42]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[43]  Karsten Weihe,et al.  Network Motifs Are a Powerful Tool for Semantic Distinction , 2016 .

[44]  Matthew Lease,et al.  Crowdsourcing for search evaluation , 2011, SIGF.

[45]  Michael Gamon,et al.  Active objects: actions for entity-centric search , 2012, WWW.

[46]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[47]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[48]  Jean-Louis Dessalles Du protolangage au langage : modèle d'une transition , 2006 .

[49]  Éric Guichard L'internet : mesures des appropriations d'une technique intellectuelle , 2002 .

[50]  Rishiraj Saha Roy,et al.  An IR-based evaluation framework for web search query segmentation , 2012, SIGIR '12.

[51]  Fuji Ren,et al.  Role-explicit query identification and intent role annotation , 2012, CIKM '12.

[52]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[53]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[54]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[55]  Hang Li QRU-1 : A Public Dataset for Promoting Query Representation and Understanding Research , 2012 .

[56]  Christian Biemann,et al.  Quantifying Semantics using Complex Network Analysis , 2012, COLING.

[57]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[58]  Ravi Kumar,et al.  Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes , 2011, ACL.

[59]  Raj Kumar Pan,et al.  Network analysis of a corpus of undeciphered Indus civilization inscriptions indicates syntactic organization , 2011, Comput. Speech Lang..

[60]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[61]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[62]  Rishiraj Saha Roy,et al.  Complex Network Analysis Reveals Kernel-Periphery Structure in Web Search Queries , 2011 .

[63]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[64]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[65]  Sahar Asadi,et al.  Kavosh: a new algorithm for finding network motifs , 2009, BMC Bioinformatics.

[66]  Noah A. Smith,et al.  Good Question! Statistical Ranking for Question Generation , 2010, NAACL.

[67]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[68]  Matthias Hagen,et al.  Towards optimum query segmentation: in doubt without , 2012, CIKM '12.

[69]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[70]  Jean-Louis Dessalles,et al.  From protolanguage to language: model of a transition , 2006 .

[71]  Dekang Lin,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2 , 2011 .

[72]  Garland D. Bills,et al.  Sociolinguistic perspectives on register , 1994 .

[73]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[74]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[75]  Le An Ha,et al.  A computer-aided environment for generating multiple-choice test items , 2006, Natural Language Engineering.

[76]  Rishiraj Saha Roy,et al.  ARE WEB SEARCH QUERIES AN EVOLVING PROTOLANGUAGE , 2012 .

[77]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[78]  Xiaoxin Yin,et al.  Building taxonomy of web search intents for name entity queries , 2010, WWW '10.

[79]  GangulyNiloy,et al.  Discovering and understanding word level user intent in Web search queries , 2015 .

[80]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2001, Graph Drawing Software.

[81]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[82]  Knut Magne Risvik,et al.  Search engines and Web dynamics , 2002, Comput. Networks.

[83]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[84]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[85]  S. Strogatz Exploring complex networks , 2001, Nature.

[86]  Slav Petrov,et al.  Using Search-Logs to Improve Query Tagging , 2012, ACL.

[87]  Rishiraj Saha Roy,et al.  Discovering and understanding word level user intent in Web search queries , 2015, J. Web Semant..

[88]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[89]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[90]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[91]  W. Bruce Croft,et al.  Joint Annotation of Search Queries , 2011, ACL.

[92]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[93]  Richard Sproat,et al.  Last Words: Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals , 2010, CL.

[94]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[95]  Candy Schwartz,et al.  Web Search Engines , 1998, J. Am. Soc. Inf. Sci..

[96]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[97]  Enrique Alfonseca,et al.  Acquisition of instance attributes via labeled and related instances , 2010, SIGIR.

[98]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[99]  Min-Yen Kan,et al.  Functional Faceted Web Query Analysis , 2007 .

[100]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.