Characterizing search activities on stack overflow

To solve programming issues, developers commonly search on Stack Overflow to seek potential solutions. However, there is a gap between the knowledge developers are interested in and the knowledge they are able to retrieve using search engines. To help developers efficiently retrieve relevant knowledge on Stack Overflow, prior studies proposed several techniques to reformulate queries and generate summarized answers. However, few studies performed a large-scale analysis using real-world search logs. In this paper, we characterize how developers search on Stack Overflow using such logs. By doing so, we identify the challenges developers face when searching on Stack Overflow and seek opportunities for the platform and researchers to help developers efficiently retrieve knowledge. To characterize search activities on Stack Overflow, we use search log data based on requests to Stack Overflow's web servers. We find that the most common search activity is reformulating the immediately preceding queries. Related work looked into query reformulations when using generic search engines and found 13 types of query reformulation strategies. Compared to their results, we observe that 71.78% of the reformulations can be fitted into those reformulation strategies. In terms of how queries are structured, 17.41% of the search sessions only search for fragments of source code artifacts (e.g., class and method names) without specifying the names of programming languages, libraries, or frameworks. Based on our findings, we provide actionable suggestions for Stack Overflow moderators and outline directions for future research. For example, we encourage Stack Overflow to set up a database that includes the relations between all computer programming terminologies shared on Stack Overflow, e.g., method name, data structure name, design pattern, and IDE name. By doing so, Stack Overflow could improve the performance of search engines by considering related programming terminologies at different levels of granularity.

[1]  Christoph Treude,et al.  Automated Query Reformulation for Efficient Search Based on Query Logs From Stack Overflow , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[2]  Daqing He,et al.  Combining evidence for automatic Web session identification , 2002, Inf. Process. Manag..

[3]  Ferhan Türe,et al.  Yelling at Your TV: An Analysis of Speech Recognition Errors and Subsequent User Behavior on Entertainment Systems , 2019, SIGIR.

[4]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[5]  Oren Kurland,et al.  Query Reformulation in E-Commerce Search , 2020, SIGIR.

[6]  Martin F. Arlitt,et al.  Characterizing Web user sessions , 2000, PERV.

[7]  Zhenchang Xing,et al.  What do developers search for on the web? , 2017, Empirical Software Engineering.

[8]  Lynda Tamine,et al.  What Can Task Teach Us About Query Reformulations? , 2020, ECIR.

[9]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[10]  Ying Li,et al.  KDD CUP-2005 report: facing a great challenge , 2005, SKDD.

[11]  Matthias Hagen,et al.  Towards optimum query segmentation: in doubt without , 2012, CIKM '12.

[12]  Heinz Züllighoven Object-oriented construction handbook - developing application-oriented software with the tools and materials approach , 2004 .

[13]  Jeffrey Heer,et al.  Research and applications: Induced lexico-syntactic patterns improve information extraction from online medical forums , 2014, J. Am. Medical Informatics Assoc..

[14]  Sushil Krishna Bajracharya,et al.  Analyzing and mining a code search engine usage log , 2010, Empirical Software Engineering.

[15]  Jaime Teevan,et al.  Information re-retrieval: repeat queries in Yahoo's logs , 2007, SIGIR.

[16]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[17]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[18]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[19]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[20]  Christoph Treude,et al.  Automatically Categorizing Software Technologies , 2020, IEEE Transactions on Software Engineering.

[21]  Tong Zhang,et al.  Hierarchical Contextual Attention Recurrent Neural Network for Map Query Suggestion , 2017, IEEE Transactions on Knowledge and Data Engineering.

[22]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[23]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[24]  Ryen W. White,et al.  Understanding web browsing behaviors through Weibull analysis of dwell time , 2010, SIGIR.

[25]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[26]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[27]  Amanda Spink,et al.  Defining a session on Web search engines , 2007, J. Assoc. Inf. Sci. Technol..

[28]  Janice Singer,et al.  An examination of software engineering work practices , 2010, CASCON.

[29]  Zhenchang Xing,et al.  AnswerBot: Automated generation of answer summary to developers' technical questions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Amanda Spink,et al.  Patterns of query reformulation during Web searching , 2009, J. Assoc. Inf. Sci. Technol..

[31]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[32]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[33]  Zhenchang Xing,et al.  Chatbot4QR: Interactive Query Refinement for Technical Question Retrieval , 2020, IEEE Transactions on Software Engineering.

[34]  Xiaochen Li,et al.  Query Expansion Based on Crowd Knowledge for Code Search , 2016, IEEE Transactions on Services Computing.

[35]  Bonita Sharif,et al.  A Gaze-Based Exploratory Study on the Information Seeking Behavior of Developers on Stack Overflow , 2019, CHI Extended Abstracts.

[36]  Xiao Li,et al.  Semantic Tagging of Web Search Queries , 2009, ACL.

[37]  Peter Bruza,et al.  Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine , 1997, RIAO.

[38]  Maarten de Rijke,et al.  A Context-aware Time Model for Web Search , 2016, SIGIR.

[39]  Martin Whittle,et al.  Data mining of search engine logs , 2007, J. Assoc. Inf. Sci. Technol..

[40]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[41]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[42]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[43]  Query Understanding for Search Engines , 2020, The Information Retrieval Series.

[44]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.