Predicting semantically linkable knowledge in developer online forums via convolutional neural network

Consider a question and its answers in Stack Overflow as a knowledge unit. Knowledge units often contain semantically relevant knowledge, and thus linkable for different purposes, such as duplicate questions, directly linkable for problem solving, indirectly linkable for related information. Recognising different classes of linkable knowledge would support more targeted information needs when users search or explore the knowledge base. Existing methods focus on binary relatedness (i.e., related or not), and are not robust to recognize different classes of semantic relatedness when linkable knowledge units share few words in common (i.e., have lexical gap). In this paper, we formulate the problem of predicting semantically linkable knowledge units as a multiclass classification problem, and solve the problem using deep learning techniques. To overcome the lexical gap issue, we adopt neural language model (word embeddings) and convolutional neural network (CNN) to capture word- and document-level semantics of knowledge units. Instead of using human-engineered classifier features which are hard to design for informal user-generated content, we exploit large amounts of different types of user-created knowledge-unit links to train the CNN to learn the most informative wordlevel and document-level features for the multiclass classification task. Our evaluation shows that our deep-learning based approach significantly and consistently outperforms traditional methods using traditional word representations and human-engineered classifier features.

[1]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[4]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[5]  Jing Li,et al.  Software-Specific Named Entity Recognition in Software Engineering Social Content , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[6]  Marco Tulio Valente,et al.  An Empirical Study on Recommendations of Similar Bugs , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[7]  D. Hubel,et al.  Receptive fields of single neurones in the cat's striate cortex , 1959, The Journal of physiology.

[8]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  David Lo,et al.  Improved Duplicate Bug Report Identification , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[11]  Liang Gong,et al.  Predicting bug-fixing time: An empirical study of commercial software projects , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[12]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[13]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[14]  David Lo,et al.  Tag recommendation in software information sites , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[15]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[16]  David Lo,et al.  EFSPredictor: Predicting Configuration Bugs with Ensemble Feature Selection , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[17]  David Lo,et al.  NIRMAL: Automatic identification of software relevant tweets leveraging language model , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[18]  Tamara G. Kolda,et al.  Temporal Link Prediction Using Matrix and Tensor Factorizations , 2010, TKDD.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[21]  David Lo,et al.  Predicting Crashing Releases of Mobile Applications , 2016, ESEM.

[22]  Mark S. Ackerman,et al.  Expertise networks in online communities: structure and algorithms , 2007, WWW '07.

[23]  Eric Gilbert,et al.  Predicting tie strength with social media , 2009, CHI.

[24]  David Lo,et al.  Towards more accurate content categorization of API discussions , 2014, ICPC 2014.

[25]  Cícero Nogueira dos Santos,et al.  Detecting Semantically Equivalent Questions in Online User Forums , 2015, CoNLL.

[26]  Jing Li,et al.  Software-specific part-of-speech tagging: an experimental study on stack overflow , 2016, SAC.

[27]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[28]  David Lo,et al.  Automated Bug Report Field Reassignment and Refinement Prediction , 2016, IEEE Transactions on Reliability.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Bernhard Schölkopf,et al.  A Primer on Kernel Methods , 2004 .

[31]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.

[32]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[33]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[34]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.

[35]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[36]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[37]  David Lo,et al.  ELBlocker: Predicting blocking bugs with ensemble imbalance learning , 2015, Inf. Softw. Technol..

[38]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[39]  David Lo,et al.  It Takes Two to Tango: Deleted Stack Overflow Question Prediction with Text and Meta Features , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[40]  David Lo,et al.  TagCombine: Recommending Tags to Contents in Software Information Sites , 2015, Journal of Computer Science and Technology.

[41]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[42]  Zhenchang Xing,et al.  Domain-specific cross-language relevant question retrieval , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  David Lo,et al.  Automatic, high accuracy prediction of reopened bugs , 2014, Automated Software Engineering.

[45]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[46]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[47]  A. Wagers,et al.  QUANTIFYING SPECTRAL FEATURES OF TYPE Ia SUPERNOVAE , 2009, 0907.3171.

[48]  Zhenchang Xing,et al.  The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow , 2017, Empirical Software Engineering.

[49]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[50]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.