Automatically Categorizing Software Technologies

Informal language and the absence of a standard taxonomy for software technologies make it difficult to reliably analyze technology trends on discussion forums and other on-line venues. We propose an automated approach called <inline-formula><tex-math notation="LaTeX">$\mathrm{Witt}$</tex-math><alternatives><mml:math><mml:mi> Witt </mml:mi></mml:math><inline-graphic xlink:href="nassif-ieq1-2836450.gif"/></alternatives></inline-formula> for the categorization of software technologies (an expanded version of the hypernym discovery problem). <inline-formula><tex-math notation="LaTeX">$\mathrm{Witt}$</tex-math><alternatives><mml:math><mml:mi> Witt </mml:mi></mml:math><inline-graphic xlink:href="nassif-ieq2-2836450.gif"/></alternatives></inline-formula> takes as input a phrase describing a software technology or concept and returns a general category that describes it (e.g., integrated development environment), along with attributes that further qualify it (commercial, php, etc.). By extension, the approach enables the dynamic creation of lists of all technologies of a given type (e.g., web application frameworks). Our approach relies on Stack Overflow and Wikipedia, and involves numerous original domain adaptations and a new solution to the problem of normalizing automatically-detected hypernyms. We compared <inline-formula><tex-math notation="LaTeX">$\mathrm{Witt}$</tex-math><alternatives><mml:math><mml:mi> Witt </mml:mi></mml:math><inline-graphic xlink:href="nassif-ieq3-2836450.gif"/></alternatives></inline-formula> with six independent taxonomy tools and found that, when applied to software terms, <inline-formula><tex-math notation="LaTeX">$\mathrm{Witt}$</tex-math><alternatives><mml:math><mml:mi> Witt </mml:mi></mml:math><inline-graphic xlink:href="nassif-ieq4-2836450.gif"/></alternatives></inline-formula> demonstrated better coverage than all evaluated alternative solutions, without a corresponding degradation in false positive rate.

[1]  Jinqiu Yang,et al.  SWordNet: Inferring semantically related words from software context , 2014, Empirical Software Engineering.

[2]  Kevin A. Schneider,et al.  A discriminative model approach for suggesting tags automatically for Stack Overflow questions , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[3]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[4]  David Lo,et al.  Tag recommendation in software information sites , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[5]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[6]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[7]  K. Chandramouli,et al.  Wikipedia as the Premiere Source for Targeted Hypernym Discovery , 2008 .

[8]  Stefano Faralli,et al.  A Large DataBase of Hypernymy Relations Extracted from the Web , 2016, LREC.

[9]  Milan Dojchinovski,et al.  Entityclassifier.eu: Real-Time Classification of Entities in Text with Wikipedia , 2013, ECML/PKDD.

[10]  Clayton Stanley Predicting Tags for StackOverflow Posts , 2013 .

[11]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[12]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[13]  David Lo,et al.  Automated construction of a software-specific word similarity database , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[14]  Gang Yin,et al.  Tag recommendation for open source software , 2013, Frontiers of Computer Science.

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  Jan Nonnen,et al.  Locating the Meaning of Terms in Source Code Research on "Term Introduction" , 2011, 2011 18th Working Conference on Reverse Engineering.

[17]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[18]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[21]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[22]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[23]  David Lo,et al.  SEWordSim: software-specific word similarity database , 2014, ICSE Companion.

[24]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[25]  David Lo,et al.  Detecting similar applications with collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[26]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[27]  Christoph Treude,et al.  Work Item Tagging: Communicating Concerns in Collaborative Software Development , 2012, IEEE Transactions on Software Engineering.

[28]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[29]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[30]  Clémentine Nebut,et al.  Automatic Extraction of a WordNet-Like Identifier Network from Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[31]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[32]  Gang Yin,et al.  Inducing Taxonomy from Tags: An Agglomerative Hierarchical Clustering Framework , 2012, ADMA.

[33]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[34]  Zornitsa Kozareva,et al.  A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web , 2010, EMNLP.

[35]  David Lo,et al.  Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[36]  Masaki Murata,et al.  Hypernym Discovery Based on Distributional Similarity and Hierarchical Structures , 2009, EMNLP.

[37]  Maria João Varanda Pereira,et al.  Probabilistic SynSet Based Concept Location , 2012, SLATE.

[38]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[39]  Luis Gravano,et al.  Using q-grams in a DBMS for Approximate String Processing , 2001, IEEE Data Eng. Bull..

[40]  Tiziano Flati,et al.  Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project , 2014, ACL.

[41]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[42]  Takahiro Hara,et al.  Wikipedia Link Structure and Text Mining for Semantic Relation Extraction , 2008, SemSearch.

[43]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.