Topic Modeling on User Stories using Word Mover's Distance

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Maleknaz Nayebi,et al.  Data-Driven Requirements Engineering - An Update , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Ning Chen,et al.  AR-miner: mining informative reviews for developers from mobile app marketplace , 2014, ICSE.

[5]  Sjaak Brinkkemper,et al.  Extracting conceptual models from user stories with Visual Narrator , 2017, Requirements Engineering.

[6]  Munindar P. Singh,et al.  Toward Automating Crowd RE , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).

[7]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[8]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[9]  Kristina Winbladh,et al.  Analysis of user comments: An approach for software requirements evolution , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Walid Maalej,et al.  How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[11]  Xinyu Dai,et al.  Topic2Vec: Learning distributed representations of topics , 2015, 2015 International Conference on Asian Language Processing (IALP).

[12]  Mayuri Mhatre,et al.  Dimensionality reduction for sentiment analysis using pre-processing techniques , 2017, 2017 International Conference on Computing Methodologies and Communication (ICCMC).

[13]  Sjaak Brinkkemper,et al.  User Story Writing in Crowd Requirements Engineering: The Case of a Web Application for Sports Tournament Planning , 2019, 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW).

[14]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[15]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[16]  Andrea Esuli,et al.  An NLP approach for cross-domain ambiguity detection in requirements engineering , 2019, Automated Software Engineering.

[17]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[18]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Ian Witten,et al.  Data Mining , 2000 .

[21]  Xindong Wu,et al.  Topic Modeling over Short Texts by Incorporating Word Embeddings , 2016, PAKDD.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Walid Maalej,et al.  Classifying Multilingual User Feedback using Traditional Machine Learning and Deep Learning , 2019, 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW).

[24]  Martin Glinz,et al.  GARUSO: a gamification approach for involving stakeholders outside organizational reach in requirements engineering , 2019, Requirements Engineering.

[25]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[26]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[27]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[28]  Sjaak Brinkkemper,et al.  Detecting terminological ambiguity in user stories: Tool and experimentation , 2019, Inf. Softw. Technol..

[29]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[30]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[31]  Abram Hindle,et al.  Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[32]  Xavier Franch,et al.  FAME: Supporting Continuous Requirements Elicitation by Combining User Feedback and Monitoring , 2018, 2018 IEEE 26th International Requirements Engineering Conference (RE).

[33]  Anna Perini,et al.  Crowdsourcing for Software Engineering The Crowd in Requirements Engineering The Landscape and Challenges , 2017 .

[34]  Hui Li,et al.  Topic mover's distance based document classification , 2017, 2017 IEEE 17th International Conference on Communication Technology (ICCT).

[35]  Maleknaz Nayebi,et al.  Toward Data-Driven Requirements Engineering , 2016, IEEE Software.

[36]  Alessio Ferrari,et al.  Natural Language Requirements Processing: From Research to Practice , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[37]  Jihong Ouyang,et al.  Classifying Extremely Short Texts by Exploiting Semantic Centroids in Word Mover's Distance Space , 2019, WWW.

[38]  Munindar P. Singh,et al.  Acquiring Creative Requirements from the Crowd: Understanding the Influences of Personality and Creative Potential in Crowd RE , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[39]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[40]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[41]  Haiyi Zhang,et al.  A T EXT MINING RESEARCH BASED ON LDA T OPIC MODELLING , 2016 .

[42]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[43]  Gabriele Bavota,et al.  User reviews matter! Tracking crowdsourced reviews to support evolution of successful apps , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[44]  Asadullah Shah,et al.  Review on Natural Language Processing (NLP) and Its Toolkits for Opinion Mining and Sentiment Analysis , 2018, 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS).