FAST2: An intelligent assistant for finding relevant papers

The SE literature is complex and immense. Most papers are rarely read so researchers and practitioners routinely miss important related work. Automatic active learners can help readers faster find more relevant works. But those tools offer no guidance on when to stop reading. Nor are those tools robust in the face of poor initial selection of papers, which leads to much longer reading time than average. This paper introduces the FAST2 text miner that addresses these problems, as follows. Firstly, FAST2 employs better tactics to quicker find a good initial set of relevant documents. As shown in this paper, with a little domain knowledge, FAST2 can find quickly find highly informative sets of initial papers, which leads to far more robust results while, at the same time, having to read fewer papers (by an average of 10% to 40%). Secondly, as reading progresses, FAST2 offers a new semi-supervised estimation model called SEMI that estimates the remaining number of relevant studies. Users can use SEMI to better understand when can they safely stop reading. When compared with the prior state-of-the-art estimator, SEMI is far more accurate.

[1]  Claes Wohlin,et al.  On the reliability of mapping studies in software engineering , 2013, J. Syst. Softw..

[2]  Carla E. Brodley,et al.  Active Literature Discovery for Scoping Evidence Reviews How Many Needles are There , 2013 .

[3]  Carla E. Brodley,et al.  Deploying an interactive machine learning system in an evidence-based practice center: abstrackr , 2012, IHI '12.

[4]  James M. Boyle,et al.  A systematic literature review of empirical evidence on computer games and serious games , 2012, Comput. Educ..

[5]  Tore Dybå,et al.  Evidence-based software engineering , 2016, Perspectives on Data Science for Software Engineering.

[6]  Tim Menzies,et al.  Trends in Topics at SE Conferences (1993-2013) , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[7]  Robert Feldt,et al.  Validity Threats in Empirical Software Engineering Research - An Initial Survey , 2010, SEKE.

[8]  Claes Wohlin,et al.  Guidelines for snowballing in systematic literature studies and a replication in software engineering , 2014, EASE '14.

[9]  Claes Wohlin,et al.  Systematic literature studies: Database searches vs. backward snowballing , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[10]  Maura R. Grossman,et al.  Evaluation of machine-learning protocols for technology-assisted review in electronic discovery , 2014, SIGIR.

[11]  Nicholas A. Kraft,et al.  How to Read Less: Better Machine Assisted Reading Methods for Systematic Literature Reviews , 2016, ArXiv.

[12]  Maura R. Grossman,et al.  Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review , 2015, ArXiv.

[13]  Lefteris Angelis,et al.  Ranking and Clustering Software Cost Estimation Models through a Multiple Comparisons Algorithm , 2013, IEEE Transactions on Software Engineering.

[14]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[15]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[16]  Carla E. Brodley,et al.  Who Should Label What? Instance Allocation in Multiple Expert Active Learning , 2011, SDM.

[17]  Pearl Brereton,et al.  A systematic review of systematic review process research in software engineering , 2013, Inf. Softw. Technol..

[18]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[19]  Carla E. Brodley,et al.  Semi-automated screening of biomedical citations for systematic reviews , 2010, BMC Bioinformatics.

[20]  Emilia Mendes,et al.  Using Forward Snowballing to update Systematic Reviews in Software Engineering , 2016, ESEM.

[21]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[22]  James Thomas,et al.  Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews , 2016, Systematic Reviews.

[23]  Stephen G. MacDonell,et al.  A visual analysis approach to update systematic reviews , 2014, EASE '14.

[24]  Jeffrey C. Carver,et al.  Identifying Barriers to the Systematic Literature Review Process , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[25]  Jeffrey Earp,et al.  An update to the systematic literature review of empirical evidence of the impacts and outcomes of computer games and serious games , 2016, Comput. Educ..

[26]  Byron C. Wallace,et al.  Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) , 2012, 2012 IEEE 12th International Conference on Data Mining.

[27]  Sophia Ananiadou,et al.  Reducing systematic review workload through certainty-based screening , 2014, J. Biomed. Informatics.

[28]  Carla E. Brodley,et al.  Toward modernizing the systematic review pipeline in genetics: efficient updating via data mining , 2012, Genetics in Medicine.

[29]  Romi Satria Wahono,et al.  A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks , 2015 .

[30]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[31]  Byron C. Wallace,et al.  Modernizing the systematic review process to inform comparative effectiveness: tools and methods. , 2013, Journal of comparative effectiveness research.

[32]  Richard Torkar,et al.  Software fault prediction metrics: A systematic literature review , 2013, Inf. Softw. Technol..

[33]  Maura R. Grossman,et al.  Scalability of Continuous Active Learning for Reliable High-Recall Text Classification , 2016, CIKM.

[34]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[35]  Prem Timsina,et al.  Leveraging Advanced Analytics Techniques for Medical Systematic Review Update , 2015, 2015 48th Hawaii International Conference on System Sciences.

[36]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[37]  S. Ananiadou,et al.  Using text mining for study identification in systematic reviews: a systematic review of current approaches , 2015, Systematic Reviews.