Using machine learning to disentangle homonyms in large text corpora

Systematic reviews are an increasingly popular decision-making tool that provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews, for example, homonyms (terms that share spelling but differ in meaning). Homonyms add noise to search results and cannot be easily identified or removed. We developed a semiautomated approach that can aid in the classification of homonyms among narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word reintroduction in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat; however, a Web of Science search for this word returned thousands of publications in which the term has other meanings and contexts. Using our method, we automatically classified a sample of 3000 of these publications with over 99% accuracy, relative to a manual classification. Our approach can be used easily with other homonyms and can greatly facilitate systematic reviews or similar work in which homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in combining automated content analysis and machine-learning methods to handle and screen big data for relevant information in conservation science.

[1]  Ricardo A Correia,et al.  Geographic trends and information deficits in Amazonian conservation research , 2015, Biodiversity and Conservation.

[2]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[3]  Giles,et al.  Searching the world wide Web , 1998, Science.

[4]  Kijpokin Kasemsap,et al.  Mastering Digital Libraries in the Digital Age , 2016 .

[5]  Guillaume Bastille-Rousseau,et al.  The evolution of peer review as a basis for scientific publication: directional selection towards a robust discipline? , 2016, Biological reviews of the Cambridge Philosophical Society.

[6]  Eric W. T. Ngai,et al.  A Review of the literature on Applications of Text Mining in Policy Making , 2016, PACIS.

[7]  Ying Wah Teh,et al.  Text mining for market prediction: A systematic review , 2014, Expert Syst. Appl..

[8]  H. Bastian,et al.  Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? , 2010, PLoS medicine.

[9]  J. Margolis,et al.  Citation Indexing and Evaluation of Scientific Papers , 1967, Science.

[10]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[11]  P. Glasziou,et al.  Systematic review automation technologies , 2014, Systematic Reviews.

[12]  Andrew S. Pullin,et al.  Considering cost alongside the effectiveness of management in evidence-based conservation: A systematic reporting protocol. , 2017 .

[13]  Wu He,et al.  International Journal of Information Management Social Media Competitive Analysis and Text Mining: a Case Study in the Pizza Industry , 2022 .

[14]  Sabina Siebert,et al.  Overflow in science and its implications for trust , 2015, eLife.

[15]  Anna Garant Social media competitive analysis and text mining: a case study in digital marketing in the hospitality industry , 2017 .

[16]  Julian D Olden,et al.  Machine Learning Methods Without Tears: A Primer for Ecologists , 2008, The Quarterly Review of Biology.

[17]  Mary Anne Kennan,et al.  Wild data: Collaborative e-research and university libraries , 2012 .

[18]  Martin J. Westgate,et al.  The difficulties of systematic reviews , 2017, Conservation biology : the journal of the Society for Conservation Biology.

[19]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[20]  Gary Bilotta,et al.  On the use of systematic reviews to inform environmental policies , 2014 .

[21]  Maxine Eskénazi,et al.  Word Sense Disambiguation for Vocabulary Learning , 2008, Intelligent Tutoring Systems.

[22]  E. D. Smet,et al.  E-Discovery Tools and Applications in Modern Libraries , 2016 .

[23]  조영빈,et al.  SOM(Self-Organizing Map) 기법을 이용한 종단자료 분석방법론 , 2015 .

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  Anne E. Trefethen,et al.  The Data Deluge: An e-Science Perspective , 2003 .

[26]  George Tzanis,et al.  Biological and Medical Big Data Mining , 2014, Int. J. Knowl. Discov. Bioinform..

[27]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[28]  William J Sutherland,et al.  Organising evidence for environmental management decisions: a '4S' hierarchy. , 2014, Trends in ecology & evolution.

[29]  Ricardo A Correia,et al.  Internet scientific name frequency as an indicator of cultural salience of biodiversity , 2017 .

[30]  Martin J. Westgate,et al.  Text analysis tools for identification of emerging topics and research gaps in conservation science , 2015, Conservation biology : the journal of the Society for Conservation Biology.

[31]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[32]  E. Tacconelli Systematic reviews: CRD's guidance for undertaking reviews in health care , 2010 .

[33]  Lynn Silipigni Connaway Electronic Books (E-books): Current Trends and Future Directions , 2003 .

[34]  Ricardo A Correia,et al.  Conservation culturomics CONCEPTS AND QUESTIONS , 2016 .

[35]  Guy M. Goodwin,et al.  Introduction to Systematic Reviews , 2004, Journal of psychopharmacology.

[36]  Lutgarde M. C. Buydens,et al.  Self- and Super-organizing Maps in R: The kohonen Package , 2007 .

[37]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[38]  Richard Grenyer,et al.  Using Wikipedia page views to explore the cultural importance of global reptiles , 2016 .

[39]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[40]  Julio Raffo,et al.  How to play the “Names Game”: Patent retrieval comparing different heuristics , 2009 .

[41]  M. Caley,et al.  Global mismatch between research effort and conservation needs of tropical coral reefs , 2011 .

[42]  C. Lee Giles,et al.  Searching the Web: general and scientific information access , 1999, First IEEE/POPOV Workshop on Internet Technologies and Services. Proceedings (Cat. No.99EX391).

[43]  Bryan C. Pijanowski,et al.  Automated content analysis: addressing the big literature challenge in ecology and evolution , 2016 .

[44]  Gertrude London The publication inflation , 1968 .

[45]  Xiulan Hao,et al.  A Machine Learning Approach Classification of Deep Web Sources , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[46]  Enrique Herrera-Viedma,et al.  A quality based recommender system to disseminate information in a university digital library , 2014, Inf. Sci..

[47]  Carl A. Raschke The digital revolution and the coming of the postmodern university , 2002 .

[48]  Maria Bardosova,et al.  Using network science and text analytics to produce surveys in a scientific topic , 2015, J. Informetrics.

[49]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[50]  Linda A. Watson,et al.  Information Retrieval: A Health and Biomedical Perspective. , 2005 .

[51]  Julie Glanville,et al.  Methodological developments in searching for studies for systematic reviews: past, present and future? , 2013, Systematic Reviews.

[52]  Sophia Ananiadou,et al.  Applications of text mining within systematic reviews , 2011, Research synthesis methods.

[53]  S. Ananiadou,et al.  Using text mining for study identification in systematic reviews: a systematic review of current approaches , 2015, Systematic Reviews.

[54]  Mohamed M. Mostafa,et al.  More than words: Social networks' text mining for consumer brand sentiments , 2013, Expert Syst. Appl..