Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing

Premise of the Study Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi‐automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. Methods and Results Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon‐by‐character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. Conclusions The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.

[1]  Susan Kelley,et al.  Flora of China , 2008 .

[2]  Robert Hoehndorf,et al.  The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants , 2016, J. Biomed. Semant..

[3]  J. G. Burleigh,et al.  Peeking behind the page: using natural language processing to identify and explore the characters used to classify sea anemones , 2015 .

[4]  Malia A. Gehan,et al.  Lights, camera, action: high-throughput plant phenotyping is ready for a close-up. , 2015, Current opinion in plant biology.

[5]  R. Felger The Pteridophytes of Mexico , 2005 .

[6]  Paula M. Mabee,et al.  Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies , 2015, Systematic biology.

[7]  Thomas G. Dietterich,et al.  Next-generation phenomics for the Tree of Life , 2013, PLoS currents.

[8]  Jing Liu,et al.  MatrixConverter: Facilitating construction of phenomic character matrices1 , 2015, Applications in plant sciences.

[9]  Isabelle Mougenot,et al.  Towards a thesaurus of plant characteristics: an ecological contribution , 2017 .

[10]  C. Klukas,et al.  Advanced phenotyping and phenotype data analysis for the study of plant growth and development , 2015, Front. Plant Sci..

[11]  Natalie de Souza High-throughput phenotyping , 2009, Nature Methods.

[12]  Hong Cui,et al.  Building the “Plant Glossary”—A controlled botanical vocabulary using terms extracted from the Floras of North America and China , 2017 .

[13]  Bertram Ludäscher,et al.  Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building , 2016, BMC Bioinformatics.

[14]  Aljos Farjon,et al.  A Handbook of the World's Conifers , 2010 .

[15]  E. Smets,et al.  Detailed mark-up of semi-monographic legacy taxonomic works using FlorML , 2014 .

[16]  W. Jones,et al.  Wollemia nobilis, a new living Australian genus and species in the Araucariaceae , 1995 .

[17]  P. Jaiswal,et al.  The Plant Ontology: A Tool for Plant Genomics. , 2016, Methods in molecular biology.

[18]  M. Tester,et al.  Phenomics--technologies to relieve the phenotyping bottleneck. , 2011, Trends in plant science.

[19]  Peter G. Lelièvre,et al.  JMorph: Software for performing rapid morphometric measurements on digital images of fossil assemblages , 2017, Comput. Geosci..

[20]  L. Stein,et al.  Plant Ontology (PO): a Controlled Vocabulary of Plant Structures and Growth Stages , 2005, Comparative and functional genomics.

[21]  Andrea Cardini,et al.  Leaf Morphology, Taxonomy and Geometric Morphometrics: A Simplified Protocol for Beginners , 2011, PloS one.

[22]  Hong Cui CharaParser for fine-grained semantic annotation of organism morphological descriptions , 2012, J. Assoc. Inf. Sci. Technol..

[23]  Thomas G. Dietterich,et al.  Crowds Replicate Performance of Scientific Experts Scoring Phylogenetic Matrices of Phenotypes , 2018, Systematic biology.

[24]  Falk Schreiber,et al.  HTPheno: An image analysis pipeline for high-throughput plant phenotyping , 2011, BMC Bioinformatics.

[25]  S. Catalano,et al.  Phylogenetic Analysis of Araucariaceae: Integrating Molecules, Morphology, and Fossils , 2013, International Journal of Plant Sciences.

[26]  J. G. Burleigh,et al.  Community assembly of the ferns of Florida. , 2018, American journal of botany.