LAITOR4HPC: A text mining pipeline based on HPC for building interaction networks

Background The amount of published full-text articles has increased dramatically. Text mining tools configure an essential approach to building biological networks, updating databases and providing annotation for new pathways. PESCADOR is an online web server based on LAITOR and NLProt text mining tools, which retrieves protein-protein co-occurrences in a tabular-based format, adding a network schema. Here we present an HPC-oriented version of PESCADOR’s native text mining tool, renamed to LAITOR4HPC, aiming to access an unlimited abstract amount in a short time to enrich available networks, build new ones and possibly highlight whether fields of research have been exhaustively studied. Results By taking advantage of parallel computing HPC infrastructure, the full collection of MEDLINE abstracts available until June 2017 was analyzed in a shorter period (6 days) when compared to the original online implementation (with an estimated 2 years to run the same data). Additionally, three case studies were presented to illustrate LAITOR4HPC usage possibilities. The first case study targeted soybean and was used to retrieve an overview of published co-occurrences in a single organism, retrieving 15,788 proteins in 7894 co-occurrences. In the second case study, a target gene family was searched in many organisms, by analyzing 15 species under biotic stress. Most co-occurrences regarded Arabidopsis thaliana and Zea mays . The third case study concerned the construction and enrichment of an available pathway. Choosing A. thaliana for further analysis, the defensin pathway was enriched, showing additional signaling and regulation molecules, and how they respond to each other in the modulation of this complex plant defense response. Conclusions LAITOR4HPC can be used for an efficient text mining based construction of biological networks derived from big data sources, such as MEDLINE abstracts. Time consumption and data input limitations will depend on the available resources at the HPC facility. LAITOR4HPC enables enough flexibility for different approaches and data amounts targeted to an organism, a subject, or a specific pathway. Additionally, it can deliver comprehensive results where interactions are classified into four types, according to their reliability.

[1]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[2]  T. Chakradhar,et al.  Genomic-based-breeding tools for tropical maize improvement , 2017, Genetica.

[3]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[4]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[5]  Miguel A. Andrade-Navarro,et al.  PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries , 2011, BMC Bioinformatics.

[6]  Fumiaki Katagiri,et al.  The μ Subunit of Arabidopsis Adaptor Protein-2 Is Involved in Effector-Triggered Immunity Mediated by Membrane-Localized Resistance Proteins. , 2016, Molecular plant-microbe interactions : MPMI.

[7]  B. Staskawicz,et al.  The Arabidopsis RPS4 bacterial-resistance gene is a member of the TIR-NBS-LRR family of disease-resistance genes. , 1999, The Plant journal : for cell and molecular biology.

[8]  Adriano Barbosa-Silva,et al.  A guide for building biological pathways along with two case studies: hair and breast development. , 2015, Methods.

[9]  C. Pieterse,et al.  Networking by small-molecule hormones in plant immunity. , 2009, Nature chemical biology.

[10]  Piotr Gawron,et al.  MINERVA—a platform for visualization and curation of molecular interaction networks , 2016, npj Systems Biology and Applications.

[11]  A. Huffaker,et al.  Endogenous peptide defense signals in Arabidopsis differentially amplify signaling for the innate immune response , 2007, Proceedings of the National Academy of Sciences.

[12]  Jianbin Yan,et al.  The Arabidopsis CORONATINE INSENSITIVE1 Protein Is a Jasmonate Receptor[C][W] , 2009, The Plant Cell Online.

[13]  Alfonso Valencia,et al.  iHOP web services , 2007, Nucleic Acids Res..

[14]  Zhiping Tan,et al.  Neuropeptide Receptors NPR-1 and NPR-2 Regulate Caenorhabditis elegans Avoidance Response to the Plant Stress Hormone Methyl Salicylate , 2014, Genetics.

[15]  B. Snel,et al.  STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. , 2000, Nucleic acids research.

[16]  Anna Vitlin Gruber,et al.  Rubisco Assembly in the Chloroplast , 2018, Front. Mol. Biosci..

[17]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[18]  A. Segura,et al.  Pseudothionin-St1, a potato peptide active against potato pathogens. , 1994, European journal of biochemistry.

[19]  C. Ballaré,et al.  Low Red/Far-Red Ratios Reduce Arabidopsis Resistance to Botrytis cinerea and Jasmonate Responses via a COI1-JAZ10-Dependent, Salicylic Acid-Independent Mechanism1[C][W][OA] , 2012, Plant Physiology.

[20]  Burkhard Rost,et al.  NLProt: extracting protein names and sequences from papers , 2004, Nucleic Acids Res..

[21]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[22]  Adam P Arkin,et al.  PaperBLAST: Text Mining Papers for Information about Homologs , 2017, mSystems.

[23]  Jonathan D. G. Jones,et al.  Role of plant hormones in plant defence responses , 2009, Plant Molecular Biology.

[24]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[25]  Jeffery L Dangl,et al.  Arabidopsis and the plant immune system. , 2010, The Plant journal : for cell and molecular biology.

[26]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[27]  Miguel A. Andrade-Navarro,et al.  LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships , 2010, BMC Bioinformatics.

[28]  Pascal Bouvry,et al.  Management of an academic HPC cluster: The UL experience , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[29]  R. Hill,et al.  Jasmonic acid is a downstream component in the modulation of somatic embryogenesis by Arabidopsis Class 2 phytoglobin , 2016, Journal of experimental botany.

[30]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[31]  Eri Adams,et al.  COI1, a jasmonate receptor, is involved in ethylene-induced inhibition of Arabidopsis root growth in the light , 2010, Journal of experimental botany.

[32]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[33]  Marcus J. Claesson,et al.  Correction for Walsh et al., Microbial Succession and Flavor Production in the Fermented Dairy Beverage Kefir , 2017, mSystems.

[34]  Xinye Ma,et al.  Use of the psbA-trnH region to authenticate medicinal species of Fabaceae. , 2013, Biological & pharmaceutical bulletin.

[35]  Robert Nawrot,et al.  Plant antimicrobial peptides , 2013, Folia Microbiologica.

[36]  Fan Wu,et al.  Potential DNA barcodes for Melilotus species based on five single loci and their combinations , 2017, PloS one.

[37]  Mark Zander,et al.  SA-inducible Arabidopsis glutaredoxin interacts with TGA factors and suppresses JA-responsive PDF1.2 transcription. , 2007, The Plant journal : for cell and molecular biology.

[38]  H. Kitano,et al.  Software for systems biology: from tools to integrated platforms , 2011, Nature Reviews Genetics.

[39]  Jie Ma,et al.  PPICurator: A Tool for Extracting Comprehensive Protein–Protein Interaction Information , 2019, Proteomics.

[40]  K. Dehesh,et al.  ORA59 and EIN3 interaction couples jasmonate-ethylene synergistic action to antagonistic salicylic acid regulation of PDF expression. , 2017, Journal of integrative plant biology.

[41]  C. Portugal,et al.  Plant antimicrobial peptides , 2015 .

[42]  Marilyn A. Anderson,et al.  Novel insights on the mechanism of action of α‐amylase inhibitors from the plant defensin family , 2008, Proteins.

[43]  Martin Hofmann-Apitius,et al.  Text mining for systems biology. , 2014, Drug discovery today.

[44]  Benjamin M. Gyori,et al.  FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining , 2018, bioRxiv.

[45]  Dongdong Sun,et al.  MPTM: A tool for mining protein post-translational modifications from literature , 2017, J. Bioinform. Comput. Biol..

[46]  Bostjan Kobe,et al.  Emerging Insights into the Functions of Pathogenesis-Related Protein 1. , 2017, Trends in plant science.

[47]  Anton J. Enright,et al.  Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future , 2015, GigaScience.

[48]  H. Kitano Systems Biology: A Brief Overview , 2002, Science.

[49]  Zhiyong Lu,et al.  Database resources of the National Center for Biotechnology Information , 2010, Nucleic Acids Res..

[50]  Patrick J. Boyle,et al.  The Arabidopsis NPR1 protein is a receptor for the plant defense hormone salicylic acid. , 2012, Cell reports.

[51]  Jian Hong,et al.  Rubisco decrease is involved in chloroplast protrusion and Rubisco-containing body formation in soybean (Glycine max.) under salt stress. , 2014, Plant physiology and biochemistry : PPB.

[52]  Marilyn A. Anderson,et al.  The evolution, function and mechanisms of action for plant defensins. , 2019, Seminars in cell & developmental biology.

[53]  Hiroaki Kitano,et al.  CellDesigner: a process diagram editor for gene-regulatory and biochemical networks , 2003 .