A transparent collaborative framework for efficient data analysis and knowledge annotation on the web

High-throughput experiments and ultrascale computing generate scientific data of growing size and complexity. These trends challenge traditional data analysis environments, most of which are based on scripting languages such as R, MATLAB or IDL, in a number of ways. To address some of these challenges, this research proposes a framework with the overarching goal to enable large-scale high-performance data analytics and collaborative knowledge annotation over the Web. The proposed framework has three major components, which parallel the three core steps of the knowledge discovery cycle. (1) For the first step, defining the data analysis pipeline, the research designs and implements a Web-enabled interactive and collaborative statistical R-based environment. The component implements a memory management system that minimizes memory requirements thereby enabling multi-user scalability. To the best of our knowledge, this is the first Web-enabled R system that supports interactive remote access to R servers and enables users to share data, results and analysis sessions. (2) For the second step, executing the data analysis pipeline, the research investigates and proposes a transparent and low-overhead means for executing external compiled language parallel codes from within R, thus seamlessly bridging two code development paradigms: efficient, compiled parallel codes and high abstraction and easy-to-use scripting codes. This component contains three elements: a transparent bidirectional translation of data objects between R and compiled languages, such as C/C++/Fortran; seamless integration of external parallel codes; and automatic parallelization of data-parallel computations in hybrid multi-core and multi-node execution environments. (3) For the third step, annotating the predictive knowledge derived from community analysis pipelines, the research explores an environment for semantically rich, structured and queriable annotation of facts, relationships between those facts, and complex events reported in scientific literature. The social networking nature of this component allows the community to improve the predictions as well as generate new, higher-level inferences, thus rolling in the gaps in the communities' understanding of physical phenomena. The environment offers mechanisms for streamlining the annotated and curated knowledge into distributed public databases, thus enabling a feedback loop into the database-publication cycle to allow scientists to make connections between data-driven predictions and published evidence.

[1]  Anne E. Trefethen,et al.  MultiMATLAB Integrating MATLAB with High Performance Parallel Computing , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[2]  Matthew C. Schmidt,et al.  An outlook into ultra-scale visualization of large-scale biological data , 2008, 2008 Workshop on Ultrascale Visualization.

[3]  Nagiza F. Samatova,et al.  Incremental all pairs similarity search for varying similarity thresholds , 2009, SNA-KDD '09.

[4]  Nagiza F. Samatova,et al.  BioDEAL: community generation of biological annotations , 2009, BMC Medical Informatics Decis. Mak..

[5]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[6]  Nagiza F. Samatova,et al.  WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages , 2009, NODALIDA.

[7]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[8]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[9]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[10]  Christopher G Chute,et al.  National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. , 2006, Omics : a journal of integrative biology.

[11]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[12]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[13]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[14]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[15]  Manish Agarwal,et al.  Enabling autonomic compositions in grid environments , 2003, Proceedings. First Latin American Web Congress.

[16]  Peter R. Wurman,et al.  PBA*: Using Proactive Search to Make A* Robust to Unplanned Deviations , 2008, AAAI.

[17]  Dennis Gannon,et al.  On Building Parallel & Grid Applications: Component Technology and Distributed Services , 2004, CLADE.

[18]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[19]  Karsten Hokamp,et al.  PubCrawler: keeping up comfortably with PubMed and GenBank , 2004, Nucleic Acids Res..

[20]  Nagiza F. Samatova,et al.  Systematic Evaluation of Convergence Criteria in Iterative Training for NLP , 2009, FLAIRS Conference.

[21]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[22]  Manuel Spannagl,et al.  Apollo2Go: a web service adapter for the Apollo genome viewer to enable distributed genome annotation , 2007, BMC Bioinformatics.

[23]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[24]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[25]  Mohammed Yeasin,et al.  Semantically linking and browsing PubMed abstracts with gene ontology , 2008, BMC Genomics.

[26]  Alexander A. Stepanov,et al.  C++ Standard Template Library , 2000 .

[27]  T Epperly,et al.  Babel 1.0 Release Criteria: A Working Document , 2004 .

[28]  Nagiza F. Samatova,et al.  Transparent runtime parallelization of the R scripting language , 2011, J. Parallel Distributed Comput..

[29]  Alain Viari,et al.  Genepi: a blackboard framework for genome annotation , 2006, BMC Bioinformatics.

[30]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..

[31]  P. Breimyer,et al.  BioDEAL: Biological data-evidence-annotation linkage system , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[32]  Richard C. Murphy DOE's Institute for Advanced Architecture and Algorithms: An application-driven approach , 2009 .

[33]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[34]  Steven G. Parker,et al.  Component‐based, problem‐solving environments for large‐scale scientific computing , 2002, Concurr. Comput. Pract. Exp..

[35]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[36]  Ashlee Vance,et al.  Data Analysts Captivated by R's Power , 2009 .

[37]  William N. Venables,et al.  An Introduction To R , 2004 .

[38]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[39]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[40]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[41]  John G. Cleary,et al.  Automatically linking MEDLINE abstracts to the Gene Ontology , 2003 .

[42]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[43]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[44]  James C. Browne,et al.  Visual programming and debugging for parallel computing , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[45]  Alan Edelman,et al.  Parallel MATLAB: Doing it Right , 2005, Proceedings of the IEEE.

[46]  Nagiza F. Samatova,et al.  pR: Lightweight, Easy-to-Use Middleware to Plugin Parallel Analytical Computing with R , 2009, IKE.

[47]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[48]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[49]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[50]  Steven G. Parker,et al.  The CCA component model for high-performance scientific computing , 2006 .

[51]  Scott Klasky,et al.  Introduction to scientific workflow management and the Kepler system , 2006, SC.

[52]  William C Reinhold,et al.  MatchMiner: a tool for batch navigation among gene and gene product identifiers , 2003, Genome Biology.

[53]  Nagiza F. Samatova,et al.  Web-Enabled R for Large-Scale Collaborative Data Mining: A Survey , 2009, IKE.

[54]  Jack Dongarra,et al.  PB-BLAS: a set of parallel block basic linear algebra subprograms , 1996 .

[55]  Arie Shoshani,et al.  High performance statistical computing with parallel R: Applications to biology and climate modelling , 2006 .

[56]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[57]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[58]  E. Birney,et al.  Apollo: a sequence annotation editor , 2002, Genome Biology.

[59]  Nagiza F. Samatova,et al.  RScaLAPACK: High-Performance Parallel Statistical Computing with R and ScaLAPACK , 2005, ISCA PDCS.

[60]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[61]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[62]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[63]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[64]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[65]  James C. Browne,et al.  The CODE 2.0 graphical parallel programming language , 1992, ICS '92.

[66]  Jeff Banfield Rweb:Web-based Statistical Analysis , 1999 .

[67]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[68]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[69]  Sean R. Eddy,et al.  The Distributed Annotation System , 2001, BMC Bioinformatics.

[70]  Thomas Schiex,et al.  Chimerism and central , 1996 .

[71]  Nagiza F. Samatova,et al.  Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs , 2008 .

[72]  Allan R. Wilks,et al.  The new S language: a programming environment for data analysis and graphics , 1988 .

[73]  Angelo M. Mineo,et al.  Using R via PHP for Teaching Purposes: R-php , 2006 .

[74]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[75]  Wendy Hall,et al.  Conceptual linking: ontology-based open hypermedia , 2001, WWW '01.

[76]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[77]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[78]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[79]  A M Durham,et al.  The GATO gene annotation tool for research laboratories. , 2005, Brazilian journal of medical and biological research = Revista brasileira de pesquisas medicas e biologicas.

[80]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[81]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[82]  Nagiza F. Samatova,et al.  On perturbation theory and an algorithm for maximal clique enumeration in uncertain and noisy graphs , 2009, U '09.

[83]  James Arthur Kohl,et al.  How the common component architecture advances computational science , 2006 .

[84]  Bianca Habermann,et al.  ProFAT: a web-based tool for the functional annotation of protein sequences , 2006, BMC Bioinformatics.

[85]  Geoffrey C. Fox,et al.  WebFlow - a visual programming paradigm for Web/Java based coarse grain distributed computing , 1997, Concurr. Pract. Exp..

[86]  Jin-Soo Hwang,et al.  Statistical Analysis on the Web Using PHP3 , 1999 .

[87]  Robert D. Finn,et al.  The Distributed Annotation System for Integration of Biological Data , 2006, DILS.

[88]  John F. B. Mitchell,et al.  THE WCRP CMIP3 Multimodel Dataset: A New Era in Climate Change Research , 2007 .

[89]  Claus-Wilhelm von der Lieth,et al.  PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts , 2005, Nucleic Acids Res..

[90]  Jeff McAffer,et al.  Eclipse Rich Client Platform: Designing, Coding, and Packaging Java¿ Applications , 2005 .

[91]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[92]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[93]  Friedrich Leisch,et al.  Editorial Porting R to Darwin/x11 and Mac Os X Mac Os X Application Environments User Experience Porting Problems Rpvm: Cluster Statistical Computing in R , 2022 .

[94]  T.A. Short Rpad: open source in action , 2006, 2006 IEEE Power Engineering Society General Meeting.

[95]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[96]  R. C. Whaley,et al.  LAPACK Working Note 94: A User''s Guide to the BLACS v1.0 , 1995 .

[97]  Manish Parashar,et al.  A Middleware Substrate for Integrating Services on the Grid , 2003, HiPC.

[98]  Nagiza F. Samatova,et al.  Automatic Parallelization of Scripting Languages: Toward Transparent Desktop Parallel Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[99]  G. Pertea,et al.  RESOURCERER: a database for annotating and linking microarray resources within and across species , 2001, Genome Biology.

[100]  Duncan Temple Lang The Omegahat Environment: New Possibilities for Statistical Computing , 2000 .

[101]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[102]  A. Valencia,et al.  A text‐mining perspective on the requirements for electronically annotated abstracts , 2008, FEBS letters.

[103]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[104]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[105]  Nigel Collier,et al.  The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers , 1999, EACL.

[106]  E. Rossi,et al.  MedMOLE : Mining literature to extract biological knowledge by microarray data , 2003 .

[107]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.