Mining Biological Data on the Cloud - A MapReduce Approach

During last decades, bioinformatics has proven to be an emerging field of research leading to the development of a wide variety of applications. The primary goal of bioinformatics is to detect useful knowledge hidden under large volumes biological and biomedical data, gain a greater insight into their relationships and, therefore, enhance the discovery and the comprehension of biological processes. To achieve this, a great number of text mining techniques have been developed that efficiently manage and disclose meaningful patterns and correlations from biological and biomedical data repositories. However, as the volume of data grows rapidly these techniques cannot cope with the computational burden that is produced since they apply only in centralized environments. Consequently, a turn into distributed and parallel solutions is indispensable. In the context of this work, we propose an efficient and scalable solution, in the MapReduce framework, for mining and analyzing biological and biomedical data.

[1]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Frank Klawonn,et al.  Fuzzy clustering: More than just fuzzification , 2015, Fuzzy Sets Syst..

[4]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Mohammed J. Zaki Data Mining In Bioinformatics (Advanced Information and Knowledge Processing) , 2004 .

[6]  Tony McAleavy,et al.  Introduction to Clustering Large and High-Dimensional Data , 2006 .

[7]  Marti A. Hearst Chapter 2 of the second edition of Modern Information Retrieval Renamed Modern Information Retrieval : The Concepts and Technology behind Search , 2011 .

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Giannis Tzimas,et al.  Genome-Based Population Clustering: Nuggets of Truth Buried in a Pile of Numbers? , 2012, AIAI.

[10]  Chen Zhang,et al.  K-means Clustering Algorithm with Improved Initial Center , 2009, 2009 Second International Workshop on Knowledge Discovery and Data Mining.

[11]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Hannu Toivonen,et al.  Data Mining In Bioinformatics , 2005 .

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  Giannis Tzimas,et al.  A set of novel mining tools for efficient biological knowledge discovery , 2013, Artificial Intelligence Review.

[16]  Yi Pan,et al.  Novel hybrid hierarchical-K-means clustering method (H-K-means) for microarray analysis , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[17]  Dennis Shasha,et al.  Introduction to Data Mining in Bioinformatics , 2005, Data Mining in Bioinformatics.

[18]  Vassiliki Gkantouna,et al.  Population-specific documentation of pharmacogenomic markers and their allelic frequencies in FINDbase. , 2011, Pharmacogenomics.

[19]  Kohei Inoue,et al.  Fuzzy clustering based on cooccurrence matrix and its application to data retrieval , 2001 .

[20]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[21]  Eloisa Vargiu,et al.  Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges , 2012, Adv. Bioinformatics.

[22]  Wen-Lian Hsu,et al.  A Survey of State of the Art Biomedical Text Mining Techniques for Semantic Analysis , 2008, 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008).

[23]  Vassiliki Gkantouna,et al.  Population-ethnic group specific genome variation allele frequency data: a querying and visualization journey. , 2012, Genomics.

[24]  Milan Macek,et al.  FINDbase: a relational database recording frequencies of genetic defects leading to inherited disorders worldwide , 2006, Nucleic Acids Res..

[25]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[26]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[27]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.