Towards a Reference Architecture with Modular Design for Large-scale Genotyping and Phenotyping Data Analysis: A Case Study with Image Data

With the rapid advancement of computing technologies, various scientific research communities have been extensively using cloud-based software tools or applications. Cloud-based applications allow users to access software applications from web browsers while relieving them from the installation of any software applications in their desktop environment. For example, Galaxy, GenAP, and iPlant Colaborative are popular cloud-based systems for scientific workflow analysis in the domain of plant Genotyping and Phenotyping. These systems are being used for conducting research, devising new techniques, and sharing the computer assisted analysis results among collaborators. Researchers need to integrate their new workflows/pipelines, tools or techniques with the base system over time. Moreover, large scale data need to be processed within the time-line for more effective analysis. Recently, Big Data technologies are emerging for facilitating large scale data processing with commodity hardware. Among the above-mentioned systems, GenAp is utilizing the Big Data technologies for specific cases only. The structure of such a cloud-based system is highly variable and complex in nature. Software architects and developers need to consider totally different properties and challenges during the development and maintenance phases compared to the traditional business/service oriented systems. Recent studies report that software engineers and data engineers confront challenges to develop analytic tools for supporting large scale and heterogeneous data analysis. Unfortunately, less focus has been given by the software researchers to devise a well-defined methodology and frameworks for flexible design of a cloud system for the Genotyping and Phenotyping domain. To that end, more effective design methodologies and frameworks are an urgent need for cloud based Genotyping and Phenotyping analysis system development that also supports large scale data processing. In our thesis, we conduct a few studies in order to devise a stable reference architecture and modularity model for the software developers and data engineers in the domain of Genotyping and Phenotyping. In the first study, we analyze the architectural changes of existing candidate systems to find out the stability issues. Then, we extract architectural patterns of the candidate systems and propose a conceptual reference architectural model. Finally, we present a case study on the modularity of computation-intensive tasks as an extension of the data-centric development. We show that the data-centric modularity model is at the core of the flexible development of a Genotyping and Phenotyping analysis system. Our proposed model and case study with thousands of images provide a useful knowledge-base for software researchers, developers, and data engineers for cloud based Genotyping and Phenotyping analysis system development.

[1]  Eunjoo Lee,et al.  The effect of IMPORT change in software change history , 2014, SAC.

[2]  Marco Thines,et al.  An Illumina metabarcoding pipeline for fungi , 2014, Ecology and evolution.

[3]  Muhammad Ali Babar,et al.  10 years of software architecture knowledge management: Practice and future , 2016, J. Syst. Softw..

[4]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[5]  Ugrasen Suman,et al.  A survey on software architecture evaluation methods , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[6]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[7]  Leonard J. Bass,et al.  Scenario-Based Analysis of Software Architecture , 1996, IEEE Softw..

[8]  Adrian Caciula,et al.  Optimization Techniques For Next-Generation Sequencing Data Analysis , 2014 .

[9]  Rick Kazman,et al.  Evaluating Software Architectures: Methods and Case Studies , 2001 .

[10]  Sushil K. Prasad,et al.  Lessons Learnt from the Development of GIS Application on Azure Cloud Platform , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[11]  Malia A. Gehan,et al.  A Versatile Phenotyping System and Analytics Platform Reveals Diverse Temporal Responses to Water Availability in Setaria. , 2015, Molecular plant.

[12]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[13]  Chanchal Kumar Roy,et al.  Towards a Reference Architecture for Cloud-Based Plant Genotyping and Phenotyping Analysis Frameworks , 2017, 2017 IEEE International Conference on Software Architecture (ICSA).

[14]  Giuseppe Scanniello,et al.  Weighing lexical information for software clustering in the context of architecture recovery , 2015, Empirical Software Engineering.

[15]  Matthew A. Brown,et al.  Automatic Panoramic Image Stitching using Invariant Features , 2007, International Journal of Computer Vision.

[16]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[17]  Aggelos K. Katsaggelos,et al.  Hybrid image segmentation using watersheds and fast region merging , 1998, IEEE Trans. Image Process..

[18]  Bernard Marr,et al.  Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance , 2015 .

[19]  Doreen Ware,et al.  The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences , 2016, PLoS biology.

[20]  Mehrdad Sabetzadeh,et al.  Automated change impact analysis between SysML models of requirements and design , 2016, SIGSOFT FSE.

[21]  K Kontula,et al.  A primer-guided nucleotide incorporation assay in the genotyping of apolipoprotein E. , 1990, Genomics.

[22]  Yuanfang Cai,et al.  Detecting software modularity violations , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[23]  Barry W. Boehm,et al.  Towards Better Understanding of Software Quality Evolution through Commit-Impact Analysis , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[24]  Martin E. Nordberg Aspect-Oriented Dependency Inversion , 2001 .

[25]  Min Xu,et al.  Template-free detection of macromolecular complexes in cryo electron tomograms , 2011, Bioinform..

[26]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[27]  G. Bavota,et al.  A Validated Set of Smells in Model-View-Controller Architectures , 2016, ICSME.

[28]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[29]  C. A. R. Hoare,et al.  Unifying theories of programming , 1998, RelMiCS.

[30]  Dhabaleswar K. Panda,et al.  Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[31]  Rami Bahsoon,et al.  Architectural Stability , 2009, OTM Workshops.

[32]  Ambuj K. Singh,et al.  Bisque: a platform for bioimage analysis and management , 2009, Bioinform..

[33]  Xiaoji Chen,et al.  Research on Software Maintenance Cost of Influence Factor Analysis and Estimation Method , 2011, 2011 3rd International Workshop on Intelligent Systems and Applications.

[34]  Daniel Pakkala,et al.  Reference Architecture and Classification of Technologies, Products and Services for Big Data Systems , 2015, Big Data Res..

[35]  Yuanfang Cai,et al.  Leveraging design rules to improve software architecture recovery , 2013, QoSA '13.

[36]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[37]  Ivica Crnkovic,et al.  A Systematic Review on Architecting for Software Evolvability , 2010, 2010 21st Australian Software Engineering Conference.

[38]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Leonard J. Bass,et al.  SAAM: a method for analyzing the properties of software architectures , 1994, Proceedings of 16th International Conference on Software Engineering.

[40]  Mehdi Jazayeri On Architectural Stability and Evolution , 2002, Ada-Europe.

[41]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[42]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[43]  Chin-Feng Lee,et al.  Improved Productivity of Mosaic Image by K-medoids and Feature Selection Mechanism on a Hadoop-Based Framework , 2016, 2016 International Conference on Networking and Network Applications (NaNA).

[44]  Torsten Bumgarner Software Architecture Knowledge Management Theory And Practice , 2016 .

[45]  Esperanza Marcos,et al.  RE-CMS: a reverse engineering toolkit for the migration to CMS-based web applications , 2015, SAC.

[46]  He Jiang,et al.  Summarizing Software Artifacts: A Literature Review , 2016, Journal of Computer Science and Technology.

[47]  John Klein,et al.  A Reference Architecture for Big Data Systems in the National Security Domain , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[48]  Hendrik F. Hamann,et al.  IBM PAIRS curated big data service for accelerated geospatial data analytics and discovery , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[49]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[50]  Jia Zhang,et al.  Bridging VisTrails Scientific Workflow Management System to High Performance Computing , 2013, 2013 IEEE Ninth World Congress on Services.

[51]  T. C. Nicholas Graham,et al.  An Iterative Framework for Software Architecture Recovery: An Experience Report , 2008, ECSA.

[52]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[53]  Chanchal Kumar Roy,et al.  Embedded Emotion-based Classification of Stack Overflow Questions Towards the Question Quality Prediction , 2016, SEKE.

[54]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[55]  Ulf Leser,et al.  Optimization of Complex Dataflows with User-Defined Functions , 2017, ACM Comput. Surv..

[56]  Kim B. Clark,et al.  Design Rules: The Power of Modularity Volume 1 , 1999 .

[57]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[58]  Zhenlong Li,et al.  Big Data and cloud computing: innovation opportunities and challenges , 2017, Int. J. Digit. Earth.

[59]  Audris Mockus,et al.  Identifying reasons for software changes using historic databases , 2000, Proceedings 2000 International Conference on Software Maintenance.

[60]  Mamdouh Alenezi,et al.  Software Architecture Quality Measurement Stability and Understandability , 2016 .

[61]  Daniela Cruzes,et al.  A study of cyclic dependencies on defect profile of software components , 2013, J. Syst. Softw..

[62]  Navneet Kumar Agrawal,et al.  Counting of Flowers using Image Processing , 2014 .

[63]  Cees T. A. M. de Laat,et al.  Defining architecture components of the Big Data Ecosystem , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[64]  David R. Swanson,et al.  Image Harvest: an open-source platform for high-throughput plant image processing and analysis , 2016, Journal of experimental botany.

[65]  James H. Cross,et al.  Reverse engineering and design recovery: a taxonomy , 1990, IEEE Software.

[66]  Heiko Koziolek,et al.  Sustainability evaluation of software architectures: a systematic review , 2011, QoSA-ISARCS '11.

[67]  Akshai Aggarwal,et al.  A Generalized Environment for Distributed Image Processing , 2003 .

[68]  Brian P. Bailey,et al.  Software history under the lens: A study on why and how developers examine it , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[69]  Stephen Travis Pope,et al.  A Description of the Model-View-Controller User Interface Paradigm in the Smalltalk-80 System , 1998 .

[70]  Joaquín Fernández Sánchez,et al.  A sofware reference architecture for the design and development of mobile workflow learning applications , 2014 .

[71]  N. V. Kalyankar,et al.  Application of computer vision and color image segmentation for yield prediction precision , 2013, 2013 International Conference on Information Systems and Computer Networks.

[72]  Ling Liu,et al.  Computing infrastructure for big data processing , 2013, Frontiers of Computer Science.

[73]  Md Zahidul Islam,et al.  Towards a standard Bangla PhotoOCR: Text detection and localization , 2014, 2014 17th International Conference on Computer and Information Technology (ICCIT).

[74]  Max Klein,et al.  Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark , 2017, Bioinform..

[75]  Yuanfang Cai,et al.  Modularity Analysis of Logical Design Models , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[76]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[77]  Alexander S. Szalay,et al.  GrayWulf: Scalable Software Architecture for Data Intensive Computing , 2009 .

[78]  Yafei Dai,et al.  Processing Concurrent Graph Analytics with Decoupled Computation Model , 2017, IEEE Transactions on Computers.

[79]  Paul W. P. J. Grefen,et al.  A framework for analysis and design of software reference architectures , 2012, Inf. Softw. Technol..

[80]  Vojislav B. Misic,et al.  A holistic architecture assessment method for software product lines , 2007, Inf. Softw. Technol..

[81]  M. Anwar Ma'sum,et al.  Design of intelligent k-means based on spark for big data clustering , 2016, 2016 International Workshop on Big Data and Information Security (IWBIS).

[82]  Xavier Franch,et al.  A software reference architecture for semantic-aware Big Data systems , 2017, Inf. Softw. Technol..

[83]  Nenad Medvidovic,et al.  Architectural-Based Speculative Analysis to Predict Bugs in a Software System , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[84]  Lerina Aversano,et al.  Evaluating architecture stability of software projects , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[85]  Lisa M. Brown,et al.  A survey of image registration techniques , 1992, CSUR.

[86]  S. Deschamps,et al.  Genotyping-by-Sequencing in Plants , 2012, Biology.

[87]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[88]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[89]  Jun Wei,et al.  MR-runner: a modularized map-reduce job management tool , 2013, Internetware.

[90]  Mohak Shah,et al.  An architecture for the deployment of statistical models for the big data era , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[91]  Kerry Koitzsch,et al.  “Image As Big Data” Systems: Some Case Studies , 2017 .

[92]  Jameela Al-Jaroodi,et al.  Applying software engineering processes for big data analytics applications development , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[93]  Lei Huang,et al.  Large-Scale Image Processing Research Cloud , 2014, CLOUD 2014.

[94]  Falk Schreiber,et al.  HTPheno: An image analysis pipeline for high-throughput plant phenotyping , 2011, BMC Bioinformatics.

[95]  Francisco Torres Context is King: What's Your Software's Operating Range? , 2015, IEEE Software.

[96]  Bernd Brügge,et al.  Semi-automatic generation of audience-specific release notes , 2016, CSED@ICSE.

[97]  Johannes E. Schindelin,et al.  The ImageJ ecosystem: An open platform for biomedical image analysis , 2015, Molecular reproduction and development.

[98]  Masood Masoodian,et al.  Heterogeneous client-server architecture for a virtual meeting environment , 2000, Proceedings 8th Euromicro Workshop on Parallel and Distributed Processing.

[99]  Md. Saidur Rahman,et al.  A new hierarchical clustering technique for restructuring software at the function level , 2013, ISEC.

[100]  Yuanfang Cai,et al.  Titan: a toolset that connects software architecture with quality analysis , 2014, SIGSOFT FSE.

[101]  Yuanfang Cai,et al.  Hotspot Patterns: The Formal Definition and Automatic Detection of Architecture Smells , 2015, 2015 12th Working IEEE/IFIP Conference on Software Architecture.

[102]  Gabriele Bavota,et al.  Automatic generation of release notes , 2014, SIGSOFT FSE.

[103]  Alexandru Adrian Tole,et al.  Big Data Challenges , 2013 .

[104]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[105]  Mohammad Ghafari,et al.  A Framework for Classifying and Comparing Architecture-centric Software Evolution Research , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[106]  Özgür Yilmazel,et al.  Not all software engineers can become good data engineers , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[107]  Frank Nielsen,et al.  K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching , 2010, 2010 IEEE International Conference on Image Processing.

[108]  Vineet Sinha,et al.  Using dependency models to manage complex software architecture , 2005, OOPSLA '05.

[109]  Yuanfang Cai,et al.  Design Rule Hierarchies and Parallelism in Software Development Tasks , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[110]  David Garlan,et al.  Documenting software architectures: views and beyond , 2002, 25th International Conference on Software Engineering, 2003. Proceedings..

[111]  Banani Roy,et al.  Methods for Evaluating Software Architecture: A Survey , 2008 .

[112]  Andrzej Zalewski,et al.  Beyond ATAM: Early architecture evaluation method for large-scale distributed systems , 2013, J. Syst. Softw..

[113]  Michel Krämer,et al.  A modular software architecture for processing of big geospatial data in the cloud , 2015, Comput. Graph..

[114]  Shaowen Wang,et al.  Parallel cartographic modeling: a methodology for parallelizing spatial data processing , 2016, Int. J. Geogr. Inf. Sci..

[115]  ジョン エイ イウォスクス,et al.  Loosely coupled mass storage computer cluster , 1996 .

[116]  Mario R. Barbacci,et al.  Quality Attribute Workshops (QAWs), Third Edition , 2003 .

[117]  Feng Yu,et al.  A Design of Heterogeneous Cloud Infrastructure for Big Data and Cloud Computing Services , 2014, CloudCom 2014.

[118]  Alexander L. Wolf,et al.  Acm Sigsoft Software Engineering Notes Vol 17 No 4 Foundations for the Study of Software Architecture , 2022 .

[119]  Jason Lawrence,et al.  HIPI : A Hadoop Image Processing Interface for Image-based MapReduce Tasks , 2011 .

[120]  Enis Afgan,et al.  BioBlend: automating pipeline analyses within Galaxy and CloudMan , 2013, Bioinform..

[121]  Elisa Yumi Nakagawa,et al.  Reference Architecture and Product Line Architecture: A Subtle But Critical Difference , 2011, ECSA.

[122]  Devin White,et al.  A Fully Automated High-Performance Image Registration Workflow to Support Precision Geolocation for Imagery Collected by Airborne and Spaceborne Sensors , 2017 .

[123]  Jeffrey C. Carver,et al.  Characterizing software architecture changes: A systematic review , 2010, Inf. Softw. Technol..

[124]  Harald C. Gall,et al.  Discovering Patterns of Change Types , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[125]  Xiaochun Cao,et al.  A Hierarchical Distributed Processing Framework for Big Image Data , 2016, IEEE Transactions on Big Data.

[126]  Enis Afgan,et al.  Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy , 2012, Current protocols in bioinformatics.

[127]  Cláudio Sant'Anna,et al.  From retrospect to prospect: Assessing modularity and stability from software architecture , 2009, 2009 Joint Working IEEE/IFIP Conference on Software Architecture & European Conference on Software Architecture.

[128]  Ricardo Terra,et al.  Mining architectural violations from version history , 2016, Empirical Software Engineering.

[129]  David A. Patterson,et al.  GenAp: a distributed SQL interface for genomic data , 2016, BMC Bioinformatics.