CBSSD: community-based semantic subgroup discovery

Modern data mining algorithms frequently need to address the task of learning from heterogeneous data, including various sources of background knowledge. A data mining task where ontologies are used as background knowledge in data analysis is referred to as semantic data mining. A specific semantic data mining task is semantic subgroup discovery: a rule learning approach enabling ontology terms to be used in subgroup descriptions learned from class labeled data. This paper presents Community-Based Semantic Subgroup Discovery (CBSSD), a novel approach that advances ontology-based subgroup identification by exploiting the structural properties of induced complex networks related to the studied phenomenon. Following the idea of multi-view learning, using different sources of information to obtain better models, the CBSSD approach can leverage different types of nodes of the induced complex network, simultaneously using information from multiple levels of a biological system. The approach was tested on ten data sets consisting of genes related to complex diseases, as well as core metabolic processes. The experimental results demonstrate that the CBSSD approach is scalable, applicable to large complex networks, and that it can be used to identify significant combinations of terms, which can not be uncovered by contemporary term enrichment analysis approaches.

[1]  R. Schumann,et al.  Single nucleotide polymorphisms of Toll-like receptors and susceptibility to infectious disease. , 2005, The Lancet. Infectious diseases.

[2]  David Bryant,et al.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists , 2007, Nucleic Acids Res..

[3]  Nada Lavrac,et al.  SegMine workflows for semantic microarray data analysis in Orange4WS , 2011, BMC Bioinformatics.

[4]  Johannes Fürnkranz,et al.  Foundations of Rule Learning , 2012, Cognitive Technologies.

[5]  Hyman M. Schipper,et al.  MicroRNA Expression in Alzheimer Blood Mononuclear Cells , 2007, Gene regulation and systems biology.

[6]  Andrey Alexeyenko,et al.  Network enrichment analysis: extension of gene-set enrichment analysis to gene networks , 2012, BMC Bioinformatics.

[7]  Rushed Kanawati,et al.  Community detection in multiplex networks: A seed-centric approach , 2015, Networks Heterog. Media.

[8]  E. Lander,et al.  Comprehensive assessment of cancer missense mutation clustering in protein structures , 2015, Proceedings of the National Academy of Sciences.

[9]  Nada Lavrac,et al.  Py3plex: A Library for Scalable Multilayer Network Analysis and Visualization , 2018, COMPLEX NETWORKS.

[10]  Stephen Muggleton,et al.  Inductive Logic Programming , 2011, Lecture Notes in Computer Science.

[11]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[12]  Lovro Subelj,et al.  Convexity in complex networks , 2016, Network Science.

[13]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[14]  Hao Wang,et al.  Semantic data mining: A survey of ontology-based approaches , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  H. Tipney,et al.  An introduction to effective use of enrichment analysis software , 2010, Human Genomics.

[17]  Richard GH Cotton Response to Stenson et al. in Human Genomics Vol. 4, No. 2, pp. 69-72: 'The Human Gene Mutation Database: Providing a comprehensive central mutation database for molecular diagnostics and personalised genomics' , 2009, Human Genomics.

[18]  Nada Lavrac,et al.  Semantic Data Mining of Financial News Articles , 2013, Discovery Science.

[19]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[20]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[21]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[23]  Steffen Staab,et al.  What Is an Ontology? , 2009, Handbook on Ontologies.

[24]  Janez Konc,et al.  Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins , 2018, Molecular informatics.

[25]  Peter Butala,et al.  Discovering autonomous structures within complex networks of work systems , 2012 .

[26]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[27]  Agnieszka Lawrynowicz Semantic Data Mining - An Ontology-Based Approach , 2017, Studies on the Semantic Web.

[28]  Nada Lavrac,et al.  Relational and Semantic Data Mining - - Invited Talk - , 2015, LPNMR.

[29]  Ruoming Jin,et al.  Mining Biomedical Ontologies and Data Using RDF Hypergraphs , 2013, 2013 12th International Conference on Machine Learning and Applications.

[30]  Weidong Tian,et al.  LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights , 2016, Scientific Reports.

[31]  Carl T. Bergstrom,et al.  The map equation , 2009, 0906.1405.

[32]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[33]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[34]  Karolis Leonavicius,et al.  Multi-omics at single-cell resolution: comparison of experimental and data fusion approaches. , 2019, Current opinion in biotechnology.

[35]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[36]  E M Brown,et al.  Molecular Cloning and Functional Expression of Human Parathyroid Calcium Receptor cDNAs (*) , 1995, The Journal of Biological Chemistry.

[37]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[38]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[39]  Alexandre Arenas,et al.  Identifying modular flows on multilayer networks reveals highly overlapping organization in social systems , 2014, ArXiv.

[40]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[41]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[42]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[43]  Giovanni Montana,et al.  Community detection in multiplex networks using Locally Adaptive Random walks , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[44]  Nada Lavrac,et al.  Explaining Mixture Models through Semantic Pattern Mining and Banded Matrix Visualization , 2014, Discovery Science.

[45]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[47]  Gajendra P. S. Raghava,et al.  dbEM: A database of epigenetic modifiers curated from cancerous and normal genomes , 2016, Scientific Reports.

[48]  The UniProt Consortium UniProt: the universal protein knowledgebase , 2016, Nucleic Acids Res..

[49]  Nada Lavrac,et al.  Semantic Subgroup Discovery Systems and Workflows in the SDM-Toolkit , 2013, Comput. J..

[50]  Alfonso Valencia,et al.  EnrichNet: network-based gene set enrichment analysis , 2012, Bioinform..

[51]  S. Strogatz Exploring complex networks , 2001, Nature.

[52]  Markus List,et al.  KeyPathwayMinerWeb: online multi-omics network enrichment , 2016, Nucleic Acids Res..

[53]  Maria-Florina Balcan,et al.  Exploiting Ontology Structures and Unlabeled Data for Learning , 2013, ICML.

[54]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[55]  Nada Lavrac,et al.  Community-Based Semantic Subgroup Discovery , 2017, NFMCP@PKDD/ECML.

[56]  Mihaela E. Sardiu,et al.  Identification of Topological Network Modules in Perturbed Protein Interaction Networks , 2017, Scientific Reports.

[57]  A. Arenas,et al.  Community detection in complex networks using extremal optimization. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[58]  Nada Lavrac,et al.  Contrasting Subgroup Discovery , 2012, Comput. J..

[59]  Blaz Skrlj,et al.  Identification of Sequence Variants within Experimentally Validated Protein Interaction Sites Provides New Insights into Molecular Mechanisms of Disease Development , 2017, Molecular informatics.

[60]  Hannu Toivonen,et al.  Biomine: predicting links between biological entities using network models of heterogeneous databases , 2012, BMC Bioinformatics.

[61]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[62]  Blaz Skrlj,et al.  Computational identification of non-synonymous polymorphisms within regions corresponding to protein interaction sites , 2016, Comput. Biol. Medicine.

[63]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[64]  S. Džeroski,et al.  Relational Data Mining , 2001, Springer Berlin Heidelberg.

[65]  Reuven Cohen,et al.  Complex Networks: Structure, Robustness and Function , 2010 .

[66]  Xiao Sun,et al.  A Comparative Study of Network Motifs in the Integrated Transcriptional Regulation and Protein Interaction Networks of Shewanella , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[67]  Nada Lavrac,et al.  HINMINE: heterogeneous information network mining with information retrieval heuristics , 2018, Journal of Intelligent Information Systems.

[68]  Xiang Li,et al.  Fundamentals of Complex Networks: Models, Structures and Dynamics , 2015 .

[69]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.