Finding Top-k Covering Irreducible Contrast Sequence Rules for Disease Diagnosis

Diagnostic genes are usually used to distinguish different disease phenotypes. Most existing methods for diagnostic genes finding are based on either the individual or combinatorial discriminative power of gene(s). However, they both ignore the common expression trends among genes. In this paper, we devise a novel sequence rule, namely, top-k irreducible covering contrast sequence rules (TopkIRs for short), which helps to build a sample classifier of high accuracy. Furthermore, we propose an algorithm called MineTopkIRs to efficiently discover TopkIRs. Extensive experiments conducted on synthetic and real datasets show that MineTopkIRs is significantly faster than the previous methods and is of a higher classification accuracy. Additionally, many diagnostic genes discovered provide a new insight into disease diagnosis.

[1]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[4]  Guoren Wang,et al.  Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules , 2011, 2011 IEEE 11th International Conference on Data Mining.

[5]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[7]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Botao Wang,et al.  Efficiently mining local conserved clusters from gene expression data , 2010, Neurocomputing.

[9]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[10]  Jianyong Wang,et al.  Efficient mining of frequent sequence generators , 2008, WWW.

[11]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[12]  Jian Pei,et al.  Mining phenotypes and informative genes from gene expression data , 2003, KDD '03.

[13]  Anthony K. H. Tung,et al.  What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[14]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[15]  Aidong Zhang,et al.  Virtual Gene: Using Correlations Between Genes to Select Informative Genes on Microarray Datasets , 2005, Trans. Comp. Sys. Biology.