Algebraic stability indicators for ranked lists in molecular profiling

MOTIVATION We propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate marker-based diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictor's type and help selecting solutions. RESULTS A computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearman's footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis. AVAILABILITY Supplementary Material and software are available at the address http://biodcv.fbk.eu/listspy.html

[1]  D. Critchlow Metric Methods for Analyzing Partially Ranked Data , 1986 .

[2]  Ruth Etzioni,et al.  Combining Results of Microarray Experiments: A Rank Aggregation Approach , 2006 .

[3]  Cesare Furlanello,et al.  Semisupervised Profiling of Gene Expressions and Clinical Data , 2005, WILF.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Heiko Müller,et al.  Mining published lists of cancer related microarray experiments: Identification of a gene expression signature having a critical role in cell-cycle control , 2005, BMC Bioinformatics.

[6]  Graham Cormode,et al.  Permutation Editing and Matching via Embeddings , 2001, ICALP.

[7]  Cesare Furlanello,et al.  Integrating gene expression profiling and clinical data , 2008, Int. J. Approx. Reason..

[8]  S. Falcon,et al.  Combining Results of Microarray Experiments: A Rank Aggregation Approach , 2006, Statistical applications in genetics and molecular biology.

[9]  S. Merler,et al.  Semisupervised learning for molecular profiling , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Rainer Spang,et al.  Similarities of Ordered Gene Lists , 2006, J. Bioinform. Comput. Biol..

[11]  Léon Personnaz,et al.  Enrichment or depletion of a GO category within a class of genes: which test? , 2007, Bioinform..

[12]  Stuart G. Baker,et al.  Identifying genes that contribute most to good classification in microarrays , 2006, BMC Bioinformatics.

[13]  Alex Lewin,et al.  Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data , 2006, BMC Bioinformatics.

[14]  P. Diaconis Group representations in probability and statistics , 1988 .

[15]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[16]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[17]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[18]  Cesare Furlanello,et al.  Proteome Profiling without Selection Bias , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[19]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[20]  C. Dwork,et al.  Rank Aggregation Revisited , 2002 .

[21]  Jiajun Liu,et al.  Domain-enhanced analysis of microarray data using GO annotations , 2007, Bioinform..

[22]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[23]  Donald G. Saari,et al.  Chaotic Elections! - A Mathematician Looks at Voting , 2001 .

[24]  R. Simon,et al.  Development and evaluation of therapeutically relevant predictive classifiers using gene expression profiling. , 2006, Journal of the National Cancer Institute.

[25]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Hedi Peterson,et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[27]  Rainer Spang,et al.  OrderedList - a bioconductor package for detecting similarity in ordered gene lists , 2006, Bioinform..

[28]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[29]  Liliana Florea,et al.  List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. , 2005, Gene.

[30]  Sayan Mukherjee,et al.  Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles , 2006, ISMB.

[31]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[32]  R V Jensen,et al.  Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[34]  Michael E. Orrison,et al.  Spectral Analysis of the Supreme Court , 2006 .

[35]  Søren Wichmann,et al.  A stability metric for typological features , 2008 .

[36]  Baolin Wu,et al.  Ovarian Cancer Classification based on Mass Spectrometry Analysis of Sera , 2007, Cancer informatics.

[37]  Ronald Fagin,et al.  Comparing and aggregating rankings with ties , 2004, PODS '04.

[38]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[39]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  I. Kohane,et al.  Absolute enrichment: gene set enrichment analysis for homeostatic systems , 2006, Nucleic acids research.

[41]  Melanie Hilario,et al.  Stability of feature selection algorithms , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[42]  Sayan Mukherjee,et al.  Modeling Cancer Progression via Pathway Dependencies , 2008, PLoS Comput. Biol..

[43]  Cesare Furlanello,et al.  Deriving the Kernel from Training Data , 2007, MCS.

[44]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[45]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..