Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.

Predicting at the time of discovery the prognosis and metastatic potential of cancer is a major challenge in current clinical research. Numerous recent studies searched for gene expression signatures that outperform traditionally used clinical parameters in outcome prediction. Finding such a signature will free many patients of the suffering and toxicity associated with adjuvant chemotherapy given to them under current protocols, even though they do not need such treatment. A reliable set of predictive genes also will contribute to a better understanding of the biological mechanism of metastasis. Several groups have published lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common. This lack of agreement raised doubts about the reliability and robustness of the reported predictive gene lists, and the main source of the problem was shown to be the small number of samples that were used to generate the gene lists. Here, we introduce a previously undescribed mathematical method, probably approximately correct (PAC) sorting, for evaluating the robustness of such lists. We calculate for several published data sets the number of samples that are needed to achieve any desired level of reproducibility. For example, to achieve a typical overlap of 50% between two predictive lists of genes, breast cancer studies would need the expression profiles of several thousand early discovery patients.

[1]  K. Pearson Biometrika , 1902, The American Naturalist.

[2]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[3]  G. Hughes,et al.  Number of pattern classifier design samples per class (Corresp.) , 1969, IEEE Trans. Inf. Theory.

[4]  Anil K. Jain,et al.  On the optimal number of features in the classification of multivariate Gaussian data , 1978, Pattern Recognit..

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[6]  Mike Clarke,et al.  Polychemotherapy for early breast cancer: an overview of the randomised trials , 1998, The Lancet.

[7]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[8]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[10]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[13]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[14]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[15]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[16]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[17]  N. Iizuka,et al.  MECHANISMS OF DISEASE Mechanisms of disease , 2022 .

[18]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[19]  G. Nemes Asymptotic Expansions of Integrals , 2004 .

[20]  J. Peterse,et al.  Breast cancer metastasis: markers and models , 2005, Nature Reviews Cancer.

[21]  J. Brenton,et al.  Microarrays and breast cancer clinical studies: forgetting what we have not yet learnt , 2005, Breast Cancer Research.

[22]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[23]  C. Caldas,et al.  Molecular classification and molecular forecasting of breast cancer: ready for clinical application? , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[24]  Eytan Domany,et al.  Outcome Signature Genes in Breast Cancer: Is There a Unique Set? , 2022 .

[25]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[26]  J. Ioannidis Microarrays and molecular research: noise discovery? , 2005, The Lancet.

[27]  T. Sørlie,et al.  Genomics in breast cancer—therapeutic implications , 2005, Nature Clinical Practice Oncology.

[28]  Y. Oshika,et al.  P-glycoprotein-mediated acquired multidrug resistance of human lung cancer cells in vivo. , 1996, British Journal of Cancer.