A Gene-centric Human Proteome Project

The success of the Human Genome Project (1, 2) has provided a blueprint for the gene-encoded proteins potentially active in all of the hundreds of cell types that make up the human body. Yet we still have limited knowledge regarding a majority of the approximately 20,000 protein-coding human genes discovered through the genome project (3, 4). At present, about 8,000 (38%) of these genes lack experimental evidence at the protein level (UniProt), and for many others there is very little information related to protein abundance, distribution, subcellular localization, and function. The diagnostic, prognostic, and therapeutic value of understanding human biology at the protein level argues for a systematic effort to map the human proteome. The proteomic space generated from these gene products is enormous, including up to a million different protein molecules derived by combinatorial recombination of DNA (immunoglobulins and T-cell receptors), alternative splicing of RNAs, and numerous protein modifications of various types that vary with time and with physiological, pathological, and pharmacological perturbations. Hochstrasser (5) therefore recently argued for a protein-centric human proteome project, driven by mass spectrometry technology focusing on the protein perturbations caused by human diseases. Our goal is to define clear endpoints of a Human Proteome Project, combining the strengths of complementary technology platforms. We therefore propose a gene-centric approach to generate a human proteome map with an “information backbone” about the proteins expressed from each gene locus and to make this information publicly available with no restrictions, as was done with the genome sequence data, thereby facilitating in-depth studies to understand human biology and diseases. With further analogy with the genome project, the gene-centric human proteome map can be complemented with in-depth studies on protein variability with relevance to life stages and various diseases. Reasonable end points of such a Human Proteome Project would be feasible within a limited time period and achievable without major paradigm shifts in technology. Taking into account recent major advances in mass spectrometry (6) and immunobased methods (7), we propose a systematic threepart approach to ensure that, for each predicted proteincoding gene, at least one of its major representative proteins will be characterized in the context of its major anatomical sites of expression, its abundance, and its interacting protein partners: