论文信息 - Has the yo‐yo stopped? An assessment of human protein‐coding gene number

Has the yo‐yo stopped? An assessment of human protein‐coding gene number

Since the identification of ∼ 25 000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30 000 and 70 000. The recently announced genome closure has not generated a consensus gene count despite this being a key parameter for many areas of biology including drug target discovery and characterization of the human proteome. Contrary to earlier predictions of constitutive under‐detection for eukaryotic genes, the latest model organism updates have produced minor increases in the worm but fly and yeast gene numbers have decreased. The postdraft, precompletion interval has produced large increases in human transcript coverage, continuous improvements in genome assembly and refinements in automated genomic annotation. Notably these enhancements have resulted in an Ensembl human protein‐coding gene number of 22 184, a decrease of 1862 since the first release. Longitudinal database surveys indicate that redundancy‐reduced human mRNA and protein collections are flattening out at ∼ 28 000, although Ensembl maps ∼ 20 000 known sequences. Observations suggest high‐throughput cloning projects are predominantly extending known genes or sampling new splice forms and novel protein discovery has slowed to a trickle. The hypothesis that substantial numbers of short proteins remain experimentally and computationally undetected in mammalian genomes is neither supported by sequence data nor by the extensive homology between mouse and human proteins. Aggregating the independent annotations for complete transcripts from seven completed human chromosomes extrapolates to ∼ 25 000 genes. The inclusion of partial putative genes would increase this to above 30 000 but recent data suggest these represent predominantly nonprotein‐coding transcripts. Mass spectrometry‐based proteomics has already verified more than 10% of human genes but has not identified significant numbers of unpredicted proteins. The available data are thus converging to a basal protein‐coding gene number well below 30 000, which could even be as low as 25 000.

C. Southan