Between them, these two categories of annotated open reading frames often represent more than half of the potential protein-coding regions of a genome.
Genome annotation efforts emphasize identifying protein-coding regions [ 37].
More interesting is how the protein-coding regions changed.
It is still not fully understood why the negative selective pressure on these regions is so much stronger than the selection in protein-coding regions.
For example, a recent genomic analysis of 13,799 human genes revealed that approximately 4% harbored retrotransposon sequences within protein-coding regions [ 13 ] .
Such matches might reveal unpredicted protein-coding regions within the genome.
A more liberal definition is that orphan genes are protein-coding regions that have no recognizable homolog in distantly related species.
We assume the protein-coding region is the longest ORF on the forward strand, and required it to span at least 40% of the cDNA length.
When the trinucleotide repeat is present within the protein-coding region, the repeat expansion leads to production of a mutant protein with gain of function.
The human genome, for example, comprises less than 2% protein-coding regions, with the remainder being various types of non-coding DNA (especially transposable elements).