Novel genes in genomes

From Shiu Lab

Jump to: navigation, search

Contents

Background

  • Transcriptome sequenceing and whole genome tiling array studies reveal that significant levels of expression has been detected in intergenic regions in human (Kapranov et al. 2002), (Rinn et al. 2003), (Bertone et al. 2004), fly (Stolc et al. 2004), A. thaliana (Yamada et al. 2003), (Stolc et al. 2005b), and rice (Stolc et al. 2005a),(Li et al. 2006). These studies demonstrates the presence of genic sequences in un-annotated “intergenic” regions.
    • However, it remains in most cases an open question if these transcripts represent novel protein coding or RNA genes. Some studies assumed that these sequences represent non-coding RNA genes because the sequences had not been annotated as coding regions. In several other cases, the designation of coding regions tends to be arbitrary with an ad hoc length threshold without attempting to distinguish coding from non-coding sequences experimentally or computationally with gene finders.
  • Most ab initio gene prediction programs distinguish coding (CDS) and non-coding sequences (NCDS) with their differences in nucleotide composition, intron splice sites, promoters, translational start/stop sites, and polyadenylation signals. These signals are generally integrated for evaluating the coding likelihood of a sequence (Claverie 1997)(Burge and Karlin 1998)(Brent and Guigo 2004).
    • The integration of multiple criteria decreases the chance that false exons are predicted as true (low false positive rate) but likely increases the chance that true exons are not predicted (high false negative rate) (Claverie 1997). The issue of false negative prediction is particularly serious for smaller CDSs (≤ 300 nucleotides) due to the difficulty in distinguishing the relatively few biologically meaningful sequences from the very large pool or short ORFs (Basrai et al. 1997)(Wang et al. 2003).
    • Nonetheless, small proteins (referred to as sORF) include several classes of important genes. In yeast, sORFs includes mating pheromones, proteins involved in energy metabolism, proteolipids, chaperonins, stress proteins, transporters, transcriptional regulators, nucleases, ribosomal proteins, thioredoxins, and metal ion chelators in yeast (Basrai et al. 1997). In addition, many yeast sORFs missed by ab initio prediction but supported by evidence of expression has been shown to be translated and functional in many cases (Ghaemmaghami et al. 2003),(Huh et al. 2003),(Kastenmayer et al. 2006).
    • In human, 997 known genes from Ensembl ((Hubbard et al. 2002)) are sORFs and 593 of them are annotated by Refseq (Pruitt et al. 2005) or Swissprot (Boeckmann et al. 2005).
    • In A. thaliana, relatively little is known about sORFs but a number of small, secreted proteins that likely act as receptor ligands are identified not by gene finding programs but by similarity searches and/or functional studies (Cock and McCormick 2001), (Butenko et al. 2003).

Objectives

  • Noting the relatively high false negative rate of current gene finding algorithms and the difficulty to identify small protein genes, we are developing a pipeline for prediction based on:
    1. The hexamer composition bias, which has been established as the best measure for distinguishing CDS from NCDS (Farber et al. 1992), (Fickett and Tung 1992)
    2. Expression of intergenic regions interrogated with genome tiling array.
    3. Signature of purifying selection - since most functional genes are subject to purifying selection, a novel gene is expected to undergo stronger selective constraints on nonsynonymous sites than for synonymous ones (Li 1997), (Makalowski and Boguski 1998).

Findings

  • We devised a simplified method (Coding Index, CI) of current gene finder based solely on the composition bias of most coding sequences.
    • Applying this method to 114 yeast sORFs with evidence of expression at the protein level, 98 are correctly predicted.
    • Applying this method to 20 Arabidopsis sORFs that are known genes, 19 are correctly predicted.
    • In the Arabidopsis thaliana genome, we identified 3,274 sORFs based on the CI measure.
  • To determine if these Arabidopsis sORF represent functional genes, we evaluated each sORF for evidence of transcription or evolutionary conservation.
    • We found that 1,589 sORFs are likely expressed in at least one experimental condition of the Arabidopsis tiling array data.
  • In addition, the evolutionary conservation of each Arabidopsis sORF was examined within Arabidopsis thaliana or between Arabidopsis and 4 plants with complete or partial genome sequences available. In 522 sORFs with readily identifiable homologous sequences, 158 sORFs are subject to purifying selection.
  • Therefore, 1,661 sORFs have evidence of either transcription or purifying selection and likely belong to novel protein coding genes in the Arabidopsis genome. Some of them (13.2%) are possibly missing exons of known genes. However the rest 1275 (82.8%) sORFs with either evidence of transcription or purifying selection likely belong to novel genes in the Arabidopsis genome.

Lab section-Novel genes

RESTRICTED

Personal tools