ENCODE Pseudogenes

Status on 6-Oct

We have a list of pseudogenes from four research groups. These pseudogenes are our starting point at this moment. We agree to add other pseudogenes later.

GIS 46 42 45 39
HAVANA 42 165 104 132
UCSC 45 106 163 105
Yale 39 135 104 167

Roadmap of generating a concensus pseudogene annotation for the ENCODE regions

Step I -- filter the above lists to remove pseudogenes overlapping with current GENCODE coding exons /loci. Pseudogenes overlaping with introns or noncoding genes will be kept.

Following are the filtered pseudogenes -- i.e., those overlapping with exons of Known_genes have been removed (except for HAVAVA list):

GIS 45 44 42 38
HAVANA 44 185 113 144
UCSC 42 113 138 97
Yale 38 144 95 156

Step II -- take a union of the above pseudogenes. Where a pseudogenic region is annotated by more than one group, the boundary represents the smallest start and the largest end.

  • 222 union pseudogenes.

    Step III -- Assign a parent protein for each pseudogene in the union using a protein set from the UniProt. Pseudogenes without a matching protein are excluded.

  • The protein set -- SPROT TREMBL.
  • 198 updated pseudogenes with their parent proteins identified.

    Step IV -- re-align each pseudogene to its parent protein.

  • Resulting Alignments

    Step V -- update consensus list of pseudogenes with boundaries derived from the alignment in Step IV.

  • The 198 concensus pseudogenes (differs from the above in terms of boundaries)
  • A table showing that these consensus pseudogenes intersect with 43 GIS, 177 HAVANA, 128 UCSC-retro, 146 UCSC-duplicated, 152 YALE pseudogenes (and 19 GENCODE exons).

    Step VI -- The updated consensue list of pseudogenes with their assigned parent proteins and new classification (processed or non-processed).

  • Consensus pseudogenes with classification in ENCODE coordinates.
  • Consensus pseudogenes with classification in hg17 chromosome coordinates.
  • Alignments of consensus pseudogenes with their parent proteins.