Format of pseudogene annotation files (.gff)
The pseudogene annotation files are tab-delimited, multi-field text files. Each file has a header line describing the content in each column. The columns from left to right are:

  1. ID:  unique identifier for each processed pseudogene in the format of chr$a_$b.$c where $a is chromosome name, $b is Swissprot/Trembl Protein Accession Number, and $c is the sequential numering of the pseudogene that matches protein $b on chromosome $a. Example: chr1_P02404.1.
  2. Short_ID: short version of pseudogene ID in the fomat of $a_$b where $a is the Swissprot protein acession number and $b is the sequential numbering of the pseudogene that matches protein $a in the whole genome. Example P02404_1.
  3. Chr: chromosome name.
  4. Chrom_start: starting coordiante of the pseudogene on the chromosome, based on the Build 28 of the GoldebPath assemble.
  5. Chrom_end: end coordinate of the pseudogene on the chromosome.
  6. Chrom_strand: "-" or "+"
  7. Cytogenic_band: chromosomal band as predicted by Ensembl. Example: "1p36.33".
  8. Query_protein: Accession number of the cloest match protein in Swissprot/TrEmbl.
  9. Query_start: starting amino acid number on the query protein that the pseudogene matches.
  10. Query_end: end amino acid number on the query protein that the pseudogene matches.
  11. Query_len: Sequence length of the query protein in cloumn 8.
  12. Completeness: sequence completeness of the pseudogene compared with the query protein.
  13. E-value: Expect value of the pseudogene in the TBLASTX search.
  14. AA_ident: amino acid sequence identity between the pseudogene and query protein.
  15. DNA_ident: nucleotide sequence identity between the pseudogene and the query protein, coding region only. Some query proteins don't have coding sequence available.
  16. Polya: "0" or "1" or "2" or "3".
    • "0" : no polyA tail ( > 30 A in 50 bp window) detected of the pseudogene.
    • "1" : has polyA tail and also polyadenilation signal with 50 bp of the begining of the tail
    • "2" : has polyA tail and polyadenilation signal within 50-100 bp of the begining of the tail
    • "3":  has polyA tail but no polyadenilation detected.

  17. Disable: "0" or "d" or "D". "0" indicates no disablement (only for RP pseudogenes). "d" indicates disablement in a region of low sequence identity. "D" indicates disablement in region of high sequence identity.
  18. GC_Pgene: GC content of the pseudogene sequence
  19. GC_Isochore: GC content of the 100K bp window on the chromosome.
  20. Isochore_class: isochore class where the pseudogene resides. L1, L2, HJ1, H2 H3
  21. Kimura_Distance: Evolution distance of the pseudogene sequence from the present day sequence.
  22. Class: "PSSD1" indicates "true" processed pseudogenes. "PSSD2" indicates putative processed pseudogenes.
  23. Comment: cytoplasmic ribosomal protein pseudogenes are labeled as "RP".
  24. Protein_name: "Protein name" field of the query protein in the Swissprot/TrEmbl.
  25. Gene_name: "Gene name" field of the query protein in the Swissprot/TrEmbl.
  26. MIM: Entry of the query protein in the MIM database (Mendelian Inheritance in Man).


 

Format of pseudogene DNA sequence file (.dna)

These files contain multiple-sequence, FASTA format, nucleotide seuqunces of the annotated processed pseudogenes.
Each pseudogene entry has 2 lines. The header line begining with ">", followed with a unique pseudogene ID (field 1 in the corresponding .gff annotation file). Some other attributes of the pseudogene are also provided on the header line including "Chrom", "Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start", "Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of the attributes can be found from above.

 

Format of pseudogene amino acid sequence file (.fa)

These files contain multiple-sequence, FASTA format, predicted amino acid seuqunces of the annotated processed pseudogenes.
Each pseudogene entry has 3lines. The header line begining with ">", followed with a unique pseudogene ID (field 1 in the corresponding .gff annotation file). Some other attributes of the pseudogene are also provided on the header line including "Chrom", "Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start", "Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of the attributes can be found from above.
Second line is the amino acid sequence of the query protein, the third line is the predicted amino acid sequence of the pseudogene. Frameshifts are indicated as "\" or "/", stop codons are indicated as "X", gaps are shown as "-".

 

Occurrences of processed pseudogenes

The file contain multiple tab-deliminated fields:

  1. Rank: ranking of the proteins based on number of processed pseudogenes.
  2. Count: numer of processed pseudogenes (excluding putative ones) that close match the protein.
  3. AC: Swissprot/TrEMBL acession number of the protein.
  4. DB: "SWP" or "TREMBL".
  5. DB_Name: Swissprot entry name.
  6. Comment: Ribosomal protein sequences are labeled as "RP".
  7. Protein_len: Sequence length of the protein.
  8. CDS_len: Sequence length of the coding sequence (CDS) of the protein.
  9. CDS_GC: GC content of the coding sequence.
  10. EBI_Name: Description of the protein provided by EBI.
  11. Secondary_AC: Secondary acession number of the protein in Swissprot/TrEMBL.
  12. Protein_Name: protein name described in Swissprot/TrEMBL.
  13. Synonyms: alternative names for the protein, provided by Swissprot.
  14. Gene_Name: Associated gene name for the protein.
  15. MIM: Entry in the MIM (Mendelian Inheritance in Man) database.
  16. Key Words: biological key word of the protein provided by Swissprot/TrEMBL.

 

 

 


Updated 12/03/2002, ZL@bioinfo.mbb.yale.edu
Copyright 2002, All Rights Reserved