Format of pseudogene annotation
files (.gff)
The pseudogene annotation files are tab-delimited,
multi-field text files. Each file has a header line describing the content in
each column. The columns from left to right are:
- ID: unique identifier for each processed pseudogene in the
format of chr$a_$b.$c where $a is chromosome name, $b is Swissprot/Trembl
Protein Accession Number, and $c is the sequential numering of the pseudogene
that matches protein $b on chromosome $a. Example: chr1_P02404.1.
- Short_ID: short version of pseudogene ID in the fomat of $a_$b
where $a is the Swissprot protein acession number and $b is the sequential
numbering of the pseudogene that matches protein $a in the whole genome.
Example P02404_1.
- Chr: chromosome name.
- Chrom_start: starting coordiante of the pseudogene on the
chromosome, based on the Build 28 of the GoldebPath assemble.
- Chrom_end: end coordinate of the pseudogene on the chromosome.
- Chrom_strand: "-" or "+"
- Cytogenic_band: chromosomal band as predicted by Ensembl. Example:
"1p36.33".
- Query_protein: Accession number of the cloest match protein in
Swissprot/TrEmbl.
- Query_start: starting amino acid number on the query protein that
the pseudogene matches.
- Query_end: end amino acid number on the query protein that the
pseudogene matches.
- Query_len: Sequence length of the query protein in cloumn 8.
- Completeness: sequence completeness of the pseudogene compared with
the query protein.
- E-value: Expect value of the pseudogene in the TBLASTX search.
- AA_ident: amino acid sequence identity between the pseudogene and
query protein.
- DNA_ident: nucleotide sequence identity between the pseudogene and
the query protein, coding region only. Some query proteins don't have coding
sequence available.
- Polya: "0" or "1" or "2" or "3".
- Disable: "0" or "d" or "D". "0" indicates no disablement (only for
RP pseudogenes). "d" indicates disablement in a region of low sequence
identity. "D" indicates disablement in region of high sequence identity.
- GC_Pgene: GC content of the pseudogene sequence
- GC_Isochore: GC content of the 100K bp window on the chromosome.
- Isochore_class: isochore class where the pseudogene resides. L1,
L2, HJ1, H2 H3
- Kimura_Distance: Evolution distance of the pseudogene sequence from
the present day sequence.
- Class: "PSSD1" indicates "true" processed pseudogenes. "PSSD2"
indicates putative processed pseudogenes.
- Comment: cytoplasmic ribosomal protein pseudogenes are labeled as
"RP".
- Protein_name: "Protein name" field of the query protein in the
Swissprot/TrEmbl.
- Gene_name: "Gene name" field of the query protein in the
Swissprot/TrEmbl.
- MIM: Entry of the query protein in the MIM database (Mendelian
Inheritance in Man).
Format of pseudogene DNA sequence
file (.dna)
These files contain multiple-sequence, FASTA format, nucleotide seuqunces of
the annotated processed pseudogenes.
Each pseudogene entry has 2 lines. The
header line begining with ">", followed with a unique pseudogene ID (field 1
in the corresponding .gff annotation file). Some other attributes of the
pseudogene are also provided on the header line including "Chrom",
"Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start",
"Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of
the attributes can be found from above.
Format of pseudogene amino acid
sequence file (.fa)
These files contain multiple-sequence, FASTA format, predicted amino acid
seuqunces of the annotated processed pseudogenes.
Each pseudogene entry has
3lines. The header line begining with ">", followed with a unique pseudogene
ID (field 1 in the corresponding .gff annotation file). Some other attributes of
the pseudogene are also provided on the header line including "Chrom",
"Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start",
"Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of
the attributes can be found from above.
Second line is the amino acid
sequence of the query protein, the third line is the predicted amino acid
sequence of the pseudogene. Frameshifts are indicated as "\" or "/", stop codons
are indicated as "X", gaps are shown as "-".
Occurrences of processed
pseudogenes
The file contain multiple tab-deliminated fields:
- Rank: ranking of the proteins based on number of processed pseudogenes.
- Count: numer of processed pseudogenes (excluding putative ones) that close
match the protein.
- AC: Swissprot/TrEMBL acession number of the protein.
- DB: "SWP" or "TREMBL".
- DB_Name: Swissprot entry name.
- Comment: Ribosomal protein sequences are labeled as "RP".
- Protein_len: Sequence length of the protein.
- CDS_len: Sequence length of the coding sequence (CDS) of the protein.
- CDS_GC: GC content of the coding sequence.
- EBI_Name: Description of the protein provided by EBI.
- Secondary_AC: Secondary acession number of the protein in Swissprot/TrEMBL.
- Protein_Name: protein name described in Swissprot/TrEMBL.
- Synonyms: alternative names for the protein, provided by Swissprot.
- Gene_Name: Associated gene name for the protein.
- MIM: Entry in the MIM (Mendelian Inheritance in Man) database.
- Key Words: biological key word of the protein provided by Swissprot/TrEMBL.
Updated
12/03/2002, ZL@bioinfo.mbb.yale.edu
Copyright 2002, All Rights Reserved