Format of Processed Ribosomal Protein Pseudogene Flat Files

The pseudogene annotation files are tab-delimited, multi-field text files. All the information relating to the ribosomal proteins were downloaded from the Ribosomal Protein Gene Database (RPDB).

Data Description

Each file has a header line describing the content in each column. The columns from left to right are:

ID: unique identifier for each processed pseudogene in the format of chr$a_$b.$c where $a is chromosome name, $b is the name of ribosomal protein as designated in RPDB , and $c is the sequential numering of the pseudogene that matches protein $b on chromosome $a. Example: chr10_RPL7.2
Short_ID: short version of pseudogene ID in the format of $a_$b where $a is the Swissprot protein accession number and $b is the sequential numbering of the pseudogene that matches protein $a in the whole genome. Example RPL7_18
Chr: chromosome name
Chrom_start: starting coordinate of the pseudogene on the chromosome, based on Release 36 of Ensembl.
Chrom_strand: "-" or "+"
Query_protein: name of the query ribosomal protein as named in RPDB
Query_start: starting amino acid number on the query protein that the pseudogene matches.
Query_end: end amino acid number on the query protein that the pseudogene matches.
Query_len: sequence length of the query protein in column 7.
Match_length: fractional length of the pseudogene compared to the query protein.
E-value: expect value of the pseudogene in the TBLASTX search.
AA_ident: amino acid sequence identity between the pseudogene and query protein.
Polya: "0" or "1" or "2" or "3".
- "0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
- "1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
- "2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
- "3": has polyA tail but no polyadenylation detected.
Disable: "0" or "d" or "D". "0" indicates no disablement i.e. . "d" indicates disablement in a region of low sequence identity. "D" indicates disablement in region of high sequence identity.