Format of Processed Ribosomal Protein Pseudogene Flat Files
The pseudogene annotation files are tab-delimited, multi-field text files.
All the information relating to the ribosomal proteins were downloaded from the Ribosomal Protein Gene Database (RPDB).
Data Description |
Each file has a header line describing the content in each column. The columns
from left to right are:
- ID: unique identifier for each processed pseudogene in the format of chr$a_$b.$c where $a is chromosome name, $b is the name of ribosomal protein as designated in RPDB , and $c is the sequential numering of the pseudogene that matches protein $b on chromosome $a. Example: chr10_RPL7.2
- Short_ID: short version of pseudogene ID in the format of $a_$b where $a is the Swissprot protein accession number and $b is the sequential numbering of the pseudogene that matches protein $a in the whole genome. Example RPL7_18
- Chr: chromosome name
- Chrom_start: starting coordinate of the pseudogene on the chromosome, based on Release 36 of Ensembl.
- Chrom_strand: "-" or "+"
- Query_protein: name of the query ribosomal protein as named in RPDB
- Query_start: starting amino acid number on the query protein that the pseudogene matches.
- Query_end: end amino acid number on the query protein that the pseudogene matches.
- Query_len: sequence length of the query protein in column 7.
- Match_length: fractional length of the pseudogene compared to the query protein.
- E-value: expect value of the pseudogene in the TBLASTX search.
- AA_ident: amino acid sequence identity between the pseudogene and query protein.
- Polya: "0" or "1" or "2" or "3".
- "0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
- "1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
- "2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
- "3": has polyA tail but no polyadenylation detected.
- Disable: "0" or "d" or "D". "0" indicates no disablement i.e. . "d" indicates disablement in a region of low sequence identity. "D" indicates disablement in region of high sequence identity.
|
|