The tables chr21.pseudogenes and chr22.pseudogenes contain pseudogene annotations. All annotations are for individual exons, except where labelled 'multi_exon' in the comment field. This data is unpublished and may be modified again before publication. Please do not disseminate it. The fields of the tables are as follows: 1 = chromosome label. Either chr21_ or chr22_. 2 = source of the pseudogene. Either pmh (annotations derived here, abbreviation for 'pseudomatch' ) or sanger (Sanger Center annotations) 3 = sequence type. Swp (derived from matching SWISSPROT protein) ; Ens (derived from matching Ensembl protein) ; sanger (derived from Sanger centre annotation) 4 = start of sequence on chromosome. 5 = end of sequence on chromosome. 6 = strand of chromosome. 7 = primary name of the sequence. 8 = number of exons for the pseudogene (Sanger Center chromosome 22 annotations). 9 = classification according to detection of polyadenylation. Class '1' if have AATAAA signal <50 nt 5' to polyadenine tail ; Class '2' if have AATAAA signal <100 && >=50 nt 5' to polyadenine tail (very few of these); Class '3' if have polyadenine tail but no AATAAA signal ; else labelled '-1' 10 = fraction of length of closest matching sequence (Ensembl, and if no Ensembl matcher, the closest SWISSPROT matcher). 11 = long or short segment. equal to 1, if length >95% of known exons for human chromosome 22 genes (>942 nt) equal to -1, otherwise 12 = whether Ig segment. equal to 1, if immunoglobulin gene segment equal to 0, otherwise 13 = length of largest gap in closes sequence (in aa). 14 = whether candidate processed pseudogene equal to 1, if candidate processed pseudogene equal to 0, otherwise 15 = comment on entry (labelled 'multi_exon' if largest gap in the pseudogene is >126 nt; 5% of introns would be shorter than this) other labels are 'pssd_pgene' (processed pseudogene), 'ig_segment' (immunoglobulin gene segment) and 'single_exon'. 16 = alternative name for sequence (sometimes used) 17 = name of closest Ensembl human protein matcher, if any 18 = percentage identity to closest Ensembl human protein matcher, if any 19 = OLD or NEW (< or >=79% id to closest Ensembl human protein matcher) 20 = overall name (different to primary name for individual pseudogenic matches that have been merged into one pseudogene by inspection) 21 = list of InterPro motif names for this pseudogene 22 = list of GO classifications for this pseudogene #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# The corresponding files of sequences for the pseudogenes that I have derived are in (fasta format, the name is the 'primary name' given above): (chromosome 21) chr21.pmh.Swp.N.FASTA ---> pseudogenes for which the primary match is to a SWISSPROT protein chr21.pmh.Ens.N.FASTA ---> pseudogenes for which the primary match is to an Ensembl protein chr21.riken.pmh.N.FASTA ---> pseudogenes which are primarily Riken Centre annotations, but which are also detected by our procedures. (chromosome 22) chr22.pmh.Swp.N.FASTA ---> pseudogenes for which the primary match is to a SWISSPROT protein chr22.pmh.Ens.N.FASTA ---> pseudogenes for which the primary match is to an Ensembl protein chr22.sanger.pmh.N.FASTA ---> pseudogenes which are primarily Sanger Centre annotations, but which are also detected by our procedures. Frameshifts are indicated by a slash ('/' or '\') and a deletion by a '-'. Stops are labelled with an '*'.