Once gene sequences have been identified in the genome, it is possible to use sequence alignment programs (such as FASTA or BLAST) to detect matching regions in the nucleotide sequence. These matching regions are potential gene homologs and are termed pseudogenes if there is some evidence that either of the causes (see above) are satisfied.
In these analyses, genes from annotated genomes and protein databases have first been clustered into paralog families and then used to survey whole genomes for copies or homologs. For each potential pseudogene (or fragment) match, a number of steps have been taken to assess its validity as a pseudogene. These steps include checking for overcounting and repeat elements, overlap on the genomic DNA with other homologs and cross-referencing with exon assignments from genome annotations. The resulting pseudogenes or pseudogenic fragments have then been assigned to the paralog family of the most homologous gene (or assigned to a singleton gene if the probe gene has no obvious paralog).
In a number of cases, more distant evolutionary and functional relationships between proteins can only be elucidated through the analysis of the folds that their structures adopt. While it must not be forgotten that the assignment of function to a gene is often implied from that of a gene with a homologous sequence, the added information that protein structures can provide is very desirable in genome annotation.
In the case of pseudogenes, structural information can give extra evolutionary clues and facilitate analysis of the scope of folds in the pseudogene population ("pseudo"-folds) in contrast to those observed for the genes themselves. Where possible, i.e. where a gene can be matched to a SCOP domain, assignment of fold to a pseudogene or pseudogenic fragment is based upon the assignment of the most homologous gene.
Our initial goal was to survey some eukaryotic genomes for pseudogene sequences and fragments of pseudogene sequences. In addition to this, we have also quantified "pseudo-fold" usage, amino-acid composition, and single-nucleotide polymorphisms (SNPs) to help elucidate the relationships between pseudogene families across these organisms.