Subject: Summary of pseudogene call on Thurs. 27-Oct at 11 AM EDT + Overall Group Summary [PGENE] Date: Mon, 07 Nov 2005 08:47:43 -0500 Hi, Deyou and I won't be with you in Santa Cruz. I hope the workshop goes well! If people are interested, we may be able to call in on Fri. Just to make sure everyone is on the same page, I'm sending around the current list of encode pseudogenes again and am attaching below minutes from the 27-Oct call plus a overall summary of the group (which I did for Tom and Roderic). cheers, marK ## ## Summary of 27-Oct-05 Call ## 1. We are moving towards a very well annotated pseudogene dataset for ENCODE regions. 2. We are very interested in transcribed pseudogenes and understanding if there is any activity associated with them. 3. To answer question number 2, feature integration of various sorts will give a better idea of how strong the evidence for pseudogene transcription is. 4. Need to think about what is the best way to intersect pseudogene and tag sequence data, as tags are usually in UTR's and pseudogene coordinates will not have those. 5. Roderic, Tom and Mark will decide on how to go about including pseudogene representation in the workshop next week. == Identification of candidate pseudogenes that could be potentially transcribed France used a 1 nucleotide overlap between TAR and pseudogene coordinates as the criteria for identifying potentially transcribed pseudogenes. For ditags, only the content on the 5' and 3' ends were considered. Here again, France used the coordinates of 5' and 3' tag to pull out pseudogenes which had at least 1bp overlap with the ditag coordinates Approximately 50% of pseudogenes could be potentially transcribed. But a similar analysis also shows similar results for gene transcription.We expect to not see too many overlaps this way as pseudogene coordinates are entirely based on protein coding homology and does not include UTRs. *Thus, 5% overlap with CAGE tags is intriguing. To ponder: *What is a good way to intersect pseudogenes with CAGE and ditag data? France suggested looking at the distribution of lengths of the tags away from pseudogene coordinates and use an optimal number for thresholding. i.e. include all pseudogenes that are within 'x' base-pairs away from a tag sequence as a candidate for potentially transcribed pseudogene. Singapore person proposed that genomic features such as EST data, binding site evidence, CAGE, DITAG, transfrags..all of these can be integrated to see if it makes a stronger case for a pseudogene in terms of being potentially transcribed. It would also be of interest to check if pseudogenes that could be potentially transcribed have high sequence identity to the parent gene. Presumably, this could mean that it is an artefactual result due to cross hybridization issues. == OVERALL SUMMARY OF SUBGROUP (provided to Tom and Roderic) We have carefully identified and annotated 198 pseudogenes in ENCODE regions. Pseudogenes were identified using different methods by the various groups and categorized as processed or non-processed pseudogenes. All the candidate pseudogenes were merged and a consensus list of pseudogenes has been obtained. Pseudogenes that were not identified by all the groups have been carefully inspected manually and merged with the consensus list after thorough evaluation of various pieces of evidence such as GENCODE gene annotations and EST data. Pseudogenes that are potentially transcribed have been elucidated by intersecting pseudogene locations on the genome with transcriptional evidence from TARS obtained by expression studies by Affymetrix and Yale groups, EST evidence, CAGE and DITAGS. RACE experiments are currently underway in 12 human tissues and multiple RACEs will be pooled to be analyzed by hybridization to Affymetrix ENCODE 20 nucleotide resolution arrays to experimentally verify potentially transcribed pseudogenes. The current consensus list of pseudogenes are integrated in the UCSC browser and also available from http://www.pseudogene.org/ENCODE/.