See http://bioinfo.mbb.yale.edu/permissions.shtml for usage terms.
Please note that the pipeline is not engineered for casual use by
relatively novice users. You may need to roll up your sleeves and have
a look at the code from time to time.
1) BLAST an organism's genome against its proteome. We typically split
the proteome into bite-size chunks and run a number of concurrent
BLASTs. Use the '-m 8' option to produce tab delimited output.
2) 'processBlastOutput.py' will convert the BLAST output into a form
appropriate for use by the pipeline. Note: This script takes two
arguments: a file containing the proteome in FASTA format and a
directory in which the various BLAST split outputs are located. The
script is hardwired to look for the pattern 'splitXXXXOut' where X is
a digit. It would be straightforward to change this pattern if you wish.
3) The pipeline needs data to mask out known genes. There a couple of options here:
i) Provide null exon data files (one per chromosome). No masking.
ii) Provide coordinates that span an entire gene. Intronic regions will then be masked.
iii) Provide coordinates of just exons.
Pseudogene.org uses option iii). The requisite data is extracted from the Ensembl mysql files:
exon.txt.table
exon_transcript.txt.table
seq_region.txt.table
translation.txt.table
translation_stable_id.txt.table
by the script 'extractKPExonLocations.py'. This script assumes that it
is executed in the directory containing the files and is given as an
argument a file containing the proteome in FASTA format.
4) Set environment variables. Here's an example file for bash:
# we tend to gather all files together under one subdirectory
dataDir=/home1/njc2/bioInformatics/genomes/dr/34.5b
# describes a pattern used to find all BLAST output files. Note the
# 'P': Step 1) produced output segregated by chromosome and strand.
# Pseudogenes.org runs the pipeline once for the plus ('P') strand and
# once for the minus ('M') strand.
#
# Here and elsewhere, '%s' will be replaced by chromosome identifiers
# during the execution of the script. E.g, if the script is run with
# the arguments '1 X', then it will look for files named
# 'chr1_P_blastHits.sorted' and 'chrX_P_blasHits.sorted'.
export BlastoutSortedTemplate=${dataDir}/pgpipe/chr%s_P_blastHits.sorted
# Location of chromosome dna files from Ensembl.
export ChromosomeFastaTemplate=${dataDir}/dna/Danio_rerio.ZFISH5.oct.dna.chromosome.%s.fa
# Location of maskt files (see Step 3) above)
export ExonMaskTemplate=${dataDir}/mysql/chr%s_exLocs
# The columns in the mask file that provide start and stop data (0-based).
export ExonMaskFields='2 3'
# Location of the FASTA program tfasty34
export FastaProgram=/home1/njc2/fromColossusHome/bioInformatics/fasta/tfasty34
# The proteome in FASTA format.
export ProteinQueryFile=${dataDir}/pep/Danio_rerio.ZFISH5.oct.pep.known.fa
5) In a fresh directory, run the pipeline via the script
'runScripts.py'. We typically create directories 'plus' and 'minus' to
keep the two runs separate and then merge the results
later. 'runScripts.py' expects a list of chromosomes as
arguments. Note: the analysis of one chromosome is independent of all
the others, so you may choose to run one or two small ones to test the
setup before running with a list of all of the chromosomes.
E.g.:
mkdir plus
cd plus
source plusEnvVariables
${PipelinePath}/runScripts.py 22
# if the results look good...
${PipelinePath}/runScripts.py 1 2 3 ...
6) The final output lives in the subdirectory 'pgenes'. Other
directories contain intermediate results.