Protocol:Pseudogene

From Shiu Lab

Jump to: navigation, search

Contents

Purpose

  • Identification of pseudogenes in intergenic space.

Requirements

  • Datasets
    1. BLAST tabular output using a set of protein sequences as the query and a set of intergenic genome sequences as the subject. In the example: e3_vs_aly.
    2. The protein sequence file in fasta format. In the example: e3.fa.
    3. The intergenic genome sequence. In the example: aly_genome.
    4. A BLOSUM50 matrix (in the pseudogene script folder).
  • Scripts
    • For local user with access to our cluster - calculon:/home/public/scripts/pseudo/
    • For others, you need:
      • Pseudo_pkg.tgz: Standalone script package in a tgz file with scripts, the parameter file, and BLOSUM50 table.
      • BLAST: You need to download/install Fasta and specify where the executable is in the parameter file (see below).
      • Fasta 3.x: You need to download/install Fasta and specify where the executable is in the parameter file (see below).
  • Examples
    • calculon:/home/shiu/project/e3/2_pseudo/

Running The Pipeline

Using the wrapper, Pseudo_wrap.py

  • This is the wrapper for the pseudogene pipeline. For some help, type:
python pseudo_wrap.py

Run the blast

  • Run a tblastn of the protein sequence (query) vs. the intergenic genome sequence (subject/database)
  • If you don't want to use our standards for filtering the blast output (E value < 1e-5, Identity > 40%, Match length > 30aa, Match length > 5% of the query sequence) filter the blast output using ParseBlast.py.
 python /home/shiu/codes/ParseBlast.py -f get_qualified4 -blast e3_vs_aly -fasta e3.fa -E 5 -I 40 -L 30 -P 0.05 -Q 1

Create a soft link to shorten the name (the names are going to get quite long)

ln -s e3_vs_aly_E5I40L30P5Q1.qlines e3_vs_aly.q

Make the parameter file

  • The wrapper requires a parameter file and the minimum required lines are:
    • b_out: BLAST ouput in tabular format.
    • p_seq: Query protein sequences.
    • g_seq: Subject DNA sequenec.
    • il_t: intron length threshold for linking pseudoexons.
    • b_filter: filter BLAST result based on E-value, identity, length, and proportion match or not.
    • p_codes: Path to pseudogene pipeline script.
    • o_codes: Path to ParseBlast.py and its dependencies.
    • f_dir: Path to Fasta 3.x executables, particularly tfasty.
    • blosum: Path to the BLOSUM50 table.
  • Here is the content of an example parameter file:
b_out=e3_alygenome
p_seq=e3.fa
g_seq=aly_genome
il_t=500
b_filter=y
p_codes=/export/home/shiu/codes/_misc/pseudo/
o_codes=/export/home/shiu/codes/
f_dir=/export/home/shiu/bin/fasta3/
blosum=/export/home/shiu/codes/_misc/pseudo/blosum50.matrix

Run the wrapper

  • To run the wrapper, issue the following command:
python /home/shiu/codes/_misc/pseudo/pseudo_wrap.py parameters
  • Go to the last section of this webpage to see how to interpret the results

Perform repeat masking on the results

  • Our lab uses RepeatMasker for this task. It's open source and fairly simple.

If you want to run the whole thing step by step

  • note: any script made by the Shiu lab will give you help with its purpose and parameters if you type "python" followed by the name of the script in unix (that is, running the script with no parameters)

Run the blast

  • Run a tblastn of the protein sequence (query) vs. the intergenic genome sequence (subject/database)

Filter BLAST output

  • Use the ParseBlast module to get qualified matches. The parameters used for our 2009 paper:
    1. E value < 1e-5
    2. Identity > 40%
    3. Match length > 30aa
    4. Match length > 5% of the query sequence
python /home/shiu/codes/ParseBlast.py -f get_qualified4 -blast e3_vs_aly -fasta e3.fa -E 5 -I 40 -L 30 -P 0.05 -Q 1

output: e3_vs_aly_E5I40L30P5Q1.qlines

  • Create a soft link to shorten the name (the names are going to get quite long)
ln -s e3_vs_aly_E5I40L30P5Q1.qlines e3_vs_aly.q

output: e3_vs_aly.q

Get phase 1 pseudoexons

  • You need to know the max gap size for linking pseudoexons. In this case, we set it to 500.
python /export/home/public/scripts/pseudo/script_step2e.py e3_vs_aly.q 500

output: e3_vs_aly.q_G500.PE

Link pseudoexons into phase 1 pseudogenes

  • Intron length set at 500
python /export/home/public/scripts/pseudo/script_step3b.py e3_vs_aly.q_G500.PE 500

output: e3_vs_aly.q_G500.PE_I500.PS1

Run SW algorithm to detect framshifts and premature stops

  • First we need to get a pairs file and a subj coordinate file for the phase 1 pseudogenes:
python /home/shiu/codes/_misc/pseudo/script_step3.5.py e3_vs_aly.q_G500.PE_I500.PS1

output: e3_vs_aly.q_G500.PE_I500.PS1.pairs, e3_vs_aly.q_G500.PE_I500.PS1.subj_coord

  • Get the putative pseudogene regions out of the genome sequence using the coordinate file just generated:
python /home/shiu/codes/FastaManager.py -f get_stretch4 -fasta aly_genome -coords e3_vs_aly.q_G500.PE_I500.PS1.subj_coord

output: e3_vs_aly.q_G500.PE_I500.PS1.subj_coord.fa, e3_vs_aly.q_G500.PE_I500.PS1.subj_coord.missing

  • Now run tfasty with:
    1. The pair file just generated two steps before (-g).
    2. The protein sequences used in the original BLAST search (-i).
    3. The phase 1 pseudogene sequences from the previous step (-j) (.fa, not .missing).
python /home/shiu/codes/BlastUtility.py -f batch_sw -p tfasty35 -g e3_vs_aly.q_G500.PE_I500.PS1.pairs -i e3.fa -j e3_vs_aly.q_G500.PE_I500.PS1.subj_coord.fa -bdir /home/shiu/bin/fasta3/bin -d 1

output: e3_vs_aly.q_G500.PE_I500.PS1.pairs.err, e3_vs_aly.q_G500.PE_I500.PS1.pairs.out, e3_vs_aly.q_G500.PE_I500.PS1.pairs.raw

  • Get the final output (.disable_count) file which contains the pseudogenes and info on frameshift and stop:
python /home/shiu/codes/_misc/pseudo/script_step6.py /home/shiu/codes/_misc/pseudo/blosum50.matrix e3_vs_aly.q_G500.PE_I500.PS1.pairs.sw.out

output: e3_vs_aly.q_G500.PE_I500.PS1.pairs.sw.out.disable_count

Perform repeat masking on the results

  • Our lab uses RepeatMasker for this task. It's open source and fairly simple.

The Final Output

  • The .disable_count file looks like:
#Bradi1g22780.1 scaffold_1|5439851-5440363 99-237:419-4 1 2 0 1 0
NILLAYELTHFLQRKRSGSAGYAALILDISKAYDRVDWNF----LRDMMIKLGFHRSWVELIMKCITSVRYQIKVNGDAT
DVIIPQRGLRQGDPLSPYLFLICAEGFSAIWDKL-SKPKLTGGLGFRDLHSFNLAMLARQAWRL
NKLIDGPVTHLIIQLTHNSESPAATKVSSRKSHLIRSYAFFKSSLRSIQPFLPFLR-LIE*TASCTTTILSVLNLPGIKL
DCAGPINFIIMPFNLPTIVFVMT---LKSTLHKL/SRPKLSNFLRLHHLTN*DHNRIICIHWEL
  • The first line with # contains the following info:
    • Query name: Bradi1g22780.1
    • Subject name: scaffold_1|5439851-5440363
      • The coordinates on the right side of the "|" is the region on the scaffold that matched the query protein in the original BLAST.
    • Matching region: 99-237:419-4
      • the matching pseudogene region on the protein and scaffold, respectively, separated by the ":".
    • Expect value: 1
      • tfasty E value
    • 2 0 1 0 - pseudogene evidence code.
      • 1st - Type 1 Stop: Type 1 stop is a stop codon found in subject and is far away from intronic region. High confidence.
      • 2nd - Type 2 Stop: Type 2 stop is a stop codon found in subject and is close to intronic region. Low confidence.
      • 3rd - Type 1 frameshift: Type 1 frameshift is a frameshift found in subject and is far away from intronic region. High confidence.
      • 4th - Type 2 frameshift: Type 2 stop is a frameshift found in subject and is close to intronic region. Low confidence.
  • The following lines: alignments between query and subject (-: gap, *: stop, | or /: frameshift).</pre>
Personal tools