HI,

I wrote to Cole for how Tophat tell the junctions (with CIGAR line xxxMxxxNxxxM) from deletions (xxxMxxxDxxxM). I guess they might be busy to answer all emails, so I dig the code myself a bit further and found a related part (I guess):

-------------- long_spanning_reads.cpp --------------

/*

* FIXME, currently just differentiating between a deletion and a

* reference skip based on length. However, would probably be better

* to denote the difference explicitly, this would allow the user

* to supply their own (very large) deletions

*/

if ((lb->right - lb->left - 1) <= max_deletion_length)

{

new_cigar.push_back(CigarOp(DEL, lb->right - lb->left - 1));

antisense_closure = prev_hit->is_spliced() ? prev_hit->antisense_splice() : curr_hit->antisense_splice();

}

else

{

new_cigar.push_back(CigarOp(REF_SKIP, lb->right - lb->left - 1));

antisense_closure = lb->antisense;

}

However, when I change the value of max_deletion_length option, the output does not change.

To reproduce what I said, I've made a small example file here:

-- test.fa (http://zlab.umassmed.edu/~dongx/temp/test.fa) contais 4 reference sequences for index (which are identical in sequences with different ID, in order to get multiple hit)

-- test.reads.fa (http://zlab.umassmed.edu/~dongx/temp/test.reads.fa) contains 3 reads (as below), where reads "read00" is directly extracted from the reference sequence (so should be perfectly matched). As you see, read1 and read2 are part of read00 (I marked the common part as upper case and the part deleted from read1/2 are marked as lower case). So, ideally, read1 should have CIGAR line as xxxM9DxxxxM, read2 has xxxM8DxxxM and read00 has xxxM.

$ cat test.reads.fa

>read1

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGgAATACTAGAGGCCACTGCAT

>read2

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGagAATACTAGAGGCCACTGCAT

>read00

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGtagagaggagAATACTAGAGGCCACTGCAT

I made bowtie2 index from test.fa, called 'test', then run Tophat (v2.0.4):

tophat -o test --max-deletion-length 10 --no-convert-bam -p 8 test test.reads.fa; grep read test/accepted_hits.sam

read2 256 116 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1168 CP:i:1 HI:i:0

read00 256 116 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1168 CP:i:1 HI:i:0

read2 0 1168 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1176 CP:i:1 HI:i:1

read00 256 1168 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1176 CP:i:1 HI:i:1

read2 256 1176 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1916 CP:i:1 HI:i:2

read00 256 1176 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1916 CP:i:1 HI:i:2

read2 256 1916 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 HI:i:3

read00 0 1916 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 HI:i:3

As you see, read1 is not in the output. I also change to use different value of --max-deletion-length, but never can get read1 included. I also tried to use "-i 5" for example, but still no read1 there. I've also tested different deletion size in read2, and found only when the size < or = 8, it can be output; otherwise, it will be filtered. Why??

Hopefully I can get your help on this.

I also try to compile the source code myself. Everything works fine except an error in the make step. I made a note here:

Appreciated if you can help by the way.

Best,

-sterding

I wrote to Cole for how Tophat tell the junctions (with CIGAR line xxxMxxxNxxxM) from deletions (xxxMxxxDxxxM). I guess they might be busy to answer all emails, so I dig the code myself a bit further and found a related part (I guess):

-------------- long_spanning_reads.cpp --------------

/*

* FIXME, currently just differentiating between a deletion and a

* reference skip based on length. However, would probably be better

* to denote the difference explicitly, this would allow the user

* to supply their own (very large) deletions

*/

if ((lb->right - lb->left - 1) <= max_deletion_length)

{

new_cigar.push_back(CigarOp(DEL, lb->right - lb->left - 1));

antisense_closure = prev_hit->is_spliced() ? prev_hit->antisense_splice() : curr_hit->antisense_splice();

}

else

{

new_cigar.push_back(CigarOp(REF_SKIP, lb->right - lb->left - 1));

antisense_closure = lb->antisense;

}

However, when I change the value of max_deletion_length option, the output does not change.

To reproduce what I said, I've made a small example file here:

-- test.fa (http://zlab.umassmed.edu/~dongx/temp/test.fa) contais 4 reference sequences for index (which are identical in sequences with different ID, in order to get multiple hit)

-- test.reads.fa (http://zlab.umassmed.edu/~dongx/temp/test.reads.fa) contains 3 reads (as below), where reads "read00" is directly extracted from the reference sequence (so should be perfectly matched). As you see, read1 and read2 are part of read00 (I marked the common part as upper case and the part deleted from read1/2 are marked as lower case). So, ideally, read1 should have CIGAR line as xxxM9DxxxxM, read2 has xxxM8DxxxM and read00 has xxxM.

$ cat test.reads.fa

>read1

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGgAATACTAGAGGCCACTGCAT

>read2

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGagAATACTAGAGGCCACTGCAT

>read00

CCATCTGTAGAAGTGGAAGGGCTGAAGGAGtagagaggagAATACTAGAGGCCACTGCAT

I made bowtie2 index from test.fa, called 'test', then run Tophat (v2.0.4):

tophat -o test --max-deletion-length 10 --no-convert-bam -p 8 test test.reads.fa; grep read test/accepted_hits.sam

read2 256 116 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1168 CP:i:1 HI:i:0

read00 256 116 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1168 CP:i:1 HI:i:0

read2 0 1168 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1176 CP:i:1 HI:i:1

read00 256 1168 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1176 CP:i:1 HI:i:1

read2 256 1176 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 CC:Z:1916 CP:i:1 HI:i:2

read00 256 1176 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 CC:Z:1916 CP:i:1 HI:i:2

read2 256 1916 1 1 29M8D23M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-29 XN:i:0 XM:i:0 XO:i:1 XG:i:8 NM:i:8 MD:Z:29^GTAGAGAG23 YT:Z:UU NH:i:4 HI:i:3

read00 0 1916 1 1 60M * 0 0 CCATCTGTAGAAGTGGAAGGGCTGAAGGAGTAGAGAGGAGAATACTAGAGGCCACTGCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YT:Z:UU NH:i:4 HI:i:3

As you see, read1 is not in the output. I also change to use different value of --max-deletion-length, but never can get read1 included. I also tried to use "-i 5" for example, but still no read1 there. I've also tested different deletion size in read2, and found only when the size < or = 8, it can be output; otherwise, it will be filtered. Why??

Hopefully I can get your help on this.

I also try to compile the source code myself. Everything works fine except an error in the make step. I made a note here:

Appreciated if you can help by the way.

Best,

-sterding