Seqanswers Leaderboard Ad

**henry** · 09-07-2009, 03:25 AM

Originally posted by Mark View Post

Hi All

I'm trying to use tophat with the --GFF argument so as to get RPKM data for some yeast experiments. My question is that the .junc file produced by tophat seems not to be consistent with the exon data supplied in the GFF file. For example, when the GFF specifies

Scchr01 SGD gene 87287 87753 . + . ID=YAL030W

Scchr01 SGD mRNA 87287 87753 . + . ID=YAL030WmRNA;Parent=YAL030W

Scchr01 SGD exon 87287 87388 . + 0 ID=YAL030Wexon1;Parent=YAL030WmRNA

Scchr01 SGD exon 87502 87753 . + 0 ID=YAL030Wexon2;Parent=YAL030WmRNA

the .junc file specifies

Scchr01 87387 87501 +

The position 87387 appears incorrect if it is suppose to be indicating the first base of the intron (as 87501 appears to indicate the last position of the intron) or even the last base of the exon. Am I misinterpreting this or is there a problem here?

Thanks for your help

I have no idea. I 'm trying to install tophat. but there are errors occuring during installation. maybe I will also run into the same problem you have in the near future. i'm also expecting someone to fix it too. ^ ^

**sdriscoll** · 09-23-2009, 03:31 PM

Don't know if you got this sorted out but from what I have seen in my runs with Tophat it isn't SUPER accurat when it comes to positions. Output tends to vary a little. What I see from your post is that the junction specified in your .junc file is a junction between those two exons (lines 3 and 4). I'm not surprised that Tophat has it a click or two off. I have sequencing from several lanes and when I compare the junction.bed files in UCSC's browser I can easily see that a junction found in one lane is the same as that found in another lane. However if I look at the numbers in the junction.bed files the start and end points of those junctions are not equal. They are sometimes up to 10 positions off from each other.

**Cole Trapnell** · 09-23-2009, 08:27 PM

A splice junction identified in two different runs may look slightly different in the bed file. The reason for this is not due to alignment accuracy, it's actually a feature of the output format.

Each bed record in junctions.bed contains two blocks, one on the left side of the intron and one on the right side. The length of these blocks is determined by looking at all the alignments that span the junction, and measuring how far the left and right "overhangs" extend for each read. That is, suppose a read that spans a junction in such a way that the first 20 bp of the read fall on the left exon, and the last 55bp fall on the right exon (for a 75bp) read. If there is only one alignment spanning this intron, then the bed record for it will have the first block be 20bp, and the second block 55bp, and the distance between them in the genomic coordinate space will be the length of the intron.

If there are multiple alignments across the junction, then each block is as big as the biggest overhang from any read, on each side. Does this make sense?

Thus since the number of reads spanning a given junction will naturally vary from run to run, as will how they fall across it, the length of the blocks will vary. However, the actual intron coordinates reflected by a given bed record should be consistent from run to run, at least as long as there are any alignments at all spanning that intron.

It's straightforward to extract the actual intron coordinates from the bed records after a run, and in the upcoming version of TopHat (1.0.11), I provide a script to do so.

**Cole Trapnell** · 09-23-2009, 08:36 PM

I should have posted a reply to Mark's earlier question as well. The .juncs file format is zero-based (as opposed to the 1-based GTF file), and left coordinate marks the rightmost base of the *left* exon. The right coordinate in each line marks the leftmost base of the *right* exon. Think of it as "each line says concatenate right base to the left base, leaving out everything in between".

**sdriscoll** · 09-23-2009, 09:04 PM

Thanks Cole. Your responses are very helpful in understanding the outputs. I'm actually a programmer working for a lab and they have charged me with learning how to use Tophat and Bowtie. From what you wrote here it sounds like if I were to compare intron coordinates between two runs in the .bed files I should be able to filter out matching junctions and reveal junctions from one run that did not show up in another.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

tophat .junc file

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News