Unconfigured Ad

**Cole Trapnell** · 02-08-2010, 01:03 PM

Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.

**shurjo** · 02-08-2010, 01:11 PM

OK, cool, thanks.

**steven** · 02-09-2010, 12:01 PM

Watch out when considering these read densities if they come from overlapping transcripts though: although it is counter-intuitive, they cannot be directly compared to each other as if they could indicate a relative "expression" or "splicing" value.

Indeed, since a read that overlaps a common exonic part of two splicing variants is used to compute both density values (although it comes from one of the variants only), it would be mathematically incorrect to assume that read densities can be compared between such overlapping splicing variants. If reads are assigned to each variant independently, cases can appear where a variant A that is present at a lower number of copy than a variant B is given a higher density.. so RPKM are only relevant across "genes" (non overlapping transcripts).

Therefore, reads should actually be assigned across transcripts (considering the whole set of variants) in a discriminative way, and that is why proper transcriptome reconstruction requires linear programming algorithms in order to find the optimal distribution.

Unless of course i am totally wrong

**thinkRNA** · 03-30-2010, 11:02 AM

Originally posted by Cole Trapnell View Post

Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.

So, how do you determine the average depth. For example, consider
exon1| exon2|exon3

Lets say that Sample A exhibits
exon1|exon3 only 20% of the times
exon1|exon2|exon3 only 80% of the times

while Sample B exhibits
exon1|exon3 100% of the times

how will you determine average depth of this gene in sampleA?
I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

How does cufflinks handle or calculate relative abundance of isoforms within a sample?

**Cole Trapnell** · 03-30-2010, 03:02 PM

Originally posted by thinkRNA View Post

So, how do you determine the average depth. For example, consider
exon1| exon2|exon3

Lets say that Sample A exhibits
exon1|exon3 only 20% of the times
exon1|exon2|exon3 only 80% of the times

while Sample B exhibits
exon1|exon3 100% of the times

how will you determine average depth of this gene in sampleA?
I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

How does cufflinks handle or calculate relative abundance of isoforms within a sample?

In the upcoming Cufflinks paper, we have a large section in the supplement devoted to the math of how this is done. This isn't the best medium for explaining how the algorithm and the math work, so keep an eye out for the paper. We hope it will be out within a few weeks, and I'll post a message to our mailing list and link to the paper from the site when it appears.

For now, let me roughly describe how it works to give you some intuition. Suppose you have two isoforms A and B of a gene (we'll assume they are the same length for now). We want to calculate the abundances of A and B by counting up how many fragments came from A and how many came from B, and then dividing each count by the total number of fragments. This gives us the relative abundances for A and B. Now consider a fragment that could have come from either of two isoforms, A or B. We can't just assign this fragment to A or B - we need to "probabilistically assign" to both of them. The probability it came from A in this example is just equal to the relative abundance of A (and the same goes for B). Now we're in a bind, because to get the abundances we need the counts, and to get the (probabilistic) counts, we need the abundances! There's an additional complication that happens because of course transcripts have differing lengths.

Cufflinks calculates the abundances of transcripts by assuming that the fragments in an experiment were generated according to a basic model of how RNA-Seq works. With these assumptions one can write down a function that takes an assignment of relative abundances to each transcript as its argument. This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.

There are a lot of details in the paper, but this is the core idea. Once we have the abundances, we can use them to estimate how many fragments came from each transcript, and thus we can estimate the average depth of sequencing coverage for each one.

**thinkRNA** · 03-30-2010, 03:35 PM

Originally posted by Cole Trapnell View Post

This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.
.

Great explanation and definitely helps intuitively understand the process! Since numbers/examples help a lot, if we continue with your example of two isoforms A and B of a gene. Lets say that reads spanning isoform A = 100 reads and reads spanning isoform B=10 reads. So, cufflinks explores all possible *isoforms* using these numbers and figures out that probabilistically isoform A is more prevalent than isoform B.
Did I get this right?

I think, it takes into account what are all the possible isoform for this gene by taking in the gene annotation that we provide (is this where the GTF file comes in as input). It then taken the counts of reads and determines which combination of exons (or which isoform) is most probable.

**marcora** · 05-15-2010, 11:04 AM

Originally posted by Cole Trapnell View Post

Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript.

Please correct me if I am wrong (which most definitively I am) in interpreting how to calculate the number of mapped reads per transcript. I am total noob at this RNAseq stuff

if a transcript looks like this in transcripts.gtf:

Code:

chr1	Cufflinks	transcript	3204563	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3204563	3207049	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "1"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3411783	3411982	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "2"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
chr1	Cufflinks	exon	3660633	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "3"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";

then

transcript length = sum of length of exons = 3,631
cov = 0.020363

number of mapped reads/transcript = 3,631 * 0.020363 = 74

Is doesn't seem to be correct!

It would be great if cufflinks could report not only normalized FPKM values but also the raw number of mapped reads per transcript and exon (to use in software like DESeq, which I am using since I couldn't figure out how to use cuffcompare/cuffdiff with biological replicates for each conditition

)

Thanx in advance for your help and consideration.

**carmeyeii** · 09-07-2012, 10:06 AM

marcora,

it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

Does this make sense?

Carmen

**hubery_Bio** · 10-28-2013, 06:20 AM

Originally posted by carmeyeii View Post

marcora,

it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

Does this make sense?

Carmen

yeah,it makes sense!

Topics	Statistics	Last Post
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, 07-24-2026, 12:17 PM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-24-2026, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM

Unconfigured Ad

Cufflinks transcripts.expr file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News