Unconfigured Ad

**RockChalkJayhawk** · 04-05-2010, 08:23 AM

Originally posted by lpachter View Post

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.

Why not just take the FPKM values categorized by cufflinks for each individual transcripts and multiply them by transcript length and number of aligned sequences? That would give you the tag count per transcript.

**lpachter** · 04-05-2010, 08:58 AM

Thats correct- the procedure RCJ suggests will give you an estimate of the actual tag count for each transcript.

**Siva** · 04-05-2010, 07:32 PM

Originally posted by lpachter View Post

Dear Siva,

I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.

Dear L Pachter
Thank you very much for your reply thanks also to RCJ. I understand your point about probabilistically assigning reads to transcripts. I will get back to our statistical group about this. However the procedure suggested by RCJ and seconded by you has me a bit confused. I used to think that the algorithm that Cufflinks uses assigns reads based on probability and it will be different for each transcript and also vary according to genome location. So how can a mere multiplication of FPKM by transcript length and aligned sequences give us tag count per transcript? Am I missing something too obvious here?

thanks
Siva

**RockChalkJayhawk** · 04-06-2010, 04:46 AM

To answer your question, see this page <a href="http://cufflinks.cbcb.umd.edu/howitworks.html"> here</a>

To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments.

Therefore, cufflinks is using probabilities to assign individual fragments to individual transcripts. Since the units of the counts are in fragments per KB transcript per million reads, you can convert them back to raw integers by the multiplying the length of a given transcript and number of reads in millions.

**RockChalkJayhawk** · 04-06-2010, 05:08 AM

Unfortunately I just noticed that when cufflinks calls transcripts, it doesn't report thier length. Only in the tmap reference files. All the other files only show the genomic coordinates, which isn't as helpful.

I will e-mail the author to see if this functional;ity can be easily added or worked around.

**Siva** · 04-06-2010, 09:28 AM

Hi RCJ
Thanks. Btw, I did single end sequencing and not paired ends. I don't think that that should be a problem in calculating integer counts per transcript. I have contacted our statistics collaborator regarding FPKM and will get back after I get a reply.

best
Siva

**Simon Anders** · 04-06-2010, 10:01 AM

In case this got lost in my lengthy post #12:

The reason why raw counts are advantageous to FPKM values for statistical inference is discussed in this thread, from post #6 onwards: http://seqanswers.com/forums/showthread.php?t=4349

**lpachter** · 04-07-2010, 07:56 AM

I'd like to point out again that in fact raw counts are not advantageous to FPKM.

FPKM is a unit for reporting an estimate of expression. Reports of expression estimates (whether in FPKM or other units) are by definition based on statistical inference from raw read counts.

**Simon Anders** · 04-07-2010, 08:54 AM

Of course, FPKM is a proper unit to report expression, more so than raw counts.

However, if you want to perform a statistical test to decide the question whether expression in two conditions is significantly different, then a test that works on the level of raw counts is more powerful than one that works on the level of normalized expression, for the reasons that we discuss in our paper.

Of course, if I know transcript lengths and library sizes, I can always convert between one type of information and the other. If I have two test procedures, one that only asks for raw read counts, or one that only asks for FPKM values, with neither asking for transcript lengths, this two procedures will necessarily do something different.

Actually, not only DESeq but also cuffdiff works on the level of raw counts, as Cole commented here: http://seqanswers.com/forums/showpos...59&postcount=9

Finally, on the risk of being nit-picking: If one divides counts by transcript length or total reads, one not only changes the unit but also the type of quantity reported. If I tell you either that I traveled 10 miles or 13 km, this is the same quantity in different units. If it took me 2 hours and I only tell you that made 5 miles per hour, this is a different quantity, namely velocity rather than distance. Similarly, FPKM is a unit of expression, i.e., it is deemed to be proportional to molar concentration of transcript molecules. Counts is not a unit of expression. Nevertheless, it is advantageous to compare counts rather then expression values to do inference on differential expression.

**lpachter** · 04-07-2010, 10:00 AM

Perhaps I'll risk being nitpicky as well regarding a sentence in your nitpick.

You write that "Nevertheless, it is advantageous to compare counts rather then expression values to do inference on differential expression."

Now before you wrote _raw_ counts and now just counts. I agree with the latter (counts) but not with your statement before of using raw counts.

Consider the following simple example:
A gene with two isoforms A and B (with equal lengths), with the property that in experiment 1 isoform A is expressed at double the amount of B, vs. experiment 2 in which B is double A. The number of read counts from this gene may be identical in both experiments. However it may be possible to infer the changes in the expression levels of the individual transcripts. That is because of where the reads occur (e.g. one transcript may contain a splice junction and reads there may help to infer its abundance).

Now given _expression values_ rather than the raw counts, it may be clear that there has been differential expression. But looking at counts alone there would be no way to tell.

Having said this, I'm aware of your paper and I think your approach is valuable (and I agree with your argument that its better than the Poisson assumption). But what needs to be used is the probabilistic assignment of read counts to transcripts, rather than just raw read counts. And these _can_ be obtained from estimates of expression together with transcript lengths (as you point out). But again, they cannot be obtained from raw read counts alone.

**Simon Anders** · 04-07-2010, 12:51 PM

Maybe our discussion revolved a bit on a misunderstanding regarding what we are aiming for. I want to get expression of genes, summing over transcripts, while you want to distinguish isoforms. (I may have worked too much with yeast and so got used to not having much alternative splicing going on in my data. )

If I don't care about isoforms or think that my coverage is too low to distinguish isoforms anyway, I expect to get optimal power by simply summing everything up.

Our (and edgeR's) negative binomial approach depends on the counts being "raw" (not normalized by anything) to get the shot noise right. So, if we want to use DESeq to distinguish isoforms, we would have to count reads for each exon and work on this level. (Actually, this might not work too well, as we would run into trouble with exons having different lengths in different isoforms.)

Cuffdiff is, as I understand it, designed to deal with such issues, while our approach ignores them. I expect that DESeq, in compensation for being unsuitable to detect differences in isoform proportions as in your example, achieves much better detection power for differences in total expression (per gene, summing over isoforms), especially at very low counts.

As I am not clear on how biological noise is taken into account by cuffdiff I cannot be fully sure whether this expectation will hold (and I'm quite curious to learn more about cuffdiff once your paper is out).

**Siva** · 04-07-2010, 09:08 PM

We are all learning a great deal from your discussions. Thanks, I now clearly understand many aspects of RNA-seq data normalization. Unfortunately, I am unable to convince our statistician to use cufflinks FPKM values even with the confidence intervals.

So I have two questions:
1. I would like to know if we can get transcript length in Cufflinks outputs so I can convert FPKM values to absolute read counts? I would assume this would need a script to automate.

2. Has someone got a script that uses the Bowtie SAM output to directly report counts per transcript and transcript length? Is there a software that I can use?

Siva

**Simon Anders** · 04-08-2010, 01:39 AM

Hi Siva

Originally posted by Siva View Post

2. Has someone got a script that uses the Bowtie SAM output to directly report counts per transcript and transcript length?

I do, and I've just written up the documentation for it yesterday: http://www-huber.embl.de/users/ander...doc/count.html

It is part of HTSeq, a Python framework to process HTS data, which I am working on currently. I'm still rounding some corners and will announce its release soon, but you can already use it, of course. Just tell me if you find any bugs or problems. For full documentation, see here: http://www-huber.embl.de/users/anders/HTSeq

Simon

**mgogol** · 04-08-2010, 05:26 AM

one way to do it

Code:

bamToBed -i accepted_hits.sorted.bam > accepted_hits.sorted.bed
coverageBed -a accepted_hits.sorted.bed -b mm9.bed | sort -r -n -k7 > counts.txt

This will get the per gene counts from a sam file, but doesn't consider introns/exons.

bamToBed and coverageBed are from bedtools and mm9.bed is a bed file describing gene positions. I'm using this for genes though, not transcripts.

On second thought, maybe this is too crude.

**lpachter** · 04-08-2010, 05:54 AM

My personal advice to biologists performing RNA-Seq experiments is to care about isoforms. But that issue aside, as the paper by Li et al. from the Dewey lab recently pointed out, probabilistic assignment of reads is necessary even for genes, because many reads map to multiple genes. I therefore contest the statement that DESeq outperforms other methods simply because it uses raw read counts, even in yeast.

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News