How to I convert the FPKM values output by Cufflinks in genes.expr to raw counts to import into DESeq?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
counting reads
hi,
i stuggled to figure this out myself for a while.
I've wound up using Simon Anders' HT-Seq
http://http://www-huber.embl.de/user...unt.html#count
The starting point for this, however, is the sam file, not FPKM.
Chris
-
@Chrisbala:
We've discussed this issue on a number of other threads before. It doesn't make sense to convert FPKM values to read count estimates for the purposes of running DESeq.
If you want to test gene level differential expression you can just use HT-Seq as you noted. However you should know that differential expression by count data will not allow you to compare expression values of different transcripts (or genes) within a single experiment nor will you be able to test for isoform level differential expression. Unfortunately the DESeq approach is simply not suited to that. Furthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)
Comment
-
Yes this issue was discussed a number of times. One way to convert FPKM values is to multiply the FPKM values with transcript length and the number of reads mapped in million. Trascript length can be obtained using HTSeq.
I can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.Especially when you want to use biological replicates for testing differential expression. Get the FPKM values by comparing expression between different samples by cufflinks. Then convert FPKM values into read counts and use any of the 'R' programs to test for differential expression.
Comment
-
Originally posted by lpachter View PostFurthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)
This requires, of course, that one properly discard reads mapping to overlaps. This is what htseq-count ensures. Not that htseq-count would be more than a simple script but it is important to note that it does this. On the Bioconductor website, you will find (in the workshop materials section) several explanations on how to do the counting in R, and unfortunately, all these do not properly take care of overlaps.Last edited by Simon Anders; 08-15-2010, 09:20 AM.
Comment
-
Originally posted by Balat View PostI can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.
However, the whole point of cufflinks is to deal with the fact that most reads will map to several transcripts, and each read can hence influence the FPKM values of all these transcripts, and it will definitely not augment each count by one.
A crucial aspect of DESeq and edgeR is that they assess the shot noise by assuming that each counting unit is evidence of one sequencing read, and hence the counting noise follows a Poisson distribution. In cufflinks' output this is not the case. Instead, cufflinks calculates the uncertainty for you using more involved math.
DESeq and edgeR use a simple way to include counting noise and go some lengths assessing biological noise. cufflinks offers you a more sophisticated estimate for the counting noise that can deal with reads mapping to multiple transcripts, but, at least so far, no means of assessing biological noise.
There are two fundamentally different tasks that you should not mix up, and that are served by different tools:
1. If you want to compare the abundance of two different transcripts in the same sample, cufflinks will allow you to do this even if these transcripts overlap with other transcripts that should stay out of the comparison.
2. If you have two different experimental conditions (or tissues or genotypes) and you want to know whether a given gene changes its expression strength due to the condition, you need to assess biological noise between replicates to know whether the observed difference is significantly stronger than the difference between replicates, i.e., whether it is really due to the change in the experimental condition and not just due to biological variability.
This is what DESeq and edgeR do. Of course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.
The tool that sits awkwardly in between the to use cases is cuffdiff. It tests whether a transcript has different concentration in two samples. The problem with this is that many of its users forget that there is a lot of difference between asking (i) "Is the concentration of transcript X in the two samples different?" and (ii) "Is the difference in concentration sufficiently large to make it unlikely that it is only due to biological variability?" Only if you may say yes to (ii), you can attribute your observation to the fact that your samples had different experimental treatment.
This is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.
Comment
-
Thanks Simon for the explanation. I am looking at the effect of a treatment on gene expression between samples with 3 biological replicates. I can test for the differential expression of genes under different treatments with the available tools but as you suggested there are no tools at this stage for measuring differential expression of isoforms under different treatments.
Comment
-
Originally posted by Simon Anders View PostOf course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.
Comment
-
Simon: I have read your posts in a number of threads and message boards / mailing lists, and they have been helpful in clarifying some points I was questioning. I'm doing preliminary research for a possible RNA-Seq project for differential expression based on multiple experimental treatments later this year, and am trying to outline possible workflows. I have eventually settled on two general possibilities: RNA-Seq -> tophat -> htseq-count -> DESeq or RNA-Seq -> tophat -> cufflinks -> cuffdiff. Your comments here and in other places to the effect that cuffdiff is inappropriate for differentiating biological from treatment-induced differences in expression seem logical (at least with only a minimal understanding of the differences in statistical methods each employs), so it seems the former would be appropriate.
Originally posted by Simon Anders View PostThis is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.
Thanks,
Jeremy
Comment
-
Hi Jeremy
Originally posted by jdsv View PostWould you still consider this to be true? I haven't been able to find anything in my search indicating that cuffdiff has changed in its method of handling biological vs. treatment variation, but researchers seem to be using it for this purpose (e.g. http://dx.doi.org/10.1371/journal.pone.0016266).
However, to my knowledge, they have not yet offered any more specific explanations. Does this mean that cuffdiff now estimates biological variation and accordingly tests in a more stringent way? If so, how does it do that?
I hope that the cuffdiff authors will soon publish some methods paper elaborating on this, but until then, there is no way to judge whether it is sound.
Should you wait for that? Personally, I consider it improper to use a tool before its method has been published; after all, you just trust that the tool's authors did a good job without anybody having double-checked it yet. On the other hand, I understand that a practitioner without sufficient expertise in statistics could not judge the soundness of a method anyway, so a publication doesn't help too much (unless you have great trust in peer review).
Concerning your problem: We are working on a method to test for alternative isoform regulation, and I am aware of at least two other groups who work on similar projects. We hope to release our tool soon, and I guess our competitors are not that far behind. So, you will soon have several methods to chose from, and if somebody can come up with an idea how to construct a suitable gold-standard test data set, we could even resolve by testing the issue of which methods are sound.
Simon
Comment
-
Simon,
I agree that its fantastic that the DESeq paper is already published:
However I believe it was used by many people long before it appeared in print on the 20th of October 2010. If I recall, it was already being discussed long before that.
Now I know you submitted it on the 20th of April 2010, and it took a long time for it to get accepted (perhaps even longer than a standard paper?) but I still think its good that the method was available for people to use before it appeared in print.
With Cufflinks, we have been completely open and clear about exactly what our methods are doing. In fact, we hold ourselves to the highest standard of openness: namely open source. Any user can look at our software and now exactly what it is doing, and in fact many people have. We do not hide anything, contrary to your suggestions that we do.
Furthermore, we publish our methods in our code before they appear in print, always. This allows users to benefit from our methods before the peer review, and the open source means they can see what we are doing if they are interested in details. For example, we have been distributing the code to do bias correction even before the paper was published. It just appeared yesterday:
Cufflinks has performed isoform specific expression estimates since the first version was released, well over a year ago. Already the original version took into account variability in isoform estimates due to uncertainty because of ambiguous read counts when performing differential expression. It is true, that at the time, our underlying model was Poisson. This is in fact a very good model for a wide range of expression values, even when used for biological replicates. It is even better when coupled with bias correction which we now do.
In our more recent versions of Cufflinks, we have been working on improving our approach to differential expression even beyond the methods in our two papers so far. And to my knowledge, currently our software provides a solution to this (which many people are using), while your software doesn't. Until it does, and until there is code to do it, or a paper, I don't think your comments are of much use to either biologists or statisticians.
Comment
-
lpachter: Why don't you update the cuffdiff documentation to describe how it works and how biological replicates are handled? While it is very nice that the code is open source, it would be much easier to understand what the code is doing if you could give some hints on the underlying model.
Comment
-
kopi-o: The preliminary replicate support introduced in Cufflinks 0.9.0 is currently being overhauled and expanded to address many of the concerns that Simon and others have raised. As part of Cufflinks 1.0, which will be released soon, we will include a complete description of what the software is doing on the website. We haven't decided whether to write a standalone paper about these enhancements.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
23 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment