Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting FPKM from Cufflinks to raw counts for DESeq

    How to I convert the FPKM values output by Cufflinks in genes.expr to raw counts to import into DESeq?

  • #2
    counting reads

    hi,

    i stuggled to figure this out myself for a while.

    I've wound up using Simon Anders' HT-Seq

    http://http://www-huber.embl.de/user...unt.html#count

    The starting point for this, however, is the sam file, not FPKM.

    Chris

    Comment


    • #3
      @Chrisbala:

      We've discussed this issue on a number of other threads before. It doesn't make sense to convert FPKM values to read count estimates for the purposes of running DESeq.

      If you want to test gene level differential expression you can just use HT-Seq as you noted. However you should know that differential expression by count data will not allow you to compare expression values of different transcripts (or genes) within a single experiment nor will you be able to test for isoform level differential expression. Unfortunately the DESeq approach is simply not suited to that. Furthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)

      Comment


      • #4
        Yes this issue was discussed a number of times. One way to convert FPKM values is to multiply the FPKM values with transcript length and the number of reads mapped in million. Trascript length can be obtained using HTSeq.
        I can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.Especially when you want to use biological replicates for testing differential expression. Get the FPKM values by comparing expression between different samples by cufflinks. Then convert FPKM values into read counts and use any of the 'R' programs to test for differential expression.

        Comment


        • #5
          Originally posted by lpachter View Post
          Furthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)
          This is incorrect. If I have several samples, the probability that a read gets discarded due to it falling on overlapping genes is the same in all samples. Hence, under the null hypothesis of no differential expression, the count in each sample gets reduced by the same proportion, so that the effect cancels out. One might lose power because of the overlap, but the result of the test is correct.

          This requires, of course, that one properly discard reads mapping to overlaps. This is what htseq-count ensures. Not that htseq-count would be more than a simple script but it is important to note that it does this. On the Bioconductor website, you will find (in the workshop materials section) several explanations on how to do the counting in R, and unfortunately, all these do not properly take care of overlaps.
          Last edited by Simon Anders; 08-15-2010, 09:20 AM.

          Comment


          • #6
            Originally posted by Balat View Post
            I can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.
            If it were that each read maps to one transcript, you could multiply the FPKM values with the transcript length to get raw counts again.

            However, the whole point of cufflinks is to deal with the fact that most reads will map to several transcripts, and each read can hence influence the FPKM values of all these transcripts, and it will definitely not augment each count by one.

            A crucial aspect of DESeq and edgeR is that they assess the shot noise by assuming that each counting unit is evidence of one sequencing read, and hence the counting noise follows a Poisson distribution. In cufflinks' output this is not the case. Instead, cufflinks calculates the uncertainty for you using more involved math.

            DESeq and edgeR use a simple way to include counting noise and go some lengths assessing biological noise. cufflinks offers you a more sophisticated estimate for the counting noise that can deal with reads mapping to multiple transcripts, but, at least so far, no means of assessing biological noise.

            There are two fundamentally different tasks that you should not mix up, and that are served by different tools:

            1. If you want to compare the abundance of two different transcripts in the same sample, cufflinks will allow you to do this even if these transcripts overlap with other transcripts that should stay out of the comparison.

            2. If you have two different experimental conditions (or tissues or genotypes) and you want to know whether a given gene changes its expression strength due to the condition, you need to assess biological noise between replicates to know whether the observed difference is significantly stronger than the difference between replicates, i.e., whether it is really due to the change in the experimental condition and not just due to biological variability.

            This is what DESeq and edgeR do. Of course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.

            The tool that sits awkwardly in between the to use cases is cuffdiff. It tests whether a transcript has different concentration in two samples. The problem with this is that many of its users forget that there is a lot of difference between asking (i) "Is the concentration of transcript X in the two samples different?" and (ii) "Is the difference in concentration sufficiently large to make it unlikely that it is only due to biological variability?" Only if you may say yes to (ii), you can attribute your observation to the fact that your samples had different experimental treatment.

            This is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.

            Comment


            • #7
              Thanks Simon for the explanation. I am looking at the effect of a treatment on gene expression between samples with 3 biological replicates. I can test for the differential expression of genes under different treatments with the available tools but as you suggested there are no tools at this stage for measuring differential expression of isoforms under different treatments.

              Comment


              • #8
                Originally posted by Simon Anders View Post
                Of course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.
                Could you just use exon level counts? I guess this depends on the depth of sequencing and expression level of that exon, but has anyone tested it out?

                Comment


                • #9
                  I have had success using the R limma package on fpkm values.

                  Comment


                  • #10
                    Simon: I have read your posts in a number of threads and message boards / mailing lists, and they have been helpful in clarifying some points I was questioning. I'm doing preliminary research for a possible RNA-Seq project for differential expression based on multiple experimental treatments later this year, and am trying to outline possible workflows. I have eventually settled on two general possibilities: RNA-Seq -> tophat -> htseq-count -> DESeq or RNA-Seq -> tophat -> cufflinks -> cuffdiff. Your comments here and in other places to the effect that cuffdiff is inappropriate for differentiating biological from treatment-induced differences in expression seem logical (at least with only a minimal understanding of the differences in statistical methods each employs), so it seems the former would be appropriate.

                    Originally posted by Simon Anders View Post
                    This is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.
                    Would you still consider this to be true? I haven't been able to find anything in my search indicating that cuffdiff has changed in its method of handling biological vs. treatment variation, but researchers seem to be using it for this purpose (e.g. http://dx.doi.org/10.1371/journal.pone.0016266).

                    Thanks,
                    Jeremy

                    Comment


                    • #11
                      Hi Jeremy

                      Originally posted by jdsv View Post
                      Would you still consider this to be true? I haven't been able to find anything in my search indicating that cuffdiff has changed in its method of handling biological vs. treatment variation, but researchers seem to be using it for this purpose (e.g. http://dx.doi.org/10.1371/journal.pone.0016266).
                      A while ago, the cufflinks authors announced that their new version of cuffdiff now handles biological replicates as well. This is stated on the cufflinks web page and was also said by the cufflinks authors here on SeqAnswers.

                      However, to my knowledge, they have not yet offered any more specific explanations. Does this mean that cuffdiff now estimates biological variation and accordingly tests in a more stringent way? If so, how does it do that?

                      I hope that the cuffdiff authors will soon publish some methods paper elaborating on this, but until then, there is no way to judge whether it is sound.

                      Should you wait for that? Personally, I consider it improper to use a tool before its method has been published; after all, you just trust that the tool's authors did a good job without anybody having double-checked it yet. On the other hand, I understand that a practitioner without sufficient expertise in statistics could not judge the soundness of a method anyway, so a publication doesn't help too much (unless you have great trust in peer review).

                      Concerning your problem: We are working on a method to test for alternative isoform regulation, and I am aware of at least two other groups who work on similar projects. We hope to release our tool soon, and I guess our competitors are not that far behind. So, you will soon have several methods to chose from, and if somebody can come up with an idea how to construct a suitable gold-standard test data set, we could even resolve by testing the issue of which methods are sound.

                      Simon

                      Comment


                      • #12
                        Simon,
                        I agree that its fantastic that the DESeq paper is already published:

                        However I believe it was used by many people long before it appeared in print on the 20th of October 2010. If I recall, it was already being discussed long before that.

                        Now I know you submitted it on the 20th of April 2010, and it took a long time for it to get accepted (perhaps even longer than a standard paper?) but I still think its good that the method was available for people to use before it appeared in print.

                        With Cufflinks, we have been completely open and clear about exactly what our methods are doing. In fact, we hold ourselves to the highest standard of openness: namely open source. Any user can look at our software and now exactly what it is doing, and in fact many people have. We do not hide anything, contrary to your suggestions that we do.
                        Furthermore, we publish our methods in our code before they appear in print, always. This allows users to benefit from our methods before the peer review, and the open source means they can see what we are doing if they are interested in details. For example, we have been distributing the code to do bias correction even before the paper was published. It just appeared yesterday:


                        Cufflinks has performed isoform specific expression estimates since the first version was released, well over a year ago. Already the original version took into account variability in isoform estimates due to uncertainty because of ambiguous read counts when performing differential expression. It is true, that at the time, our underlying model was Poisson. This is in fact a very good model for a wide range of expression values, even when used for biological replicates. It is even better when coupled with bias correction which we now do.

                        In our more recent versions of Cufflinks, we have been working on improving our approach to differential expression even beyond the methods in our two papers so far. And to my knowledge, currently our software provides a solution to this (which many people are using), while your software doesn't. Until it does, and until there is code to do it, or a paper, I don't think your comments are of much use to either biologists or statisticians.

                        Comment


                        • #13
                          lpachter: Why don't you update the cuffdiff documentation to describe how it works and how biological replicates are handled? While it is very nice that the code is open source, it would be much easier to understand what the code is doing if you could give some hints on the underlying model.

                          Comment


                          • #14
                            kopi-o: The preliminary replicate support introduced in Cufflinks 0.9.0 is currently being overhauled and expanded to address many of the concerns that Simon and others have raised. As part of Cufflinks 1.0, which will be released soon, we will include a complete description of what the software is doing on the website. We haven't decided whether to write a standalone paper about these enhancements.

                            Comment


                            • #15
                              Thanks, that's good to know.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:20 AM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              38 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X