Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffdiff overcompensating normalization of samples?

    Hello-

    first off, I've been using cuffliniks/cuffdiff since it was first released so I'm familiar enough with the software. I have only recently discovered this issue because we've had some RNA-Seq runs on an Illumina Hiseq 2000 machine where we barcoded samples and ran 6 samples per "lane" to get a total of 48 samples sequenced at a time. We've had a large amount of variability in the # of reads per sample with this method.

    I'm sharing a less dramatic example here but one that's good enough to point out a freaky issue.

    I've got a "wt" verses "mutant" experiment with 2 biological replicates per condition. The aligned read counts for the samples is roughly 40M for three of them and 18M for the fourth. So the "wt" condition has 40M+18M and the "mutant" has 40M+40M.

    I've run cuffdiff including both replicates per condition and created a scatter plot of the FPKM values. I've also run cuffdiff excluding the sample with lower reads (single "wt" sample verses two "mutant" samples) and created another scatterplot. In both cases I've tried upper quartile normalization and the default. The attached image shows both plots (1 vs 2 on the left and 2 vs 2 on the right). The diagonal blue line indicates the "1:1" line. If the normalization step is done correctly the main body of the data should be centered on this line as we would expect most genes to have a fold change near 1. Points in red are genes identified significantly differentially expressed by cuffdiff.

    You can see in the 2 vs 2 plot, on the right, the main body is NOT centered on the blue line - in fact it's pulled in the direction of the "wt" condition and there's a large body of red points within the main body of the scatter. The plot on the left shows proper normalization and that large body of points is correctly shown as not significant. On the log scale, those points furthest to the upper-right have FPKM values 1000's higher in the "wt" condition in the plot on the right verses the plot on the left. By including the low-read count sample cuffdiff artificially increased all of the gene expression values for those samples.

    So why does cuffdiff fail at properly normalizing these samples? We're talking about a simple division step and I can see it correctly identifies the different normalization factors during the start of its run. I haven't read anywhere that it's dangerous to use samples with different read counts. If I take this same data and run it through a DESeq based pipeline I get results that look very much like the plot on the left even if I include the sample with a low read count.

    I've seen a much more extreme example where the entire body of points in the scatter were pulled so far to one side of the blue line that cuffdiff called over 13,000 genes significant. Again running through the DESeq pipeline I got a more sensible output where the normalization correctly centered the scatter on the blue line.

    Has anyone else seen this issue? This seems very misleading.
    Attached Files
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

  • #2
    An update to this issue. If I run the low coverage sample as a third condition then cuffdiff seems to correctly perform the normalization. But then my replicate sample is no longer a replicate. So it's not a solution.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Wow, seems like a pretty big issue. Perhaps you should email the authors, and if you do, please let post the response. I'm also about to get some replicates back that have a lower read count, and I didn't think that would be an issue, but now I'm worried.

      Comment


      • #4
        It IS a big issue. It's not so bad for me though because I don't use cuffdiff as a primary tool for differential expression. Every time we get a sequencing run done at our lab I check in and see if cufflinks is "useable" for me yet and so far I always run into something that turns me away. We don't like the decisions it makes and since it's basically a black box you can't really track why it made the decisions it made. What we like to do is generate coverageBed files from the aligned reads and have a look at them in the Ucsc genome browser. We'll go through the differential expression results and look at the coverage at the same time. It's always been the case that the cufflinks or cuffdiff output does not follow the raw coverage Ina way that makes sense. Not on a massive scale, mind you, but there seem to be a large percentage of the DE hits from cuffdiff that we have to toss out after observing the coverage. Sometime you get hig expressions for genes that clearly don't have many reads aligned to them. Overall I'd say it makes me nervous at best.

        I like to use a simple approach and I'm willing to accept that differential expression at the isoform level just isn't there yet. At least not when you go in and look closely at the results. I use a read counter like HTSeq and the use count based DE tools like DESeq and edgeR. I've found the output of DESeq to make a lot of sense when compared to the coverage data. plus getting used to using R is a good thing for anybody that's going to do this stuff in any amount.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Great post, I'll certainly take a look at the software you mentioned.

          Comment


          • #6
            bump! i'm still wondering if anyone has seen this - and if it'll be fixed in cufflinks 1.4
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              it would appear this has been fixed in Cufflinks 2.0! thanks guys.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                Originally posted by sdriscoll View Post
                it would appear this has been fixed in Cufflinks 2.0! thanks guys.
                I'm not so sure. Please take a look at these plots.

                The first is a plot generated using Cufflinks 1.3 without any duplicates. There were three samples run on one lane. The genes match up nicely along the 1:1 line.

                The second is a plot generated from the newest cufflinks with duplicates that are half the size. So I took the three samples on one lane, and then ran duplicates with six samples on one lane. Meaning I had different sizes for my duplicates. I see this huge branching off.

                Any ideas or thoughts on how to get around this?
                Attached Files

                Comment


                • #9
                  This was actually a problem with my data, not Cufflinks. Cuffdiff is now good on this.

                  Comment


                  • #10
                    Hi,
                    I read with interst an old post of yours...I assume bythen you have solve dyour issue.

                    My question:
                    I have used cuffdiff to look as t DE between wt (no replicates) and one mutant (no replicates). I have done this for 7 different mutats, but all taken as "single" and not replicates.

                    Well, I am struggling to explain how cuffdiff calculate the DE. In some examples clearly there is a big fold change between the two samples, but somehow, cuffdiff says it is not significant..

                    any suggestion??
                    thanks,
                    ib

                    Comment


                    • #11
                      I guess you should start by asking yourself this: in what other type of experiment would you attempt to assign significance in a 1 vs 1 comparison?

                      The only reason it's possible to assign p-values to genes in a 1 vs 1 condition has to do with some assumptions. One of the simplest to grasp is the idea that similarly expressed genes tend to have similar variance. Therefore for any gene you can look at the distribution of expression of genes similarly expresses between the two samples. Another part of this assumption that makes it work is that we can usually assume that most genes are not differentials expressed. So you can cheat a little and use these assumptions to figure out if the observed change for a gene is typical for genes with similar expressions or not. If not then its potentially significant. The visual version of this is to plot the expressions of your two samples against one another and spot the points that appear to stand out from the mass or from the 1:1 line that should run up the center of the mass of points.

                      On a finer scale it's likely that similarly expressed genes do not have the same variance across replicates but without replicated its not possible to model those.

                      If the DE results don't make sense it's probably because there is more going on then a gene by gene comparison which makes it difficult to visually check without knowing their statistic test and where the variance value comes from. I'd try using more than one tool to test for de genes and compare the lists.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #12
                        hi sdriscoll,
                        thanks a lot for the reply. I understand more now than before.

                        In the manual it is explained how cuffidiff calculates DE, but I do not have strong statistic bases, thus I really did not get it. As far as I can get, I know I have set up a cut off of .05 (in the program I set FDR= 0.05 and significance is given looking at the q value and not p value. I understood the q values is the p value FDR adjusted) (http://cufflinks.cbcb.umd.edu/howitworks.html#hdif).

                        but, whatever the program is choosing, is it still ok to do statistic on 2 samples and draw some conclusion? can some odd results be explained by the fact I do not have repeats?

                        well, any answer is appreciated!
                        ib

                        Comment


                        • #13
                          I guess aside from the theoretical issue with doing a 1 vs 1 comparison I can say that I do indeed often see lots of variation between replicate samples. Gene expression is the most varied across samples for low expressed genes (say maybe the bottom 10 to 20% of genes when ranked by expression or read counts). I think that percentile goes down when you sequence deeper because, obviously, there's a cap on the total number of genes that can be expressed in a sample so at some point you should stop seeing more genes with low expression and start seeing the expression levels of all genes go up and become very stable.

                          Variance of expression of genes above the median expression level is usually very tight across replicates. Up at that level a 3 fold change or possibly a 2 fold change (in the upper quartile) in expression is pretty unlikely.

                          These aren't proven facts - just my own observations from the various experiments I've done analysis for.
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin


                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                            Today, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          35 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          38 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          54 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X