Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ETHANol
    Senior Member
    • Feb 2010
    • 308

    #16
    This thread was pretty informative to me, so thanks everyone for your comments.

    I have been using DESeq. I wrote a little script to wrap up the process, so I thought I would share it. Maybe someone will find it useful.

    Anyway, it's here:
    UPDATE: I fixed a few bugs in the script and it should be running better now.  Also you can run it from sam files or a counts table now, if you reads have already been mapped. No time to blog these…
    --------------
    Ethan

    Comment

    • turnersd
      Senior Member
      • May 2011
      • 115

      #17
      Thanks Ethan. Looking forward to trying that one out, or possibly modifying it to work with DEXSeq.

      Comment

      • sudders
        Member
        • Dec 2011
        • 32

        #18
        Originally posted by Cole Trapnell View Post
        DEseq and edgeR (and other tools that test raw gene-level counts) assume that the features are the same length. However, if you have any differential splicing or isoform switching going on between conditions, this assumption gets you into trouble. Suppose for example, you have an gene with two isoforms, one twice as long as the other. In both conditions being compared, you have 100 reads on the gene. However, in condition A, all the reads are on the shorter one. In B, they're all on the longer one. Comparing the raw gene-level count will show no change, even though the expression of the gene is two-fold higher in condition A! This is a contrived example (the reads wouldn't *all* move like that in a real situation), but it illustrates the point.
        I noted this in my original post. Unfortunately, several of the experimental designs I use require more complex designs than just testing group A against group B. Most often this is because of a paired design, but there are several experiments that involve testing interactions between factors etc. which just can't be done in cufflinks. What we can do with the tag based system is test differential expression of exons, and then find some way of selecting an exon to represent a gene, with the understanding that this won't be valid if there is a differential splicing event that changes the usage of that exon.

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #19
          Well, Cole is certainly write with both that this is an issue and that it is a contrived example. In practice, it will be rare that a change in isoform usage and a change in overall expression cancel out that well, and hence I would not worry about it. It is true, however, that DESeq and edgeR cannot distinguish changes in number of transcripts from changes in the average transcript length, and hence, if the tools report significant evidence for differential expression, this should be read as "significant evidence for differential expression and/or differential isoform usage". In many cases, you will be able to live with that -- either because you intend to have a closer look at the top hits anyway and would then discover this by manual inspection, or because you just want an overview and don't mind if a few splicing changes are among your expression changes.

          And if you are interested in splicing changes, have a look at DEXSeq, too.

          Comment

          • lpachter
            Member
            • Feb 2010
            • 40

            #20
            I'd like to point out that Cole's example is not only showing that DESeq and edgeR's assumption of features being the same length is problematic because it might lead to a canceling out that leads to non-DE calls when they should be made. In fact, the reverse can happen as well. In the example, suppose that in condition B you actually have 200 reads (on the longer isoform), instead of 100 on the shorter. In this case, the 2 fold change in read count is certain to look like differential expression, but in reality, there has been no change in expression at all. So its not only that you might get a few splicing changes among your expression changes with DESeq or edgeR. You will also miss significant expression changes.

            The numbers of counts in Cole's example are made up (to make the point), but the differences in transcript length are not. Its common for transcripts in a gene to have variable lengths, even much more than the factor of 2 in the example. In DESeq/edgeR every time that happens in the structure of a gene, the computed fold change in expression, which is the basis for calling differential expression, is incorrect.

            In summary, I agree with Simon that DESeq and edgeR are useful tools for analysis of differential expression in single isoform genes where all reads map uniquely.

            Comment

            • areyes
              Senior Member
              • Aug 2010
              • 165

              #21
              I think the combination of several methods its the way to go, depending on the question one may ask. I think a single method won't answer all questions. At the end all the methods have advantages and limitations. (e.g. edgeR and DESeq have very good control over in the false discovery rate aspect compared other methods)

              Regarding the last point of longer/shorter isoforms as possible confounders of differential expression, maybe one could also look at differences in exon usage. e.g. DEXSeq would detect the exons that are differentially used in between the shorter (condition A) and longer isoform (condition B), independently if the gene is differentially expressed between A and B.
              Last edited by areyes; 04-23-2012, 12:34 AM.

              Comment

              • gringer
                David Eccles (gringer)
                • May 2011
                • 845

                #22
                A good genomic reference sequence also helps. Isoforms are much easier to find and identify when you know what the genomic sequence looks like.

                Comment

                • Xi Wang
                  Senior Member
                  • Oct 2009
                  • 317

                  #23
                  Hi Simon, Cole & Lior,

                  Glad to find you'er discussing the two main strategies in analysing RNA-seq data for differential expression. I think either exon-centric or isoform-centric strategy has its own pros and cons, and both perform reasonably well in real-world RNA-seq studies. What I concern is, however, which strategy is more preferred in paired-end (PE) RNA-sequencing. PE design will become the unique in RNA-seq, as PE is widely recognised better for identifying splice variants in transcriptomes. The advantage of PE can be easily and fully modelled in isoform-centric methods including Cufflinks and RSEM, but how exon-centric methods can take PE into account? Any idea? Thanks.
                  Xi Wang

                  Comment

                  • epi
                    Member
                    • Jan 2012
                    • 38

                    #24
                    Originally posted by Cole Trapnell View Post
                    The thread you linked to doesn't show a bug, just an incorrect use of Cufflinks, as I just noted in that thread.
                    Hi Cole, as reflected in the thread, it seems this not an incorrect use.

                    From the cufflinks manual: -g option
                    Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
                    I was wondering if there is any solution or resolution, as many others have reported the same issue from this thread
                    HTML Code:
                    http://seqanswers.com/forums/showthread.php?t=18070
                    .

                    I also reported another issue in the same thread, can you please look and comment on it.

                    Originally posted by epi View Post
                    Hi Cole, Thanks again for replying. While your comment clarifies some things, it also raises some more question. First here is my command
                    I am still unclear why there are new predictions (with CUFF. IDs) that are identical to existing annotations. I have manually looked at one of the test runs and all new predictions (with CUFF. IDs) were original genes/transcripts in the source GTF annotation (I must have looked only about 20-25 ). So far I have not encountered any novel predictions which do not correspond to existing annotation. In some cases, original annotation still exists and in some cases not. Similar to the example about Trib3

                    Comment

                    • lshen
                      Member
                      • Jan 2008
                      • 30

                      #25
                      I got another issue with new CUFFLINK 2:

                      When I directly quantify against ensembl gtf, the cufflinks returned 0 expression for most of them. This only occurred when I used replicates. single sample group is fine. And seems only when transcripts matched to known gene's annotation.


                      #command:
                      cufflinks-2.0.0.Linux_x86_64/cuffdiff -p 8 -L P1,P2 -c 1 -b anFam2.fa -o cuffdiff.P1.P2.ensembl canFam2.67.gtf TOPHAT2.C1.bam,TOPHAT2.C2.bam,TOPHAT2.C3.bam,TOPHAT2.C4.bam TOPHAT2.C5.bam,TOPHAT2.C6.bam,TOPHAT2.C7.bam,TOPHAT2.C8.bam,TOPHAT2.C9.bam

                      Here are the number of genes returned FPKM 0 in cuffdiff:

                      $8 is for treatment P1, $9 is for treatment P2 in output.

                      awk ' $8 ==0 { i++}; END {print i " of " NR " = " i/NR*100 "%"} ' cuffdiff.P1.P2.ensembl/gene_exp.diff

                      cufflinks-2.0.0:

                      P1: 24649 of 24661 = 99.9513%
                      P2: 24645 of 24661 = 99.9351%

                      cufflinks 1.3.0 seems right:

                      P1: 6345 of 24661 = 25.7289%
                      P2: 6564 of 24661 = 26.6169%

                      Comment

                      • natstreet
                        Member
                        • Nov 2009
                        • 83

                        #26
                        Cuffdiff zero FPKM with reference GFF

                        I have the exact same problem of having zero FPKM values for all genes that have a reference transcript. In my case, the GFF reference file comes from phytozome. I have run it through gffread to be sure it is being interpreted correctly.

                        If I run the samples through as individual samples rather than biological replicates in groups then things seem OK. Using previous versions of cufflinks/diff this worked fine.

                        My process was tophat2-->cufflinks2-->cuffmerge-->cuffdiff

                        I'm using version v2.0.0 of cufflinks and tophat-2.0.0.Linux_x86_64
                        Last edited by natstreet; 05-20-2012, 02:18 AM.

                        Comment

                        • DJParker
                          Junior Member
                          • Jan 2012
                          • 7

                          #27
                          Cufflinks2 and DeSeq

                          Hello

                          I have been working with the Cufflinks2 Vs. DESeq issue for a while now and thus decided to test how both packages performed with my own data.

                          Essentially DESeq appears to be more conservative producing 73 DE genes to 173 from cufflinks at a cutoff of p=0.05

                          However of the 73 genes detected by DESeq 55 of them also appeared in the list produced by Cufflinks.

                          Thus, it looks to me that the 2 methods are producing broadly the same result.

                          I am still unsure which of the results to 'trust' more, however, I think areyes point about combining the approaches is a good move as I now have a fair bit of confidence that in the genes that at DE in both lists really are DE.


                          ALSO:

                          lshen and natstreet

                          The Binary release of cufflinks 2.0.0 has an issue with cuffdiff where it assigns lots of zeros and other strange counts. This has been fixed for the binary release in 2.0.1 and later see: http://cufflinks.cbcb.umd.edu/

                          Comment

                          • billstevens
                            Senior Member
                            • Mar 2012
                            • 120

                            #28
                            Wow, DESeq must be very, very conservative because the new version of Cufflinks is incredibly conservative. The number of differentially expressed genes went from the high hundreds to just 5 or so when I upgraded to the new version of Cufflinks.

                            Comment

                            • billstevens
                              Senior Member
                              • Mar 2012
                              • 120

                              #29
                              Sorry Simon, would you mind commenting on the new Cufflinks and Cuffdiff? I don't mean to put you on the spot, since there is clearly a competition aspect of this, but in this nascent field, there are so few experts that can comment authoritatively on this.

                              Apart from increased knowledge, I ask for my own selfish purposes because I've fallen victim to what turnersd has said: "I've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next."

                              Comment

                              • turnersd
                                Senior Member
                                • May 2011
                                • 115

                                #30
                                Originally posted by billstevens View Post
                                Sorry Simon, would you mind commenting on the new Cufflinks and Cuffdiff? I don't mean to put you on the spot, since there is clearly a competition aspect of this, but in this nascent field, there are so few experts that can comment authoritatively on this.
                                +1.

                                My current strategy is to use cufflinks/cuffdiff for isoform level differences, DEXSeq for differential exon usage, and DESeq for differential expression at the gene level, then presenting all the results to the core's users while explaining the limitations, caveats, gotchas, and known reproducibility issues with each software/method.

                                Further, I've spoken with others recently about alternative workflows. E.g., tophat for alignment, cufflinks or trinity for assembly, RSEM or eXpress for getting counts for isoforms, then edgeR or DESeq for differential expression testing. There are several permutations of this workflow, and I'm hesitant to use any of them in my core until I see something published, but the ideas are worth entertaining.
                                Last edited by turnersd; 08-14-2012, 10:22 AM. Reason: clarity

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...