Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    This thread was pretty informative to me, so thanks everyone for your comments.

    I have been using DESeq. I wrote a little script to wrap up the process, so I thought I would share it. Maybe someone will find it useful.

    Anyway, it's here:
    UPDATE: I fixed a few bugs in the script and it should be running better now.  Also you can run it from sam files or a counts table now, if you reads have already been mapped. No time to blog these…


    • #17
      Thanks Ethan. Looking forward to trying that one out, or possibly modifying it to work with DEXSeq.


      • #18
        Originally posted by Cole Trapnell View Post
        DEseq and edgeR (and other tools that test raw gene-level counts) assume that the features are the same length. However, if you have any differential splicing or isoform switching going on between conditions, this assumption gets you into trouble. Suppose for example, you have an gene with two isoforms, one twice as long as the other. In both conditions being compared, you have 100 reads on the gene. However, in condition A, all the reads are on the shorter one. In B, they're all on the longer one. Comparing the raw gene-level count will show no change, even though the expression of the gene is two-fold higher in condition A! This is a contrived example (the reads wouldn't *all* move like that in a real situation), but it illustrates the point.
        I noted this in my original post. Unfortunately, several of the experimental designs I use require more complex designs than just testing group A against group B. Most often this is because of a paired design, but there are several experiments that involve testing interactions between factors etc. which just can't be done in cufflinks. What we can do with the tag based system is test differential expression of exons, and then find some way of selecting an exon to represent a gene, with the understanding that this won't be valid if there is a differential splicing event that changes the usage of that exon.


        • #19
          Well, Cole is certainly write with both that this is an issue and that it is a contrived example. In practice, it will be rare that a change in isoform usage and a change in overall expression cancel out that well, and hence I would not worry about it. It is true, however, that DESeq and edgeR cannot distinguish changes in number of transcripts from changes in the average transcript length, and hence, if the tools report significant evidence for differential expression, this should be read as "significant evidence for differential expression and/or differential isoform usage". In many cases, you will be able to live with that -- either because you intend to have a closer look at the top hits anyway and would then discover this by manual inspection, or because you just want an overview and don't mind if a few splicing changes are among your expression changes.

          And if you are interested in splicing changes, have a look at DEXSeq, too.


          • #20
            I'd like to point out that Cole's example is not only showing that DESeq and edgeR's assumption of features being the same length is problematic because it might lead to a canceling out that leads to non-DE calls when they should be made. In fact, the reverse can happen as well. In the example, suppose that in condition B you actually have 200 reads (on the longer isoform), instead of 100 on the shorter. In this case, the 2 fold change in read count is certain to look like differential expression, but in reality, there has been no change in expression at all. So its not only that you might get a few splicing changes among your expression changes with DESeq or edgeR. You will also miss significant expression changes.

            The numbers of counts in Cole's example are made up (to make the point), but the differences in transcript length are not. Its common for transcripts in a gene to have variable lengths, even much more than the factor of 2 in the example. In DESeq/edgeR every time that happens in the structure of a gene, the computed fold change in expression, which is the basis for calling differential expression, is incorrect.

            In summary, I agree with Simon that DESeq and edgeR are useful tools for analysis of differential expression in single isoform genes where all reads map uniquely.


            • #21
              I think the combination of several methods its the way to go, depending on the question one may ask. I think a single method won't answer all questions. At the end all the methods have advantages and limitations. (e.g. edgeR and DESeq have very good control over in the false discovery rate aspect compared other methods)

              Regarding the last point of longer/shorter isoforms as possible confounders of differential expression, maybe one could also look at differences in exon usage. e.g. DEXSeq would detect the exons that are differentially used in between the shorter (condition A) and longer isoform (condition B), independently if the gene is differentially expressed between A and B.
              Last edited by areyes; 04-23-2012, 12:34 AM.


              • #22
                A good genomic reference sequence also helps. Isoforms are much easier to find and identify when you know what the genomic sequence looks like.


                • #23
                  Hi Simon, Cole & Lior,

                  Glad to find you'er discussing the two main strategies in analysing RNA-seq data for differential expression. I think either exon-centric or isoform-centric strategy has its own pros and cons, and both perform reasonably well in real-world RNA-seq studies. What I concern is, however, which strategy is more preferred in paired-end (PE) RNA-sequencing. PE design will become the unique in RNA-seq, as PE is widely recognised better for identifying splice variants in transcriptomes. The advantage of PE can be easily and fully modelled in isoform-centric methods including Cufflinks and RSEM, but how exon-centric methods can take PE into account? Any idea? Thanks.
                  Xi Wang


                  • #24
                    Originally posted by Cole Trapnell View Post
                    The thread you linked to doesn't show a bug, just an incorrect use of Cufflinks, as I just noted in that thread.
                    Hi Cole, as reflected in the thread, it seems this not an incorrect use.

                    From the cufflinks manual: -g option
                    Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
                    I was wondering if there is any solution or resolution, as many others have reported the same issue from this thread
                    HTML Code:

                    I also reported another issue in the same thread, can you please look and comment on it.

                    Originally posted by epi View Post
                    Hi Cole, Thanks again for replying. While your comment clarifies some things, it also raises some more question. First here is my command
                    I am still unclear why there are new predictions (with CUFF. IDs) that are identical to existing annotations. I have manually looked at one of the test runs and all new predictions (with CUFF. IDs) were original genes/transcripts in the source GTF annotation (I must have looked only about 20-25 ). So far I have not encountered any novel predictions which do not correspond to existing annotation. In some cases, original annotation still exists and in some cases not. Similar to the example about Trib3


                    • #25
                      I got another issue with new CUFFLINK 2:

                      When I directly quantify against ensembl gtf, the cufflinks returned 0 expression for most of them. This only occurred when I used replicates. single sample group is fine. And seems only when transcripts matched to known gene's annotation.

                      cufflinks-2.0.0.Linux_x86_64/cuffdiff -p 8 -L P1,P2 -c 1 -b anFam2.fa -o cuffdiff.P1.P2.ensembl canFam2.67.gtf TOPHAT2.C1.bam,TOPHAT2.C2.bam,TOPHAT2.C3.bam,TOPHAT2.C4.bam TOPHAT2.C5.bam,TOPHAT2.C6.bam,TOPHAT2.C7.bam,TOPHAT2.C8.bam,TOPHAT2.C9.bam

                      Here are the number of genes returned FPKM 0 in cuffdiff:

                      $8 is for treatment P1, $9 is for treatment P2 in output.

                      awk ' $8 ==0 { i++}; END {print i " of " NR " = " i/NR*100 "%"} ' cuffdiff.P1.P2.ensembl/gene_exp.diff


                      P1: 24649 of 24661 = 99.9513%
                      P2: 24645 of 24661 = 99.9351%

                      cufflinks 1.3.0 seems right:

                      P1: 6345 of 24661 = 25.7289%
                      P2: 6564 of 24661 = 26.6169%


                      • #26
                        Cuffdiff zero FPKM with reference GFF

                        I have the exact same problem of having zero FPKM values for all genes that have a reference transcript. In my case, the GFF reference file comes from phytozome. I have run it through gffread to be sure it is being interpreted correctly.

                        If I run the samples through as individual samples rather than biological replicates in groups then things seem OK. Using previous versions of cufflinks/diff this worked fine.

                        My process was tophat2-->cufflinks2-->cuffmerge-->cuffdiff

                        I'm using version v2.0.0 of cufflinks and tophat-2.0.0.Linux_x86_64
                        Last edited by natstreet; 05-20-2012, 02:18 AM.


                        • #27
                          Cufflinks2 and DeSeq


                          I have been working with the Cufflinks2 Vs. DESeq issue for a while now and thus decided to test how both packages performed with my own data.

                          Essentially DESeq appears to be more conservative producing 73 DE genes to 173 from cufflinks at a cutoff of p=0.05

                          However of the 73 genes detected by DESeq 55 of them also appeared in the list produced by Cufflinks.

                          Thus, it looks to me that the 2 methods are producing broadly the same result.

                          I am still unsure which of the results to 'trust' more, however, I think areyes point about combining the approaches is a good move as I now have a fair bit of confidence that in the genes that at DE in both lists really are DE.


                          lshen and natstreet

                          The Binary release of cufflinks 2.0.0 has an issue with cuffdiff where it assigns lots of zeros and other strange counts. This has been fixed for the binary release in 2.0.1 and later see:


                          • #28
                            Wow, DESeq must be very, very conservative because the new version of Cufflinks is incredibly conservative. The number of differentially expressed genes went from the high hundreds to just 5 or so when I upgraded to the new version of Cufflinks.


                            • #29
                              Sorry Simon, would you mind commenting on the new Cufflinks and Cuffdiff? I don't mean to put you on the spot, since there is clearly a competition aspect of this, but in this nascent field, there are so few experts that can comment authoritatively on this.

                              Apart from increased knowledge, I ask for my own selfish purposes because I've fallen victim to what turnersd has said: "I've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next."


                              • #30
                                Originally posted by billstevens View Post
                                Sorry Simon, would you mind commenting on the new Cufflinks and Cuffdiff? I don't mean to put you on the spot, since there is clearly a competition aspect of this, but in this nascent field, there are so few experts that can comment authoritatively on this.

                                My current strategy is to use cufflinks/cuffdiff for isoform level differences, DEXSeq for differential exon usage, and DESeq for differential expression at the gene level, then presenting all the results to the core's users while explaining the limitations, caveats, gotchas, and known reproducibility issues with each software/method.

                                Further, I've spoken with others recently about alternative workflows. E.g., tophat for alignment, cufflinks or trinity for assembly, RSEM or eXpress for getting counts for isoforms, then edgeR or DESeq for differential expression testing. There are several permutations of this workflow, and I'm hesitant to use any of them in my core until I see something published, but the ideas are worth entertaining.
                                Last edited by turnersd; 08-14-2012, 10:22 AM. Reason: clarity


                                Latest Articles


                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin

                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin

                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM





                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                Last Post seqadmin