Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by dpryan View Post
    With the first method you wouldn't use real data. You would need to simulate data (i.e., generate some with a model of some sort).
    How to generate simulation data to play comparison?
    Is that generate a counts table like Htseq-output ?
    But how to define "real" DE genes before play edgeR? I am sorry for this naive question!

    Comment


    • #32
      Hi all,
      I am working with DESeq and got to the point of calling nbinom. However I am getting this error and could not see why?
      Just a primer on my work: I have 7 samples which have no replicates.
      Here is what I used:
      > cds1<-estimateDispersions(cds1,method="blind",sharingMode="fit-only",fitType="local")
      > res<-nbinomTest(cds1,"A","B","C","D","E","F","G")
      Error in nbinomTest(cds1, "A", "B", "C", "D", :
      unused arguments ("E", "F", "G")
      Any suggestions?
      my condition argument matches the sample names and there are 7. I do not understand what unused argument means and why is it being unused? I dont have replicates and have asked it to "fit". Please help
      geneart.

      Comment


      • #33
        Originally posted by dpryan View Post
        ....FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly.
        ...
        Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.
        ...
        .
        Could you please elaborte on these two points please?

        Comment


        • #34
          Originally posted by super0925 View Post
          What I get to know so far is that:
          1. Cuffdiff cannot do unbalance analysis:
          e.g. 3 control samples vs 4 case samples
          or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
          2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
          3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
          Am I right? I am not sure.
          And I haven't summarize advantage vs disadvantage of RFKM vs Count.
          Do you have any ideas?
          1. I haven't heard about this. Although there was another topic here that seems to work with no problem: http://seqanswers.com/forums/showthread.php?t=16528
          Likewise with what I have been working with, I haven't experienced any problems working with unbalanced data sets.
          2. I heard about this too from the biostar forums. Here's a good paper that did a comparison of this too: http://genomebiology.com/2013/14/9/R95
          3. Here's a real good paper that addresses this: http://www.nature.com/nbt/journal/v3.../nbt.2450.html and http://gettinggeneticsdone.blogspot....cuffdiff2.html

          I believe, both methods have their pros and cons. Hence, I am doing both as this increases the number of statistical tests to use when you are building validation on your assessment on the differential expression of the genes. Hence, the two different approaches allows for greater statistical tests to be conducted.

          I know a bioinformatician who does the following when publishing their data:
          NGS - RNA Seq sequencing
          Shrimp - Tuxedo
          Tophat - Tuxedo
          Shrimp - HT-Seq - EdgeR/DeSeq
          Tophat - HT-Seq - EdgeR/DeSeq

          They did used 6 different tools.

          Then on top of that they did qPCR. The problem with doing qPCR and analyzing the data is that you have to design the primers from scratch and sometimes it can become very difficult task and hope everything worked properly first with RT-PCR and then do qPCR.

          FPKM/RPKM takes accounts for geness' introns and exons junctions. A real good method to visualize this download NCBI IGV or check flybase's genome browser. In NCBI IGV, add a bam file with the cufflinks gtf file as well. This will illustrate certain regions of a gene will have more highly expressed exons compared to other exons of the same gene.

          A part of my research, I am investigating this by looking at certain large introns 1 kilobases+ that are conserved with differential expression of exons within the same annotated gene. This will lead to better annotation of genes and understanding of splice junctions.

          Counts believes the current genome annotation is correct based on the reference genome that you provide to create the counts file for the BAM files that you have. But in actuality certain exons are more expressed than other exons of the same gene.

          Originally posted by geneart View Post
          Hi all,
          I am working with DESeq and got to the point of calling nbinom. However I am getting this error and could not see why?
          Just a primer on my work: I have 7 samples which have no replicates.
          Here is what I used:
          > cds1<-estimateDispersions(cds1,method="blind",sharingMode="fit-only",fitType="local")
          > res<-nbinomTest(cds1,"A","B","C","D","E","F","G")
          Error in nbinomTest(cds1, "A", "B", "C", "D", :
          unused arguments ("E", "F", "G")
          Any suggestions?
          my condition argument matches the sample names and there are 7. I do not understand what unused argument means and why is it being unused? I dont have replicates and have asked it to "fit". Please help
          geneart.
          Are your seven samples divided into 2 groups? Control or treated/different organ/etc?

          Please check here at step 9: http://cartwrightlab.wikispaces.com/DESeq

          Comment


          • #35
            DESeq analysis using HTSeq count data: Problems
            Im following the DESeq guide to perform a differential expression analysis on my HTSeq count data across 7 samples together, without biological replicates. I am successful until getting to the estimatedispersions part. Now I wanted to try using the nbinomial command as indicated in the DESeq guide , using my HTSeq count data as follows:
            > cds <- newCountDataSetFromHTSeqCount( countTable, condition1 )
            Error in file(file, "rt") : invalid 'description' argument
            > cds2 = cds[ ,c( "A","B","C","D","E","F","G" ) ]
            > cds2 = estimateDispersions(cds2,method="blind",sharingMode="fit-only",fitType="local")
            > res2=nbinomTest(cds2, "A","B","C","D","E","F","G")
            Error in nbinomTest(cds2, "A", "B", "C", "D", :
            unused arguments ("E", "F", "G")
            # I had previously described my condition1
            > res2=nbinomTest(cds2,condition1)
            Error in match(x, table, nomatch = 0L) :
            argument "condB" is missing, with no default

            My question is: this typically works if I used just two samples, without biological replicates. However if I have more than two samples and have no biological replicates but I need to see differential expression between them anyhow, then why do I get those errors?All those samples are the same condition like for example all are treated, but I want to see if there is any differencial expression amongst them.
            Please can anyone help?
            geneart.

            Comment


            • #36
              Originally posted by geneart View Post
              > cds <- newCountDataSetFromHTSeqCount( countTable, condition1 )
              Error in file(file, "rt") : invalid 'description' argument
              You should fix that problem before concerning yourself with anything else. Presumably you have a wrong filename somewhere. Perhaps fixing that solves everything else.

              Comment


              • #37
                Originally posted by dpryan View Post
                With the first method you wouldn't use real data. You would need to simulate data (i.e., generate some with a model of some sort).
                Hi dpryan
                I found a blog to discuss about RPKM/FPKM vs Counts.
                Counts I understood is how many reads mapping to the corresponding gene.
                http://liorpachter.wordpress.com/201...-significance/

                The author of Cuffdiff declared that Counts generate the wrong to calculate the fold-change.

                (1)
                Is he correct? If he is correct, why there are still lots of works/ pipelines based on Counts based methods? Does it mean that although the fold change in not correct but the ranking of DE genes is still correct (e.g. Top 500 lowest P-value by DESeq/EdgeR)?

                (2)
                BTW, I am still confused that how to generate the simulation data to observe FPR of methods. Is that like me said, to generate a counts table like Htseq-output ?
                But how to define "real" DE genes before play edgeR? I am sorry for this naive question! ...
                Last edited by super0925; 03-18-2014, 09:09 AM.

                Comment


                • #38
                  (1) To steal a word from my German colleagues, "Jein" (yes and no). The newer versions of cuffdiff2 are likely able to produce more accurate fold-changes than the older (this is a pivotal word here) count-based methods. The reality is that (A) newer count-based methods like DESeq2 use shrinkage when calculating fold-changes, meaning this may no longer be the case and (B) the earlier versions of cuffdiff produced some truly terrible results (leading to a lot of "once bitten twice shy" behaviour in the community). I should also note that the fold-changes were more likely to be off when they were more extreme. The count-based methods are also MUCH faster and generally yield accurate enough results. The saying "don't let the perfect be the enemy of the good" comes to mind. I'll also add that the count-based methods are vastly easier to debug when you get weird results.

                  (2) This is a non-trivial problem and (not to be rude) very likely beyond your capabilities at the moment. You're best off just trying a few different programs and seeing what validates down-stream to judge what's working best with your data.

                  Comment


                  • #39
                    Originally posted by dpryan View Post
                    (1) To steal a word from my German colleagues, "Jein" (yes and no). The newer versions of cuffdiff2 are likely able to produce more accurate fold-changes than the older (this is a pivotal word here) count-based methods. The reality is that (A) newer count-based methods like DESeq2 use shrinkage when calculating fold-changes, meaning this may no longer be the case and (B) the earlier versions of cuffdiff produced some truly terrible results (leading to a lot of "once bitten twice shy" behaviour in the community). I should also note that the fold-changes were more likely to be off when they were more extreme. The count-based methods are also MUCH faster and generally yield accurate enough results. The saying "don't let the perfect be the enemy of the good" comes to mind. I'll also add that the count-based methods are vastly easier to debug when you get weird results.

                    (2) This is a non-trivial problem and (not to be rude) very likely beyond your capabilities at the moment. You're best off just trying a few different programs and seeing what validates down-stream to judge what's working best with your data.

                    Thank you so much! So there isn't any "perfect" pipeline.
                    (1)
                    In my real data, do you suggest me to run 2-4 pipelines simultaneously and observe the number of common DE genes, and then choose which pipeline is most suitable for my real data for downstream e.g. GO enrichment , pathway analysis and wet lab validation.


                    (2)
                    BTW, I will try to use DESeq2 instead of DESeq. I have found there are different packages. And for Cuffdiff2 , is that same usage with Cuffdiff (just update the latest version)? Because I haven't found the download page or usage page for Cuffdiff2. I am now just use the latest version of Cufflinks and run Cuffdiff. http://cufflinks.cbcb.umd.edu/tutorial.html

                    Comment


                    • #40
                      (1) Yeah, that's a good idea when you're getting started.

                      (2) Cuffdiff2 is just the most recent version of cuffdiff (unlike DESeq/DESeq2).

                      Comment


                      • #41
                        Originally posted by dpryan View Post
                        (1) Yeah, that's a good idea when you're getting started.

                        (2) Cuffdiff2 is just the most recent version of cuffdiff (unlike DESeq/DESeq2).
                        Thank you dpyran. You are really a expert.
                        On the preprocessing , do you have any ideas.
                        It is controversial that after we get the reads(single-ended/ pair-ended), some one suggested that we don't need pre-processing (like trimming, remove adaptors) but some one said it is essential before TOPHat, until now I am still confused. What is your opinion or more detail illustration? Thank you!
                        Last edited by super0925; 03-25-2014, 04:13 AM.

                        Comment


                        • #42
                          It's mostly a question of how aggressively to quality trim after removing adapter sequences. I fall into the the "trim gently" camp, so I trim off adapters and bases with a phred score of 5 or below. While more aggressive trimming can probably increase the mapping rate, it will generally drastically decrease the overall number of mapped reads (and generally not improve accuracy that much). This is at least the case for RNAseq, other experiment types may be different.
                          Last edited by dpryan; 03-25-2014, 08:39 AM. Reason: Someday I'll actually start proof reading these things before hitting submit...

                          Comment


                          • #43
                            Originally posted by dpryan View Post
                            It's mostly a question of how aggressively to quality trim after removing adapter sequences. I fall into the the "trim gently" camp, so I trim off adapters and bases with a phred score of 5 or below. While more aggressive trimming can probably increase the mapping rate, it will generally drastically decrease the overall number of mapped reads (and generally not improve accuracy that much). This is at least the case for RNAseq, other experiment types may be different.
                            Thank you. I have also read a blog talking about "gently trim" before but I don't remember the address of that blog.
                            I have done trim by cutadapt on Ion Proton data by removing the adapter "GGCCAAGGCG ", which followed the Ion Community's recommendation.
                            But I am not sure is that same procedure in Illumina data? and how to remove the phred score<5 , by which software? I am sorry to ask you so naive question.

                            Comment


                            • #44
                              I think cutadapt can do quality trimming to, but if not then trim_galore can do both (it's a wrapper around cutadapt, in fact). There's also trimmomatic, which is relatively popular. I haven't used ion proton reads myself so I can't make any specific recommendations there.

                              Comment


                              • #45
                                I would recommend Trimmomatic as well.


                                Its very good from what I have used for. Also it can take account for paired reads unlike some other trimmers.

                                Other trimmers I have seen and used are Scythe/Sickle and FastX Toolkit.

                                Also trim the the first 10 to 14 bp depending on primers length in the start of the reads length.

                                Personally, I trim for adapters, primers and any over-expressed sequences based on FastQC or as best I can. I really take care of over expressed Ns.

                                All the best with your project.

                                Comment

                                Working...
                                X