Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Zapages View Post
    I am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.

    On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.

    Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.

    Also I created the counts using HTSEQ.



    Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?

    Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf

    Thank you again for all the great help.

    Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?
    Hi I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
    I use the data set with the Drosophila melanogaster genome.
    The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
    But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
    Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
    Thank you!
    Last edited by super0925; 03-07-2014, 06:40 AM.

    Comment


    • #17
      Originally posted by super0925 View Post
      Hi I found a big problem that gene ID output from Cuffdiff and egdeR are different so that I cannot find the common DE genes.
      I use the data set with the Drosophila melanogaster genome.
      The edgeR output gene ID (e.g. FBgn0000370,FBgn0000500,...)
      But the Cuffdiff output gene ID is something like (XLOC_000028,XLOC_000038,...) and gene symbol (e.g.,KH1,RpLP1,...).
      Could you please recommend any software or command to translate them automatically? or Could you have any good idea for me to generate the same gene ID?
      Thank you!
      Please use HT-Seq and use it on the BAM file outputs from Tophat2 with the GTF reference file from Ensembl. This is for EdgeR.

      Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.

      This will allow you to have the same transcript ids. (FBtr#) across both methods.

      I hope this helps a bit.
      Last edited by Zapages; 03-11-2014, 06:51 AM.

      Comment


      • #18
        Originally posted by Zapages View Post
        Please use HT-Seq and use it on the BAM file outputs from Tophat2 with the GTF reference file from Ensembl. This is for EdgeR.

        Furthermore, please use Ensembl Fasta reference chromosome and reference GTF for Tophat2/Cufflinks/Cuffmerge/Cuffdiff/cummeRbund.

        This will allow you to have the same transcript ids. (FBtr#) across both methods.

        I hope this helps a bit.
        That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).

        Comment


        • #19
          Originally posted by dpryan View Post
          That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).
          That should work as well. Thank you for pointing it out.

          Anyway, I have used the Ensembl reference files (fasta and GTF) for my D. melanogaster data sets and it appears to be fine with flybase genome transcripts as their indicators for different genes in both DESeq and Cuffdiff.

          Comment


          • #20
            Originally posted by Zapages View Post
            I am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.

            On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.

            Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.

            Also I created the counts using HTSEQ.



            Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?

            Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf

            Thank you again for all the great help.

            Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?

            What I get to know so far is that:
            1. Cuffdiff cannot do unbalance analysis:
            e.g. 3 control samples vs 4 case samples
            or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
            2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
            3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
            Am I right? I am not sure.
            And I haven't summarize advantage vs disadvantage of RFKM vs Count.
            Do you have any ideas?
            Last edited by super0925; 03-12-2014, 02:56 AM.

            Comment


            • #21
              Originally posted by dpryan View Post
              That won't help the situation at all. Yes, the Ensembl annotations are usually better, but that's not the source of the problem, which is (1) summarizing expression on a different feature between the two and (2) the introduction of novel features by cufflinks. The appropriate solution is to simply use the same annotation file with htseq-count as is used by cuffdiff (i.e., merged.gtf).
              Hi I have another naive question.
              In the classic paper introduced Cufflinks,http://www.nature.com/nprot/journal/....2012.016.html
              tophat is used as
              tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
              tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
              tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq
              tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq
              they mapping the reads per condition per replicate ,and then cuffmerge,to generate the merged.gtf and then do the Cuffdiff,

              But on the classic paper introduced edgeR, http://www.nature.com/nprot/journal/....2013.099.html
              tophat is used as
              tophat2 -G genes.gtf -p 5 -o Untreated genome \ SRR031714_1.fastq,SRR031715_1.fastq SRR031714_2.fastq,SRR031715_2.fastq
              tophat2 -G genes.gtf -p 5 -o treated genome \ SRR031724_1.fastq,SRR031725_1.fastq SRR031724_2.fastq,SRR031725_2.fastq

              I think SRR031714,SRR031715 are replicates.

              They generated the mapping result per sample and then go to HTseq to calculate the Counts and then do the edgeR.

              So my question is what is the difference between these two tophat strategy (per sample per replicate VS per sample ) and is it really essential to mapping with replicate 1 by 1 when we use tophat-cufflinks-cuffmerge-cuffdiff pipeline?
              Last edited by super0925; 03-12-2014, 04:29 AM.

              Comment


              • #22
                SRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.

                In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.

                Comment


                • #23
                  Originally posted by dpryan View Post
                  SRR031714 and SRR031715 are technical replicates (this is not obvious, have a look here), so merging the alignments together prior to counting makes sense.

                  In both of the cases you mentioned they processed each biological sample as a unit, so they end up being the same. Remember that in RNAseq technical replicates can simply be merged together.
                  So do you mean C1_R1 and C1_R2 are two samples but not the replicates of one sample in condition1(C1)?
                  And the replicates in each sample are better to be merged in Tophat before doing the Cufflinks. Thank you!

                  Comment


                  • #24
                    Yes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.

                    Comment


                    • #25
                      Originally posted by dpryan View Post
                      Yes, C1_R1 and C1_R2 are biological replicates (i.e., independent samples) of the C1 group. At least that's what appears to be the case from a cursory look through the paper. That would also make sense given how the data ought to normally be processed.
                      Hi dpryan
                      For my summary compare the software, what I get to know so far is that:
                      1. Cuffdiff cannot do unbalance analysis:
                      e.g. 3 control samples vs 4 case samples
                      or different time point or replicates. It could only do the pairwise analysis but edgeR/DESeq could do it.
                      2. Cuffdiff is more conservative than DESeq/edgeR (if you don't adjust FDR )
                      3. Cuffdiff provides more information e.g. TSS, Transcripts, CDS but DESeq/edgeR only export DE genes.
                      Am I right? I am not sure.
                      And I haven't summarize advantage vs disadvantage of RFKM vs Count.
                      Do you have any ideas?

                      The second question is that : I found many paper's are talking false positive rate . It is another method to measure the software. So far I only know we could draw Venny Plot to observe the sharing significant Differential expressed(DE) genes.
                      What is the false positive? I think it represent number of non DE genes divided by total number of genes.
                      But so far I don't know how to find the FPR from the output of edgeR/Cuffdiff, cos they only output the gene ID and their P-value.
                      Thank you!

                      Comment


                      • #26
                        1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
                        2. I'm not sure how true that is (also, it'll be completely version dependent).
                        3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...

                        RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.

                        You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).

                        Comment


                        • #27
                          Originally posted by dpryan View Post
                          1. I've never heard that cuffdiff can't handle unbalanced designs before, though I guess I've never tried.
                          2. I'm not sure how true that is (also, it'll be completely version dependent).
                          3. That's true, DESeq/edgeR are only targeted toward DE analysis. The actual reality is that it's often biologically unclear what it means when a given gene switches isoform use...

                          RPKM/FPKM are largely going to be replaced by TPM (transcripts per million). The benefit of RPKM/FPKM and TPM is that you can use an EM algorithm to come up with values, which is really helpful if you're interested in looking at transcripts. FPKM/RPKM values have some downsides in that they lead you to believe that you've properly corrected for library size differences, when you actually haven't robustly. Counts have the benefit that you're not masking information by dividing by transcript/gene/whatever length. Some people mistakenly believe that the count-based method is "biased" toward longer genes. This is incorrect since that's not a real bias (in the common rather than technical meaning), that's a benefit. Etc.

                          You're understanding of the false positive rate is incorrect. A false positive is a feature (gene, transcript, whatever) declared to be significantly different that isn't actually. There's also "true positive", "false negative" and "true negative". These are general statistical terms that you'd do well to familiarize yourself with. To actually find these values you either need to perform a simulation or, better yet, use real data and then perform independent validation (whether you use independent samples or not will depend on the goals of the validation, though using completely independent samples will yield more predictive results).
                          Thank you , I got the False positive meaning . It is helpful.
                          But I am still confused how to get the FPR value (what you said "use real data and then perform independent validation")? because from the output of edgeR/DESeq I only find the DE gene list with their P-value. Could you please illustrate it in detail?
                          Last edited by super0925; 03-13-2014, 05:39 AM.

                          Comment


                          • #28
                            Sure, so there are two approaches. The computational approach would be to generate some data where you know the underlying genes that are DE and then feed that into the programs and see how their output matches what you "know" to be the case. This method suffers from the down-side that it's completely dependent on how well your simulation data matches what's realistic...and getting this really correct is not always trivial.

                            The other method is the wet-lab approach. Take a handful of candidate genes that the various packages have called DE and then do some qPCR. To get a better idea how predictive that is, use different samples than you did for the sequencing. Also, don't bother with genes with a small fold-change, you're unlikely to reliably see small differences with qPCR.

                            Comment


                            • #29
                              Originally posted by dpryan View Post
                              Sure, so there are two approaches. The computational approach would be to generate some data where you know the underlying genes that are DE and then feed that into the programs and see how their output matches what you "know" to be the case. This method suffers from the down-side that it's completely dependent on how well your simulation data matches what's realistic...and getting this really correct is not always trivial.

                              The other method is the wet-lab approach. Take a handful of candidate genes that the various packages have called DE and then do some qPCR. To get a better idea how predictive that is, use different samples than you did for the sequencing. Also, don't bother with genes with a small fold-change, you're unlikely to reliably see small differences with qPCR.
                              Thank you!Could you please talk about the first method in detail.
                              If now my data is real data (e.g. mouse,cow) How could I know the “real” DE genes before doing the DE analysis by edgeR/DESeq?and then compare the false positive numbers? I am still confused.

                              Comment


                              • #30
                                With the first method you wouldn't use real data. You would need to simulate data (i.e., generate some with a model of some sort).

                                Comment

                                Working...
                                X