Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq to DeSeq/EdgeR to Heatmap

    Hi Everyone,

    I am new with DeSeq and EdgeR. I have done the following:

    1) Tophat2 (provided Ensembl GTF and Fasta files)
    2) Used Samtools to convert the files to Sam format (offline)
    3) HTseq and used the same GTF file from Tophat2. (offline)
    4) Used Iplant Collaborative Join tab delimited to merge the Counts into one file. -Edited/removed no_feature, ambiguous, too_low_aQual:, not_aligned, alignment_not_unique in the output file and added Organs headers for each rep.
    5) Used EdgeR and DeSeq respectively on Iplant Collaborative.

    Now I was wondering how could I possibly create heat map based on the genes that I interested based on EdgeR and DeSeq. Or is there any better method to analyze these genes via EdgeR and DeSeq.

    I have already used Tuxedo pipeline to create a heat map using CummeRbund for the species.

    I have 2 replicates for Organ A and 2 replicates for Organ B. The p values and log information seem to indicate high values for the genes that I am interested in.

    In total I hope to have 3 Heatmaps with the specific genes of interests only:

    1) EdgeR
    2) DeSeq
    3) Tuxedo Package - Cuffdiff/CummeRbund.

    If someone could kindly help me on how I could go about this. I would really appreciate it. Thank you.
    Last edited by Zapages; 02-02-2014, 07:20 PM.

  • #2
    There are instructions for making heatmaps in the DESeq(2) vignette (and likely the edgeR vignette as well).

    In addition, there are the heatmap and heatmap.2 functions in R (the latter is in the gplot package).

    Comment


    • #3
      Thank you, dpryan. Its my first time trying to create a heat map via R for edgeR and DeSeq. Is there any guide that I should follow?

      The CummeRbund guide was very helpful when I was following the tuxedo workflow.

      I did find this guide for using heat map.2: http://davetang.org/muse/2010/12/06/...eatmap-with-r/

      Is there anything specific I should do with my output counts file or edgeR and DeSeq output file from Iplant Collaborative to make it ready for creating a heat map via R. Is there normalization step or something like that?

      My original combined tab limited values have something like this

      Gene ID | Organ A | Organ A | Organ B | Organ B|
      GeneA | 1 | 2 | 2520 | 2414 |

      I did see that there was command in EdgeR manual but it was not helpful for a new timer like myself. As it does not really show an output of a heat map.

      http://www.bioconductor.org/packages...UsersGuide.pdf

      Sorry for so many amateur questions.

      Comment


      • #4
        Hi guys,

        I have been able to figure out how to create a heat maps for the 100 genes or something like that. But I am still struggling on how to create your own heat maps based on the genes of interest.

        Some one could kindly help me on this. I would really appreciate it.

        So far this has been for DESeq.

        This is the source that I have been following: http://cartwrightlab.wikispaces.com/DESeq

        EDIT: I just figured it out.

        But I have question, did I do it properly? Should I have normalized the data before creating the heat map?



        Code:
        > rawCounts <- read.delim("file.txt")
        > conditions <-as.factor(c("Organ1","Organ1","Organ2","Organ2"))
        > countDataSet <- newCountDataSet(rawCounts,conditions)
        > countDataSet <- estimateSizeFactors(countDataSet)
        > countDataSet <- estimateDispersions(countDataSet)
        > colors <- colorRampPalette(c("white","darkblue"))(100)
        > heatmap( vsd[select,], col = colors, scale = "none")
        > cdsBlind <- estimateDispersions( cds)
        > countDataSet -> cds
        > cdsBlind <- estimateDispersions( cds)
        > vsd <- getVarianceStabilizedData( cdsBlind )
        > select <- order(res$pval)[1:100] #for 1st hundred genes
        > heatmap( vsd[select,], col = colors, scale = "none")
        > Genes_of_Interest <-c("gene1","gene2","gene3") #for genes of interest
        > heatmap.2( vsd[Genes_of_Interest,], col=rev(heat.colors(25)),trace="none",scale="none",cexRow=0.7,cexCol=0.7)
        To anyone who is new to this, please be aware that you should install gsplot library before creating a heat map.
        Last edited by Zapages; 02-03-2014, 05:19 PM.

        Comment


        • #5
          That's likely fine. Lior Pachter had a recent blog post regarding distance metrics in the creation of heatmaps and I addressed a bit about clustering in a recent answer on biostars.

          The most important part is that you did a variance stabilising transform, which is good. I'm not sure how informative a heatmap using 3 genes would be, but I'll leave that to you.

          Comment


          • #6
            Thank you for the great help. I really appreciate it. We have other in situ analysis for these genes of interest. The other in situ techniques postulate that these genes should be highly expressed and have certain expression.

            Do you think if I do normalization in Deseq and then create a heatmap will that still be good?

            Or creating a bar graph with error will be good approach to visualizing the gene expression variance between the two organs of interest.

            Thank you.

            Comment


            • #7
              For just a few datapoints you might either just create a table or directly make a graph (dot or box plots would probably suffice). The biggest problem with heatmaps is that it's annoying to try to tell exact levels from the coloring. For large number of genes you wouldn't actually do that, so this isn't a problem, but for just a few genes of a priori biological interest...

              Comment


              • #8
                Thank you dpryan. I was wondering what should the table include?

                Should it be based on the normalized values, Binomial Test, or specific statistical test output from EdgeR/DeSeq. What's the preferred statistical test?

                Also would the table from gene_exp.diff will be fine from CummeRbund?

                As for box plots or histograms in CummeRbund its easy to create them. How should I go about creating them in R? I mean what should I base the box plot or histograms normalized values or specific statistical test values?

                Sorry for so many questions. Many Thanks for your great help and advice.

                EDIT: Could I use graphpad Prism or Excel to create the bar graphs?
                Last edited by Zapages; 02-09-2014, 06:41 AM.

                Comment


                • #9
                  Depends on what you want to show. Normalized values might be the simplest, but you could also just put up the log2 fold-change and p-value and be done with it (or you could do both in the same table, since it's small enough).

                  I can't say I recall the format of gene_exp.diff, so I couldn't say, though I think it gives confidence intervals, which would actually be quite nice.

                  Histograms use the hist() functions and I think box plots use the boxplot() function. CummeRbund is actually an R package, it doesn't do any of the graphing itself. It very likely uses the ggplot2 package, which makes quite nice graphs (I use this myself). I should note that ggplot2 is more complicated to use, but if you're making a graph for publication then it might be worth it.

                  Having said that, it'd be simpler to do with graphpad prism if you're unfamiliar with R. I would avoid Excel, it's not a very powerful program and its graphs are atrocious.

                  Comment


                  • #10
                    Thank you dpryan for all the great advice. I really appreciate it.

                    So something like this should be fine for both DESeq and EdgeR tables:

                    Gene ID|baseMean|Base Mean Sample 1|Base Mean Sample 2| Normalized value|log2 fold-change | p value | q value |

                    the gene_exp.diff file has the following information:

                    Gene ID | locus | sample_1 Name | sample_2 Name| status | value_1 |value_2 log2(fold_change) |test_stat| p_value |q_value | significant (Yes/No)

                    Yes its true, we are hoping to publish this data.

                    As for creating bar graphs. Should I just use the normalized values for the genes in Graphpad? I am learning R, any advice on creating these graphs with error bars. Although Cummerbund is fairly easy to use to make these graphs for the tuxedo pipeline. I don't think graph pad will allow me to create bar graphs with error bars unless I do a statistical test. Any statistical test that you would recommend for Graphpad or through DESeq/EdgeR?

                    Is the binomial test fine for DESeq? Can that be then be used for creating a bar/histogram?

                    Sorry for so many questions. Thank you again for the great help.

                    Comment


                    • #11
                      How Zapages How do you compare your DE result from DEseq/edgeR/cuffdiff? By which statistical variable? I am a rookie in RNA-seq. Thank you!

                      Comment


                      • #12
                        Are you going untargeted? Or have you decided an a priori set of genes?

                        This is a ok I guess:

                        edgeR: (You can use the "togTags" command to get these)

                        Gene ID, logCPM, logFC, test-stat, p-value, (FDR if untargeted)


                        Cuffdiff:

                        Gene ID, value_1, value_2 log2(fold_change) |test_stat| p_value, (FDR if untargeted)



                        Cuffdiff also gives confidence intervals, standard deviation and dispersion estimates. Might be nice if you only look a some genes. Might be overkill if >1000 genes.

                        Comment


                        • #13
                          Originally posted by super0925 View Post
                          How Zapages How do you compare your DE result from DEseq/edgeR/cuffdiff? By which statistical variable? I am a rookie in RNA-seq. Thank you!
                          I am currently debating, which statistical tests will be best to use. I am thinking of doing a negative binomial test as I am comparing the expression presence between two organs. Cuffdiff already does a statistical test via its algorithm.

                          On top of this, I am thinking of doing a normalization of the counts to see how the data appears in normalized data sets. I will be using the Benjamini-Hochberg false discovery rate of 0.01 for EdgeR based on a recent paper . I am going to apply the same principle on DeSEQ as well.

                          Overall goal is to illustrate that these three genes based on three independent RNA-Seq Analysis tools yield the same results of these certain genes are being highly expressed in Organ A versus Organ B. I am working on other in situ models and analysis to indicate that these genes share a common expression.

                          Also I created the counts using HTSEQ.

                          Originally posted by sindrle View Post
                          Are you going untargeted? Or have you decided an a priori set of genes?

                          This is a ok I guess:

                          edgeR: (You can use the "togTags" command to get these)

                          Gene ID, logCPM, logFC, test-stat, p-value, (FDR if untargeted)


                          Cuffdiff:

                          Gene ID, value_1, value_2 log2(fold_change) |test_stat| p_value, (FDR if untargeted)



                          Cuffdiff also gives confidence intervals, standard deviation and dispersion estimates. Might be nice if you only look a some genes. Might be overkill if >1000 genes.
                          Thank you, I really appreciate it here. I am actually looking at about 3 to 5 genes. There a few duplicates of certain genes. Is following something like this good idea for recalling the toptags. I think I have been using the iplant's edgeR and it has not provided me with tags information. So I am going to try it on my laptop (16 GB of ram), that should be fine?

                          Source for EdgeR: http://rnajournal.cshlp.org/content/....DC1/edgeR.pdf

                          Thank you again for all the great help.

                          Where exactly the confidence intervals, SD, Dispersion estimates in the Cuffdiff output files?

                          Comment


                          • #14
                            Hi!
                            Depends on the version you use, but check out the _tracking - files

                            Comment


                            • #15
                              Thank you! I will try your method to compare the performance of EdgeR/DeSeq/Tuxedo

                              Comment

                              Working...
                              X