No announcement yet.

RNA-seq, RPKM and heatmap???

  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq, RPKM and heatmap???

    I calculated the RPKM based on my RNA-seq data. I am trying to cluster the data and explain the gene expression through a time series (along which my samples are taken).

    Could anybody recommend some good method to do so?

    I am thinking to log-transform the RPKM data, and then make a heatmap graphs like what we usually do for microarray data. What do you guys think about this?


  • #2
    I have a similar task and would be interested in a professional answer. Naively I'll try with HTSeq and DESeq on simple read count data and compare my samples pairwise.


    • #3

      We ran into similar problems when looking at this kind of data. The resulting dendrograms for the large sets of gene lists that come out of the next generation sequencing data can be difficult to visualize. We used both a heatmap approach and a combination of a dendrogram with boxplots over a time series in the paper we just published (RNA-Seq atlas of Glycine max --


      • #4
        I have exactly the same question. Can anybody give some idea?


        • #5
          some none professional answers

          The resulting dendrograms for the large sets of gene lists that come out of the next generation sequencing data can be difficult to visualize.
          Definitely - but this applies to all large data sets. Drawing a heatmap and dendrogram with 20'000 genes in 20 samples will never look very nice - and I personally think it is also not giving you a lot of information what is going on biologically. So either one takes a subset (as severin in the paper) or one groups the genes in a senseful way prior to plotting (eg GO terms / gene families / PFAM domains etc). Depending on the experiment there may be also some groups that are anyway not in the focus and can be left out.

          So - in my opinion - I would first think on what I would like to show... So if I have a timecourse where I'm interested in what makes the difference I would first search for genes / gene sets (grouped together in a senseful way - eg function) that show the major difference between the samples and only plot these. This should reduce the amount of data plotted, in case of groups it links naked gene names to a term that one understands (e.g. 'ABC transporters' tells me personally more than 'ATXGXXXXX' or a '.' in a picture).

          However - this requires some timecourse analysis... What is not the most unproblematic thing (eg due to between timepoints correlation). And it is also the question what is tested/what would you like to know... I guess there may be some helpful literature related to timecourses and ANOVA (not that you need to use ANOVA - but I think it is a good option to get some general principles and problems of timecourse studies).


          • #6

            Originally posted by schmima View Post
            group the genes in a senseful way prior to plotting (eg GO terms / gene families / PFAM domains etc).
            I am in agreement with schmima here. One of the easiest ways to group genes is to look into the following groups: highest expressed (rowsum across the time points), time point specific expression, expressed in one time point significantly higher than all other time points (this is what we did for seed over all other tissues in the paper I mentioned before).

            Genes that show no expression in any time point can be removed from the analysis and reduce your gene list sometimes substantially.

            I have also seen analysis that group expression into groups in a K-means manner to try to identify the major themes in the expression.

            Like with most data I strongly recommend just playing with the data and seeing what jumps out at you then follow up on it. Look closely at the subgroups I mentioned above and also transcription factors and tissue related gene families in the time series.

            You can also look at change in expression rather than expression values. how does the expression change between point 1 and 2 or point 2 and 3 or 1 and 3 etc.


            • #7
              I need to take in the graph generated in MA-plot DEGseq, the differentially expressed genes. has some software that does this? or script?


              • #8
                graphs and figures

                Any command in R that produces a figure can typically be wrapped to produce a pdf or tiff or jpeg output rather than output to an R graph. Look into the R help on each output type for more information.

                Here is a really simple pdf wrapper function


                An example of how to use it.



                • #9
                  Good morning.I need to normalize the data leaving the software analysis of SOLiD, Bioscope?
                  I need to normalize?


                  • #10
                    time courses and heat maps

                    From my previous experience with time course experiments (however, this was in the proteomics field), I recommend the following:
                    - Decide first which is your time point of reference. This has to be clear already when you design the experimental protocol.
                    - Use the data of this timepoint as "background"/ zero / reference (whatever you would like to call it) and then calculate the ratio of all the other time points with respect to this one.
                    - Once you have fold-chance or log ratio values by gene per time point, you can visualize the values in a heatmap (I did this once with RPKMs using Gitools @


                    • #11
                      I would recommend clustering the time-course expression profiles of each gene using fuzzy c means clustering. I am pretty sure this can be done in R fairly easily. Then you can look for enrichment of specific pathways or GO terms in each cluster. And maybe you can see what genes are regulated early, middle, and late. Perhaps middle or late genes are regulated by a transcription factor that you see increased in the early group. Just an idea.

                      But i would definitely look into the fuzzy c means clustering. Look at figure 7 in this paper for the type of output you can expect from it.

                      Rigbolt KT, Prokhorova TA, Akimov V, Henningsen J, Johansen PT, Kratchmarova
                      I, Kassem M, Mann M, Olsen JV, Blagoev B. System-wide temporal characterization
                      of the proteome and phosphoproteome of human embryonic stem cell differentiation.
                      Sci Signal. 2011 Mar 15;4(164):rs3. PubMed PMID: 21406692.