Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ETHANol
    Senior Member
    • Feb 2010
    • 308

    Hierarchical clustering of RNA-seq data

    I'm clearly out of my comfort zone here but it needs to be done. So I've taken a cursory look into clustering some RNA-seq data and I am more confused then when I started. Does anyone know of any good reviews that could help my mathematically limited knowledge-base understand some of the complexities of hierarchical clustering, especially with respect to RNA-seq data.

    I have taken a look at a couple things I have found.
    1. DESeq has some clustering abilities but seems pretty limited.
    2. MBCluster.Seq ( http://cran.r-project.org/web/packag...luster.Seq.pdf ) looks interesting but seems to lack human friendly documentation.
    3. hopach ( http://bioc.ism.ac.jp/2.6/bioc/html/hopach.html ) looks interesting but there is no documentation on how to feed it RNA-seq data (or any data for that matter).

    Data preprocessing steps I've considered but may or may not be correct.
    1. Seems best to select genes with high variance.
    2. Perhaps values should be log2 transformed??
    3. Filtering genes with low read counts seems like a good idea.
    4. Perhaps depending on what you want to see, divide by lowest expressed condition for each gene so you are looking a fold change.
    5. Are there tools that take into account variance within a condition???

    It all looks pretty complicated so any insights would be greatly appreciated.

    Thanks!!
    --------------
    Ethan
  • themerlin
    Member
    • Feb 2010
    • 51

    #2
    I would try out the multiple experiment viewer.

    Comment

    • ETHANol
      Senior Member
      • Feb 2010
      • 308

      #3
      That looks like a great resource. Thanks!!!!!
      --------------
      Ethan

      Comment

      • Wolfgang Huber
        Senior Member
        • Aug 2009
        • 109

        #4
        Dear ETHANol,

        regarding the question about log-transformation, that is not a good idea since then you will create missing values in those cases where a gene had 0 counts in some of the samples. Genes that have 0 counts in some samples but not in others may be among the most interesting ones.

        Now you can of course use log2(n+c), with some suitable choice of c. If one explores the question 'what is the best choice of c?' a bit deeper, one place to end up is variance stabilising transformation. This is explained in the DESeq vignette.

        You said "DESeq has some clustering abilities but seems pretty limited". In fact, DESeq is a package for the R language and environment, which probably offers the most wide-ranging spectrum of clustering and machine learning methods on the planet. See the CRAN task view for an overview. DESeq itself does not do clustering, it 'just' preprocesses the data (and offers statistical tests and regression modelling).

        Best wishes
        Wolfgang
        Wolfgang Huber
        EMBL

        Comment

        • kopi-o
          Senior Member
          • Feb 2008
          • 319

          #5
          Excellent points by Wolfgang and even more to the point with regard to the original question: the CRAN Cluster Analysis Task View

          Comment

          • ETHANol
            Senior Member
            • Feb 2010
            • 308

            #6
            Thanks for the comments.

            I have already replaced all the 1's in my data set with 0's. Seems like a reasonable thing to do with RNA-seq data. The log2(n+c) sounds like a better option then just cutting off the low expressed genes at some arbitrary quantile. I'll have to read the DESeq vignette a little more carefully, but when it got to clustering it was pretty much over my head and I don't really see a way to take it further (not saying there isn't - I just don't have the background).

            The links look like a jungle and sort of reinforce the idea that this is really difficult. But with anything there has to be a way in.
            --------------
            Ethan

            Comment

            • kopi-o
              Senior Member
              • Feb 2008
              • 319

              #7
              If you want a few R functions/packages to get started, you could try "agnes()" or good old "heatmap()".

              Comment

              • ETHANol
                Senior Member
                • Feb 2010
                • 308

                #8
                Originally posted by Wolfgang Huber View Post
                DESeq itself does not do clustering, it 'just' preprocesses the data (and offers statistical tests and regression modelling).
                I completely missed that point when looking at the vignette. So that is really nice as you can take the preprocessed data from DESeq and use it how you wish. Thanks for the heads up.

                Code:
                library(DESeq)
                
                countsTable <- read.delim("countstable.txt", header=TRUE, stringsAsFactors=TRUE)
                rownames( countsTable ) <- countsTable$gene
                countsTable <- countsTable[ , -1 ]
                conds <- factor(c("pos","pos","pos","pos","neg","neg","neg","neg"))
                cds <- newCountDataSet( countsTable, conds )
                cds <- estimateSizeFactors( cds )
                normcds <- counts( cds, normalized=TRUE )
                
                cds <- estimateDispersions( cds )
                res <- nbinomTest( cds, "pso", "neg" )
                write.table(res, file="DEG_list.txt", quote=FALSE, sep="\t", row.names=FALSE)
                
                
                
                cdsBlind <- estimateDispersions( cds, method="blind" )
                vsd <- getVarianceStabilizedData( cdsBlind )
                
                select <- order(res$pval)[1:40]
                colors <- colorRampPalette(c("white","darkblue"))(100)
                heatmap( vsd[select,], col = colors, scale = "none")
                A question on this code, which is taken directly out of the DESeq vigentte.
                The object 'cds' is a counts data set in which the conditions are ascribed to each column. If I have several conditions, can I just write this code:
                Code:
                conds <- factor(c("pos","pos","het","het","het","neg","neg","neg"))
                The the top 40 genes are then selected using this line:
                Code:
                select <- order(res$pval)[1:40]
                In the case of an experiment that has several conditions and thus not being able to identify differentially expressed genes by one simple pairwise comparison, would there be some way to select the best genes, say using the genes that show the highest variance?

                Thanks again for the help.
                Last edited by ETHANol; 04-07-2012, 10:34 PM.
                --------------
                Ethan

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  each column. If I have several conditions, can I just write this code:
                  Code:
                  conds <- factor(c("pos","pos","het","het","het","neg","neg","neg"))
                  Yes, that shoul work. Try it out.

                  In the case of an experiment that has several conditions and thus not being able to identify differentially expressed genes by one simple pairwise comparison, would there be some way to select the best genes, say using the genes that show the highest variance?
                  Exactly. You could try
                  Code:
                  library(genefilter)
                  select <- order( -rowVars(vsd) )[1:40]
                  The '-' is to get the highest, not the lowest, variances.

                  And maybe use a few more than just 40 genes.

                  Simon

                  Comment

                  • ETHANol
                    Senior Member
                    • Feb 2010
                    • 308

                    #10
                    Thanks Simon and everyone else. This has all been a lot of help!!!
                    --------------
                    Ethan

                    Comment

                    • ETHANol
                      Senior Member
                      • Feb 2010
                      • 308

                      #11
                      I'm back to clustering. I wish there was someone else here more qualified to do this but no such luck.

                      I'll describe an experiment. I have four conditions which I will call:
                      Control
                      Treatment-A
                      Treatment-B
                      Treatment-C

                      I have three replicates for each condition.
                      What I would like to find out is the genes that are differentially expressed comparing the control to Treatment-A, Treatment-B and Treatment C. These binary comparisons are easy to do with DESeq.

                      I would also like to know which genes are using the control as a baseline up/down regulated in Treatment-A AND Treatment-B, Treatment-B AND Treatment-C, Treatment-A AND Treatment-C, and all three.

                      I believe throwing the gene lists from each pairwise comparison into Venn diagrams would produce a lot of false differences.

                      Is clustering the correct approach to getting these lists of genes? Anybody have any suggestion on how to best do this? Should I use the replicates or just the averages for each condition?
                      --------------
                      Ethan

                      Comment

                      • vyellapa
                        Member
                        • Oct 2011
                        • 59

                        #12
                        After the normalizing step

                        Code:
                        data <-counts(cds, normalized=TRUE)
                        My samples are all tumor only.Then, is it appropriate to select genes based on a Standard deviation cutoff or a list of top 'n' genes with highest SD?

                        Code:
                        dataLog2=log2(data+1)
                        for (i in (1:94137)) {if(sd(dataLog2[i,])>3.5 {geneList=union(geneList, rownames (dataLog2)[i]);}}
                        Last edited by vyellapa; 08-01-2013, 03:15 PM.

                        Comment

                        • vyellapa
                          Member
                          • Oct 2011
                          • 59

                          #13
                          I'm trying to identify subtypes in my tumor RNA-Seq data through unsupervised hierarchical clustering. For this, I'm curious if counts should normalized (using counts( cds, normalized=TRUE )) before estimateDispersions and Variance Stabilization or should counts not be normalized before these steps.

                          Thank you,
                          Teja

                          Code:
                          countsTable <- read.delim("countstable.txt", header=TRUE, stringsAsFactors=TRUE) 
                          rownames( countsTable ) <- countsTable$gene 
                          countsTable <- countsTable[ , -1 ] 
                          conds <- factor(c("tumor","tumor","tumor","tumor","tumor","tumor","tumor","tumor")) 
                          cds <- newCountDataSet( countsTable, conds ) 
                          cds <- estimateSizeFactors( cds ) 
                          cds <- counts( cds, normalized=TRUE )  
                          cdsBlind <- estimateDispersions( cds, method="blind" ) 
                          vsd <- getVarianceStabilizedData( cdsBlind )  
                          select <- order(res$pval)[1:250] 
                          colors <- colorRampPalette(c("white","darkblue"))(100) heatmap( vsd[select,], col = colors, scale = "none")

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            New Genomics Tools and Methods Shared at AGBT 2025
                            by seqadmin


                            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                            The Headliner
                            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                            03-03-2025, 01:39 PM
                          • seqadmin
                            Investigating the Gut Microbiome Through Diet and Spatial Biology
                            by seqadmin




                            The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                            02-24-2025, 06:31 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 05:03 AM
                          0 responses
                          15 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 07:27 AM
                          0 responses
                          12 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-18-2025, 12:50 PM
                          0 responses
                          15 views
                          0 reactions
                          Last Post seqadmin  
                          Started by seqadmin, 03-03-2025, 01:15 PM
                          0 responses
                          185 views
                          0 reactions
                          Last Post seqadmin  
                          Working...