Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I'd be surprised if leaving them in makes a difference. I use almost identical methods and never bother to remove miRNA sequences/counts before running the statistics.

    If you really wanted to go through the hassle of removing them then you can probably use Biomart to get a list of gene IDs for miRNAs (e.g., this query I just made for mice) and then simply "grep -v -f IDs.txt counts.txt > counts_without_miRNAs.txt".

    Comment


    • Originally posted by dpryan View Post
      I'd be surprised if leaving them in makes a difference. I use almost identical methods and never bother to remove miRNA sequences/counts before running the statistics.

      If you really wanted to go through the hassle of removing them then you can probably use Biomart to get a list of gene IDs for miRNAs (e.g., this query I just made for mice) and then simply "grep -v -f IDs.txt counts.txt > counts_without_miRNAs.txt".
      Thank you D, useful!
      For my second question, is it could be happen that the some genes are predicted as logFC positive in RPKM based pipeline (Cuffdiff) while predicted as logFC negative in count based pipeline(edgeR/DESeq2)?

      Comment


      • I'd be pretty surprised to see that on anything that's differentially expressed, at least. If it's not DE then the logFC estimate is pretty uncertain, so getting a positive value with one tool and a negative with another wouldn't seem that unusual.

        Comment


        • Originally posted by geneart View Post
          DESeq analysis using HTSeq count data: Problems
          Im following the DESeq guide to perform a differential expression analysis on my HTSeq count data across 7 samples together, without biological replicates. I am successful until getting to the estimatedispersions part. Now I wanted to try using the nbinomial command as indicated in the DESeq guide , using my HTSeq count data as follows:
          > cds <- newCountDataSetFromHTSeqCount( countTable, condition1 )
          Error in file(file, "rt") : invalid 'description' argument
          > cds2 = cds[ ,c( "A","B","C","D","E","F","G" ) ]
          > cds2 = estimateDispersions(cds2,method="blind",sharingMode="fit-only",fitType="local")
          > res2=nbinomTest(cds2, "A","B","C","D","E","F","G")
          Error in nbinomTest(cds2, "A", "B", "C", "D", :
          unused arguments ("E", "F", "G")
          # I had previously described my condition1
          > res2=nbinomTest(cds2,condition1)
          Error in match(x, table, nomatch = 0L) :
          argument "condB" is missing, with no default

          My question is: this typically works if I used just two samples, without biological replicates. However if I have more than two samples and have no biological replicates but I need to see differential expression between them anyhow, then why do I get those errors?All those samples are the same condition like for example all are treated, but I want to see if there is any differencial expression amongst them.
          Please can anyone help?
          geneart.
          Hi,
          I have the same question with u. Now I am working on 4 groups of miRNA data with no replicates, and also I wanna compare the differential expression between them but do not know how to do?
          Have u already solved your problem?

          Comment


          • Originally posted by dpryan View Post
            I'd be pretty surprised to see that on anything that's differentially expressed, at least. If it's not DE then the logFC estimate is pretty uncertain, so getting a positive value with one tool and a negative with another wouldn't seem that unusual.
            Hi D, I remember that clearly that you told me in RNA-Seq, bofore doing the mapping we need 'gentle' trimming. So I have trimmed the very short reads, but is that essential to remove the low quality score reads as well?
            Another question is that after I finish the mapping and get the .bam file, is that essential to do quality filtering again? e.g. unique mapped reads filtering. And then go to DE analysis (e.g. HTseq-count for DESeq2/edgeR and Cuffdiff2)
            Thank you so much!

            Comment


            • By "low quality score", do you mean base call Phred scores or MAPQ? Bases can be trimmed fairly conservatively (i.e., don't be aggressive). I generally only include alignments with MAPQ above 5 or so, which is a convenient way of removing multimappers from many RNAseq-specific aligners. HTseq-count has reasonable defaults, so it'll do its best to ignore multimappers already, so you likely only need to change the default settings there if your aligner doesn't produce alignments with the NH auxiliary tag.

              BTW, for cuffdiff2 you can keep multimappers. This is also the case for quantification methods like RSEM or eXpress.

              Comment


              • Originally posted by dpryan View Post
                By "low quality score", do you mean base call Phred scores or MAPQ? Bases can be trimmed fairly conservatively (i.e., don't be aggressive). I generally only include alignments with MAPQ above 5 or so, which is a convenient way of removing multimappers from many RNAseq-specific aligners. HTseq-count has reasonable defaults, so it'll do its best to ignore multimappers already, so you likely only need to change the default settings there if your aligner doesn't produce alignments with the NH auxiliary tag.

                BTW, for cuffdiff2 you can keep multimappers. This is also the case for quantification methods like RSEM or eXpress.
                Hi D
                Helpful! And
                1.Before trimming, the quality score means Phred scores. So do you mean I don't need to trim too much (so far I only remove the very short reads)
                2. After mapping, for the DESeq2/edgeR, do you mean when I did HTseq-count to generate the counts table, it has been considering the multimappers ? Otherwise I will first remove MAPQ<5 , am I right?
                3. For Cuffdiff2, I could keep multimappers? Could you explain why just briefly?
                Thank you!
                Q
                Last edited by super0925; 10-13-2014, 06:59 AM.

                Comment


                • 1. Correct
                  2. Correct
                  3. Cufflinks/cuffquant/etc. use an expectation-maximization algorithm that will assign fractional counts to multiple genes/transcripts. Cuffdiff can handle this, which is good since it'd need to to do transcript-level analyses.

                  Comment


                  • Originally posted by dpryan View Post
                    1. Correct
                    2. Correct
                    3. Cufflinks/cuffquant/etc. use an expectation-maximization algorithm that will assign fractional counts to multiple genes/transcripts. Cuffdiff can handle this, which is good since it'd need to to do transcript-level analyses.
                    Hi D after alignment.
                    (1)I found that some mapped reads in the SAM file
                    are '23M1I25M1I15M1I10M' or '21M1I13M1I8M2I8M' (CIGAR). are these normal?
                    (2) After alignment, we could IGV to visualize it. Do we need to do another trim for the aligned file if there are some 'bad' mapping?

                    Comment


                    • 1) It's not unusual to see a few instances of things like that.
                      2) Depends on if you want to see those or not. It's unlikely that IGV will have any problem with bad alignments.

                      Comment


                      • Originally posted by blancha View Post
                        heatmap.2 is actually not a function of FactoMineR or DESeq2, but a function of the package gplots.
                        You just need to set the columns names of the matrix given in input to heatmap.2 to the samples names for the samples labels to appear on the plot.

                        Here is the R code to make a PCA plot with FactoMineR.
                        You don't need to use quanti.sup or quali.sup.

                        Code:
                        library(DESeq2)
                        library(FactoMineR)
                        library(RColorBrewer)
                        
                        # dds is the DESeqDataSet object created with DESeq2.
                        # rlog: "Regularized" log transformation
                        rld <- rlog(dds)
                        
                        # Transpose of the matrix of the count data.
                        # It's important to remember to give into input to FactoMineR the transpose of the matrix.
                        assay.rld.t <- t(assay(rld))
                        
                        ##################
                        # FactoMineR PCA #
                        ##################
                        pca <- PCA(assay.rld.t, graph=FALSE) 
                        
                        # Colors. You'll have to adjust this to the number of conditions and replicates in your experiment.
                        # I highly recommend using the brewer palettes.
                        colors.brewer <- brewer.pal(n=4, name="Set1")
                        colors <- c(rep(colors.brewer[1], 3),
                                    rep(colors.brewer[2], 3),
                                    rep(colors.brewer[3], 3),
                                    rep(colors.brewer[4], 3))
                        
                        # FactoMineR PCA plot           
                        pdf(file.path(outputDirectoryPlots, "PCA_with_colors.pdf"))
                        plot.PCA(pca, habillage="ind", col.hab=colors)
                        dev.off()
                        Hi Blancha, thanks for your codes for PCA using FactoMineR.
                        I am very new to R. I have a query regarding graph of variables. As FactoMineR generates two graphs: graph of individuals and also graph of variables. As per the example given by you, if we plot graph of variables, it will plot a lot of values (for all contigs) and of-course will not be informative. Could you please give me inputs on how to do a radial plot on 'assay.rld.t', just like the way as shown in case of example data (decathlon), the "graph of variables". Or I'm just thinking conceptually wrong. Thanks
                        Last edited by a0909; 11-12-2014, 08:11 AM.

                        Comment


                        • Is there a way to get the genomic coordinates (including strand) in DESeq2 from the list of DE genes? I used HTseq to start with.

                          Comment


                          • paper used HTSEQ and DEseq2

                            Dear all,

                            Where can I found some published paper which used HTSEQ and DEseq2 in their data analysis? Thank you very much in advance.

                            Comment

                            Working...
                            X