Announcement

Collapse
No announcement yet.

HTseq to DeSeq/EdgeR to Heatmap

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fabrice
    replied
    paper used HTSEQ and DEseq2

    Dear all,

    Where can I found some published paper which used HTSEQ and DEseq2 in their data analysis? Thank you very much in advance.

    Leave a comment:


  • nbahlis
    replied
    Is there a way to get the genomic coordinates (including strand) in DESeq2 from the list of DE genes? I used HTseq to start with.

    Leave a comment:


  • a0909
    replied
    Originally posted by blancha View Post
    heatmap.2 is actually not a function of FactoMineR or DESeq2, but a function of the package gplots.
    You just need to set the columns names of the matrix given in input to heatmap.2 to the samples names for the samples labels to appear on the plot.

    Here is the R code to make a PCA plot with FactoMineR.
    You don't need to use quanti.sup or quali.sup.

    Code:
    library(DESeq2)
    library(FactoMineR)
    library(RColorBrewer)
    
    # dds is the DESeqDataSet object created with DESeq2.
    # rlog: "Regularized" log transformation
    rld <- rlog(dds)
    
    # Transpose of the matrix of the count data.
    # It's important to remember to give into input to FactoMineR the transpose of the matrix.
    assay.rld.t <- t(assay(rld))
    
    ##################
    # FactoMineR PCA #
    ##################
    pca <- PCA(assay.rld.t, graph=FALSE) 
    
    # Colors. You'll have to adjust this to the number of conditions and replicates in your experiment.
    # I highly recommend using the brewer palettes.
    colors.brewer <- brewer.pal(n=4, name="Set1")
    colors <- c(rep(colors.brewer[1], 3),
                rep(colors.brewer[2], 3),
                rep(colors.brewer[3], 3),
                rep(colors.brewer[4], 3))
    
    # FactoMineR PCA plot           
    pdf(file.path(outputDirectoryPlots, "PCA_with_colors.pdf"))
    plot.PCA(pca, habillage="ind", col.hab=colors)
    dev.off()
    Hi Blancha, thanks for your codes for PCA using FactoMineR.
    I am very new to R. I have a query regarding graph of variables. As FactoMineR generates two graphs: graph of individuals and also graph of variables. As per the example given by you, if we plot graph of variables, it will plot a lot of values (for all contigs) and of-course will not be informative. Could you please give me inputs on how to do a radial plot on 'assay.rld.t', just like the way as shown in case of example data (decathlon), the "graph of variables". Or I'm just thinking conceptually wrong. Thanks
    Last edited by a0909; 11-12-2014, 08:11 AM.

    Leave a comment:


  • dpryan
    replied
    1) It's not unusual to see a few instances of things like that.
    2) Depends on if you want to see those or not. It's unlikely that IGV will have any problem with bad alignments.

    Leave a comment:


  • super0925
    replied
    Originally posted by dpryan View Post
    1. Correct
    2. Correct
    3. Cufflinks/cuffquant/etc. use an expectation-maximization algorithm that will assign fractional counts to multiple genes/transcripts. Cuffdiff can handle this, which is good since it'd need to to do transcript-level analyses.
    Hi D after alignment.
    (1)I found that some mapped reads in the SAM file
    are '23M1I25M1I15M1I10M' or '21M1I13M1I8M2I8M' (CIGAR). are these normal?
    (2) After alignment, we could IGV to visualize it. Do we need to do another trim for the aligned file if there are some 'bad' mapping?

    Leave a comment:


  • dpryan
    replied
    1. Correct
    2. Correct
    3. Cufflinks/cuffquant/etc. use an expectation-maximization algorithm that will assign fractional counts to multiple genes/transcripts. Cuffdiff can handle this, which is good since it'd need to to do transcript-level analyses.

    Leave a comment:


  • super0925
    replied
    Originally posted by dpryan View Post
    By "low quality score", do you mean base call Phred scores or MAPQ? Bases can be trimmed fairly conservatively (i.e., don't be aggressive). I generally only include alignments with MAPQ above 5 or so, which is a convenient way of removing multimappers from many RNAseq-specific aligners. HTseq-count has reasonable defaults, so it'll do its best to ignore multimappers already, so you likely only need to change the default settings there if your aligner doesn't produce alignments with the NH auxiliary tag.

    BTW, for cuffdiff2 you can keep multimappers. This is also the case for quantification methods like RSEM or eXpress.
    Hi D
    Helpful! And
    1.Before trimming, the quality score means Phred scores. So do you mean I don't need to trim too much (so far I only remove the very short reads)
    2. After mapping, for the DESeq2/edgeR, do you mean when I did HTseq-count to generate the counts table, it has been considering the multimappers ? Otherwise I will first remove MAPQ<5 , am I right?
    3. For Cuffdiff2, I could keep multimappers? Could you explain why just briefly?
    Thank you!
    Q
    Last edited by super0925; 10-13-2014, 06:59 AM.

    Leave a comment:


  • dpryan
    replied
    By "low quality score", do you mean base call Phred scores or MAPQ? Bases can be trimmed fairly conservatively (i.e., don't be aggressive). I generally only include alignments with MAPQ above 5 or so, which is a convenient way of removing multimappers from many RNAseq-specific aligners. HTseq-count has reasonable defaults, so it'll do its best to ignore multimappers already, so you likely only need to change the default settings there if your aligner doesn't produce alignments with the NH auxiliary tag.

    BTW, for cuffdiff2 you can keep multimappers. This is also the case for quantification methods like RSEM or eXpress.

    Leave a comment:


  • super0925
    replied
    Originally posted by dpryan View Post
    I'd be pretty surprised to see that on anything that's differentially expressed, at least. If it's not DE then the logFC estimate is pretty uncertain, so getting a positive value with one tool and a negative with another wouldn't seem that unusual.
    Hi D, I remember that clearly that you told me in RNA-Seq, bofore doing the mapping we need 'gentle' trimming. So I have trimmed the very short reads, but is that essential to remove the low quality score reads as well?
    Another question is that after I finish the mapping and get the .bam file, is that essential to do quality filtering again? e.g. unique mapped reads filtering. And then go to DE analysis (e.g. HTseq-count for DESeq2/edgeR and Cuffdiff2)
    Thank you so much!

    Leave a comment:


  • xunong
    replied
    Originally posted by geneart View Post
    DESeq analysis using HTSeq count data: Problems
    Im following the DESeq guide to perform a differential expression analysis on my HTSeq count data across 7 samples together, without biological replicates. I am successful until getting to the estimatedispersions part. Now I wanted to try using the nbinomial command as indicated in the DESeq guide , using my HTSeq count data as follows:
    > cds <- newCountDataSetFromHTSeqCount( countTable, condition1 )
    Error in file(file, "rt") : invalid 'description' argument
    > cds2 = cds[ ,c( "A","B","C","D","E","F","G" ) ]
    > cds2 = estimateDispersions(cds2,method="blind",sharingMode="fit-only",fitType="local")
    > res2=nbinomTest(cds2, "A","B","C","D","E","F","G")
    Error in nbinomTest(cds2, "A", "B", "C", "D", :
    unused arguments ("E", "F", "G")
    # I had previously described my condition1
    > res2=nbinomTest(cds2,condition1)
    Error in match(x, table, nomatch = 0L) :
    argument "condB" is missing, with no default

    My question is: this typically works if I used just two samples, without biological replicates. However if I have more than two samples and have no biological replicates but I need to see differential expression between them anyhow, then why do I get those errors?All those samples are the same condition like for example all are treated, but I want to see if there is any differencial expression amongst them.
    Please can anyone help?
    geneart.
    Hi,
    I have the same question with u. Now I am working on 4 groups of miRNA data with no replicates, and also I wanna compare the differential expression between them but do not know how to do?
    Have u already solved your problem?

    Leave a comment:


  • dpryan
    replied
    I'd be pretty surprised to see that on anything that's differentially expressed, at least. If it's not DE then the logFC estimate is pretty uncertain, so getting a positive value with one tool and a negative with another wouldn't seem that unusual.

    Leave a comment:


  • super0925
    replied
    Originally posted by dpryan View Post
    I'd be surprised if leaving them in makes a difference. I use almost identical methods and never bother to remove miRNA sequences/counts before running the statistics.

    If you really wanted to go through the hassle of removing them then you can probably use Biomart to get a list of gene IDs for miRNAs (e.g., this query I just made for mice) and then simply "grep -v -f IDs.txt counts.txt > counts_without_miRNAs.txt".
    Thank you D, useful!
    For my second question, is it could be happen that the some genes are predicted as logFC positive in RPKM based pipeline (Cuffdiff) while predicted as logFC negative in count based pipeline(edgeR/DESeq2)?

    Leave a comment:


  • dpryan
    replied
    I'd be surprised if leaving them in makes a difference. I use almost identical methods and never bother to remove miRNA sequences/counts before running the statistics.

    If you really wanted to go through the hassle of removing them then you can probably use Biomart to get a list of gene IDs for miRNAs (e.g., this query I just made for mice) and then simply "grep -v -f IDs.txt counts.txt > counts_without_miRNAs.txt".

    Leave a comment:


  • super0925
    replied
    Originally posted by dpryan View Post
    1) A few reads here and there being mapped to miRNAs is pretty common. You'll generally not get enough reads mapping to miRNAs when you look at mRNA to have decent statistical power. However, if you do, then I suppose you can go ahead and use those reads. Have a look at them and see if they're of the mature form or the pri-miRNA, which would seem more likely.

    2) It could be useful, though that's most common among poorly expressed genes (rather than those with some level of detectable baseline expression). In any case, the real fold-change is unlikely to be infinite, that's just the estimate given the input data.

    Thank you D, very useful.
    We extracted the RNAs from the cells using tryzol method followed by a clean up using RNAeasy Mini kit. And then remove most of the RNA shorter than 200 nts in length. Thus, most of the miRNA reads may come from contaminant and then are not relevant for our analysis. So do you think it will affect our analysis, do we need to do separately analysis without miRNA? If the result is still OK as you said and will not impact the DE human genes, we are glad. If not , So far I don't know about how to remove those miRNA from the analysis.

    Leave a comment:


  • dpryan
    replied
    1) A few reads here and there being mapped to miRNAs is pretty common. You'll generally not get enough reads mapping to miRNAs when you look at mRNA to have decent statistical power. However, if you do, then I suppose you can go ahead and use those reads. Have a look at them and see if they're of the mature form or the pri-miRNA, which would seem more likely.

    2) It could be useful, though that's most common among poorly expressed genes (rather than those with some level of detectable baseline expression). In any case, the real fold-change is unlikely to be infinite, that's just the estimate given the input data.

    Leave a comment:

Working...
X