Header Leaderboard Ad


DESeq2 - understanding multi-factor DE analysis



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 - understanding multi-factor DE analysis

    Hello again dear SeqAnswers,

    I have another question regarding RNA-seq data processing, this time with DESeq. I had count tables from my experiment generated successfully by DEXSeq counting script. Since I wanted to use DESeq2 this time, without having to re-count all the samples, I used geneCountTable function on my ExonCountSet in R to "compress" all exon bins into genes.

    1) I hope this is correct approach to obtaining gene count tables for DESeq ?

    Now the main question, my study is about gender differences in response to some knock-out. So I have 4 sample groups:

    Wild-type male
    Wild-type female
    Knock-out male
    Knock-out female

    I generated all the necessary tables following the DESeq2 vignette:

    sex strain
    WTM_1.counts male WT
    WTM_2.counts male WT
    WTM_3.counts male WT
    WTF_1.counts female WT
    WTF_2.counts female WT
    WTF_3.counts female WT
    KOM_1.counts male KO
    KOM_2.counts male KO
    KOM_3.counts male KO
    KOF_1.counts female KO
    KOF_2.counts female KO
    KOF_3.counts female KO

    2) What is then a proper approach to see differences between gene expression of KO samples that would take into consideration the natural variance between genders in wild-type (like sex-specific genes) ?

    I know I can manipulate the GLM with the formula, but to be honest I am unsure if I understand the entire meaning behind it.

    Currently, just for fun, I used formula (design = ~ sex + strain). Now what I got back are 3 tables with fold-changes - Intercept, male vs female, KO vs WT.

    3) What is intercept exactly ? Are values in tables male vs. female and KO vs. WT simply pair-wise comparisons, like ones I would get by using formulas: design = ~ sex and design = ~ strain separately ?

    I know this might take a longer answer so I will patiently wait for somebody's time. I will be extremely thankful for any insight, I am a beginner in RNA-seq , hence those questions might seem trivial for some.

    All the best!

  • #2
    Summing the counts from DEXSeq may lead to spurious results since you'll be double counting reads overlapping multiple exons. If you found any differential exon usage with DEXSeq, then you're even more likely to get spurious results here. You can get counts pretty quickly with featureCounts(), so give that a try.

    If you're interested in "gender differences in response to some knock-out" then I would think you'd be most interested in the sex:strain interaction (so "design = ~ sex*strain").

    The intercept is the base level of the model, which appears to be WT females (presumably you releveled() the strain factor in your design, since normally they're set alphabetically, meaning that KO females would be the base level), unless I have that backward. The "male vs female" result would be interpreted as "differences due to gender after accounting for differences due to strain". The "KO vs. WT" results would have a similar interpretation.


    • #3
      That's very concise and informative. Thank you for the answer !

      However, one more thing troubles me, I have an Ensembl GRCm38 .gtf file that contains in 3rd column values such as exon, CDS, start_codon, stop_codon, but no "gene". Since I want to use the .gtf file for HTseq counting of reads based on their overall mapping to genes, I thought I should use the -t gene option. Otherwise the default -t exon would count the multi-exon reads twice into same gene, like the DEXseq counting script, right ? Or maybe the -t exon option is fine for preparing count tables for DESeq ?
      Last edited by kajot; 04-07-2014, 12:18 AM.


      • #4
        htseq-count won't double count reads mapping to multiple exons, since it's designed with gene-level analyses in mind. The scripts that come with DEXSeq are different in this regard, since there it makes sense to double count in these instances (the DESeq/DEXSeq authors put enough thought into things that you won't normally shoot yourself in the foot doing this sort of thing). If you used the default "-t exon -i gene_id" then what you're telling the script to do is to look at reads mapping within exons and then count them according to how they overlap genes denoted by their gene_id. This is normally the appropriate way to go about things.


        • #5
          I understand, I wasn't sure since in my .gtf file there are entries like this:

          1 protein_coding exon 4830268 4830315 . + . exon_id "ENSMUSE00001118471"; exon_number "4"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000025903"; gene_name "Lypla1"; p_id "P31648"; transcript_id "ENSMUST00000150971"; transcript_name "Lypla1-003"; tss_id "TSS49998";
          1 protein_coding exon 4830268 4830315 . + . exon_id "ENSMUSE00001118471"; exon_number "4"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000025903"; gene_name "Lypla1"; p_id "P31294"; transcript_id "ENSMUST00000119612"; transcript_name "Lypla1-009"; tss_id "TSS45053";

          As you can see, there is a double entry for two transcripts from the same gene, but the exon position and gene IDs are the same, therefore I was a bit concerned that the script might count a read mapped to such exon twice for the same ENSMUSG00000025903.


          • #6
            Ah, yeah, that won't cause any double counting or other problems


            • #7
              Ok, all is clear, I am deeply grateful for your help