No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird DE-results

    I'm doing some differential expression analysis with DESeq2, limma(voom) and edgeR, and I get some weird results. There is a gene that I know should be highly expressed in one of my samples, but my results tell me it's the other way around. Looking at both the counts and the FPKM values (from Cufflinks) for said gene, they both agree that I should get a log2 fold change around 0.8, but I get around -0.8. Setting aside the fact that eyeballing a fold change from counts or FPKM isn't the right way to do things, it should at the very least be of the correct sign, right?

    I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:

    class: DESeqDataSet 
    dim: 20477 7 
    assays(1): counts
    rownames(20477): ENSG00000000003 ENSG00000000005 ... ENSG00000273431
    rowData metadata column names(0):
    colnames(7): 1 2 ... 6 7
    colData names(2): condition
    Now, comparing this to the DESeq2 vignette, everything seems fine, except the "colnames(7): 1 2 ... 6 7" part - instead of numbers in my data, the vignette has "treated1, treated2", etc instead. I figured it must then be due to some fault in me creating the dds construct, but I can't figure out what. This is (the relevant parts of) my code:

    # Load data
    cat('reading data\n')
    data = read.table("counts/collected_counts.txt", header=TRUE, row.names=1)
    # Filter for desired samples
    samples = strsplit(args$input_samples, ",")[[1]]
    data = data[ , c(grep(samples[1], names(data), value=TRUE), grep(samples[2], names(data), value=TRUE))]
    # Load data in DESeq2
    number_samples_1 = length(grep(samples[1], names(data), value=TRUE))
    number_samples_2 = length(grep(samples[2], names(data), value=TRUE))
    condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))
    organization = data.frame(names(data), condition=condition)
    #DESeq2 calculations
    dds = DESeqDataSetFromMatrix(countData=data, colData=organization, design=~condition)
    dds = DESeq(dds, parallel=TRUE)
    res_DESeq2 = results(dds, parallel=TRUE)
    I call my script from the terminal as an Rscript like so:

    de_analysis.R cell_line_1,cell_line_2
    ... mainly so that I can do different analyses on different combinations of cell lines without having to do different sets of code. I am guessing that there's something wrong with part of this that is making the fold change go bonkers. It is my intent that the fold change would here be cell_line_1 / cell_line_2. Given an experiment with 3 and 4 replicates per cell line (as above) the "condition" variable looks like this:

    [1] hct hct hct rko rko rko rko
    Levels: hct rko
    And the "organization" variable like this:

    Code: condition
    1       hct_a       hct
    2       hct_b       hct
    3       hct_c       hct
    4       rko_a       rko
    5       rko_b       rko
    6       rko_c       rko
    7       rko_d       rko
    ... which looks right to me. I am stumped, and I would VERY much appreciate any help with this!
    Last edited by ErikFas; 02-17-2015, 06:16 AM.

  • #2
    The lack of the colnames slot having the actual sample names isn't an issue. If you set the row.names() of "organization" to the sample names then that slot will get filled in.

    It's likely that all of the fold-changes just have the opposite sign to what you expect because the base level is different from what you expect. Unless you explicitly specify a factor level order, R will set the lexicographic first level as a factor as the base level. You might consider using the contrast= argument to results() just so that you can specify the base level in the fold-change more conveniently (that, or just have that set further up when you make the "organization" data.frame).


    • #3
      Like dpryan said, it might just be the case of having the two conditions switched around that leads to the inverse fold change. You could check this using something like head(res_DESeq2) for the DESeq2 results which would tell you the order in which it is comparing your conditions. I hope this helps.


      • #4
        Okay, i tried head(res_DESeq2) and this is what I get:

        log2 fold change (MAP): condition rko vs hct 
        Wald test p-value: condition rko vs hct
        ... and I'm calling the script as hct,rko. Does that mean that it is doing fold change = hct / rko (like I want) or the other way around?

        I haven't really used contrasts much, as I read in the vignette that it's mainly used for cases where you have more than 2 comparisons (i.e. A vs B vs C and the combinations thereof), or am I misreading that? Or do you mean some sort of "hct vs rko vs base level"? (What is base level here, anyway?)


        • #5
          The other way around. rko vs hct means log2(rko/hct). You can specify the groups in any order and this will still be how the fold-change is computed due to how factors are constructed in R.

          Regarding contrasts, yes, those are mostly used with more groups, but they can also allow you to arbitrarily set which comparison is used for the fold changes. For a baselevel, R will always use the lexicographicly first factor level. Since "hct" would come before "rct" in a dictionary, it's the base level used for comparisons. Similarly, if your groups were "control" and "cancer", then the fold-change would be control/cancer, even though that's the opposite of what you want. So either set the base level manually:
          groups <- factor(groups, levels=c("rko", "hct"))
          or use a contrast.


          • #6
            Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko? I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?


            • #7
              Originally posted by ErikFas View Post
              Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko?
              Exactly. the levels= part is a convenient way to reset how R would normally handle things.

              I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?
              "groups" was just an example name. I guess it's called "organization" in your script.


              • #8
                That did the trick! Although the thing I needed to add was condition, like this:

                condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))  # as previously
                condition = factor(condition, levels=c(samples[2], samples[1]))  # new line
                ... rather than organization, which I started with. I then checked the DESeq2 vignette, and they did:

                dds$condition = factor(dds$condition, levels=c(samples[2], samples[1]))
                ... which also works just fine, except it doesn't do anything for my downstream analyses of limma(voom) and edgeR - changing the condition parameter does. So, thanks again for all your help!


                • #9
                  hi Erik,

                  Do we actually have the line of code with "levels=c(samples[2], samples[1])" somewhere? I can't find it. I try to encourage explicitly writing out the level names as character, because sample order can change.


                  • #10
                    Hey, Michael! Sorry, I wasn't being clear. You have the line written explicitly as "levels=c("untreated","treated")", just like you say you do - I was just writing the equivelant for my code for clarity of the discussion. Sorry for the confusion!


                    Latest Articles


                    • seqadmin
                      Advanced Tools Transforming the Field of Cytogenomics
                      by seqadmin

                      At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                      09-26-2023, 06:26 AM
                    • seqadmin
                      How RNA-Seq is Transforming Cancer Studies
                      by seqadmin

                      Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                      09-07-2023, 11:15 PM
                    • seqadmin
                      Methods for Investigating the Transcriptome
                      by seqadmin

                      Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                      Whole Transcriptome RNA-seq
                      Whole transcriptome sequencing...
                      08-31-2023, 11:07 AM





                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:57 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 09-26-2023, 07:53 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 09-25-2023, 07:42 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 09-22-2023, 09:05 AM
                    0 responses
                    Last Post seqadmin