Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ErikFas
    Member
    • Jun 2014
    • 86

    Weird DE-results

    I'm doing some differential expression analysis with DESeq2, limma(voom) and edgeR, and I get some weird results. There is a gene that I know should be highly expressed in one of my samples, but my results tell me it's the other way around. Looking at both the counts and the FPKM values (from Cufflinks) for said gene, they both agree that I should get a log2 fold change around 0.8, but I get around -0.8. Setting aside the fact that eyeballing a fold change from counts or FPKM isn't the right way to do things, it should at the very least be of the correct sign, right?

    I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:

    Code:
    dds
    class: DESeqDataSet 
    dim: 20477 7 
    exptData(0):
    assays(1): counts
    rownames(20477): ENSG00000000003 ENSG00000000005 ... ENSG00000273431
      ENSG00000273457
    rowData metadata column names(0):
    colnames(7): 1 2 ... 6 7
    colData names(2): names.data. condition
    Now, comparing this to the DESeq2 vignette, everything seems fine, except the "colnames(7): 1 2 ... 6 7" part - instead of numbers in my data, the vignette has "treated1, treated2", etc instead. I figured it must then be due to some fault in me creating the dds construct, but I can't figure out what. This is (the relevant parts of) my code:

    Code:
    # Load data
    cat('reading data\n')
    data = read.table("counts/collected_counts.txt", header=TRUE, row.names=1)
    
    # Filter for desired samples
    samples = strsplit(args$input_samples, ",")[[1]]
    data = data[ , c(grep(samples[1], names(data), value=TRUE), grep(samples[2], names(data), value=TRUE))]
    
    # Load data in DESeq2
    number_samples_1 = length(grep(samples[1], names(data), value=TRUE))
    number_samples_2 = length(grep(samples[2], names(data), value=TRUE))
    condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))
    organization = data.frame(names(data), condition=condition)
    
    #DESeq2 calculations
    dds = DESeqDataSetFromMatrix(countData=data, colData=organization, design=~condition)
    dds
    dds = DESeq(dds, parallel=TRUE)
    res_DESeq2 = results(dds, parallel=TRUE)
    I call my script from the terminal as an Rscript like so:

    Code:
    de_analysis.R cell_line_1,cell_line_2
    ... mainly so that I can do different analyses on different combinations of cell lines without having to do different sets of code. I am guessing that there's something wrong with part of this that is making the fold change go bonkers. It is my intent that the fold change would here be cell_line_1 / cell_line_2. Given an experiment with 3 and 4 replicates per cell line (as above) the "condition" variable looks like this:

    Code:
    [1] hct hct hct rko rko rko rko
    Levels: hct rko
    And the "organization" variable like this:

    Code:
      names.data. condition
    1       hct_a       hct
    2       hct_b       hct
    3       hct_c       hct
    4       rko_a       rko
    5       rko_b       rko
    6       rko_c       rko
    7       rko_d       rko
    ... which looks right to me. I am stumped, and I would VERY much appreciate any help with this!
    Last edited by ErikFas; 02-17-2015, 06:16 AM.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    The lack of the colnames slot having the actual sample names isn't an issue. If you set the row.names() of "organization" to the sample names then that slot will get filled in.

    It's likely that all of the fold-changes just have the opposite sign to what you expect because the base level is different from what you expect. Unless you explicitly specify a factor level order, R will set the lexicographic first level as a factor as the base level. You might consider using the contrast= argument to results() just so that you can specify the base level in the fold-change more conveniently (that, or just have that set further up when you make the "organization" data.frame).

    Comment

    • aggp11
      Member
      • Jun 2011
      • 87

      #3
      Like dpryan said, it might just be the case of having the two conditions switched around that leads to the inverse fold change. You could check this using something like head(res_DESeq2) for the DESeq2 results which would tell you the order in which it is comparing your conditions. I hope this helps.

      Comment

      • ErikFas
        Member
        • Jun 2014
        • 86

        #4
        @aggp11
        Okay, i tried head(res_DESeq2) and this is what I get:

        Code:
        log2 fold change (MAP): condition rko vs hct 
        Wald test p-value: condition rko vs hct
        ... and I'm calling the script as hct,rko. Does that mean that it is doing fold change = hct / rko (like I want) or the other way around?

        @dpryan
        I haven't really used contrasts much, as I read in the vignette that it's mainly used for cases where you have more than 2 comparisons (i.e. A vs B vs C and the combinations thereof), or am I misreading that? Or do you mean some sort of "hct vs rko vs base level"? (What is base level here, anyway?)

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #5
          The other way around. rko vs hct means log2(rko/hct). You can specify the groups in any order and this will still be how the fold-change is computed due to how factors are constructed in R.

          Regarding contrasts, yes, those are mostly used with more groups, but they can also allow you to arbitrarily set which comparison is used for the fold changes. For a baselevel, R will always use the lexicographicly first factor level. Since "hct" would come before "rct" in a dictionary, it's the base level used for comparisons. Similarly, if your groups were "control" and "cancer", then the fold-change would be control/cancer, even though that's the opposite of what you want. So either set the base level manually:
          Code:
          groups <- factor(groups, levels=c("rko", "hct"))
          or use a contrast.

          Comment

          • ErikFas
            Member
            • Jun 2014
            • 86

            #6
            Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko? I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              Originally posted by ErikFas View Post
              Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko?
              Exactly. the levels= part is a convenient way to reset how R would normally handle things.

              I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?
              "groups" was just an example name. I guess it's called "organization" in your script.

              Comment

              • ErikFas
                Member
                • Jun 2014
                • 86

                #8
                That did the trick! Although the thing I needed to add was condition, like this:

                Code:
                condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))  # as previously
                condition = factor(condition, levels=c(samples[2], samples[1]))  # new line
                ... rather than organization, which I started with. I then checked the DESeq2 vignette, and they did:

                Code:
                dds$condition = factor(dds$condition, levels=c(samples[2], samples[1]))
                ... which also works just fine, except it doesn't do anything for my downstream analyses of limma(voom) and edgeR - changing the condition parameter does. So, thanks again for all your help!

                Comment

                • Michael Love
                  Senior Member
                  • Jul 2013
                  • 333

                  #9
                  hi Erik,

                  Do we actually have the line of code with "levels=c(samples[2], samples[1])" somewhere? I can't find it. I try to encourage explicitly writing out the level names as character, because sample order can change.

                  Comment

                  • ErikFas
                    Member
                    • Jun 2014
                    • 86

                    #10
                    Hey, Michael! Sorry, I wasn't being clear. You have the line written explicitly as "levels=c("untreated","treated")", just like you say you do - I was just writing the equivelant for my code for clarity of the discussion. Sorry for the confusion!

                    Comment

                    Latest Articles

                    Collapse

                    • GATTACAT
                      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by GATTACAT
                      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                      07-01-2026, 11:43 AM
                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 07-02-2026, 11:08 AM
                    0 responses
                    16 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-30-2026, 05:37 AM
                    0 responses
                    17 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-26-2026, 11:10 AM
                    0 responses
                    20 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    54 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...