Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNAseq analysis by DESeq : can't find a gene previously published as important

    Hi,

    I'm new to this forum, but It's been a while i'm looking on it to find some solutions about my problem. I have, for training purpose, to analyze data comming from a publication on Candida parapsilosis's differential expression in two condition : normoxia and hypoxia.

    We were provided with two files : a Chip with log ratios for each spot and a RNAseq count file

    Here's my problem : the original publication makes it very clear that a peculiar gene, CPAR2_403510 should be found as differentially expressed by RNAseq.
    They used FPKM and a "hte statistical test to compute significance of FPKM observed changes" (I don't have any clue about what that could be)
    EDIT : I guess "hte" is just a keyboard tipping error, but they don't give any further information. The paper says they used Quantile-based normalization to identify differentially expressed genes and that they corrected using fdr.
    So, using DESeq vignette's, as instructed by my teacher (she even provided a barebone script to begin with, in case we would make mistakes), I started sorting interesting genes by adjusted pvalues.

    Here's the data for my peculiar gene :
    > res[1948,]
    id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
    CPAR2_403510 12900.24 2287.847 28818.83 12.59649 3.654949 0.05991648 0.1682318

    And here's the data extracted from the raw count file :
    id N1 N2 N3 N4 N5 N6 H1 H2 H3 H4
    CPAR2_403510 1004 1424 1196 1388 6315 6779 5047 89097 3246 11171

    There is, as you can see 4 Hypoxia duplicates and 6 Normoxia duplicates. I already tried to remove some of these duplicates that appeared as weird on my heatmaps (N5 and N6 looked clusterized and made me think about a batch problem, but stripping them or not, it doesn't get my results better)

    Being totally new to that kind of analyses, and almost to statistics in fact, I don't really know what I should do :
    1) Should I rely on padj only ?
    2) Should I rely on log2FoldChange only ?
    3) Should I create a hybrid statistic taking both into account ? (is that even possible ? I sense that I would introduce some bias)
    4) Is there any explanation ?
    5) Should I discard the paper's information and just go on with those results ?

    I believed that cutting below 0.1 padj was already kinda laxist. But then again, I don't really know.
    I don't know if I provided enough information to be helped, but the goal of my exercise is to come up with a list of the "20" most differentially expressed genes, but cross-linking chip analysis (this one comes fine and I have all the genes I should have in it) and RNAseq results. But I have to admit that I often have like 700 genes of interest, and I'm not sure about the relevence of cutting pvalues always lower.

    Would you have any clue ? I can't use DESeq2 as DESeq was imposed (well I could, but eventually I would have to do it using DESeq). I've tried method = "pooled" on my estimateDispersions as I read that on another thread, but it seems like the results are likewise.

    Thank you for your replies !
    Last edited by elipsoid; 10-30-2013, 02:14 AM.

  • #2
    What are the normalized (not raw) counts for that gene? If the unadjusted p-value isn't itself below 0.05 then I'd be surprised if it ever popped up as being DE. BTW, just because a paper reports something as being significant, that doesn't mean it actually is...there are a LOT of crap analyses in the literature.

    Comment


    • #3
      Thank you for your reply.

      As I am really really new to this, I have a tendency to believe the papers has the right method and something's wrong with mine. But maybe they just analyzed their data in a bad way :/
      It's been 8 hours i'm trying to obtain the same results for their "key gene" (they mention three in the paper, I assumed those three weren't obtained by miss-interpretation).

      Here are the normalized counts (well I think those are, they are the results of :
      > cdsNorm = counts(cds, normalized = TRUE)

      > cdsNorm[1948,]
      countN_rep1 countN_rep2 countN_rep3 countN_rep4 countN_rep5 countN_rep6
      1039.511 1218.836 1043.076 1150.849 3892.303 5382.503
      countH_rep1 countH_rep2 countH_rep3 countH_rep4
      6668.934 90394.665 8004.270 10207.432
      Last edited by elipsoid; 10-30-2013, 02:17 AM.

      Comment


      • #4
        I'm not a statistician either, but to call a comparison significant in t test, the important thing is that the difference between group means is larger than within group variations. From your data, the raw and normalized counts, both showed difference between two groups. But when you used "pooled" estimate for dispersion, you will get larger variation for both groups, therefore eliminated the difference between group means. Since you have enough replicates, you should use "per-condition" estimate of dispersion, then the p value will surely be significant.

        Comment


        • #5
          Originally posted by elipsoid View Post
          Thank you for your reply.

          As I am really really new to this, I have a tendency to believe the papers has the right method and something's wrong with mine. But maybe they just analyzed their data in a bad way :/
          It's been 8 hours i'm trying to obtain the same results for their "key gene" (they mention three in the paper, I assumed those three weren't obtained by miss-interpretation).

          Here are the normalized counts (well I think those are, they are the results of :
          > cdsNorm = counts(cds, normalized = TRUE)

          > cdsNorm[1948,]
          countN_rep1 countN_rep2 countN_rep3 countN_rep4 countN_rep5 countN_rep6
          1039.511 1218.836 1043.076 1150.849 3892.303 5382.503
          countH_rep1 countH_rep2 countH_rep3 countH_rep4
          6668.934 90394.665 8004.270 10207.432
          As bye suggested, using per-condition dispersion might fix this. There's also the issue of how or whether they shared information across genes (see the "sharingMode" option in estimateDispersions). The default is "maximum", which is a good idea unless you have a lot of samples (more than are here). I suspect that using "per-condition" for the dispersion estimation will resolve this without monkeying around with the sharingMode.

          Comment


          • #6
            Can you make a MDS plot of your samples ? Which samples are samples 5 and 6 close to ? For the gene, samples 5 and 6 look more like the hypoxia samples. You should explore if that is the case when using hundreds of genes.

            Comment


            • #7
              The source of the issue is sample H2, which is 10-fold above all the over samples. This makes DESeq comclude that this gene is extremely variable from sample to sample and hence should not be considered significant.

              In other words: Why should we believe that the 3-fold change between the averages of the normal and the hypoxia samples has been caused by the hypoxic treatment, if the data shows us that one can see changes of up to 10-fold even between samples which were not treated differently?

              Comment


              • #8
                Hi, sorry for the lateness of my answers. I do not believe sample H2 to be wrong, but I believe their H2 experiment wasn't a great success on a reproducibility level (see PCA graph). But I strongly believe samples N5 and N6 are batch biaised. I chose to discard them for my analyses.
                I will attach a pdf showing some graphs I generated. I saw on showing that conditions N5 and N6 succombed to something like batch effect. I would gladly done

                pdf1 : with all conditions included : diffAnalysis-AllConditions.pdf
                pdf2 : with N5 and N6 condition removed : diffAnalysis.pdf
                pdf3 : with H2 condition removed : diffAnalysis-H.pdf
                pdf4 : with N5, N6 and H2 conditions removed : diffAnalysis-HNN.pdf

                In order you get :
                1) plotDispEsts(cds, name= "H")
                2) plotDispEsts(cds, name= "N")
                3) plotMA(res)
                4&5) The two next heatmaps where generated with method=blind while computing estimateDispersion
                6) plotPCA(vsdFull, intgroup=c("condition"))
                7) Venn Diagram of my results cutting them only with two parameters : for Chip : pval < 0.1% and for RNAseq adjusted pval < 0.1%

                After using "per-condition" parameter, pvalues got better (and it helped a lot, thanks guys).

                I can't really know how to plot a MDS, I will have to look about it if it is necessary.
                I must admit this dataset is puzzling me.
                Is it a common thing to get that kind of data ?

                Thanks again for all your answers !

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:35 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Working...
                X