Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • frymor
    Senior Member
    • May 2010
    • 151

    clarification of rna-seq normalization

    Hi everybody,

    I read a lot in the last few days about the different opinions to rna-seq normalization methods.
    To be honest I'm quite a bit confused at the moment and so I would like to ask for your help to try and clarify me about how to use what kind of normalization method.

    I'm sure that there is no straightforward answer for such a question but I would really appreciate contradictory opinions if it will help for other users also to explain the problem.

    As far as I understand it there is no "standard" method for normalizing methods.

    We have one rna-seq experiment with each only one set for control and one set for treatment. Albeit the fact of insignificance regarding the lack of replicates, I would like to understand how to work in general with rna-seq data.

    we would like to look into both differential expression and differences in splice variants between the two conditions.
    I have read opinion about how to normalize the data in best way for identifying differentially expressed genes and for identifying isoforms.
    Apparently these two goals should be analyzed differently.
    The best example for that was the discussion between Simon and lpachter about when to normalize how here: http://seqanswers.com/forums/showthr...p?t=586&page=1

    I think it shows how controversy this can be. I was interested in this discussion, though it is quite an old one and a lot have changed probably.

    RPKM measure the relative level of gene expression between experiments, but apparently some people are against it, due to certain biases, which it can't compensate. In the posting above, Simon mentions DESeq (EdgeR), which suppose to work better for differential expression

    So my questions are:
    (well I will probably have a lot more, but these are to begin with)

    1. Will it be better to normalize the data twice separately for the two goals

    2. Does it make sense to normalize data one time after the other?

    3. Can I relay on cuffdiff/cuffcompare to give me a good estimation on the splice variants and on DESeq/DEGSeq to give me a good estimation about the differentially expressed genes?

    I would appreciate every comment or discussion.

    Thanks

    A.
  • eslondon
    Member
    • Jul 2009
    • 21

    #2
    Clearly it is important to follow the assumptions and models within each of the tools you mention.

    If you want to compile a simple "table of expression", you can produce RKPMs, fold-changes, etc. If, however you use a specific tool, such as edgeR, which has its own methodology for normalizing and estimating differences in expression (bearing in mind that edgeR has a variety of models implemented, as explained in its manual), then you should provide it what it expects, i.e. raw read counts

    Since we are still in early days clearly lab validation of results is the key to understanding which tools are giving you best answers in the end....
    --------------------------------------
    Elia Stupka
    Co-Director and Head of Unit
    Center for Translational Genomics and Bioinformatics
    San Raffaele Scientific Institute
    Via Olgettina 58
    20132 Milano
    Italy
    ---------------------------------------

    Comment

    • sphil
      Senior Member
      • Apr 2010
      • 192

      #3
      Hey,

      you are asking somewhat for the 'holy grail' - how to normalize my data.
      In my opinion the most crucial step is to know where your data comes from. Thus, DE normalization between technical replicates needs to be different from DE detection between biological replicates (poisson vs. neg. binom (see Marioni et al.)). In addition, as mentioned above, every method assumes a different distribution of reads.
      RPKM 'just' normalize for gene length and amount of reads in total. It does not correct biases coming from transcript abundance in the library. Thus your RPKM values should follow a normal distrib. and they should not show a linear correlation between gene length and transcript abundance. However, since housekeepers provide a great amount of transcript one should also take into account to normalize maybe with quantile normalization, for instance. DESeq (and stuff like that) want the raw counts to estimate dispersion and distribution to optimally fit the assumptions to the given data. So I would do different analysis (i.e. using DESeq as well as RPKM/FC analysis) and compare the results. From that comparison you can figure out what distribution fits best to your data, at least somewhat.

      Comment

      • DZhang
        Senior Member
        • Jun 2010
        • 177

        #4
        Hi frymor,

        You may try different methods but ultimately you must rely on the follow-up experiment(s) to validate the results. Let's say you try 2-3 analysis methods/models, you will have DE genes identified by all methods or by some. You need to validate them by independent methods - e.g., qPCR. The field needs sufficient validation results to see which method is better suited for a certain application.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        30 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        96 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        116 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        109 views
        0 reactions
        Last Post SEQadmin2  
        Working...