Announcement

Collapse
No announcement yet.

Whicn normalization methods are used for RNAseqv2 data at TCGA?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Whicn normalization methods are used for RNAseqv2 data at TCGA?

    Hi,

    I have downloaded the RNAseqV2 data for BRCA. there are diffrent version of expression values inside of the RNAseqV2, Level 3 folder.
    in the files with extention
    rsem.isoforms.results: we have raw_count and scaled_estimate
    .rsem.genes.normalized_results: we have Normalized count

    My question is that, what is the diffenece between Normalized count and Scaled estimate ? which Normalization methods they have used ?

  • #2
    The wiki explains how the data was handled - basically, there are two pipelines that were used, and the file names should tell you which files came out of which pipeline.

    see https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      It took me a while to get my head around this, since the column names in the rsem.genes/isoforms.results files don't match the default output of RSEM, neither the version they claim to have used nor the most current version.

      The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

      The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

      The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

      In conclusion, I would strongly recommend using the TPM/scaled_estimate values for all intents and purposes. It seems to me to be the more robust and mathematically sound value.

      Hope that helps, best wishes,

      Benjamin

      Comment


      • #4
        Hi,

        Yep, as Benjamin has pointed out, we have found the data in the *.normalized_results to be the most robust and comparable across samples and experiments. We've also done some testing against values we generate using a standard 75th-percentile normalization approach on the raw counts, and we find the relationship between our normalized values and the values presented in *.normalized_results to be in very high concordance (assessed by pearson correlation of gene-by-gene value comparisons).

        In fact, we chose to import the raw counts into our software platform, GenePool. When users of GenePool work with the RNA-Seq data in GenePool, they have the choice to apply different normalization methods, one of which is the standard 75th normalization method.

        If you're interested in checking out what we've done to bring TCGA data into GenePool, here are some related posts:

        http://seqanswers.com/forums/showthread.php?t=48485
        http://seqanswers.com/forums/showthread.php?t=42471

        Good luck!

        ------------------------------
        GenePool is making genomics data management, analysis, and sharing easier!
        Products @ www.stationxinc.com
        Last edited by GenePool; 11-23-2014, 09:25 PM.

        Comment


        • #5
          Hello
          I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
          Thank you for your help

          Comment


          • #6
            Originally posted by dreamer2001 View Post
            Hello
            I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
            Thank you for your help
            If you are interested in only checking one (or few genes) then you may want to do that at cancer Bioportal (http://www.cbioportal.org/) or the GenePool site mentioned above (if it is really free).

            Comment


            • #7
              Hello again,
              sorry guys, I am facing an issue here. I used the scaled estimate from TCGA data to correlate two genes across 550 patients. One reviewer said I should use normalized count as used by cBioportal. Which one is better? And how can I explain the use of scaled estimate over normalized count? To me scaled estimate sounded more sense so I just used it cuz I could understand how the data is generated from raw count.
              Thanks for your help.

              Comment


              • #8
                Scaled estimate and normalised count are similar ways of normalising the reads of each sample. Neither one is better and both are fine. Make a scatterplot of scaled estimate vs. normalised count to show the reviewer that they basically provide the same information and complain that there's no good reason to change your analysis and figures.

                Comment

                Working...
                X