Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Whicn normalization methods are used for RNAseqv2 data at TCGA?

    Hi,

    I have downloaded the RNAseqV2 data for BRCA. there are diffrent version of expression values inside of the RNAseqV2, Level 3 folder.
    in the files with extention
    rsem.isoforms.results: we have raw_count and scaled_estimate
    .rsem.genes.normalized_results: we have Normalized count

    My question is that, what is the diffenece between Normalized count and Scaled estimate ? which Normalization methods they have used ?

  • #2
    The wiki explains how the data was handled - basically, there are two pipelines that were used, and the file names should tell you which files came out of which pipeline.

    see https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      It took me a while to get my head around this, since the column names in the rsem.genes/isoforms.results files don't match the default output of RSEM, neither the version they claim to have used nor the most current version.

      The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

      The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

      The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

      In conclusion, I would strongly recommend using the TPM/scaled_estimate values for all intents and purposes. It seems to me to be the more robust and mathematically sound value.

      Hope that helps, best wishes,

      Benjamin

      Comment


      • #4
        Hi,

        Yep, as Benjamin has pointed out, we have found the data in the *.normalized_results to be the most robust and comparable across samples and experiments. We've also done some testing against values we generate using a standard 75th-percentile normalization approach on the raw counts, and we find the relationship between our normalized values and the values presented in *.normalized_results to be in very high concordance (assessed by pearson correlation of gene-by-gene value comparisons).

        In fact, we chose to import the raw counts into our software platform, GenePool. When users of GenePool work with the RNA-Seq data in GenePool, they have the choice to apply different normalization methods, one of which is the standard 75th normalization method.

        If you're interested in checking out what we've done to bring TCGA data into GenePool, here are some related posts:

        Registered SEQanswers sponsors/vendors can post commercial content here. Please support our sponsors!

        Registered SEQanswers sponsors/vendors can post commercial content here. Please support our sponsors!


        Good luck!

        ------------------------------
        GenePool is making genomics data management, analysis, and sharing easier!
        Products @ www.stationxinc.com
        Last edited by GenePool; 11-23-2014, 09:25 PM.

        Comment


        • #5
          Hello
          I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
          Thank you for your help

          Comment


          • #6
            Originally posted by dreamer2001 View Post
            Hello
            I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
            Thank you for your help
            If you are interested in only checking one (or few genes) then you may want to do that at cancer Bioportal (http://www.cbioportal.org/) or the GenePool site mentioned above (if it is really free).

            Comment


            • #7
              Hello again,
              sorry guys, I am facing an issue here. I used the scaled estimate from TCGA data to correlate two genes across 550 patients. One reviewer said I should use normalized count as used by cBioportal. Which one is better? And how can I explain the use of scaled estimate over normalized count? To me scaled estimate sounded more sense so I just used it cuz I could understand how the data is generated from raw count.
              Thanks for your help.

              Comment


              • #8
                Scaled estimate and normalised count are similar ways of normalising the reads of each sample. Neither one is better and both are fine. Make a scatterplot of scaled estimate vs. normalised count to show the reviewer that they basically provide the same information and complain that there's no good reason to change your analysis and figures.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X