Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • puggie
    Member
    • Nov 2011
    • 52

    The way to normalize RNA-SEQ coverage data for multiple samples?

    Dear Forum,

    Im analyzing several RNA-seq samples from the same Illumina run. I have obtained 50-70 mio reads pr sample, and following mapping I get a figure of 40-50 mio reads aligned depending on sample (an arbitary example). Then I would go on to make bedGraph og bigwig files of the bam alignments for visualization in UCSC, but these are not "normalized" e.g. for direct sample comparison. How should I scale these files? E.g. if one sample has 50 mio aligned and the other 40 mio should this be taken into account when making these visualization files (bed wig etc). I believe it should, but what would be the correct procedure, and how is it done?

    Tia
  • Dario1984
    Senior Member
    • Jun 2011
    • 166

    #2
    Using a combination of Bioconductor packages will give you what you need. You can make raw counts in gene regions, then use the function calcNormFactors in edgeR to work out the compositional bias of each sample. To get raw coverage, you can use the coverage function from GenomicRanges, then multiply each sample's coverage by the scale factor you calculated with calcNormFactors. To export the coverage to a BedGraph or BigWig file, the export function from rtracklayer can be used.

    Comment

    • puggie
      Member
      • Nov 2011
      • 52

      #3
      Thanks for your reply,

      I have now done the analysis in R after calculating raw counts against ensembl genes. For three of the samples (ABC) I get from calcNormFactors:

      sample lib.size. norm.scaling
      A 5604846 0.9273452
      B 4433633 1.0454615
      C 6520510 1.0314556

      Intuitively, I would have thought that C should have been scaled somewhat compared to B, when comparing library sizes.

      The raw counts were calculated on exon/intron of genes and excluding intervals of <50 counts

      Comment

      • Dario1984
        Senior Member
        • Jun 2011
        • 166

        #4
        Make an MA plot before and after normalisation. The function is maPlot in edgeR. You will see that the data points will be centred around M = 0 after normalisation. This is based on the biological assumption that, between conditions, the majority of genes don't change in expression.

        Comment

        • puggie
          Member
          • Nov 2011
          • 52

          #5
          Okay I will try this.

          Regarding the raw counts table, what would be the best procedure for selecting regions for normalization? Lets say I have an ensembl annotation file of 30.000 regions total (isoforms merged etc.). When I do the raw counting I get something like <10.000 regions/sample, which may contain up to several thousand read counts. Also I see a pattern between the samples e.g. from random line selection I could get something looking like this for 4 samples:

          0 0 2 0
          0 15 0 0
          96 143 71 132
          1 0 5 0
          850 1201 1171 907
          1 0 0 1

          Hence same genes seems to be active, which makes sense as the samples are from same tissue type.

          What would be the best way to buidling this table, e.g. taking all intervals (genes) into account in edgeR which are 1. Expressed and 2. Expression in general do not deviate by a preset factor ??

          Or is there some recommended "general gene list" that is considered stable like we know from the qPCR days.

          Comment

          • Dario1984
            Senior Member
            • Jun 2011
            • 166

            #6
            It's a good idea to get rid of lowly expressed genes before calculating the normalisation factors. There is no safe gene list. I use all of them and don't filter on variance.

            Comment

            • Richard Finney
              Senior Member
              • Feb 2009
              • 701

              #7
              It's a good idea to get rid of lowly expressed genes

              Why is that?

              Comment

              • Dario1984
                Senior Member
                • Jun 2011
                • 166

                #8
                The estimates for fold change aren't stable for those genes. A couple of extra reads here or there could change the fold change calculation drastically for a lowly expressed gene. Also, a rough rule is that about ten percent of genes are being reproducibly expressed in a cell at any one time, so unstable fold changes from spurious, low transcription would contribute the most to the calculation.

                Comment

                • sisterdot
                  Junior Member
                  • Apr 2013
                  • 6

                  #9
                  two options that have not been tested:

                  1) genomeCoverageBed has a -scale option (e.g. DESeq estimateSizeFactors), although i guess Dario1984 suggestion might be easier: "get raw coverage, you can use the coverage function from GenomicRanges, then multiply each sample's coverage by the scale factor you calculated with calcNormFactors. To export the coverage to a BedGraph or BigWig file, the export function from rtracklayer can be used."

                  2) using normalize_bigwig.py (RSeQC package)
                  Last edited by sisterdot; 04-09-2013, 03:44 AM.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  25 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  42 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  48 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  49 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...