Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nonuniformity of reads across transcript length

    Hello,
    I have been looking at the alignment of RNAseq reads (Illumina) from a library which preserved both PolyA + and PolyA - transcripts. As expected, a majority (~80%) of the reads appear to be from rRNA (18S, 28S) fragments. In mapping these reads to these rRNA sequences (18S is around 1800 bp, 28S is 5500bp), I obtain extremely uneven distribution of the reads. This uneven distribution takes the form of some relatively large regions where there are very few reads compared to other regions where there are many. Additionally, in terms of the exact mapping - even in regions where there are large numbers of reads, the reads are not evenly distributed (or at least a semi Poisson distribution), but rather many reads pile up at a specific bp site, which might have 10X the number of aligns as a neighboring bp.

    The overall unevenness I can perhaps understand (degradation?), but the more local drastic peaks and valleys I find more difficult to explain. Some possibilities appear to be sequencing bias (GC bias), or differential PCR amplification. Any ideas from users with more experience than myself would be greatly appreciated.

    Also, if anyone is aware of any human sequencing data (publicly available) where the PolyA Minus fraction has been maintained - which I can look at for comparison - this would be very helpful.

    Thanks for any ideas.

  • #2
    Mapping "deserts" can be due to non-unique kmers in the genomic sequence, where no read can be unambiguously aligned.
    As for unevenness of coverage, i would say the cause can be some amplification bias (too many PCR cycle) or some fragmentation/shearing bias (cleavage "hotspots", protected regions). Does that make sense?

    Also, you may find this paper interesting, it mentions coverage biases (among others).

    Comment


    • #3
      It seems that some of the non-uniformity or uneven coverage comes from library preparation. During the cDNA library prep there are a number of factors that can contribute to non-uniformity. Shearing by enzymatic cleavage or sonication tends to cause breaks in some sequences more frequently than others. This is less of a problem with chemical cleavage (and indeed we observe less extreme non-uniformity when we use chemical cleavage). Any preparation that uses random k-mers to attach an adapter sequence to the library will have some non-uniformity introduced in that step, as not all k-mers have the same melting point (i.e. GCCGCC has a different melting point than ACCTAA, despite both being 6-mers). There are probably other factors that influence the non-uniformity.
      Ambiguous alignments can cause some deserts, but in my experience do not account for all the non-uniformity.

      We have observed, however, that the pattern of non-uniformity is very highly conserved for SOLiD data, even across different conditions. This leads us to believe that the non-uniformity does not influence our specific RNA-Seq experimental design. That may not be true of Illumina reads and all experimental designs, however, so take that with a grain of salt.

      Comment


      • #4
        Just in case, you may find some non polyA-selected RNA-seq data from the UCSC table browser, assembly=hg18, group=expression, track=CSHL Long RNA-seq.
        The tables with a name that ends with "CellTotal" are from whole cellular extracts, not just from the cytosol, so i guess they could contain polyA- transcripts. May be worth a try.

        Comment


        • #5
          Daniel,

          The sequence-specific bias correction method we've implemented in Cufflinks 0.9.x takes some of these issues into account when estimating abundances. There are some details on the method on the "How It Works" page.

          -Adam

          Comment


          • #6
            I thank everyone for their helpful ideas/suggestions/references. Insofar as the mapping "deserts" being due to repeats in genomic regions, this is something which I already examined - and does not appear to be an issue over here. The fragmentation bias certainly seems to be a possibility, I am just surprised by the magnitude of difference that 1 bp shift (i.e. the # of reads I get starting at site x, compared to those compared starting at site x+1) seems to make in the number of aligned reads. See the article http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ which discusses some of the issues for Solexa reads, and does not find that start sequence makes a significant difference in reads. I will try and examine all these possibilities in greater detail.

            Daniel

            Comment


            • #7
              The paper you mention refers to DNA sequencing. In RNA sequencing there is an additional step where the single-stranded RNA is reverse transcribed and made into double-stranded cDNA. There is a substantial sequence specific bias introduced at this step, especially when random hexamer priming is used. See nar.oxfordjournals.org/cgi/content/abstract/38/12/e131 for more details. We have since found similar biases in numerous other protocols and will be publishing a paper on our correction method shortly.

              Comment


              • #8
                Yes- I see clearly from this article the bias in RNAseq -as opposed to DNAseq - which you are referring to. It appears that the specific bias which they find in begin sites of Illumina reads corresponds very closely to at least some of the unevenness which we are seeing in our read aligns. I will certainly watch out for your correction method when it comes out.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X