Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nonuniformity of reads across transcript length

    Hello,
    I have been looking at the alignment of RNAseq reads (Illumina) from a library which preserved both PolyA + and PolyA - transcripts. As expected, a majority (~80%) of the reads appear to be from rRNA (18S, 28S) fragments. In mapping these reads to these rRNA sequences (18S is around 1800 bp, 28S is 5500bp), I obtain extremely uneven distribution of the reads. This uneven distribution takes the form of some relatively large regions where there are very few reads compared to other regions where there are many. Additionally, in terms of the exact mapping - even in regions where there are large numbers of reads, the reads are not evenly distributed (or at least a semi Poisson distribution), but rather many reads pile up at a specific bp site, which might have 10X the number of aligns as a neighboring bp.

    The overall unevenness I can perhaps understand (degradation?), but the more local drastic peaks and valleys I find more difficult to explain. Some possibilities appear to be sequencing bias (GC bias), or differential PCR amplification. Any ideas from users with more experience than myself would be greatly appreciated.

    Also, if anyone is aware of any human sequencing data (publicly available) where the PolyA Minus fraction has been maintained - which I can look at for comparison - this would be very helpful.

    Thanks for any ideas.

  • #2
    Mapping "deserts" can be due to non-unique kmers in the genomic sequence, where no read can be unambiguously aligned.
    As for unevenness of coverage, i would say the cause can be some amplification bias (too many PCR cycle) or some fragmentation/shearing bias (cleavage "hotspots", protected regions). Does that make sense?

    Also, you may find this paper interesting, it mentions coverage biases (among others).

    Comment


    • #3
      It seems that some of the non-uniformity or uneven coverage comes from library preparation. During the cDNA library prep there are a number of factors that can contribute to non-uniformity. Shearing by enzymatic cleavage or sonication tends to cause breaks in some sequences more frequently than others. This is less of a problem with chemical cleavage (and indeed we observe less extreme non-uniformity when we use chemical cleavage). Any preparation that uses random k-mers to attach an adapter sequence to the library will have some non-uniformity introduced in that step, as not all k-mers have the same melting point (i.e. GCCGCC has a different melting point than ACCTAA, despite both being 6-mers). There are probably other factors that influence the non-uniformity.
      Ambiguous alignments can cause some deserts, but in my experience do not account for all the non-uniformity.

      We have observed, however, that the pattern of non-uniformity is very highly conserved for SOLiD data, even across different conditions. This leads us to believe that the non-uniformity does not influence our specific RNA-Seq experimental design. That may not be true of Illumina reads and all experimental designs, however, so take that with a grain of salt.

      Comment


      • #4
        Just in case, you may find some non polyA-selected RNA-seq data from the UCSC table browser, assembly=hg18, group=expression, track=CSHL Long RNA-seq.
        The tables with a name that ends with "CellTotal" are from whole cellular extracts, not just from the cytosol, so i guess they could contain polyA- transcripts. May be worth a try.

        Comment


        • #5
          Daniel,

          The sequence-specific bias correction method we've implemented in Cufflinks 0.9.x takes some of these issues into account when estimating abundances. There are some details on the method on the "How It Works" page.

          -Adam

          Comment


          • #6
            I thank everyone for their helpful ideas/suggestions/references. Insofar as the mapping "deserts" being due to repeats in genomic regions, this is something which I already examined - and does not appear to be an issue over here. The fragmentation bias certainly seems to be a possibility, I am just surprised by the magnitude of difference that 1 bp shift (i.e. the # of reads I get starting at site x, compared to those compared starting at site x+1) seems to make in the number of aligned reads. See the article http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ which discusses some of the issues for Solexa reads, and does not find that start sequence makes a significant difference in reads. I will try and examine all these possibilities in greater detail.

            Daniel

            Comment


            • #7
              The paper you mention refers to DNA sequencing. In RNA sequencing there is an additional step where the single-stranded RNA is reverse transcribed and made into double-stranded cDNA. There is a substantial sequence specific bias introduced at this step, especially when random hexamer priming is used. See nar.oxfordjournals.org/cgi/content/abstract/38/12/e131 for more details. We have since found similar biases in numerous other protocols and will be publishing a paper on our correction method shortly.

              Comment


              • #8
                Yes- I see clearly from this article the bias in RNAseq -as opposed to DNAseq - which you are referring to. It appears that the specific bias which they find in begin sites of Illumina reads corresponds very closely to at least some of the unevenness which we are seeing in our read aligns. I will certainly watch out for your correction method when it comes out.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Understanding Genetic Influence on Infectious Disease
                  by seqadmin




                  During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                  Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                  09-09-2024, 10:59 AM
                • seqadmin
                  Addressing Off-Target Effects in CRISPR Technologies
                  by seqadmin






                  The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                  08-27-2024, 04:44 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 09-06-2024, 08:02 AM
                0 responses
                143 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 09-03-2024, 08:30 AM
                0 responses
                147 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 08-27-2024, 04:40 AM
                0 responses
                158 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 08-22-2024, 05:00 AM
                0 responses
                401 views
                0 likes
                Last Post seqadmin  
                Working...
                X