Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

High duplicates in mRNA-seq data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High duplicates in mRNA-seq data

    I extracted total RNA from drug and vehicle treated primary neurons (mouse) and used Kapa Stranded mRNA-Seq kit to generate libraries.

    Goal is differential expression analysis - primarily looking at roughly 60 neuronal genes and also a more general effect of our drugs on transcriptional output of neuronal genes.

    Input RNA: 1.5ug, PCR cycles - 8x - RNA RIN was always over 8 with good electropherogram trace

    Sequencing info: Illumina HiSeq2100 - 5 libraries multiplexed into 1 lane.

    So the problem: between 55-60% duplication rate for all libraries - very consistent across the board. The highest number of duplicates are from poly-A and poly-T tracts according to QC data from the sequencing core.

    I could really use some advice here. Is this rate of duplication a problem for a DE experiment such as this? What rate of duplication would be more acceptable?

    Thanks so much for any input, I'm really worried that my whole PhD project is toast...

  • #2
    This sounds fairly typical, one expects a high level of apparent duplications in RNAseq. Note that I wrote "apparent duplications", since these are likely not real PCR or optical duplicates. A bias toward the 3' end is also not that uncommon, at least if you did any polyA enrichment (I'm not familiar with the kapa kit).

    BTW, it's a bit premature to worry that your PhD is toast after one experiment (hint, most experiments don't work).

    Comment


    • #3
      Originally posted by dpryan View Post
      Note that I wrote "apparent duplications", since these are likely not real PCR or optical duplicates
      Slightly off-topic... I've been wondering why Illumina or any other company didn't commercialize a library prep kit where each read gets its own random barcode. In principle it shouldn't be that difficult to generate adapters with a random kmer long enough to distinguish millions of reads. Not saying that it's going to be easy in practice but this issue of what to do with positional duplicates recurs so often and it seems to me that any work around it is not ideal.

      Comment


      • #4
        In a sense that's what 10x is doing, but for whole genome sequencing, so presumably it's possible.

        Comment


        • #5
          Originally posted by dariober View Post
          Slightly off-topic... I've been wondering why Illumina or any other company didn't commercialize a library prep kit where each read gets its own random barcode. In principle it shouldn't be that difficult to generate adapters with a random kmer long enough to distinguish millions of reads. Not saying that it's going to be easy in practice but this issue of what to do with positional duplicates recurs so often and it seems to me that any work around it is not ideal.
          At least there is a kit that has implemented molecular tagging but I can think of few reasons for less wide adaptation of this approach:
          1- With majority of current kits, adapter ends that ligate to insert are double stranded thus using random sequences would result in less complementary ends and low ligation efficiency
          2- It seems logical approach at first look but the practical value of such approach is questionable. For more info look at these: http://journals.plos.org/plosone/art...l.pone.0119123 and http://www.pnas.org/content/109/21/E1330.full

          Comment


          • #6
            % of duplicates per gene

            One thing I've looked at is the % of duplicates per gene. If you have a high number of duplicates only in a few genes you should be fine, but if you have low expression genes with high duplication then you should look a bit more closely into this, you might have PCR amplification biases. This all is relative to PE and coverage but calculating the % of duplicates per gene (as opposed to library total) should help elucidate if you have a problem or not.

            Check this out:
            http://www.nature.com/articles/srep25533

            By the way here they use the "random" barcode method mentioned above (better known as a UMI or unique molecular identifier)
            Last edited by aleferna; 11-28-2016, 02:05 AM.

            Comment

            Working...
            X