Header Leaderboard Ad

Collapse

26% duplicates

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 26% duplicates

    Hi is 26% duplicates an extraordinarily high number for single end sureselect targetted SOLiD reads?
    Also I presume the duplicates are in part of the mapped reads as well?

    I used Picard's markduplicates to arrive at the rmdup bam.

    ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    Unknown Library 61303170 0 40652492 26844757 0 0 0.437902

    101955662 in total
    0 QC failure
    26844757 duplicates
    61303170 mapped (60.13%)
    0 paired in sequencing
    0 read1
    0 read2
    0 properly paired (nan%)
    0 with itself and mate mapped
    0 singletons (nan%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)
    Last edited by KevinLam; 08-16-2010, 09:36 PM.
    http://kevin-gattaca.blogspot.com/

  • #2
    I wouldn't say it's extraordinary, although it is quite high. I've certainly seen higher. Depends a bit on the sample too - if you have coverage that is very high compared to the DNA represented in the sample, you will get many duplicates (you will start to sequence the same things over and over again).

    Comment


    • #3
      Hi Kopi-o,
      From my understanding, the PCR duplicates are marked by exact seq and physical proximity of the beads based on the read names pertaining to the platform.

      I can understand if it is additional coverage due to randomness. But I am concerned if perhaps I need to optimise the emulsion PCR step?
      or should I forget about removing duplicates at all? (since it is actually not marking the PCR duplicates but duplicates?)
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        I really wouldn't dare to suggest a specific course of action ... it depends on the application you have (standard answer!). You might want to check how many of the duplicates have the exact same sequence (by using Unix sort, for example) and how many just map to the same locations (with sequence differences). That would at least tell you something.

        Comment


        • #5
          Probably not ridiculous if this is only a 50bp SE frag run, which after you remove duplicates means you can only get 50x coverage max. If you apply the birthday problem to this type of probability situation to infer what the chance is that a mapped read, which encompases a given base, is unique you will find it gets extremely discouraging after you achive 20x unique coverage. Unfortunately, this is a situation were PE runs make a huge difference to the number/percentage of duplicates.

          Comment

          Working...
          X