Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 26% duplicates

    Hi is 26% duplicates an extraordinarily high number for single end sureselect targetted SOLiD reads?
    Also I presume the duplicates are in part of the mapped reads as well?

    I used Picard's markduplicates to arrive at the rmdup bam.

    ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    Unknown Library 61303170 0 40652492 26844757 0 0 0.437902

    101955662 in total
    0 QC failure
    26844757 duplicates
    61303170 mapped (60.13%)
    0 paired in sequencing
    0 read1
    0 read2
    0 properly paired (nan%)
    0 with itself and mate mapped
    0 singletons (nan%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)
    Last edited by KevinLam; 08-16-2010, 09:36 PM.
    http://kevin-gattaca.blogspot.com/

  • #2
    I wouldn't say it's extraordinary, although it is quite high. I've certainly seen higher. Depends a bit on the sample too - if you have coverage that is very high compared to the DNA represented in the sample, you will get many duplicates (you will start to sequence the same things over and over again).

    Comment


    • #3
      Hi Kopi-o,
      From my understanding, the PCR duplicates are marked by exact seq and physical proximity of the beads based on the read names pertaining to the platform.

      I can understand if it is additional coverage due to randomness. But I am concerned if perhaps I need to optimise the emulsion PCR step?
      or should I forget about removing duplicates at all? (since it is actually not marking the PCR duplicates but duplicates?)
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        I really wouldn't dare to suggest a specific course of action ... it depends on the application you have (standard answer!). You might want to check how many of the duplicates have the exact same sequence (by using Unix sort, for example) and how many just map to the same locations (with sequence differences). That would at least tell you something.

        Comment


        • #5
          Probably not ridiculous if this is only a 50bp SE frag run, which after you remove duplicates means you can only get 50x coverage max. If you apply the birthday problem to this type of probability situation to infer what the chance is that a mapped read, which encompases a given base, is unique you will find it gets extremely discouraging after you achive 20x unique coverage. Unfortunately, this is a situation were PE runs make a huge difference to the number/percentage of duplicates.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Working...
          X