Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Nebetbastet
    Junior Member
    • Apr 2016
    • 7

    Optical duplicates Hiseq4000

    Dear all,

    I am working with RNA data sequenced on the Hiseq4000 sequencer. I am trying to quantify the number of "optical duplicates" or "clustering duplicates". These duplicates appear when reads in nearby wells result from secondary exAmp seeding from a primary well when concentrations are sub-optimal.

    I used MarkDuplicates (Picard 2.1.1) and followed this procedure : http://gatkforums.broadinstitute.org...swithmatecigar

    But each time, MarkDuplicates find "0 optical duplicate clusters"...

    I tested two alignement tools: TopHat and BWA, but each time, MarkDuplicates find no optical duplicate.

    I tried on 96 samples.

    Do you have any idea of why I cannot find any optical duplicate?

    Thank you very much
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Can you provide some additional information? Is this a PE dataset? What was the PF% for the lanes (I assume these 96 samples came from one flowcell)? What are the alignment % for the aligners you have used?

    Comment

    • Nebetbastet
      Junior Member
      • Apr 2016
      • 7

      #3
      Thank you GenoMax for your answer.

      - It is a 50bp single-end dataset
      - Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
      - Using Tophat, the percentage of mapped reads ranges from 73.3% to 96.4%, with a median equal to 93.5%.
      - I used BWA only on one sample: I found that 93.3% of reads mapped to the reference genome

      Thank you in advance for your help

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        That seems a bit odd. Based on the training for HiSeq 4000 we were told that the sweet spot for PF is around 70%. Any more (once you get closer to 75%) would indicate that there will be a lot duplicates.

        When running Picard MarkDuplicates did you adjust the OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 as recommended in the link you had posted above?

        Perhaps you got lucky (and/or you have a library of excellent quality) and there are no duplicates. Though that seems a bit too good to be true.

        Comment

        • Nebetbastet
          Junior Member
          • Apr 2016
          • 7

          #5
          Thank you for your answer.

          Yes, I adjusted at 2500 as indicated in the link.

          As you say, I find it's a little too good to be true...

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Have you contacted tech support? It may be worth getting their take on this.

            I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?

            Comment

            • kmcarr
              Senior Member
              • May 2008
              • 1181

              #7
              Originally posted by Nebetbastet View Post
              - Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
              Originally posted by GenoMax View Post
              Have you contacted tech support? It may be worth getting their take on this.

              I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?
              This is just a reporting quirk when you run Bcl2fastq without using the "--with-failed-reads" option. Since it is only converting and demultiplexing PF reads it reports them as 100% PF.

              NOTE: This is true for Bcl2fastq v1.8.4. I have never tested the newer, 2.x versions of Bcl2fastq.

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                It would be odd if bcl2fastq v.2 was run with "--with-failed-reads" option but that may be a logical explanation for the 100% PF observation.

                Comment

                • Nebetbastet
                  Junior Member
                  • Apr 2016
                  • 7

                  #9
                  Hi,

                  Sorry for my slow reply. I was investigating for the 100% PF... Actually, this is a wrong number. The %PF is 71%.

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    That sounds more logical. Any update on optical duplicates? I have not been able to replicate the settings recommended on GATK site for a small number of samples I have tried.

                    See this for an update on how samtools/GATK may handle this in future.

                    Comment

                    • Nebetbastet
                      Junior Member
                      • Apr 2016
                      • 7

                      #11
                      No, no update
                      Thank you for the link to this discussion !

                      Comment

                      • Nebetbastet
                        Junior Member
                        • Apr 2016
                        • 7

                        #12
                        Hi,
                        I understood what my problem was. Actually, it's quite trivial but I let you know in case someone would meet the same problem...


                        I used single-end data (most of the projects in my team are single-end). I just noticed Markduplicates needs paired-end data. I read the documentation too quickly and I was simply supposing Markduplicates could detect optical duplicates using both single-end and paired-end data.

                        I just used it in paired-end data and I could detect "optical" duplicates !

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          Where does it say that paired-end reads are required for this procedure (unless I am missing something)?

                          Tutorial you had originally linked does say the following

                          For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate.

                          Comment

                          • Nebetbastet
                            Junior Member
                            • Apr 2016
                            • 7

                            #14
                            In the command line overview, I can read:

                            Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5' positions (coordinate and strand) for both reads in a mate pair (and optionally, matching unique molecular identifier reads; see BARCODE_TAG option).
                            When I read that, I thought "OK, it is not said clearly, but it seems it needs paired-end data as there is no mention of single-end reads". And when I used paired-end reads, it worked (i.e., I found optical duplicates).

                            But indeed, in the tutorial, it is said single-end reads can be used... Actually, when I used single-end reads, duplicates were found (which means MarkDuplicates can use single-end reads to detect duplicates... ), but MarkDuplicates was unable to find "optical duplicates" (on all the samples of all the single-end datasets I used). It's quite confusing :s .

                            I let comments on the tutorial, so maybe I will get some answers.

                            Comment

                            • GenoMax
                              Senior Member
                              • Feb 2008
                              • 7142

                              #15
                              Both reads would need to start at identical 5' co-ordinates to be certain that they represent an identical fragment so that makes sense as far as optical duplicates go.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              6 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...