Unconfigured Ad

**GenoMax** · 04-01-2016, 06:24 AM

Can you provide some additional information? Is this a PE dataset? What was the PF% for the lanes (I assume these 96 samples came from one flowcell)? What are the alignment % for the aligners you have used?

**Nebetbastet** · 04-04-2016, 12:02 AM

Thank you GenoMax for your answer.

- It is a 50bp single-end dataset
- Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
- Using Tophat, the percentage of mapped reads ranges from 73.3% to 96.4%, with a median equal to 93.5%.
- I used BWA only on one sample: I found that 93.3% of reads mapped to the reference genome

Thank you in advance for your help

**GenoMax** · 04-04-2016, 05:01 AM

That seems a bit odd. Based on the training for HiSeq 4000 we were told that the sweet spot for PF is around 70%. Any more (once you get closer to 75%) would indicate that there will be a lot duplicates.

When running Picard MarkDuplicates did you adjust the OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 as recommended in the link you had posted above?

Perhaps you got lucky (and/or you have a library of excellent quality) and there are no duplicates. Though that seems a bit too good to be true.

**Nebetbastet** · 04-04-2016, 05:03 AM

Thank you for your answer.

Yes, I adjusted at 2500 as indicated in the link.

As you say, I find it's a little too good to be true...

**GenoMax** · 04-04-2016, 05:10 AM

Have you contacted tech support? It may be worth getting their take on this.

I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?

**kmcarr** · 04-04-2016, 06:07 AM

Originally posted by Nebetbastet View Post

- Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples

Originally posted by GenoMax View Post

Have you contacted tech support? It may be worth getting their take on this.

I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?

This is just a reporting quirk when you run Bcl2fastq without using the "--with-failed-reads" option. Since it is only converting and demultiplexing PF reads it reports them as 100% PF.

NOTE: This is true for Bcl2fastq v1.8.4. I have never tested the newer, 2.x versions of Bcl2fastq.

**GenoMax** · 04-04-2016, 06:16 AM

It would be odd if bcl2fastq v.2 was run with "--with-failed-reads" option but that may be a logical explanation for the 100% PF observation.

**Nebetbastet** · 04-19-2016, 04:13 AM

Hi,

Sorry for my slow reply. I was investigating for the 100% PF... Actually, this is a wrong number. The %PF is 71%.

**GenoMax** · 04-19-2016, 04:24 AM

That sounds more logical. Any update on optical duplicates? I have not been able to replicate the settings recommended on GATK site for a small number of samples I have tried.

See this for an update on how samtools/GATK may handle this in future.

**Nebetbastet** · 04-19-2016, 04:30 AM

No, no update

Thank you for the link to this discussion !

**Nebetbastet** · 05-18-2016, 02:38 AM

Hi,
I understood what my problem was. Actually, it's quite trivial but I let you know in case someone would meet the same problem...

I used single-end data (most of the projects in my team are single-end). I just noticed Markduplicates needs paired-end data. I read the documentation too quickly and I was simply supposing Markduplicates could detect optical duplicates using both single-end and paired-end data.

I just used it in paired-end data and I could detect "optical" duplicates !

**GenoMax** · 05-18-2016, 04:16 AM

Where does it say that paired-end reads are required for this procedure (unless I am missing something)?

Tutorial you had originally linked does say the following

For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate.

**Nebetbastet** · 05-18-2016, 04:29 AM

In the command line overview, I can read:

Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5' positions (coordinate and strand) for both reads in a mate pair (and optionally, matching unique molecular identifier reads; see BARCODE_TAG option).

When I read that, I thought "OK, it is not said clearly, but it seems it needs paired-end data as there is no mention of single-end reads". And when I used paired-end reads, it worked (i.e., I found optical duplicates).

But indeed, in the tutorial, it is said single-end reads can be used... Actually, when I used single-end reads, duplicates were found (which means MarkDuplicates can use single-end reads to detect duplicates... ), but MarkDuplicates was unable to find "optical duplicates" (on all the samples of all the single-end datasets I used). It's quite confusing :s .

I let comments on the tutorial, so maybe I will get some answers.

**GenoMax** · 05-18-2016, 05:16 AM

Both reads would need to start at identical 5' co-ordinates to be certain that they represent an identical fragment so that makes sense as far as optical duplicates go.

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Optical duplicates Hiseq4000

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News