Seqanswers Leaderboard Ad

**GenoMax** · 12-23-2016, 02:06 PM

@Brian: Can this be easily extended to provide "MarkDuplicates" functionality? Currently only Piacard tools does this.

I am referring to optical duplicates that form due to "pad hopping" in case of patterned HiSeq 4000 flowcells. Since you are using smaller rectangular tiles to look at reads in the neighborhood would it be possible to identify clusters that may be optical dups and mark them?

**Brian Bushnell** · 12-23-2016, 02:14 PM

Originally posted by GenoMax View Post

@Brian: Can this be easily extended to provide "MarkDuplicates" functionality? Currently only Picard tools does this.

I am referring to optical duplicates that form due to "pad hopping" in case of patterned HiSeq 4000 flowcells. Since you are using smaller rectangular tiles to look at reads in the neighborhood would it be possible to identify clusters that may be optical dups and mark them?

It sounds like it might (or could) be a straightforward extension, yes. Could require writing temp files if all reads don't fit into memory, though. I'd have to study the optical duplicate problem in detail first because I don't currently know how to identify them confidently.

**GenoMax** · 12-23-2016, 02:25 PM

If you can look in the positional neighborhood for clusters having identical sequence then those would be it.

**SNPsaurus** · 12-23-2016, 08:53 PM

GenoMax, my first thought was also that an optical dup filter was needed! I would love to remove optical duplicates as a first step in a pipeline (before mapping). Seems like between clumpify and this filterbytile he almost has it written already. (Sorry Brian, I'm sure it is annoying for us to besiege you with requests and then add how it should be easy to do.)

**Brian Bushnell** · 12-29-2016, 02:08 PM

I decided it would work best to add deduplication features to Clumpify. All approaches work perfectly for error-free reads, but Clumpify is affected slightly more by errors than mapping-based approaches, for situations when it is desirable to remove "duplicates" with mismatches. Here's a comparison showing Clumpify's paired read deduplication compared to DedupeByMapping on some real HiSeq data, allow reads with various number of mismatches to the reference (DedupeByMapping is considered the gold standard in each case, though that does not necessarily mean it is more correct). Clumpify is run with different settings (C and D have higher removal rates because they use 3 passes; A is at default settings).

I also added in the ability to restrict duplicate removal to only clusters within a specific number pixels of each other on the flowcell, to avoid removal of PCR duplicates or coincidental duplicates due to high coverage. In so doing I noticed some interesting things... firstly, that most optical duplicates are on different tiles (inter-tile duplicates) rather than the same tile, and secondly, that in the data I tested, NextSeq has a WAY higher optical duplicate rate (~1%) than HiSeq 2500/1T (0.05%). The way you can distinguish between an inter-tile optical duplicate and a PCR duplicate is that inter-tile optical duplicates will share an X or Y coordinate (within some number of pixels, typically under 40). Intra-tile optical duplicates will of course share both coordinates as well as the tile number.

I'll release this once I'm done testing.

Attached Files

DedupeClumpify.png (39.0 KB, 556 views)

**GenoMax** · 12-29-2016, 02:57 PM

Am I reading the graph above correctly in that you were not able to find true optical dups (perfect matches on the read) in data you tested? These should be present in problematic HiSeq 3K, 4K data.

You may also want to grab some HiSeq 4000 (or HiSeq X) data from SRA to test since we expect this to be a problem there.

**Brian Bushnell** · 12-29-2016, 03:46 PM

Hi Genomax,

There were plenty of identical pairs. Everything is scaled to 100% in that graph, but here is the raw data:

Code:

	A	B	C	D	DBM
0	3868	3868	3868	3868	3870
1	6260	6316	6402	6454	6470
2	6534	6562	6720	6728	6746
3	6590	6628	6794	6806	6826
4	6622	6662	6832	6840	6858
5	6652	6690	6864	6874	6888

56% of the optical duplicates are perfectly identical, and those were found without problems. Only the reads with mismatches to each other pose challenges, but most of those were still found as well. I still consider them optical duplicates even though they are not technically identical. I've been kind of struggling with the definition of "optical duplicates", but I will use this:

Code:

Reads originating from the same fragment, called multiple times despite originating from nearly the same physical flowcell location.

Since they reads are called multiple times, they can have different errors despite being the same physical cluster.

**Brian Bushnell** · 01-04-2017, 03:09 PM

OK, the new version of Clumpify is out, adding the "dedupe" and "optical" flags (as well as a few other related flags), so you can do optical or full deduplication. Also related to FilterByTile, BBDuk now has xmin, ymin, xmax, and ymax flags for large-scale location-based read filtering; essentially, you can eliminate tile-edge effects using a bounding box. For our NextSeq I was able to eliminate tile-edge duplicates with "xmin=1600 xmax=26300". There did not seem to be any on the Y edges. But, the exact values may vary by machine or run.

**mcmc** · 01-30-2017, 10:35 AM

Brian et al., would you recommend filterbytile.sh be done first, before adapter trimming & quality filtering?
Thanks,
MC

**Brian Bushnell** · 01-30-2017, 11:11 AM

Hi MC,

FilterByTile should be run on raw data, before anything else that changes or removes any reads. If you do any quality-related filtering or trimming steps before FilterByTile, you will remove some of the lowest-quality reads, which will disrupt the statistics.

**mcmc** · 02-11-2017, 03:52 PM

Hello Brian,
I get the message "Warning: Zero reads processed." using indump=dump.flowcell. But it looks like making the dump file worked ok (it says it processed 780m reads). Is it safe to ignore this warning?

Thanks,
mcmc

**Brian Bushnell** · 02-11-2017, 04:10 PM

Yeah, sorry about that, it's a known bug when you are using already-created flowcell files. If you do everything in a single pass it won't print that. Do you mind posting the filtering statistics, platform, and read length, just out of curiosity? The defaults generally remove around 2-5% of the reads in my testing, but I've only tested it on HiSeq 2500 and NextSeq data.

**mcmc** · 02-11-2017, 04:58 PM

Thanks. This was HiSeq 2500 2x250 RapidRun (with both lanes concatenated, which I assume is ok, since you refer to "flowcell" and not "lane"). I used the entire dataset (15 metagenomes) to calc the stats, then used dump.flowcell to filter each sample. Here is one example:

Code:

Flagged 36407 of 519552 micro-tiles, containing 50578608 reads:
0 exceeded uniqueness thresholds.
30332 exceeded quality thresholds.
34084 exceeded error-free probability thresholds.
0 had too few reads to calculate statistics.

Filtering reads:        988.159 seconds.

Time:                           988.671 seconds.

Reads Processed:      56900k    57.55k reads/sec
Bases Processed:      14225m    14.39m bases/sec

Reads Discarded:       3094k    5.438%
Bases Discarded:        773m    5.438%

I used "lowqualityonly=t usekmers=f"

**Brian Bushnell** · 02-11-2017, 06:32 PM

Thanks! Yes, FilterByTile processes lanes independently, so you can use them all at once.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Introducing FilterByTile: Remove Low-Quality Reads Without Adding Bias

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News