Seqanswers Leaderboard Ad

**SNPsaurus** · 03-03-2017, 01:04 PM

Brian, we were talking about this and wondered if you could test the breakage model by looking at the location of duplicates. This would be happening during flow, right? So, if it is breakage then the duplicates should all happen in the direction of flow, with little orthogonal movement. Could you either pull up sets of duplicates and look at the coordinates, or add separate dimension distances for checking for dups and then allow a short pixel distance in one dimension and long in the other and vice versa and see how that affects the results?

**Brian Bushnell** · 03-03-2017, 03:11 PM

Oh, that's an interesting suggestion... theoretically, they shouldn't go upstream at all, or sideways very far. It might not be possible to differentiate between upstream and downstream duplicates (though theoretically, the downstream ones should have a weaker signal), but I can certainly add the ability to differentiate between the X and Y axis. I'll do that and post here when it's done.

I'd imagine that they should make a kind of cone-shaped pattern like the debris field from an airplane crash or tornado, but plotting that kind of thing is tricky since it's an all-or-nothing proposition that doesn't let you see the diminishing probability over the region.

**GenoMax** · 03-03-2017, 06:43 PM

Hasn't the plotting been kind of done in this blog post: https://sequencing.qcfail.com/articl...ted-sequences/ I had posted this over in clumpify thread.

I am wondering if the odd FC-wide duplicates are showing up due to oversampling of libraries (especially for NovaSeq data). Am I completely off-target in suggesting that as a possible cause?

**Brian Bushnell** · 03-03-2017, 07:34 PM

I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.

**SNPsaurus** · 03-03-2017, 11:37 PM

Genomax, I was looking at that blog post again and I thought, but couldn't be sure, that the HS4000 optical duplicates had a bias along the Y axis. I was hoping Brian could replicate that or not. The blog post also noted, "Significantly, 99% of HiSeq 4000 duplicates comprised di-tags originating from the same tile" which seems to be in contrast to the NovaSeq plot Brian produced with its steady increase in duplicates over long distances. Maybe the seeding of fragments onto a Novaseq's flow cell is different and the problem is greater? But your link is clearly relevant!

**nucacidhunter** · 03-04-2017, 03:46 AM

Originally posted by Brian Bushnell View Post

I wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...

I think this is limitation of ExAmp cluster amplification rather than patterned flow cell. With ExAmp reducing loading concentration increases duplication rate as a fragment seeding one Nano-well will have more chance to seed other wells as well. Once there are more data from NovaSeq this can be further investigated.

**GenoMax** · 03-04-2017, 05:14 AM

Originally posted by Brian Bushnell View Post

I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.

I was thinking (perhaps simplistically) that the test NovaSeq data on BaseSpace is probably the same library loaded on multiple lanes. If we were to run clumpify across more than 2 (or even all) lanes then that would basically give us a collection of fragments that all have the same sequence.

As we add more lanes there is diminishing return of new fragments. If that happens then we are basically capturing all sequenceable fragments that are in this library?

**GW_OK** · 03-16-2017, 06:06 AM

Can anyone share exactly how they're getting X/Y coordinates from the patterned flowcell fastq? I'm only seeing a single number, which I am guessing corresponds to a well ID.

**Brian Bushnell** · 03-16-2017, 09:07 AM

The Novaseq read headers look like this:

Code:

@VP2-06:112:H7LNDMCVY:2:1105:16224:3004 1:N:0:TCCGGAGA+GGGTCTGA

In this case, 2:1105:16224:3004 is the positional information, in the format "lane:tile:X:Y". I got this data from Basespace; it's possible that SRA data has the read headers changed.

**GenoMax** · 03-16-2017, 10:58 AM

@GW_OK: Where did you get your data from? I got mine from BaseSpace and it looks like normal Illumina fastq data.

**GW_OK** · 03-16-2017, 04:46 PM

Well, since I have a 3000 (but no extant PCR-free data) I wanted to look at the public 4000 data from Basespace, specifically:
NA12878-PCRfree450_S3_L003_R1_001.fastq.gz
NA12878-PCRfree450_S3_L003_R2_001.fastq.gz
from the
HiSeq4000: TruSeq PCRfree and Nano (350bp to 550bp insert size)
data set.

From what I can see in my survey of the fastq headers all of the X coordinates are set at '0', hence my confusion.

Edit:
Focusing on tile 1101, the headers go from
@196:2371:H7MF5BBXX:3:1101:0:15712 1:N:0:3
to
@196:2371:H7MF5BBXX:3:1101:0:4312392 1:N:0:3

**GW_OK** · 03-18-2017, 06:20 AM

deleted due to duplication (hah)

**GW_OK** · 03-20-2017, 05:12 AM

The forums aren't letting me post a big post so I'm going to break this into three posts.

I've been intrigued with the question of duplicate-well directionality. Does it follow the direction of reagent flow? Setting aside the 4000 data set for a bit I moved over to the NovaSeq data, specifically NA12878-rep1. I pulled down the fastq files from BaseSpace and decided to initially plot (using ggplot) the actual XY coordinates for each read just to see what it looked like. To make visualization easier I focused solely on tile 1105. I still had to use a 10000x10000 png to get the wells spaced out enough.

It's pretty cool to look at. You can make out the ring fiducials quite clearly.

Link to bigger

No way to make out the ordered array, since not every well had a read, though there are what looks like tracks of reads.

**GenoMax** · 03-20-2017, 05:14 AM

@GW_OK: You may have missed this post from QC Fail. They did something similar.

I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.

There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.

**GW_OK** · 03-20-2017, 05:17 AM

I then ran clumpify

Code:

markduplicates=t dupedist=2500 spantiles=t

to coalesce the duplicates I used a simple perl script to parse the fastq headers into a "tile1 x1 y1 tile2 x2 y2" text file I could use in ggplot to draw lines between duplicate wells. The first coordinates are the "initial" reads as given by clumpify and the second coordinates are the "duplicate" reads as labeled by clumpify. I pulled out all of the duplicate sets where both wells were within tile 1105. I was quite struck by how many wells duplicated over to the 2xxx tileset, which is the bottom surface while 1105 is on the top.

Link to bigger

What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News