Brian, we were talking about this and wondered if you could test the breakage model by looking at the location of duplicates. This would be happening during flow, right? So, if it is breakage then the duplicates should all happen in the direction of flow, with little orthogonal movement. Could you either pull up sets of duplicates and look at the coordinates, or add separate dimension distances for checking for dups and then allow a short pixel distance in one dimension and long in the other and vice versa and see how that affects the results?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Oh, that's an interesting suggestion... theoretically, they shouldn't go upstream at all, or sideways very far. It might not be possible to differentiate between upstream and downstream duplicates (though theoretically, the downstream ones should have a weaker signal), but I can certainly add the ability to differentiate between the X and Y axis. I'll do that and post here when it's done.
I'd imagine that they should make a kind of cone-shaped pattern like the debris field from an airplane crash or tornado, but plotting that kind of thing is tricky since it's an all-or-nothing proposition that doesn't let you see the diminishing probability over the region.Last edited by Brian Bushnell; 03-03-2017, 03:15 PM.
Comment
-
Hasn't the plotting been kind of done in this blog post: https://sequencing.qcfail.com/articl...ted-sequences/ I had posted this over in clumpify thread.
I am wondering if the odd FC-wide duplicates are showing up due to oversampling of libraries (especially for NovaSeq data). Am I completely off-target in suggesting that as a possible cause?
Comment
-
Genomax, I was looking at that blog post again and I thought, but couldn't be sure, that the HS4000 optical duplicates had a bias along the Y axis. I was hoping Brian could replicate that or not. The blog post also noted, "Significantly, 99% of HiSeq 4000 duplicates comprised di-tags originating from the same tile" which seems to be in contrast to the NovaSeq plot Brian produced with its steady increase in duplicates over long distances. Maybe the seeding of fragments onto a Novaseq's flow cell is different and the problem is greater? But your link is clearly relevant!Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
Comment
-
Originally posted by Brian Bushnell View PostI wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...
Comment
-
Originally posted by Brian Bushnell View PostI'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.
As we add more lanes there is diminishing return of new fragments. If that happens then we are basically capturing all sequenceable fragments that are in this library?
Comment
-
The Novaseq read headers look like this:
Code:@VP2-06:112:H7LNDMCVY:2:1105:16224:3004 1:N:0:TCCGGAGA+GGGTCTGA
Comment
-
Well, since I have a 3000 (but no extant PCR-free data) I wanted to look at the public 4000 data from Basespace, specifically:
NA12878-PCRfree450_S3_L003_R1_001.fastq.gz
NA12878-PCRfree450_S3_L003_R2_001.fastq.gz
from the
HiSeq4000: TruSeq PCRfree and Nano (350bp to 550bp insert size)
data set.
From what I can see in my survey of the fastq headers all of the X coordinates are set at '0', hence my confusion.
Edit:
Focusing on tile 1101, the headers go from
@196:2371:H7MF5BBXX:3:1101:0:15712 1:N:0:3
to
@196:2371:H7MF5BBXX:3:1101:0:4312392 1:N:0:3Last edited by GW_OK; 03-17-2017, 05:34 AM.
Comment
-
The forums aren't letting me post a big post so I'm going to break this into three posts.
I've been intrigued with the question of duplicate-well directionality. Does it follow the direction of reagent flow? Setting aside the 4000 data set for a bit I moved over to the NovaSeq data, specifically NA12878-rep1. I pulled down the fastq files from BaseSpace and decided to initially plot (using ggplot) the actual XY coordinates for each read just to see what it looked like. To make visualization easier I focused solely on tile 1105. I still had to use a 10000x10000 png to get the wells spaced out enough.
It's pretty cool to look at. You can make out the ring fiducials quite clearly.
Link to bigger
No way to make out the ordered array, since not every well had a read, though there are what looks like tracks of reads.
Comment
-
@GW_OK: You may have missed this post from QC Fail. They did something similar.
I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.
There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.Last edited by GenoMax; 03-20-2017, 05:20 AM.
Comment
-
I then ran clumpifyCode:markduplicates=t dupedist=2500 spantiles=t
Link to bigger
What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".Last edited by GW_OK; 03-20-2017, 09:16 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment