NovaSeq from Illumina

GenoMax replied

03-16-2017, 10:58 AM
@GW_OK: Where did you get your data from? I got mine from BaseSpace and it looks like normal Illumina fastq data.
Leave a comment:
Brian Bushnell replied

03-16-2017, 09:07 AM
The Novaseq read headers look like this:

Code:

@VP2-06:112:H7LNDMCVY:2:1105:16224:3004 1:N:0:TCCGGAGA+GGGTCTGA

In this case, 2:1105:16224:3004 is the positional information, in the format "lane:tile:X:Y". I got this data from Basespace; it's possible that SRA data has the read headers changed.
Leave a comment:
GW_OK replied

03-16-2017, 06:06 AM
Can anyone share exactly how they're getting X/Y coordinates from the patterned flowcell fastq? I'm only seeing a single number, which I am guessing corresponds to a well ID.
Leave a comment:
GenoMax replied

03-04-2017, 05:14 AM
Originally posted by Brian Bushnell View Post

I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.

I was thinking (perhaps simplistically) that the test NovaSeq data on BaseSpace is probably the same library loaded on multiple lanes. If we were to run clumpify across more than 2 (or even all) lanes then that would basically give us a collection of fragments that all have the same sequence.

As we add more lanes there is diminishing return of new fragments. If that happens then we are basically capturing all sequenceable fragments that are in this library?
Leave a comment:
nucacidhunter replied

03-04-2017, 03:46 AM
Originally posted by Brian Bushnell View Post

I wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...

I think this is limitation of ExAmp cluster amplification rather than patterned flow cell. With ExAmp reducing loading concentration increases duplication rate as a fragment seeding one Nano-well will have more chance to seed other wells as well. Once there are more data from NovaSeq this can be further investigated.
Leave a comment:
SNPsaurus replied

03-03-2017, 11:37 PM
Genomax, I was looking at that blog post again and I thought, but couldn't be sure, that the HS4000 optical duplicates had a bias along the Y axis. I was hoping Brian could replicate that or not. The blog post also noted, "Significantly, 99% of HiSeq 4000 duplicates comprised di-tags originating from the same tile" which seems to be in contrast to the NovaSeq plot Brian produced with its steady increase in duplicates over long distances. Maybe the seeding of fragments onto a Novaseq's flow cell is different and the problem is greater? But your link is clearly relevant!
Leave a comment:
Brian Bushnell replied

03-03-2017, 07:34 PM
I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.
Leave a comment:
GenoMax replied

03-03-2017, 06:43 PM
Hasn't the plotting been kind of done in this blog post: https://sequencing.qcfail.com/articl...ted-sequences/ I had posted this over in clumpify thread.

I am wondering if the odd FC-wide duplicates are showing up due to oversampling of libraries (especially for NovaSeq data). Am I completely off-target in suggesting that as a possible cause?
Leave a comment:
Brian Bushnell replied

03-03-2017, 03:11 PM
Oh, that's an interesting suggestion... theoretically, they shouldn't go upstream at all, or sideways very far. It might not be possible to differentiate between upstream and downstream duplicates (though theoretically, the downstream ones should have a weaker signal), but I can certainly add the ability to differentiate between the X and Y axis. I'll do that and post here when it's done.

I'd imagine that they should make a kind of cone-shaped pattern like the debris field from an airplane crash or tornado, but plotting that kind of thing is tricky since it's an all-or-nothing proposition that doesn't let you see the diminishing probability over the region.

Last edited by Brian Bushnell; 03-03-2017, 03:15 PM.
Leave a comment:
SNPsaurus replied

03-03-2017, 01:04 PM
Brian, we were talking about this and wondered if you could test the breakage model by looking at the location of duplicates. This would be happening during flow, right? So, if it is breakage then the duplicates should all happen in the direction of flow, with little orthogonal movement. Could you either pull up sets of duplicates and look at the coordinates, or add separate dimension distances for checking for dups and then allow a short pixel distance in one dimension and long in the other and vice versa and see how that affects the results?
Leave a comment:
Brian Bushnell replied

03-02-2017, 09:12 AM
Originally posted by pmiguel View Post

Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.

I might try running again after removing the mito, but it's not like mito accounts for >12% of the reads anyway. The number of reads was different, but this NovaSeq library only has twice the reads of the HiSeq library, so that doesn't explain the result.

Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?

As Genomax indicated, yes, with this methodology both reads in a pair are required to match for the pair to be considered a duplicate. Due to the large insert size and variance this is unlikely to occur by chance.

Originally posted by misterc

Brian, your hypothesis is reasonable as there is no other possibility to explain the duplicate rate. Not surprisingly, we see similar duplicates on HiSeq 4000, as this 'characteristic' of ExAmp isn't limited to NovaSeq.

I wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...

But, it makes me wonder what the duplicate rates of the high-throughput flowcells will look like.
Leave a comment:
misterc replied

03-02-2017, 08:12 AM
Brian, your hypothesis is reasonable as there is no other possibility to explain the duplicate rate. Not surprisingly, we see similar duplicates on HiSeq 4000, as this 'characteristic' of ExAmp isn't limited to NovaSeq.
Leave a comment:
GenoMax replied

03-02-2017, 04:35 AM
Originally posted by pmiguel View Post

Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.

Probably not since best NovaSeq sample posted on BaseSpace has 1.6 Billion reads (individual R1 and R2 files, if uncompressed are 300G each!, we have the possibility of having uncompressed read files of 1TB each when S4 cells roll around later this year).

Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?

That should be a yes since @Brian is probably using clumpify which takes both reads into account.

I am wondering if we are sampling the libraries so thoroughly on a NovaSeq that we have duplicates showing up due to oversampling.

Last edited by GenoMax; 03-02-2017, 08:14 AM.
Leave a comment:
pmiguel replied

03-02-2017, 04:10 AM
Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.

Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?

BTW, yes, a typical DNA prep from cell culture would yield enough DNA to make it unnecessary to amplify the library.

--
Phillip
Leave a comment:
Brian Bushnell replied

03-02-2017, 12:21 AM
Here is a zoomed-in image of HiSeq 2500 duplicates for the same genome (it's an immortal human cell line that does not need amplification, or so I'm told).

This is not the same as the other image, as the x-axis is logarithmic rather than linear. But the important point in my opinion is that there is a rapid increase in duplicates detected up to a point (~45) and subsequently it is completely flat for a long time. That is what I expect from a platform that occasionally identifies oddly-shaped clusters as two clusters, or in which a well occasionally migrates to an adjacent well.

At ~1000, it starts going up again. I'm not sure about that - I would expect it to be sub-linear on the log scale, but then, I'm not sure what's happening in that region. The salient point is that there is a sharp increase over roughly the width of a cluster, and then a plateau, and finally another increase due to the increasing range. After dist=1000, I can't explain the slope. But, the graph only shows duplicates of less than 0.02% of reads, so it's not very important in practice. Still, it would be great if there was one less unsolved mystery.
Attached Files

HighSeq_Duplicates.png (32.8 KB, 660 views)
Last edited by Brian Bushnell; 03-02-2017, 12:32 AM.
Leave a comment:

Previous 1 2 3 4 5 6 7 8 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News