Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    Originally posted by GW_OK View Post
    What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".
    That's downright strange...

    Leave a comment:


  • GW_OK
    replied
    A few more interesting points.

    The tile map for the S2 flowcell:
    -The first digit is the lane number: 1 or 2.
    -The second digit represents the surface: 1 for top or 2 for bottom.
    -The third digit represents the swath number:1, 2, 3, or 4.
    -The last 2 digits represent the tile number, 01 through 88. Tile numbering starts with 01 at the outlet end of the flow cell through 88 at the inlet end.

    The stuff on BaseSpace is from a pre-release flowcell but I think we can assume the tiling map holds true except their tiles range from 05 to 90.

    Mapping the number of intra-tile, inter-tile, and inter-surface-tile duplicates shows that libraries will jump tiles in the direction of flow but less so horizontally or diagonally. There's also a large amount of inter-surface jumping from one surface to the direct opposite. My previous analysis was done on tile 1105, which you can see in the table below in the upper left corner. I also mapped the duplicates in a more centrally located tile (1240) where the duplications are more pronounced surrounding the tile in question.

    This seems to reinforce the previous observations that most duplicates stay (relatively) close to their origin with some bias in the Y direction. However there also appears to be a Z-axis bias as well.
    Attached Files
    Last edited by GW_OK; 03-20-2017, 12:49 PM.

    Leave a comment:


  • SNPsaurus
    replied
    Thanks for the plots, GW_OK!

    Leave a comment:


  • GW_OK
    replied
    Maybe we need to start talking in terms of "library coverage" instead of "genome coverage"...

    Also, don't confuse read numbers with cluster numbers. There'll be 2 reads to every cluster. And on this NovaSeq data you'll have to split their cluster counts across both lanes.

    Leave a comment:


  • GenoMax
    replied
    I was thinking about "cluster-able/sequence-able" fragments present in the library. In any case this is just a hypothesis and your calculations may be spot-on.

    We are approaching data never seen before (e.g. there are 1.6 B reads in largest example data lane and this could potentially go up 4x with S4 cells).
    Last edited by GenoMax; 03-20-2017, 07:55 AM.

    Leave a comment:


  • GW_OK
    replied
    Regarding sampling depth, I am dubious.

    The Truseq PCR-free protocol (which I have to assume they're using) has you start with 1ug for 350bp inserts and 2ug for 550 bp inserts. Since they say they're using 450bp inserts I'll split the difference and say they started with 1.5ug of DNA. The entire human genome weighs 3.6pg (as per IDT) so that 1.5 ug is ~417k human genome equivalents. If they shear it on a Covaris (which is fairly random) you would have to rely on two copies of the genome shearing at the exact same base pair on both ends.

    Then, from the fairly random fragment assortment of ~417k genomes you then take ~1.8E10 molecules (assuming they loaded at 200pM) and from that sample 599M molecules.

    Someone with more statistical chops than me on a Monday morning can do the actual math but I have a feeling we're not close to oversampling these libraries. I could be wrong, though, so don't hold me to it.
    Last edited by GW_OK; 03-20-2017, 06:44 AM.

    Leave a comment:


  • GW_OK
    replied
    It's from the BaseSpace project
    Code:
    NovaSeq: WGS TruSeq PCR-Free 450 (6plex)
    So I reckon it must be PCR-free.

    NA12878-rep1 only looking at data from lane 1.

    Leave a comment:


  • GenoMax
    replied
    Can you confirm that the three posts from today use data from NovaSeq (is that data PCR free I don't recollect)? spantiles=t actually spans across all tiles. This was essential to capture the edge-duplicate effect that appears to be specific for NextSeq flowcells.

    Since the NovaSeq flowcells should have more nanowells (3-4x?), so using 2500 distance is probably not optimal (though as I said before we have not been able to pin a distance down based on the data available).

    I think Illumina is sampling this library do deeply that we are starting to see duplicates across the FC/tiles just because there are only so many sequenceable fragments in the library. I have tried to test this by pooling two lanes of NovaSeq data together to see if the number of clumps does not go up appreciably. Unfortunately I have not been able to get clumpify to work with this pooled data (and @Brian has not had a chance to look at why that is happening).

    Leave a comment:


  • GW_OK
    replied
    Originally posted by GenoMax View Post
    @GW_OK: You may have missed this post from QC Fail. They did something similar.

    I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.

    There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.

    So, yeah. I didn't miss that qcfail blog post. I've read through it several times. I wanted to recapitulate their analysis on a data set that:
    (A) had not undergone PCR amplification and
    (B) was across an entire tile, not just a small region of a tile
    (C) was performed by Illumina and/or someone with a vested interest in having their data set showing the theoretical "best" of what the machine can do. It's all well and good to throw a library on two machines but I don't know what that library looked like prior to loading.

    I did want to use spantiles to demonstrate the 'mode' of duplication. Are the duplicates moving from well to well, or across the whole tile, or from tile to tile and surface to surface? Based off what I've seen here they're not just moving across interconnected wells.

    I picked dupedist 2500 based solely on what people have used for the 4000, as given in the clumpify thread in the bioinformatics subforum.

    Leave a comment:


  • GW_OK
    replied
    Finally, I graphed duplicate read coordinates relative to the initial read coordinates (x1-x2/y1-y2 from the file I made above). I've attached that file below since it's not tremendously large. Most clump fairly close together, as others have shown, but there does seem to be a Y-bias to my eyes. Perhaps this means that there is some merit to the direction of flow duplicate theory.
    Attached Files

    Leave a comment:


  • GW_OK
    replied
    I then ran clumpify
    Code:
    markduplicates=t dupedist=2500 spantiles=t
    to coalesce the duplicates I used a simple perl script to parse the fastq headers into a "tile1 x1 y1 tile2 x2 y2" text file I could use in ggplot to draw lines between duplicate wells. The first coordinates are the "initial" reads as given by clumpify and the second coordinates are the "duplicate" reads as labeled by clumpify. I pulled out all of the duplicate sets where both wells were within tile 1105. I was quite struck by how many wells duplicated over to the 2xxx tileset, which is the bottom surface while 1105 is on the top.


    Link to bigger

    What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".
    Last edited by GW_OK; 03-20-2017, 09:16 AM.

    Leave a comment:


  • GenoMax
    replied
    @GW_OK: You may have missed this post from QC Fail. They did something similar.

    I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.

    There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.
    Last edited by GenoMax; 03-20-2017, 05:20 AM.

    Leave a comment:


  • GW_OK
    replied
    The forums aren't letting me post a big post so I'm going to break this into three posts.

    I've been intrigued with the question of duplicate-well directionality. Does it follow the direction of reagent flow? Setting aside the 4000 data set for a bit I moved over to the NovaSeq data, specifically NA12878-rep1. I pulled down the fastq files from BaseSpace and decided to initially plot (using ggplot) the actual XY coordinates for each read just to see what it looked like. To make visualization easier I focused solely on tile 1105. I still had to use a 10000x10000 png to get the wells spaced out enough.

    It's pretty cool to look at. You can make out the ring fiducials quite clearly.

    Link to bigger

    No way to make out the ordered array, since not every well had a read, though there are what looks like tracks of reads.

    Leave a comment:


  • GW_OK
    replied
    deleted due to duplication (hah)
    Last edited by GW_OK; 03-20-2017, 05:46 AM.

    Leave a comment:


  • GW_OK
    replied
    Well, since I have a 3000 (but no extant PCR-free data) I wanted to look at the public 4000 data from Basespace, specifically:
    NA12878-PCRfree450_S3_L003_R1_001.fastq.gz
    NA12878-PCRfree450_S3_L003_R2_001.fastq.gz
    from the
    HiSeq4000: TruSeq PCRfree and Nano (350bp to 550bp insert size)
    data set.

    From what I can see in my survey of the fastq headers all of the X coordinates are set at '0', hence my confusion.

    Edit:
    Focusing on tile 1101, the headers go from
    @196:2371:H7MF5BBXX:3:1101:0:15712 1:N:0:3
    to
    @196:2371:H7MF5BBXX:3:1101:0:4312392 1:N:0:3
    Last edited by GW_OK; 03-17-2017, 05:34 AM.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
52 views
0 likes
Last Post seqadmin  
Working...
X