Header Leaderboard Ad

Collapse

NextSeq Data

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NextSeq Data

    We recently acquired a NextSeq machine and are not very impressed with the data. I've uploaded a spreadsheet containing some of the statistics here:

    https://drive.google.com/file/d/0B3l...ew?usp=sharing

    The first tab is a HiSeq2000 2x150bp run. The insert size was below target, so I adapter-trimmed adapters before analyzing the data (no other preprocessing was run); and the HS2000 is not really spec'd to 2x150, so as you might imagine, the quality suffers toward the end. Regardless, it's pretty good. Looking at the mapping stats, 99.55% of the reads mapped, and overall 79.85% of the reads were error-free.

    The next two tabs contain a couple of lanes of NextSeq bacterial sequence. Lane 1 generally seems to be the best, with quality dropping to a minimum at lane 4. But even for lane 1, only 96.47% of the reads mapped and 49.3% were perfect matches; by lane 4, 95.91% mapped and 38.91% were perfect. So the rate of reads with errors roughly tripled from HS2000 (which does not support 2x150bp runs) to NextSeq (which supposedly does), and as you can see on the "Average Quality by Position" and "Error Rate vs Read Position" graphs, the comparison would be brutal - an order of magnitude or more - if you consider 2x100bp reads. Also, if you look at the "Quality Score Accuracy" graph, the HS2000 quality scores are fairly accurate and typically underestimate quality, while the NextSeq ones are inaccurate and overestimate quality by about 10 dB (and are quantized), so you can't easily quality-trim the NextSeq data to improve it.

    The "Library Uniqueness" graph, generated by sampling a kmer from each read and hashing it to see if it was seen before, is also very odd for NextSeq. It is wavy. The graph should monotonically decrease and any increase indicates a sudden error burst. So it seems maybe the period (~625000 reads) corresponds with an image frame, the clusters around the edges of the frame are blurry, as one might expect from low-quality or miscalibrated optics.

    The Base Frequency vs Position graph is also interesting - NextSeq has a clear A/T ratio bias that is not present in HS data. The 3bp-wavelength sawtooth pattern probably has something to do with codon structure.

    Does anyone else have data they'd like to share on NextSeq machines?

    P.S. Command lines I used:

    Code:
    bbcountunique.sh in=reads.fq.gz reads=100000000 out=uniqueness.txt
    
    bbduk.sh in=reads.fq.gz reads=4000000 ktrim=r k=25 hdist=1 mink=12 tbo tpe ref=nextera.fa,truseq.fa out=ktrimmed.fq.gz ow
    
    bbmap.sh in=ktrimmed.fq.gz reads=4000000 mhist=mhist.txt ihist=ihist.txt bhist=bhist.txt idhist=idhist.txt ehist=ehist.txt qhist=qhist.txt idbins=200 qahist=qahist.txt aqhist=aqhist.txt indelhist=indelhist.txt gchist=gchist.txt
    
    bbmerge.sh in=ktrimmed.fq.gz reads=4000000 ihist=ihist_merge.txt

  • #2
    Thanks Brian for posting your analysis results. I wonder if HiSeq reads are also from bacterial DNA library and prepared using the same protocol as NextSeq ones.

    Comment


    • #3
      The HiSeq reads are bacterial, but from a collection of 26 different isolates mixed together to form a synthetic metagenomic community. I don't know much about the preparation protocols, but certainly the insert sizes differ substantially, so at least size selection was probably different; maybe shearing too.

      Comment


      • #4
        Interesting, thanks very much for the detailed analysis and your thoughts. So the data looks a little worse than HiSeq, I agree, but they're at an early stage with the NextSeq chemistry. Far more serious would be the use of low quality optics, which would be understandable at that price point.

        Any thoughts or observations on de novo assembly or SNP calling ? I believe I saw a post on SeqAnswers saying SNP calling works fine on the NextSeq at the expense of a few more indel errors (compared to HiSeq data).

        We are interested in a direct comparison against the Ion Proton. I see these details indicate the indel error rate is a lot lower here than that what I've heard comes off the Proton. This is very important for getting good de novo assemblies of course.

        Thanks again.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          We recently acquired a NextSeq machine and are not very impressed with the data. I've uploaded a spreadsheet containing some of the statistics here:

          https://drive.google.com/file/d/0B3l...ew?usp=sharing

          The first tab is a HiSeq2000 2x150bp run. The insert size was below target, so I adapter-trimmed adapters before analyzing the data (no other preprocessing was run); and the HS2000 is not really spec'd to 2x150, so as you might imagine, the quality suffers toward the end. Regardless, it's pretty good. Looking at the mapping stats, 99.55% of the reads mapped, and overall 79.85% of the reads were error-free.

          The next two tabs contain a couple of lanes of NextSeq bacterial sequence. Lane 1 generally seems to be the best, with quality dropping to a minimum at lane 4. But even for lane 1, only 96.47% of the reads mapped and 49.3% were perfect matches; by lane 4, 95.91% mapped and 38.91% were perfect. So the rate of reads with errors roughly tripled from HS2000 (which does not support 2x150bp runs) to NextSeq (which supposedly does), and as you can see on the "Average Quality by Position" and "Error Rate vs Read Position" graphs, the comparison would be brutal - an order of magnitude or more - if you consider 2x100bp reads. Also, if you look at the "Quality Score Accuracy" graph, the HS2000 quality scores are fairly accurate and typically underestimate quality, while the NextSeq ones are inaccurate and overestimate quality by about 10 dB (and are quantized), so you can't easily quality-trim the NextSeq data to improve it.

          The "Library Uniqueness" graph, generated by sampling a kmer from each read and hashing it to see if it was seen before, is also very odd for NextSeq. It is wavy. The graph should monotonically decrease and any increase indicates a sudden error burst. So it seems maybe the period (~625000 reads) corresponds with an image frame, the clusters around the edges of the frame are blurry, as one might expect from low-quality or miscalibrated optics.

          The Base Frequency vs Position graph is also interesting - NextSeq has a clear A/T ratio bias that is not present in HS data. The 3bp-wavelength sawtooth pattern probably has something to do with codon structure.

          Does anyone else have data they'd like to share on NextSeq machines?

          P.S. Command lines I used:

          Code:
          bbcountunique.sh in=reads.fq.gz reads=100000000 out=uniqueness.txt
          
          bbduk.sh in=reads.fq.gz reads=4000000 ktrim=r k=25 hdist=1 mink=12 tbo tpe ref=nextera.fa,truseq.fa out=ktrimmed.fq.gz ow
          
          bbmap.sh in=ktrimmed.fq.gz reads=4000000 mhist=mhist.txt ihist=ihist.txt bhist=bhist.txt idhist=idhist.txt ehist=ehist.txt qhist=qhist.txt idbins=200 qahist=qahist.txt aqhist=aqhist.txt indelhist=indelhist.txt gchist=gchist.txt
          
          bbmerge.sh in=ktrimmed.fq.gz reads=4000000 ihist=ihist_merge.txt
          Hi Brian,

          We are looking to purchasing a NextSeq. But we do have a concern regarding the quality of the reads generated on NextSeq. Do you have a better experience now with the NextSeq?

          Your input is highly appreciated.

          James

          Comment


          • #6
            V2 chemistry has substantially higher quality than V1; it's basically fine. However, it still has some issues with the barcode-reading cycles, which has caused problems with multiplexed runs; we've had some in which certain barcodes are misread ~95% of the time, and thus get demultiplexed into the unknown bin. Last I heard, Illumina was aware of this issue and working on it; not sure what the current status is.

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              V2 chemistry has substantially higher quality than V1; it's basically fine. However, it still has some issues with the barcode-reading cycles, which has caused problems with multiplexed runs; we've had some in which certain barcodes are misread ~95% of the time, and thus get demultiplexed into the unknown bin. Last I heard, Illumina was aware of this issue and working on it; not sure what the current status is.
              Brian,

              Thanks for your reply. Are those bar-codes (that were misread) from Illumina or are they custom ones that prepared by you or your end-user?

              Thanks

              James

              Comment


              • #8
                I think they were Illumina TruSeq, but it's possible they were custom. They worked fine on HiSeq and MiSeq, though, and on NextSeq with V1 chemistry.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                  by seqadmin


                  ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                  01-24-2023, 01:19 PM
                • seqadmin
                  Introduction to Single-Cell Sequencing
                  by seqadmin
                  Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                  The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                  ...
                  01-09-2023, 03:10 PM

                ad_right_rmr

                Collapse
                Working...
                X