Announcement

Collapse
No announcement yet.

A first look at Illumina’s new NextSeq 500

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by GenoMax View Post
    @nucacidhunder: All those appear to be "standard" (gold?) samples.

    Brian: If PhiX standard does not look good then that is worrisome.
    I agree, which is why it's surprising that Illumina has made it publicly available - and indeed, that's what they pointed me to when I first started questioning them about NextSeq's quality. Currently they are claiming that a NextSeq machine is 'in spec' as long as at least 70% of the bases are labeled by the machine as at least Q30, regardless of whether those bases are actually correct! I don't know whether this is a group of employees misinterpreting the specification, or if really is Illumina's official policy, but it's worrisome either way.

    Our machine self-reports 87% of bases as having quality above 30, and therefore Illumina claims it is in-spec, but the true quality as measured by mapping for the highest-rated bases (claimed Q37) is only Q28, and the majority are much lower. In other words, bases the machine assigns Q37 are wrong 0.16% of the time rather than the claimed 0.02%, so their quality values are inflated by a factor of 8. In reality, 0% of the output is at least Q30, either from our machine or from Illumina's official PhiX data, which I used because they calibrate their machines on PhiX so it should represent the best case scenario.

    Does anyone have a different experience?
    Last edited by Brian Bushnell; 12-04-2014, 09:30 PM.

    Comment


    • #17
      Sorry, no NextSeq data to discuss. But on the issue of quality values vs. empirical error rate -- always seemed to me this would highly depend on the alignment engine and the parameters used. Specifically how gaps (indels) were handled.
      A single indel in a read results in nearly all the bases downstream of that indel being scored as "mismatch" unless a gap is introduced into the alignment.

      Seems like how gaps are handled could easily explain what Illumina (and I) would call a Q37 base showing up as only Q30 in your analysis. Depending on how you did your alignments...

      --
      Phillip

      Comment


      • #18
        The alignments were done by an indel-capable aligner. That's not the problem. In fact, the actual quality scores are calculated separately for bases impacted by mismatches only and for bases impacted by indels or SNPs. Furthermore, the exact same analysis was done for HiSeq, MiSeq, and NextSeq, and NextSeq is the only one with the major quality issues.

        Here, let me show you. These graphs were all generated by mapping after adapter-trimming the input reads. This is from a HiSeq2500, which shows low error rates and accurate (generally conservative) quality scores:



        And this is from a NextSeq, which shows extremely high error rates and vastly inflated quality scores:



        You can plainly see that something is very wrong without any mapping whatsoever, just by looking at the base frequency histogram:


        Possibly, the high error rate is driven by the A/T ratio divergence, and thus due to a fundamental base-calling or dye-system issue, but I don't know. At any rate, the base frequency divergence, the inflated Q-scores, and the high error rates have now been seen on 3 different independent NextSeq platforms at 3 different facilities (ours, Illumina's, and one of our collaborators') with unrelated organisms and libraries. I have yet to see a NextSeq run from anywhere that did not exhibit these characteristics, but now that I have 3 independent confirmations, I don't really expect that I will see one.

        The way I produced these graphs (starting with interleaved reads, and using BBTools):

        bbduk.sh in=reads.fastq.gz out=trimmed.fq.gz ktrim=r k=23 hdist=1 mink=11 tpe tbo minlen=90 ref=truseq.fa.gz,nextera.fa.gz

        bbmap.sh maxindel=200 in=trimmed.fq.gz mhist=mhist.txt bhist=bhist.txt qhist=qhist.txt qahist=qahist.txt

        I encourage anyone who is unable to share their raw data to do the same, and share the histograms. Ideally, for the same library sequenced on both a NextSeq and HiSeq/MiSeq, to eliminate any possible variables.
        Attached Files

        Comment


        • #19
          Yow, that is not good news! I can't help but want to blame it on the two-color chemistry, even though I have no basis to do so.

          Except -- I mean it still could be an indel issue -- if indels were more common with the NextSeq. What I would fear about this instrument would be bubbles in the flowcell. Seems like it would hard to distinguish no signal (bubble) from no signal (G?). Although I have been assured that the two do look different.

          Also, bubbles in the flowcell may be a HiSeq-only thing, I don't know that NextSeqs would have any.

          --
          Phillip

          Comment


          • #20
            @Brian: Are these results from one NextSeq or do you have an n of > 1?

            Comment


            • #21
              The graphs I posted in this thread are from one NextSeq, but I have generated similar graphs from multiple libraries run on 3 independent NextSeq machines at 3 different facilities (one being Illumina), and they all look about the same.

              Comment


              • #22
                Damn, thanks Brian. I woke up this morning thinking that maybe I should try a NextSeq run instead of HiSeq 2000 for this chapter of my dissertation. It seemed like I might be able to get a slightly better assembly for the money, given the longer PE reads available. I don't so much think so, now.

                Comment


                • #23
                  You can (get the long reads )

                  Provided you have access to the right HiSeq 2500. One can now do 2 x 250 PE runs.

                  Comment


                  • #24
                    Data quality is definitely inferior to both the MiSeq and HiSeq.
                    It's quick though, and perhaps more suited for counting applications, such as RNA-Seq and ChIPSeq than variant calling.
                    The question is whether the error is systematic or random. Random error can be somewhat compensated for by a decent sequence depth.

                    I'll attempt to post the QC from the Illumina PhiX we sequenced during training.

                    Comment


                    • #25
                      Originally posted by TonyBrooks View Post
                      The question is whether the error is systematic or random. Random error can be somewhat compensated for by a decent sequence depth.
                      Yep, I plan to plot the error rate across a genome and see if I can see some kind of pattern, but I have not had time to do that yet.

                      I'll attempt to post the QC from the Illumina PhiX we sequenced during training.
                      That would be great!

                      Comment


                      • #26
                        Originally posted by GenoMax View Post
                        You can (get the long reads )

                        Provided you have access to the right HiSeq 2500. One can now do 2 x 250 PE runs.


                        One can... provided one has access to the machine, and more than a tiny pilot grant to work with. Sadly, when one works on non-model insects for non-agricultural/biomedical purposes, and one is only a wee third-year, one might only have ~$2500 to spend on the run itself. (Not that one is complaining. One is really super pleased about that.)

                        Enough of my de-railing, though -- really looking forward to updates from TonyBrooks, because my application is a de novo transcriptome project, primarily interested in the gene expression. Thanks one and all.

                        Comment


                        • #27
                          So I've been getting some initial test data back from a Nextseq 500 and I'm really not happy with it compared to the Hiseq.

                          The data quality from the NextSeq is substantially worse than that from a Hiseq with substantially more errors to the point where I'm not certain the data is usable for low to medium coverage whole genome sequence variant calling (1-20x).

                          Attached are a number of different PDF's showing the data compared to HiSeq data from the same facility (same experienced technician's doing all the sequencing, library prep and everything). We resequenced our HiSeq libraries (PCR-Free 550bp insert) to compare like to like and you can clearly see the difference.

                          Two of the files show GATK's BQSR before after and plots for one of our typical Hiseq libraries (recalQC-randomHiseq.pdf) and a Hiseq library sequenced on the NextSeq (BQSR-NextSeq-Before-After.pdf). The difference is substantial and while these are not the same library the Hiseq is representative of what we usually get.

                          The second two files show the same library with 4 lanes of NextSeq sequence vs the Same library when sequenced on the Hiseq, you'll clearly be able to determine which comes from which machine (Nxt, Nxt, Nxt, Nxt, Hiseq).

                          Finally here are some alignment stats from Picard tools for the same library sequenced twice on the NextSeq (two different runs) vs the Stats for the same library from the HiSeq showing a 1-2% reduction in reads aligned and ~80% increase in mismatch rate.

                          Seq PCT_PF_READS_ALIGNED PF_MISMATCH_RATE PF_HQ_ERROR_RATE
                          NextSeq_R2 0.964464 0.022694 0.021512
                          NextSeq_R1 0.955108 0.025834 0.024588
                          HiSeq 0.973545 0.013678 0.013063


                          Now the data isn't entirely unusable for WGS if you have enough coverage you can still get variant calls out of it. However they're likely to have a higher FP and if you were looking for rare variants I would be very hesitant to use the data (especially for de novo mutations). For other uses this may be fine, but I've only experience with WGS and RNA-seq so I'll leave that for others to decide.
                          Attached Files

                          Comment


                          • #28
                            Originally posted by aeonsim View Post
                            The data quality from the NextSeq is substantially worse than that from a Hiseq with substantially more errors to the point where I'm not certain the data is usable for low to medium coverage whole genome sequence variant calling (1-20x).
                            I would certainly not want to use it for low-coverage variant calling!

                            Incidentally, though, it seems the NextSeq platform may have a silver lining. Though all of the standard data quality metrics are much worse than HiSeq in my testing, it appears to have a drastically lower cross-contamination rate (reads from one library assigned to a different library) for dual-index pooled libraries, to the point that we are considering using NextSeq over HiSeq for projects in which index cross-contamination is more important than error rate. We are still investigating why the rate is lower.

                            Comment


                            • #29
                              Originally posted by Brian Bushnell View Post
                              I would certainly not want to use it for low-coverage variant calling!

                              Incidentally, though, it seems the NextSeq platform may have a silver lining. Though all of the standard data quality metrics are much worse than HiSeq in my testing, it appears to have a drastically lower cross-contamination rate (reads from one library assigned to a different library) for dual-index pooled libraries, to the point that we are considering using NextSeq over HiSeq for projects in which index cross-contamination is more important than error rate. We are still investigating why the rate is lower.
                              One possible trivial reason could be whether mismatches between an index read and the index sequence are allowed. HiSeq and MiSeq allow 1 mismatch by default. But we demultiplex off-instrument and allow zero mismatches.

                              --
                              Phillip

                              Comment


                              • #30
                                Originally posted by pmiguel View Post
                                One possible trivial reason could be whether mismatches between an index read and the index sequence are allowed. HiSeq and MiSeq allow 1 mismatch by default. But we demultiplex off-instrument and allow zero mismatches.

                                --
                                Phillip
                                We are also allowing 0 mismatches in both cases (and typically end up with >20% of reads in the unknown bin, as a result). Right now our 2 leading candidate hypotheses are:

                                1) NextSeq has much lower cluster density;
                                2) NextSeq has a different order of {read1, read2, index1, index2, resynthesis} compared to HiSeq/MiSeq.

                                Comment

                                Working...
                                X