For TL;DR, just skip to the pictures at the bottom of the post.
Not sure if everyone even knows what "low diversity" means in this context. Let me give you a worst case scenario: we use the MiSeq to sequence PCR product derived from 16S V3 loop primers. What this implies is that if we take no other action, and just cluster and run these amplicons, over the first 20 bases of sequence every single cluster will read exactly the same base -- those bases from the V3 loop primer itself. That is low sample diversity -- zero sample diversity in this extreme case.
No need to suggest work-arounds to me, I think I am familiar with them all. Here I just want to give you a "case study" and a little background on what I would call the current state-of-the-art.
Please not that topic has been addressed in other threads. Nothing here is particularly new or shocking. But I think an additional data point will be helpful.
If one wanted to choose the perennial Illumina issue it would be the problems one encounters sequencing of low diversity libraries.
While Illumina generally tackles major issues head on and eventually solves them, the low diversity sequencing issue for some reason seems to be the one they just can't find the fortitude to directly address.
To tell you the truth, on the HiSeq it is less of an issue because only a tiny percentage of our libraries are low diversity by necessity for this instrument.
However one of the stated goals of the MiSeq is to entirely obsolete the 454. Obviously to reach that goal you have to be able to do what they call "amplicon" work. And this can include sequencing amplicons derived from a single PCR primer pair.
This is not possible on the MiSeq without using some of the workarounds. (Note I am talking v2 2x250 base MiSeq reads here.) But I wanted none of them to involve telling an investigator they had to change the way they were constructing the libraries to increase diversity. So here are the ones that remain:
(1) Spike in a percentage of some genomic DNA library (or several of them). For a zero diversity library I would pick 50%, but it is said one can drop to lower amounts using the "hard coding" work around I will mention below.
(2) Lower cluster density. I chose 8 pM. This gets me into the 700-800 K Clusters/mm^2 range. Not sure how important this is.
(3) Hard code the matrix and phasing/prephasing values. This is the most "hard core" of the hacks. Basically it allows you to use a previous run as a "control lane" for your current run.
While Illumina will gladly recommend the first 2 options as well as attempting to brow beat you into different library prep methodologies, the 3rd option is one they seem loathe to offer at all. I think this partially because "heavy" version of this requires converting format on some data contained in files from a previous run into the appropriate xml format and embedding that in a Miseq configuration file. Lots of ways this can go wrong and not work at all, I think.
Anyway, for a good description of the issue and both the "heavy" and the "lite" solutions, there is a canonical site you can peruse.
To run 500 cycle kits you use a v2 MiSeq. Somewhat disconcertingly, the above mentioned site seems to make zero mention of v2 MiSeqs. Neither do documents I was able to obtain from Illumina. It does mention what I am referring to as the "lite" hard coding method. Instead of actually hacking your miseq configuration xml, you just copy and rename couple of files from your control run into RTA's root directory. Then, ostensibly, RTA will make some sort of assessment of your data early in the run. Should it deem it "low diversity", it will use the data from those files to set the matrix and phasing/pre-phasing values.
Illumina tech support seemed unaware of this capability initially. They suggested I use the "heavy" method to make sure the hard coding actually happened.
Here are the results from a "worst case low diversity amplicon set"
without hard coding:
with hard coding:
Anyway, a couple of final points. First the run using only 2 of the 3 workarounds still produced usable data. Also much of the data assessment is the instrument's own, not really empirically determined. However the "error rate" is said to be the result of real alignment to the phiX genome. There are some disturbing things going on there in both runs. Although the hard coded run looks much better. Finally, this is a single run pair I am comparing. We all under stand that makes the information presented anecdotal and that "Your Milage May Vary".
--
Phillip
Not sure if everyone even knows what "low diversity" means in this context. Let me give you a worst case scenario: we use the MiSeq to sequence PCR product derived from 16S V3 loop primers. What this implies is that if we take no other action, and just cluster and run these amplicons, over the first 20 bases of sequence every single cluster will read exactly the same base -- those bases from the V3 loop primer itself. That is low sample diversity -- zero sample diversity in this extreme case.
No need to suggest work-arounds to me, I think I am familiar with them all. Here I just want to give you a "case study" and a little background on what I would call the current state-of-the-art.
Please not that topic has been addressed in other threads. Nothing here is particularly new or shocking. But I think an additional data point will be helpful.
If one wanted to choose the perennial Illumina issue it would be the problems one encounters sequencing of low diversity libraries.
While Illumina generally tackles major issues head on and eventually solves them, the low diversity sequencing issue for some reason seems to be the one they just can't find the fortitude to directly address.
To tell you the truth, on the HiSeq it is less of an issue because only a tiny percentage of our libraries are low diversity by necessity for this instrument.
However one of the stated goals of the MiSeq is to entirely obsolete the 454. Obviously to reach that goal you have to be able to do what they call "amplicon" work. And this can include sequencing amplicons derived from a single PCR primer pair.
This is not possible on the MiSeq without using some of the workarounds. (Note I am talking v2 2x250 base MiSeq reads here.) But I wanted none of them to involve telling an investigator they had to change the way they were constructing the libraries to increase diversity. So here are the ones that remain:
(1) Spike in a percentage of some genomic DNA library (or several of them). For a zero diversity library I would pick 50%, but it is said one can drop to lower amounts using the "hard coding" work around I will mention below.
(2) Lower cluster density. I chose 8 pM. This gets me into the 700-800 K Clusters/mm^2 range. Not sure how important this is.
(3) Hard code the matrix and phasing/prephasing values. This is the most "hard core" of the hacks. Basically it allows you to use a previous run as a "control lane" for your current run.
While Illumina will gladly recommend the first 2 options as well as attempting to brow beat you into different library prep methodologies, the 3rd option is one they seem loathe to offer at all. I think this partially because "heavy" version of this requires converting format on some data contained in files from a previous run into the appropriate xml format and embedding that in a Miseq configuration file. Lots of ways this can go wrong and not work at all, I think.
Anyway, for a good description of the issue and both the "heavy" and the "lite" solutions, there is a canonical site you can peruse.
To run 500 cycle kits you use a v2 MiSeq. Somewhat disconcertingly, the above mentioned site seems to make zero mention of v2 MiSeqs. Neither do documents I was able to obtain from Illumina. It does mention what I am referring to as the "lite" hard coding method. Instead of actually hacking your miseq configuration xml, you just copy and rename couple of files from your control run into RTA's root directory. Then, ostensibly, RTA will make some sort of assessment of your data early in the run. Should it deem it "low diversity", it will use the data from those files to set the matrix and phasing/pre-phasing values.
Illumina tech support seemed unaware of this capability initially. They suggested I use the "heavy" method to make sure the hard coding actually happened.
Here are the results from a "worst case low diversity amplicon set"
without hard coding:
with hard coding:
Anyway, a couple of final points. First the run using only 2 of the 3 workarounds still produced usable data. Also much of the data assessment is the instrument's own, not really empirically determined. However the "error rate" is said to be the result of real alignment to the phiX genome. There are some disturbing things going on there in both runs. Although the hard coded run looks much better. Finally, this is a single run pair I am comparing. We all under stand that makes the information presented anecdotal and that "Your Milage May Vary".
--
Phillip
Comment