It has been reported several times that low diversity at the start of Illumina sequencing libraries can lead to a large scale loss of data because the standard pipeline will get the initial cluster identification wrong. Researchers at our institute generated such low-diversity libraries on numerous occasions, including libraries which were digested with restriction enzymes prior to sequencing, libraries with custom barcode tags at the start of all sequences, RRBS and so on. We have developed a simple method called barcode-back-processing (or short bareback-processing) which allows the deferral of cluster identification to later cycles, and are happy to announce that this study has just been published in PLoS One (bareback manuscript).
With this study we would like to raise awareness within the sequencing community that certain types of experiments can be associated with tremendous problems on the Illumina platform (up to complete failures of entire sequencing lanes), and report a potential fix for this problem. We are aware that constant new software releases and hardware requirements might mean that the solution we present is only temporary, but hopefully our findings will be motivational for Illumina to include a proper solution to low-diversity libraries in one of their future pipeline versions.
Our method does in theory something comparable to the unofficial and undocumented Illumina option --image-flags, which has been mentioned here on Seqanswers before (undocumented option --image_flags, changes in Illumina Pipeline SCS 2.8/RTA 1.8, multiplexing on HiSeq). We have done a couple of comparisons between different standard Illumina pipeline versions, bareback and --image-flags processing with data from real world datasets from ongoing research projects. In some extreme cases, bareback-processing was able to recover more than 33 million good quality sequencing reads from a lane which produced literally 0 sequences with the standard Illumina pipeline SCS v2.6/OLB v1.6 processing and around 30,000 sequences with OLB v1.8 processing (see Supplementary Figures 1 and 2 for comparisons between --image-flags and bareback-processing). Interestingly, --image-flags was also very good at recovering extra sequences, however we found that something odd is going on when this undocumented option is used, as a quite large percentage of reads contains poor quality base calls and/or many more Ns in the sequences. In summary, it appears that bareback-processing often produces more high quality reads than the built-in but undocumented option --image-flags.
Bareback-processing works by moving the raw cluster images files containing the intially biased sequences to the end of the reads before invoking the Illumina pipeline. After the analysis has been completed, the cycles containing the low-diversity sequences are moved back to the start of the sequence reads. This of course implies that it can only be applied if the actual image files are being stored (so it will not work for HiSeq machines, even though they will still suffer from exactly the same problems!). For Illumina GAIIx machines one either needs to run SCS v2.6 (which allows storing image information) and reprocess then from the images with preferably OLB 1.8 (although this option will soon be unavailable), or upgrade the instrument PC hardware to at least a T7500. It will be interesting to see what future versions of the Illumina pipeline are going to offer...
The images of two lanes will soon be available for download from the SRA archive, one being a well diverse control library (PhiX), the other being a library with very low initial diversity (all sequences are supposed to have the first 12 bp in common) (lanes 1 and 4 from Supplementary Figure 2).
If you have any questions or comments please get in touch!
With this study we would like to raise awareness within the sequencing community that certain types of experiments can be associated with tremendous problems on the Illumina platform (up to complete failures of entire sequencing lanes), and report a potential fix for this problem. We are aware that constant new software releases and hardware requirements might mean that the solution we present is only temporary, but hopefully our findings will be motivational for Illumina to include a proper solution to low-diversity libraries in one of their future pipeline versions.
Our method does in theory something comparable to the unofficial and undocumented Illumina option --image-flags, which has been mentioned here on Seqanswers before (undocumented option --image_flags, changes in Illumina Pipeline SCS 2.8/RTA 1.8, multiplexing on HiSeq). We have done a couple of comparisons between different standard Illumina pipeline versions, bareback and --image-flags processing with data from real world datasets from ongoing research projects. In some extreme cases, bareback-processing was able to recover more than 33 million good quality sequencing reads from a lane which produced literally 0 sequences with the standard Illumina pipeline SCS v2.6/OLB v1.6 processing and around 30,000 sequences with OLB v1.8 processing (see Supplementary Figures 1 and 2 for comparisons between --image-flags and bareback-processing). Interestingly, --image-flags was also very good at recovering extra sequences, however we found that something odd is going on when this undocumented option is used, as a quite large percentage of reads contains poor quality base calls and/or many more Ns in the sequences. In summary, it appears that bareback-processing often produces more high quality reads than the built-in but undocumented option --image-flags.
Bareback-processing works by moving the raw cluster images files containing the intially biased sequences to the end of the reads before invoking the Illumina pipeline. After the analysis has been completed, the cycles containing the low-diversity sequences are moved back to the start of the sequence reads. This of course implies that it can only be applied if the actual image files are being stored (so it will not work for HiSeq machines, even though they will still suffer from exactly the same problems!). For Illumina GAIIx machines one either needs to run SCS v2.6 (which allows storing image information) and reprocess then from the images with preferably OLB 1.8 (although this option will soon be unavailable), or upgrade the instrument PC hardware to at least a T7500. It will be interesting to see what future versions of the Illumina pipeline are going to offer...
The images of two lanes will soon be available for download from the SRA archive, one being a well diverse control library (PhiX), the other being a library with very low initial diversity (all sequences are supposed to have the first 12 bp in common) (lanes 1 and 4 from Supplementary Figure 2).
If you have any questions or comments please get in touch!
Comment