Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fkrueger
    Senior Member
    • Sep 2009
    • 627

    Loss of data in low-diversity libraries can be recovered by deferred cluster calling

    It has been reported several times that low diversity at the start of Illumina sequencing libraries can lead to a large scale loss of data because the standard pipeline will get the initial cluster identification wrong. Researchers at our institute generated such low-diversity libraries on numerous occasions, including libraries which were digested with restriction enzymes prior to sequencing, libraries with custom barcode tags at the start of all sequences, RRBS and so on. We have developed a simple method called barcode-back-processing (or short bareback-processing) which allows the deferral of cluster identification to later cycles, and are happy to announce that this study has just been published in PLoS One (bareback manuscript).

    With this study we would like to raise awareness within the sequencing community that certain types of experiments can be associated with tremendous problems on the Illumina platform (up to complete failures of entire sequencing lanes), and report a potential fix for this problem. We are aware that constant new software releases and hardware requirements might mean that the solution we present is only temporary, but hopefully our findings will be motivational for Illumina to include a proper solution to low-diversity libraries in one of their future pipeline versions.

    Our method does in theory something comparable to the unofficial and undocumented Illumina option --image-flags, which has been mentioned here on Seqanswers before (undocumented option --image_flags, changes in Illumina Pipeline SCS 2.8/RTA 1.8, multiplexing on HiSeq). We have done a couple of comparisons between different standard Illumina pipeline versions, bareback and --image-flags processing with data from real world datasets from ongoing research projects. In some extreme cases, bareback-processing was able to recover more than 33 million good quality sequencing reads from a lane which produced literally 0 sequences with the standard Illumina pipeline SCS v2.6/OLB v1.6 processing and around 30,000 sequences with OLB v1.8 processing (see Supplementary Figures 1 and 2 for comparisons between --image-flags and bareback-processing). Interestingly, --image-flags was also very good at recovering extra sequences, however we found that something odd is going on when this undocumented option is used, as a quite large percentage of reads contains poor quality base calls and/or many more Ns in the sequences. In summary, it appears that bareback-processing often produces more high quality reads than the built-in but undocumented option --image-flags.

    Bareback-processing works by moving the raw cluster images files containing the intially biased sequences to the end of the reads before invoking the Illumina pipeline. After the analysis has been completed, the cycles containing the low-diversity sequences are moved back to the start of the sequence reads. This of course implies that it can only be applied if the actual image files are being stored (so it will not work for HiSeq machines, even though they will still suffer from exactly the same problems!). For Illumina GAIIx machines one either needs to run SCS v2.6 (which allows storing image information) and reprocess then from the images with preferably OLB 1.8 (although this option will soon be unavailable), or upgrade the instrument PC hardware to at least a T7500. It will be interesting to see what future versions of the Illumina pipeline are going to offer...

    The images of two lanes will soon be available for download from the SRA archive, one being a well diverse control library (PhiX), the other being a library with very low initial diversity (all sequences are supposed to have the first 12 bp in common) (lanes 1 and 4 from Supplementary Figure 2).

    If you have any questions or comments please get in touch!
    Attached Files
  • GERALD
    Member
    • Jun 2010
    • 20

    #2
    bareback

    Actually, I have tried this myself and found it to be true. I just made a perl script to copy all the files and rename them (called it goatfooler). Then, I ran CASAVA and used another script to copy the tags and qscores to the front of the call. After recalling them, my base calling went from utter failure to complete success. I actually tried the undocumented --image-flags option and, just as you described, it didn't work very well. My Illumina rep was utterly baffled by my results. It would be really nice if Illumina provided more documentation of how they do their basecalling. I'm glad to hear that someone else obtained similar results from their analysis.

    Comment

    • C.R.
      Member
      • Jun 2010
      • 25

      #3
      I strongly agree. This is a big problem and Illumina does not pay attention to it. In general my libraries are OK, since it worked for one test run on a Genome Analyzer. Now I got 5 RRBS samples sequenced on a HiScanSQ but all reads are trash due to the problem which is nicely described in your paper. The Illumina tech-support did not help so far. Now since more than a week they only keep telling us that there was no technical problem during sequencing. Well, this is true, because the control lane and 2 Lanes ChIP-Seq are OK. Unfortunately, it seems that no high resolution images have been recorded, such that I cannot use your software. Thank you very much for your helpful comments so far Felix!
      Is there anybody else who can tell me what needs to be considered for a successful Illumina HiSeq / HiScanSQ sequencing of RRBS libraries?

      Comment

      • NextGenSeq
        Senior Member
        • Apr 2009
        • 482

        #4
        We just had this same issue with our HiSeq 2000. How can we reanalyze this without the image files? Can this be done using the CIF files?

        Comment

        • fkrueger
          Senior Member
          • Sep 2009
          • 627

          #5
          I am afraid this won't work if you don't have the saved images. Did you lose entire lanes or just a certain fraction of it?
          Last edited by fkrueger; 04-27-2011, 10:37 AM.

          Comment

          • NextGenSeq
            Senior Member
            • Apr 2009
            • 482

            #6
            A fraction, the data quality drops off quickly after the barcode.

            It's infuriating that Illumina has done nothing about this when they've known about this for years.

            Comment

            • HESmith
              Senior Member
              • Oct 2009
              • 512

              #7
              I'll be the first to admit that Illumina has made some mistakes (for example, generating a file format that its aligner cannot read), and they could do a better job of advertising the issue, but the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. There are a couple of straightforward non-computational solutions: use custom sequencing primers if there's no diversity, or design multiple balanced barcodes for each sample to introduce diversity.

              Comment

              • protist
                Senior Member
                • Jan 2009
                • 101

                #8
                Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

                [I]From SCS2.9/RTA1.9 Release notes:
                Configurable Template Generation Cycles: The SCS CIF file generation feature cannot start until RTA has generated the tile templates. This
                takes 5 cycles after the declared template generation cycle.
                Normally template generation begins on cycle 1 and ends on cycle 5. However template generation requires a diversity of bases in the clusters of the template generation cycles. Some users have custom sample preparation procedures that place arbitrary sequences on the clusters, adapters or indexing ““spikes””, etc. The required diversity of bases may not be present in this case, and it is possible to delay template generation until the actual sample is being sequenced.
                [/I]

                Comment

                • fkrueger
                  Senior Member
                  • Sep 2009
                  • 627

                  #9
                  Originally posted by protist View Post
                  Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

                  I would also be interested if anyone had used this "new" option. After talking to our Illumina rep we don't have any reason to believe that the "Configurable Template Generation Cycles" option is any different from the previous unofficial option "--image-flags". Thus, I would imagine that the basecalls would still suffer from mysteriously bad qualities, see the Supplementary Figures linked in the first post of this thread.

                  Not quite but I also think that this option can only be applied to the entire flowcell and not on a per-lane basis, right?

                  Comment

                  • DNAANDDAN
                    Junior Member
                    • Jan 2010
                    • 2

                    #10
                    how about PE data

                    Hi, I have the same issue with my data. however , in my data , which is paired-end manner of solexa data ,1-81 are read1 data,and 82-162 are read2 data , 1-7 and 82-88 cycles are barcode with low diversety .
                    could bareback handle this kind of data ?

                    Comment

                    • fkrueger
                      Senior Member
                      • Sep 2009
                      • 627

                      #11
                      Hi Lan,

                      Yes, in theory bareback-processing should be able to handle this kind of data. Cluster coordinates are determined for read 1 only, so it will be sufficient if you shuffle the first 7 bp or read 1 towards the back and leave read 2 untouched (the bareback-script will do just that).

                      Good luck!

                      Comment

                      • Horacio G
                        Junior Member
                        • Nov 2010
                        • 1

                        #12
                        First try on low-diversity libraries

                        Hi guys,

                        I'm trying to run my first flow cell on a GAIIx with low-diversity libraries. I'm still not sure whether to go ahead and save the images and do the post analysis with Bareback (my illumina rep does not encourage that alternative) or to use the delay template generation. However on the latter I don't know if I'll get an early report about the quality of the run (i.e. focusing, intensities ).
                        Any suggestions would be greatly appreciated.

                        Horacio

                        Comment

                        • fkrueger
                          Senior Member
                          • Sep 2009
                          • 627

                          #13
                          Hi Horacio,

                          Why am I not surprised that your rep does not recommend anything but using the standard pipeline... If you've got the option to save the images I would definitely vote for that. If you still have the images you can choose to use the standard pipeline, use --image-flags (which is the Illumina deferred cluster calling option) or even bareback processing. However if you don't save images you will have to go with whatever the standard analysis pipeline will give you (and this can be shocking (0 sequences in the worst case scenario which we experienced several times)... but this highly depends on your experimental setup, the number of low diversity sequences, the cluster density and so on).

                          If you have further questions don't hesitate to ask via email.

                          Best,
                          Felix

                          Comment

                          • pmiguel
                            Senior Member
                            • Aug 2008
                            • 2328

                            #14
                            Originally posted by HESmith View Post
                            [...] the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. [...]
                            There is an option to save the images. We tried it out on a recent run. This is using the standard HiSeq run software and v3 chemistry. 6.24 TB of TIFFs for a 2x101+7 run. (PE + index). That was only 1 surface of one flow cell though. So it would be 2x or 4x more for a HiSeq 1000 or HiSeq 2000. Also we save the runs to an offsite server during the run -- not the console machine itself.

                            What? You don't have 25 TBs handy to store image data?

                            What are you going to do with it? You can tell the instrument console (a Dell server running Windows Vista) to reprocess the image data. But that is going to be a slow process. You probably don't want to tie up your instrument that long reprocessing a run. Maybe clone the console server into a virtual machine and run it off-site?

                            --
                            Phillip

                            Comment

                            • fkrueger
                              Senior Member
                              • Sep 2009
                              • 627

                              #15
                              Thanks for this piece of information Phillip, so far the general consensus seemed to be that it is absolutely impossible to store image data (apart from thumbnails) from the HiSeq (probably also the HighScan then) at all. Storing this amount of data let alone reprocessing a whole flowcell (which would likely take a couple of days) is a whole different matter, though...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              57 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              50 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              200 views
                              0 reactions
                              Last Post seqadmin  
                              Working...