No announcement yet.

Sequencing low complexity libraries: effects on data

  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequencing low complexity libraries: effects on data

    I am planning some experiments that involve sequencing products that have a standard adaptor sequence at the start.

    Now I know that the cluster identification occurs using bases 1-5 so I have thought about using a NNNNN after the sequencing primer. This should ensure that clusters are identified correctly.

    However, for bases 6..15 all clusters have the same base. This will produce a single colour per flow, and there will potentially be optical effects due to saturation. Now, I don't really care about these bases, I am only interested in the genomic bases after the adaptor. So my question is: will the later bases be sequenced OK given that the early bases may have these problems?

    Also, what will happen for the paired end read if that also has low complexity bases at the start? Since the cluster identification happens during the first read, the effect should be the same?

  • #2
    PS this thread was useful:

    but that deals with deferring cluster identification till after the low complexity bases. I want to know the effect of low complexity bases after a successful cluster identification.


    • #3
      Hi casbon,

      if all of your sequences have the same kind of adapter sequence at the start, can't you just avoid the whole low complexity issue by using a custom sequencing primer for that lane so that you start reading straight into the genomic sequence?

      From our experience low complexity after the initial bases is not that much of a problem, and is certainly not nearly as bad as having it right at the start. If the same base composition would in general be much of a problem, then the shuffling process would not work very well, either. It does work quite well, even though the qualties do generally not quite reach the standards of a normal run (this is most likely due to phasing/prephasing though).

      And yes paired-ends would only suffer slighlty from technical issues with basecalling, but not from any influence on cluster detection.


      • #4
        Thanks, fkrueger.

        There are slight complications with dealing with a custom sequencing primer that I didn't disclose.

        In light of your comments, I think I might just try a lane and see how it turns out.


        • #5
          In any case, if you could convince your sequencing provider to keep hold of the images of the run this might possibly help you if you want to reprocess the data, e.g. only including cycles 1-5 and 16-end for the basecalling procedure. Or bareback shuffling of the first 15 bp for that matter... Good luck!


          • #6
            The HiSeq doesn't save any of the images so the above suggestion would only work on the GAIIX.


            • #7
              You can also try the following:
              - increase the amount of Phix you're spiking in your library prior to hybridization on the flow cell. For some really low complexity libraries, you can go up to 50% PhiX. This should be really usefull when sequencing libraries where all your fragments start with the same bases.
              - try to dilute your libraries a bit more than usual before you hybridize it on the flow cell (4 pM opposed to the usual 6 to 8 pM for example). You will end up with fewer sequences but you should avoid some of the identification problems.
              Both these methods were given to us by Illumina's techsupport. We have tried the second one so far with some success and we are going to try the first one soon.


              • #8
                There are basically two problems with biased libraries. Firstly, a lack of diversity in the first few bases means that overlapping clusters aren't able to be separated so the region of measurement identified can span two clusters, leading to mixed signals when the sequences later diverge. Secondly the highly biased sequence composition messes up the signal intensity calibration so that the quality of called bases can suffer.

                The solution to the first problem is to either dilute your library to the point where very few overlapping clusters are found, or to do the cluster calling from a later set of clusters, either by specifying the clusters to use when setting up the run (with a limited range of options), or by saving images and using something like bareback to shuffle the order in which they're presented to the cluster calling program.

                The solution to the second problem is either to increase the diversity of your library through the introduction of more random sequences, or to use an external calibration, either a standard fixed one, or one derived from a different diverse lane on the same flowcell.

                Adding PhiX attempts to solve both of these problems in one step - reducing the effective concentration of the biased library, and introducing some added diversity. Alternatively you could just dilute your library more and use a control lane elsewhere on the flowcell. Either of these approaches will yield substantially less data than a deferred cluster calling but they're much better than doing a standard analysis on a biased high density library which can, in extreme cases, return no data at all.

                In your specific case, if you introduce random bases at the start so that the clusters are called correctly you may still find that all of your sequences end up being rejected due to the compositional bias later in the read. Actually the calls for your later bases will probably be OK, but one of the illumina filters looks for deteriorating quality and then flags all remaining bases with low quality scores, even if the quality later improves (the so called 'killer Bs'. You can turn this off using the undocumented parameter NO-EAMSS when processing which will preserve the original qualities. If you then trim your sequences to just your bases of interest then the qualities there should be OK.