My lab is having quality problems in the first few cycles of our runs, affecting mainly Gs and Ts.
Our sequencing is performed at a core facility (we are the client) on a HiSeq2000. These are 50 SE runs. We do the library prep ourselves. We multiplex 16 samples per lane using our own barcoded adapters. The barcodes are four bases long, followed by a T. We are careful to balance all four bases at each of the first four positions. The fifth position is always a T due to the T/A ligation used to ligate the adapters. We are sequencing yeast genomic DNA and our insert sizes are in the range of 300 bp.
Since the beginning of 2012, in some runs we have a surprisingly low number of bases passing a quality score of 30 (see attachment...I wanted to paste it into my post but am not sure how). Other runs have high scores across the board. In discussing this with the core, it appears the "good" runs were performed at much lower cluster density (around 100 million clusters) whereas the "bad" runs were somewhere in the range of 190 million clusters. The tables below include only reads passing the Illumina filter.
As you can see, Gs and Ts are much more dramatically affected than Cs and As. Also, in some runs mainly first two cycles are affected, whereas in some runs it's the third and/or fourth cycle.
This is a serious problem for us because it affects our ability to de-multiplex the data using the barcode sequences. In the "bad" runs, we get millions of reads where the first four bases don't correspond to any of our expected barcodes. This reduces our coverage, but is actually not a huge problem because we can easily discard those reads. What's worse is that even though each barcode differs from every other barcode in at least two positions, it's clear after full analysis of the data that some reads are getting placed into the wrong file when we de-multiplex. In other words, the base-calling is so bad that we can't confidently assign reads to the individual samples based on the barcodes. The files are cross-contaminated and we get uninterpretable results.
Has anyone seen this problem before...or do you have any guesses about what could be happening? Lowering the cluster density appears to alleviate the problem, but we're confused about the root cause. In the past we were able to get high quality data even at higher cluster density and with shorter barcodes (two bases). Any ideas about what could be happening (either on our end or in the sequencing core) would be welcome.
I should also mention that we have ruled out adapter-dimer as a cause of the problem. We have quantified the number of adapter-dimer reads in a large number of experiments and have found no correlation between amount of adapter contamination and base quality problems. In fact, some of our highest quality data came from samples with the worst adapter-dimer contamination, and vice-versa.
Our sequencing is performed at a core facility (we are the client) on a HiSeq2000. These are 50 SE runs. We do the library prep ourselves. We multiplex 16 samples per lane using our own barcoded adapters. The barcodes are four bases long, followed by a T. We are careful to balance all four bases at each of the first four positions. The fifth position is always a T due to the T/A ligation used to ligate the adapters. We are sequencing yeast genomic DNA and our insert sizes are in the range of 300 bp.
Since the beginning of 2012, in some runs we have a surprisingly low number of bases passing a quality score of 30 (see attachment...I wanted to paste it into my post but am not sure how). Other runs have high scores across the board. In discussing this with the core, it appears the "good" runs were performed at much lower cluster density (around 100 million clusters) whereas the "bad" runs were somewhere in the range of 190 million clusters. The tables below include only reads passing the Illumina filter.
As you can see, Gs and Ts are much more dramatically affected than Cs and As. Also, in some runs mainly first two cycles are affected, whereas in some runs it's the third and/or fourth cycle.
This is a serious problem for us because it affects our ability to de-multiplex the data using the barcode sequences. In the "bad" runs, we get millions of reads where the first four bases don't correspond to any of our expected barcodes. This reduces our coverage, but is actually not a huge problem because we can easily discard those reads. What's worse is that even though each barcode differs from every other barcode in at least two positions, it's clear after full analysis of the data that some reads are getting placed into the wrong file when we de-multiplex. In other words, the base-calling is so bad that we can't confidently assign reads to the individual samples based on the barcodes. The files are cross-contaminated and we get uninterpretable results.
Has anyone seen this problem before...or do you have any guesses about what could be happening? Lowering the cluster density appears to alleviate the problem, but we're confused about the root cause. In the past we were able to get high quality data even at higher cluster density and with shorter barcodes (two bases). Any ideas about what could be happening (either on our end or in the sequencing core) would be welcome.
I should also mention that we have ruled out adapter-dimer as a cause of the problem. We have quantified the number of adapter-dimer reads in a large number of experiments and have found no correlation between amount of adapter contamination and base quality problems. In fact, some of our highest quality data came from samples with the worst adapter-dimer contamination, and vice-versa.
Comment