Hello all, this is my first post. I have been trying for several weeks now to figure out an issue with a dataset. I have discussed this with a number of local experts and am in contact with Illumina support, but no one has come up with an answer yet. My advisor suggested SEQanswers as a good, knowledgeable forum.
Our reads should start with a 4 base degenerate sequence (which rarely aligns to the genome; to be used to identify PCR duplicates), an invariant C at the 5th base, then genomic sequence.
For visualization, start of read should be: NNNNC followed by 30 - 80 nt of genomic sequence.
Before even sending the library to be sequenced, I cloned a bit of library into pBluescript and sequenced 10 clones. All 10 had this correct structure, so we went ahead with sequencing.
However, after we sent the library to be sequenced on an Illumina HiScan SQ, the data that came back showed that only 33% of all reads had a C in the 5th position. Worse, when I randomly selected 30 reads and performed manual alignment, it appears as though anywhere from 0-5 of the first 5 bases align to the genome in a pretty random distribution. To put this another way, we have likely lost 1-5 nt from the beginning of reads (67% of all reads).
I can still work with the data by just aligning it without the first 5 bases and accepting that there will be PCR biases. However, I would prefer to use the degenerate bases to limit PCR biases and thus make the analysis a bit more quantitative.
Thanks for any help anyone can provide
Our reads should start with a 4 base degenerate sequence (which rarely aligns to the genome; to be used to identify PCR duplicates), an invariant C at the 5th base, then genomic sequence.
For visualization, start of read should be: NNNNC followed by 30 - 80 nt of genomic sequence.
Before even sending the library to be sequenced, I cloned a bit of library into pBluescript and sequenced 10 clones. All 10 had this correct structure, so we went ahead with sequencing.
However, after we sent the library to be sequenced on an Illumina HiScan SQ, the data that came back showed that only 33% of all reads had a C in the 5th position. Worse, when I randomly selected 30 reads and performed manual alignment, it appears as though anywhere from 0-5 of the first 5 bases align to the genome in a pretty random distribution. To put this another way, we have likely lost 1-5 nt from the beginning of reads (67% of all reads).
I can still work with the data by just aligning it without the first 5 bases and accepting that there will be PCR biases. However, I would prefer to use the degenerate bases to limit PCR biases and thus make the analysis a bit more quantitative.
Thanks for any help anyone can provide
Comment