Errors and SNPs easier to detect in CS.
Let's assume I have a SNP. Then the color-space reads would look like:
(CS 1) T1113113
(CS 2) T1112013
Note that *two* color-space numbers are different for a single SNP. In base-space these reads are:
(BS 1) GTGCACG
(BS 2) GTGAACG
Note the SNP in the middle which is a C to A.
Double encoded, primer trimmed, these sequences look like:
(DET 1) CCTCCT
(DET 2) CCGACT
Put these into a traditional alignment program and you will get an alignment but now it looks like a "double-SNP" instead of the true single SNP.
Likewise sequencing errors show up easily in color-space but do not in double-encoding-using traditional programs. Let's say that our reads have a single number difference:
(CS 3) T1113113
(CS 4) T1112113
As double-encoded trimmed these look like:
(DET 3) CCTCCT
(DET 4) CCGCCT
Which a *traditional* program will happily work with and give you back a SNP. But what is the actual base-space alignment?
(BS 3) GTGCACG
(BS 4) GTGACAT
Ooops!
Note that this is one of the great strengths of color-space: sequencing errors stand out as a single number change and can be discarded or corrected. In the above case a color-space aware program would throw out the read that does not match the reference. Or in a de-novo assembly project throw out the read(s) that do not match other reads.
In fact I think that the above is so important I will repeat it. In color space sequencing errors are different than SNPs and thus are easily detected as errors. This is immense power over traditional sequencing representations.
Let's assume I have a SNP. Then the color-space reads would look like:
(CS 1) T1113113
(CS 2) T1112013
Note that *two* color-space numbers are different for a single SNP. In base-space these reads are:
(BS 1) GTGCACG
(BS 2) GTGAACG
Note the SNP in the middle which is a C to A.
Double encoded, primer trimmed, these sequences look like:
(DET 1) CCTCCT
(DET 2) CCGACT
Put these into a traditional alignment program and you will get an alignment but now it looks like a "double-SNP" instead of the true single SNP.
Likewise sequencing errors show up easily in color-space but do not in double-encoding-using traditional programs. Let's say that our reads have a single number difference:
(CS 3) T1113113
(CS 4) T1112113
As double-encoded trimmed these look like:
(DET 3) CCTCCT
(DET 4) CCGCCT
Which a *traditional* program will happily work with and give you back a SNP. But what is the actual base-space alignment?
(BS 3) GTGCACG
(BS 4) GTGACAT
Ooops!
Note that this is one of the great strengths of color-space: sequencing errors stand out as a single number change and can be discarded or corrected. In the above case a color-space aware program would throw out the read that does not match the reference. Or in a de-novo assembly project throw out the read(s) that do not match other reads.
In fact I think that the above is so important I will repeat it. In color space sequencing errors are different than SNPs and thus are easily detected as errors. This is immense power over traditional sequencing representations.
Comment