I would like to find some information on the distribution of errors in SOLiD data. I'm planning to use it to simulate a pooling sequencing strategy like the "DNA Sudoku" approach, and assess how badly SOLiD's error rate hurts the capacity to uniquely resolve variants in this scheme.
If I just assume errors are uniformly distributed along reads with a frequency of 0.03%, I am pretty sure the answer will be "Not much, go for it!" But I suspect that error model is too optimistic, and there are errors which correlate with sequence position and context. Ideally, I'd like to find a paper like "Substantial biases in ultra-short read data sets from high-throughput DNA sequencing", but for SOLiD rather than Illumina. Is there such a paper?
Another possibility would be a large corpus of public SOLiD data from loci which have been sequenced by other methods, so I could compare and look for and characterize errors myself.
If I just assume errors are uniformly distributed along reads with a frequency of 0.03%, I am pretty sure the answer will be "Not much, go for it!" But I suspect that error model is too optimistic, and there are errors which correlate with sequence position and context. Ideally, I'd like to find a paper like "Substantial biases in ultra-short read data sets from high-throughput DNA sequencing", but for SOLiD rather than Illumina. Is there such a paper?
Another possibility would be a large corpus of public SOLiD data from loci which have been sequenced by other methods, so I could compare and look for and characterize errors myself.
Comment