(1) How to estimate error rate for GAII short-reads?
(2) For some papers, like panda genome paper, they filter out the base-calling duplicate. I do not have idea for this issue. Can someone give me some clues? How to filter out these base-calling duplicate?
"(1) Base-calling duplicate, this is a unique characteristic for each lane,
caused by the solexa-pipeline, and they are not real sequences. The higher the raw
cluster density, the more severe this problem is. The redundant reads were filtered at a
threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the
average rate of base-calling duplicates for each lane was about 0.83%, ranging from
0.00% to 8.52%. (2) Adapter contamination, another unique characteristic of the
specific library, is caused by DNA adaptor dimerization, the empty loading or too
small an insert size (less than the read length)."
(2) For some papers, like panda genome paper, they filter out the base-calling duplicate. I do not have idea for this issue. Can someone give me some clues? How to filter out these base-calling duplicate?
"(1) Base-calling duplicate, this is a unique characteristic for each lane,
caused by the solexa-pipeline, and they are not real sequences. The higher the raw
cluster density, the more severe this problem is. The redundant reads were filtered at a
threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the
average rate of base-calling duplicates for each lane was about 0.83%, ranging from
0.00% to 8.52%. (2) Adapter contamination, another unique characteristic of the
specific library, is caused by DNA adaptor dimerization, the empty loading or too
small an insert size (less than the read length)."
Comment