Hello everyone! I am a newbie in the NGS field and need your help.
I was looking at a part of short jump library of a staphylococcus aureus study, ~1.7 million reads (~20x of its genome size) generated from Illumina GAII and got confused while trying to find out how many of these reads are "error-free" reads.
I don't quite understand the definition of "error-free" reads. I think "error-free" should be a term of describing the highest reads quality, that should guarantee the Illumina output reads should be the exact same as the input fragments. But how can I know about this? To determine whether a short read is a "error-free" read, (1) should I align a short read back to the known reference genome sequence(s) for a perfect local matching, or (2) should I look at the overall quality scores of all its bases that beyond a certain threshold?
In (1), I tried to align several short reads with very high frequency (>3000, such as @SRR022865.8852) against the reference genome sequences (NC_010079, NC_010063.1, and NC_012417.1), and I failed to find out any perfect matches. I thought "read-free" reads should show up in their reference sequences but I didn't see any.
The reads data set (and description file) is freely available at,
I used the above dataset from the following website,
Please forgive me for any naive questions. Thanks very much!
I was looking at a part of short jump library of a staphylococcus aureus study, ~1.7 million reads (~20x of its genome size) generated from Illumina GAII and got confused while trying to find out how many of these reads are "error-free" reads.
I don't quite understand the definition of "error-free" reads. I think "error-free" should be a term of describing the highest reads quality, that should guarantee the Illumina output reads should be the exact same as the input fragments. But how can I know about this? To determine whether a short read is a "error-free" read, (1) should I align a short read back to the known reference genome sequence(s) for a perfect local matching, or (2) should I look at the overall quality scores of all its bases that beyond a certain threshold?
In (1), I tried to align several short reads with very high frequency (>3000, such as @SRR022865.8852) against the reference genome sequences (NC_010079, NC_010063.1, and NC_012417.1), and I failed to find out any perfect matches. I thought "read-free" reads should show up in their reference sequences but I didn't see any.
The reads data set (and description file) is freely available at,
I used the above dataset from the following website,
Please forgive me for any naive questions. Thanks very much!
Comment