Seqanswers Leaderboard Ad

**jeffgao** · 06-16-2011, 09:32 AM

I must ignored the difference between the real reads and the simulated reads. The word "error-free" only describes the error-free simulated reads that are used to evaluate the performance of sequence assemblers. It seems there are no absolutely error-free "real" reads, although error correction procedure (in sequence assemblers) can correct many, but not all of these errors (e.g., substitutions in Illumina data). Please correct me if I am wrong again.

**westerman** · 06-16-2011, 11:04 AM

Does GAGE use the term "error free" on their web page? I couldn't find it. In any case I would suspect that there are many 'absolutely error-free "real" reads' -- especially if you trim off poor quality bases from either end -- in an NGS data set. In other words "error free" should mean your (2) ... look at the quality scores.

As for (1) and your 1:32 PM message, I suggest that you turn your thinking around and ask: why does the reference sequence -- which is a computer file full of ACGTs and in many ways a simulation -- not match my real-world sample?

Note that many of the reads you are using are exact matches to each other. This would mean that they are either PCR duplications or "real" reads. I.e., "output reads [that are] the exact same as the input fragments". I suspect the latter.

**jeffgao** · 06-16-2011, 12:47 PM

Originally posted by westerman View Post

Does GAGE use the term "error free" on their web page? I couldn't find it. In any case I would suspect that there are many 'absolutely error-free "real" reads' -- especially if you trim off poor quality bases from either end -- in an NGS data set. In other words "error free" should mean your (2) ... look at the quality scores.

As for (1) and your 1:32 PM message, I suggest that you turn your thinking around and ask: why does the reference sequence -- which is a computer file full of ACGTs and in many ways a simulation -- not match my real-world sample?

Note that many of the reads you are using are exact matches to each other. This would mean that they are either PCR duplications or "real" reads. I.e., "output reads [that are] the exact same as the input fragments". I suspect the latter.

Hello Westerman,

Thank you very much for your reply!

No, GAGE doesn't mention the term "error free". I just saw this term from a EULER paper (see below):

"We measure the rate of error correction by measuring the number of error-free reads, that is, reads that appear exactly in the genome (68% of the reads error-free)". cited from Chaisson MJ, Pevzner PA, "Short read fragment assembly of bacterial genomes", Genome Research Jan 2008. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2203630/

Yes. I agree there are some 'error-free' real reads and I just found some of them in the reference genome sequences (I assume the reference sequences are correct, in such a way the term 'error-free' is defined).

In the first post, I thought some reads with high frequency (appear many times in the reads fastq file) might come from repeat areas so they might have a better chance to be 'error-free' (could be perfectly mapped back to the reference sequences), so I tested several most frequent reads (with frequency > 3000), but I didn't find any matches from the ref seq.

I felt confused at first but it is not surprised since such high frequent reads are only a tiny proportion of the whole reads data set (~1.7 million). 88% of all reads are unique, among the other 12% reads that appear more than once, there are only around 1000 reads having a frequency more than 10. This number is so small and, just as you said, such duplicates may be introduced by library preparation processes (PCR duplications, etc.). Although I didn't find any matches of most frequent reads, but I did find some matches of other reads (my coding is still running so I can not give a number right now) in the ref seq. I believe there will be more matches if I get all reads trimmed by quality scores.

In addition, the different between reads and ref seq doesn't necessarily means reads are incorrect. Lots of things such as polymorphisms, snp, and indels can cause differences. The ref seq can not be absolutely correct either, and it is just a guideline to help people on re-sequencing, assembly, and variants detection.

Please correct me if I said something wrong.

**westerman** · 06-17-2011, 03:56 AM

Originally posted by jeffgao View Post

In addition, the different between reads and ref seq doesn't necessarily means reads are incorrect. Lots of things such as polymorphisms, snp, and indels can cause differences. The ref seq can not be absolutely correct either, and it is just a guideline to help people on re-sequencing, assembly, and variants detection.

Please correct me if I said something wrong.

Your last paragraph is what I was trying to get at -- the reference is probably not from the same individual that was sequenced and therefore one should not expect that the reads have exact matches but rather near matches. Heck, even inside the same individual (or clonal population) the genome will vary a bit. Granted not much but some.

On this forum it is hard to know the background of people's knowledge. We get the whole range from absolute novice to expert; from pure biology to pure computer science. I often find that the latter (the CS people ... and I am one of them) have a hard time transitioning from the exactness of computers to the messiness that is biology. It does sound like you know (or are learning quickly) about uncertainties in bioinformatics.

BTW: I am off on vacation for the next week thus if you do not see any further posts from me then it just means I am having fun away from my computer instead of ignoring you or SeqAnswers.

**jeffgao** · 06-20-2011, 07:34 AM

Originally posted by westerman View Post

Your last paragraph is what I was trying to get at -- the reference is probably not from the same individual that was sequenced and therefore one should not expect that the reads have exact matches but rather near matches. Heck, even inside the same individual (or clonal population) the genome will vary a bit. Granted not much but some.

On this forum it is hard to know the background of people's knowledge. We get the whole range from absolute novice to expert; from pure biology to pure computer science. I often find that the latter (the CS people ... and I am one of them) have a hard time transitioning from the exactness of computers to the messiness that is biology. It does sound like you know (or are learning quickly) about uncertainties in bioinformatics.

BTW: I am off on vacation for the next week thus if you do not see any further posts from me then it just means I am having fun away from my computer instead of ignoring you or SeqAnswers.

Sure. The biology story is far more complex than a piece of computer code. I have a computer science background with a little bit biology, and I am trying to extend my knowledge boundaries to keep pace with the fast growing bioinformatics stuff.

Basically, I was just trying to explore the relationship between the frequency and correctness (matches with ref seq) of the "real" reads. I have to carefully think about how to define a "good" match.

Thank you very much for your reply again and wish you have a nice vacation!

Topics	Statistics	Last Post
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 195 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM
Catalog of Gene-Isoform Variation in Developing Human Brain by seqadmin Started by seqadmin, 05-23-2024, 10:28 AM	0 responses 218 views 0 likes	Last Post by seqadmin 05-23-2024, 10:28 AM
Ancient Viral Sequences in Human Brain Linked to Psychiatric Disorders by seqadmin Started by seqadmin, 05-23-2024, 07:35 AM	0 responses 221 views 0 likes	Last Post by seqadmin 05-23-2024, 07:35 AM
New Milestone for COSMIC with Extensive Cancer Mutation Data by seqadmin Started by seqadmin, 05-22-2024, 02:06 PM	0 responses 12 views 0 likes	Last Post by seqadmin 05-22-2024, 02:06 PM

Seqanswers Leaderboard Ad

Announcement

What do "error-free" reads mean?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News