Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What do "error-free" reads mean?

    Hello everyone! I am a newbie in the NGS field and need your help.

    I was looking at a part of short jump library of a staphylococcus aureus study, ~1.7 million reads (~20x of its genome size) generated from Illumina GAII and got confused while trying to find out how many of these reads are "error-free" reads.

    I don't quite understand the definition of "error-free" reads. I think "error-free" should be a term of describing the highest reads quality, that should guarantee the Illumina output reads should be the exact same as the input fragments. But how can I know about this? To determine whether a short read is a "error-free" read, (1) should I align a short read back to the known reference genome sequence(s) for a perfect local matching, or (2) should I look at the overall quality scores of all its bases that beyond a certain threshold?

    In (1), I tried to align several short reads with very high frequency (>3000, such as @SRR022865.8852) against the reference genome sequences (NC_010079, NC_010063.1, and NC_012417.1), and I failed to find out any perfect matches. I thought "read-free" reads should show up in their reference sequences but I didn't see any.

    The reads data set (and description file) is freely available at,



    I used the above dataset from the following website,


    Please forgive me for any naive questions. Thanks very much!

  • #2
    I must ignored the difference between the real reads and the simulated reads. The word "error-free" only describes the error-free simulated reads that are used to evaluate the performance of sequence assemblers. It seems there are no absolutely error-free "real" reads, although error correction procedure (in sequence assemblers) can correct many, but not all of these errors (e.g., substitutions in Illumina data). Please correct me if I am wrong again.

    Comment


    • #3
      Does GAGE use the term "error free" on their web page? I couldn't find it. In any case I would suspect that there are many 'absolutely error-free "real" reads' -- especially if you trim off poor quality bases from either end -- in an NGS data set. In other words "error free" should mean your (2) ... look at the quality scores.

      As for (1) and your 1:32 PM message, I suggest that you turn your thinking around and ask: why does the reference sequence -- which is a computer file full of ACGTs and in many ways a simulation -- not match my real-world sample?

      Note that many of the reads you are using are exact matches to each other. This would mean that they are either PCR duplications or "real" reads. I.e., "output reads [that are] the exact same as the input fragments". I suspect the latter.

      Comment


      • #4
        Originally posted by westerman View Post
        Does GAGE use the term "error free" on their web page? I couldn't find it. In any case I would suspect that there are many 'absolutely error-free "real" reads' -- especially if you trim off poor quality bases from either end -- in an NGS data set. In other words "error free" should mean your (2) ... look at the quality scores.

        As for (1) and your 1:32 PM message, I suggest that you turn your thinking around and ask: why does the reference sequence -- which is a computer file full of ACGTs and in many ways a simulation -- not match my real-world sample?

        Note that many of the reads you are using are exact matches to each other. This would mean that they are either PCR duplications or "real" reads. I.e., "output reads [that are] the exact same as the input fragments". I suspect the latter.
        Hello Westerman,

        Thank you very much for your reply!

        No, GAGE doesn't mention the term "error free". I just saw this term from a EULER paper (see below):

        "We measure the rate of error correction by measuring the number of error-free reads, that is, reads that appear exactly in the genome (68% of the reads error-free)". cited from Chaisson MJ, Pevzner PA, "Short read fragment assembly of bacterial genomes", Genome Research Jan 2008. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2203630/

        Yes. I agree there are some 'error-free' real reads and I just found some of them in the reference genome sequences (I assume the reference sequences are correct, in such a way the term 'error-free' is defined).

        In the first post, I thought some reads with high frequency (appear many times in the reads fastq file) might come from repeat areas so they might have a better chance to be 'error-free' (could be perfectly mapped back to the reference sequences), so I tested several most frequent reads (with frequency > 3000), but I didn't find any matches from the ref seq.

        I felt confused at first but it is not surprised since such high frequent reads are only a tiny proportion of the whole reads data set (~1.7 million). 88% of all reads are unique, among the other 12% reads that appear more than once, there are only around 1000 reads having a frequency more than 10. This number is so small and, just as you said, such duplicates may be introduced by library preparation processes (PCR duplications, etc.). Although I didn't find any matches of most frequent reads, but I did find some matches of other reads (my coding is still running so I can not give a number right now) in the ref seq. I believe there will be more matches if I get all reads trimmed by quality scores.

        In addition, the different between reads and ref seq doesn't necessarily means reads are incorrect. Lots of things such as polymorphisms, snp, and indels can cause differences. The ref seq can not be absolutely correct either, and it is just a guideline to help people on re-sequencing, assembly, and variants detection.

        Please correct me if I said something wrong.

        Comment


        • #5
          Originally posted by jeffgao View Post
          In addition, the different between reads and ref seq doesn't necessarily means reads are incorrect. Lots of things such as polymorphisms, snp, and indels can cause differences. The ref seq can not be absolutely correct either, and it is just a guideline to help people on re-sequencing, assembly, and variants detection.

          Please correct me if I said something wrong.
          Your last paragraph is what I was trying to get at -- the reference is probably not from the same individual that was sequenced and therefore one should not expect that the reads have exact matches but rather near matches. Heck, even inside the same individual (or clonal population) the genome will vary a bit. Granted not much but some.

          On this forum it is hard to know the background of people's knowledge. We get the whole range from absolute novice to expert; from pure biology to pure computer science. I often find that the latter (the CS people ... and I am one of them) have a hard time transitioning from the exactness of computers to the messiness that is biology. It does sound like you know (or are learning quickly) about uncertainties in bioinformatics.

          BTW: I am off on vacation for the next week thus if you do not see any further posts from me then it just means I am having fun away from my computer instead of ignoring you or SeqAnswers.

          Comment


          • #6
            Originally posted by westerman View Post
            Your last paragraph is what I was trying to get at -- the reference is probably not from the same individual that was sequenced and therefore one should not expect that the reads have exact matches but rather near matches. Heck, even inside the same individual (or clonal population) the genome will vary a bit. Granted not much but some.

            On this forum it is hard to know the background of people's knowledge. We get the whole range from absolute novice to expert; from pure biology to pure computer science. I often find that the latter (the CS people ... and I am one of them) have a hard time transitioning from the exactness of computers to the messiness that is biology. It does sound like you know (or are learning quickly) about uncertainties in bioinformatics.

            BTW: I am off on vacation for the next week thus if you do not see any further posts from me then it just means I am having fun away from my computer instead of ignoring you or SeqAnswers.
            Sure. The biology story is far more complex than a piece of computer code. I have a computer science background with a little bit biology, and I am trying to extend my knowledge boundaries to keep pace with the fast growing bioinformatics stuff.

            Basically, I was just trying to explore the relationship between the frequency and correctness (matches with ref seq) of the "real" reads. I have to carefully think about how to define a "good" match.

            Thank you very much for your reply again and wish you have a nice vacation!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM
            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 05-24-2024, 07:15 AM
            0 responses
            195 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-23-2024, 10:28 AM
            0 responses
            218 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-23-2024, 07:35 AM
            0 responses
            221 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-22-2024, 02:06 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Working...
            X