Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cedance
    Senior Member
    • Feb 2011
    • 108

    repetitive/duplicate reads

    Hi,
    I am working on NGS data, on paired end reads fastq format files. I clipped the adapters (with max. of 1 mismatch) and trimmed for quality from the reads and then analyzed the original and clipped/quality trimmed files using fastQC. I see that there are about 20% of sequences that are present more than 10 times (in both the original and clipped/quality trimmed reads). Using ShortRead Bioconductor package, in R, I was able to see that some of these reads indeed occur 300+ times. In fastQC, the graph seems pretty nice with the number of sequences that occur 1, 2, etc.. 9 times gradually decreasing to about 0 and then for 10+ repeats it rises to about 20%.

    Is this something to worry about? Or rather, how would one characterize this behavior? Because my understanding is that most sequences with adapters were the source of these repeats. So after removing the adapters (with 1 mismatch), there should be none or a significant reduction in the sequence repeats.

    Thank you!

    PS: Just to be clear, here I mean sequence repeats as the number of times the same sequence is found.
  • gaffa
    Member
    • Oct 2010
    • 82

    #2
    What is your read length? Only one mismatch is not super strict if you have longer reads.

    You should probably be able to tell if there's still a lot of adapter in there - for example the nucleotide distribution plot from FastQC would be spiky and you might even be able to correlate the spikes with the sequence of your adapters. Also if you can you could take a manual look on some of the reads that are present in many copies, and see if their sequence is close to your adapter or if there's something else going on.

    Comment

    • cedance
      Senior Member
      • Feb 2011
      • 108

      #3
      The read length is 84. After I clip for adapters, I used ShortRead package to find the sequences that occur more than 10 times and checked for adapter sequences with 0 or 1 mismatch again, but none to avail. Maybe I could run the adapter clipping again with 2 mismatches just to compare the results I guess.

      Thank you.

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        Originally posted by cedance View Post
        In fastQC, the graph seems pretty nice with the number of sequences that occur 1, 2, etc.. 9 times gradually decreasing to about 0 and then for 10+ repeats it rises to about 20%.
        That doesn't sound like a particularly nice graph. A nice graph drops almost immediately to zero and stays there. What you're describing is a pervasive low level duplication. This isn't likely to be caused by any kind of contamination, but is more likely the result either of general oversequencing or of PCR duplication artefacts. It could be that you have a particularly enriched library where this would be expected but we'd need to know more about your experiment and preferably see the QC results to be able to comment more specifically.

        Comment

        • cedance
          Senior Member
          • Feb 2011
          • 108

          #5
          Hi Simon, thanks for your reply.
          The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

          It would be nice to know your interpretations.

          Thank you.

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Cedance,

            I've looked at quite a few FastQC reports for mRNA-Seq runs from plants and based on my experience your duplication report doesn't look bad. Depending on the tissue or developmental stage the transcript diversity in a plant can be low, so as Simon suggested you have probably reached the saturation point for sequencing.

            More concerning to me would be the drop off in Q-scores at the ends of your reads. Based on that plot plan on doing some quality based trimming of your reads.

            Comment

            • simonandrews
              Simon Andrews
              • May 2009
              • 870

              #7
              Originally posted by cedance View Post
              Hi Simon, thanks for your reply.
              The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

              It would be nice to know your interpretations.
              As KMCarr said the duplication doesn't look terrible - it's pretty low level and may just represent oversequencing of the most abundant transcripts in your library.

              The bigger concern (which you may easily be able to explain) is the strong initial bias in your sequences. Your first few bases show very strong bias - which is particularly obvious at position 4. Is this a barcoded sample? If not then you might have some kind of adapter contamination at the start of your sample.

              The quality is somewhat poor at the end of your sequence and you might want to trim the ends back a bit if you're going to assemble, but it's not too bad, and the overall per-read quality looks pretty good.

              Comment

              • cedance
                Senior Member
                • Feb 2011
                • 108

                #8
                kmcarr, simonandrews,
                Yes, I provided the raw data. And yes they are barcodes. I have clipped for adapters, trimmed for quality and for barcodes and separated them as well. I did not provide it here. The duplication reduces to less than 10% for 10+ reads for those individual barcode-split files. If you would want to have a look at the individual paired end reads, I can link them as well. I guess, since, otherwise the sequenced reads were fine, it should be alright.

                Thank you once again!

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                15 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                49 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                107 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...