Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • repetitive/duplicate reads

    Hi,
    I am working on NGS data, on paired end reads fastq format files. I clipped the adapters (with max. of 1 mismatch) and trimmed for quality from the reads and then analyzed the original and clipped/quality trimmed files using fastQC. I see that there are about 20% of sequences that are present more than 10 times (in both the original and clipped/quality trimmed reads). Using ShortRead Bioconductor package, in R, I was able to see that some of these reads indeed occur 300+ times. In fastQC, the graph seems pretty nice with the number of sequences that occur 1, 2, etc.. 9 times gradually decreasing to about 0 and then for 10+ repeats it rises to about 20%.

    Is this something to worry about? Or rather, how would one characterize this behavior? Because my understanding is that most sequences with adapters were the source of these repeats. So after removing the adapters (with 1 mismatch), there should be none or a significant reduction in the sequence repeats.

    Thank you!

    PS: Just to be clear, here I mean sequence repeats as the number of times the same sequence is found.

  • #2
    What is your read length? Only one mismatch is not super strict if you have longer reads.

    You should probably be able to tell if there's still a lot of adapter in there - for example the nucleotide distribution plot from FastQC would be spiky and you might even be able to correlate the spikes with the sequence of your adapters. Also if you can you could take a manual look on some of the reads that are present in many copies, and see if their sequence is close to your adapter or if there's something else going on.

    Comment


    • #3
      The read length is 84. After I clip for adapters, I used ShortRead package to find the sequences that occur more than 10 times and checked for adapter sequences with 0 or 1 mismatch again, but none to avail. Maybe I could run the adapter clipping again with 2 mismatches just to compare the results I guess.

      Thank you.

      Comment


      • #4
        Originally posted by cedance View Post
        In fastQC, the graph seems pretty nice with the number of sequences that occur 1, 2, etc.. 9 times gradually decreasing to about 0 and then for 10+ repeats it rises to about 20%.
        That doesn't sound like a particularly nice graph. A nice graph drops almost immediately to zero and stays there. What you're describing is a pervasive low level duplication. This isn't likely to be caused by any kind of contamination, but is more likely the result either of general oversequencing or of PCR duplication artefacts. It could be that you have a particularly enriched library where this would be expected but we'd need to know more about your experiment and preferably see the QC results to be able to comment more specifically.

        Comment


        • #5
          Hi Simon, thanks for your reply.
          The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

          It would be nice to know your interpretations.

          Thank you.

          Comment


          • #6
            Cedance,

            I've looked at quite a few FastQC reports for mRNA-Seq runs from plants and based on my experience your duplication report doesn't look bad. Depending on the tissue or developmental stage the transcript diversity in a plant can be low, so as Simon suggested you have probably reached the saturation point for sequencing.

            More concerning to me would be the drop off in Q-scores at the ends of your reads. Based on that plot plan on doing some quality based trimming of your reads.

            Comment


            • #7
              Originally posted by cedance View Post
              Hi Simon, thanks for your reply.
              The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

              It would be nice to know your interpretations.
              As KMCarr said the duplication doesn't look terrible - it's pretty low level and may just represent oversequencing of the most abundant transcripts in your library.

              The bigger concern (which you may easily be able to explain) is the strong initial bias in your sequences. Your first few bases show very strong bias - which is particularly obvious at position 4. Is this a barcoded sample? If not then you might have some kind of adapter contamination at the start of your sample.

              The quality is somewhat poor at the end of your sequence and you might want to trim the ends back a bit if you're going to assemble, but it's not too bad, and the overall per-read quality looks pretty good.

              Comment


              • #8
                kmcarr, simonandrews,
                Yes, I provided the raw data. And yes they are barcodes. I have clipped for adapters, trimmed for quality and for barcodes and separated them as well. I did not provide it here. The duplication reduces to less than 10% for 10+ reads for those individual barcode-split files. If you would want to have a look at the individual paired end reads, I can link them as well. I guess, since, otherwise the sequenced reads were fine, it should be alright.

                Thank you once again!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                56 views
                0 likes
                Last Post seqadmin  
                Working...
                X