No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence Duplication Levels failure


    Good [morning | afternoon | evening | night]

    I used fastqc to qualify my data. At the beginning I had failure in (Pair base sequence content, Per base GC content, Per sequence GC content and Sequence duplication levels ). I noticed the most error was due to 9 first bases, so I trimmed them by trimmomatic. After that I still get error in (Per sequence GC content and Sequence duplication levels).

    For per sequence GC content, it is more than normal.

    For Sequence duplication levels the graph raises up after 9.

    (1)What should I do with them? Is it due to contamination?

    Btw my "Sequence duplication levels" has only one red line and no blue line. (2)Why it is like that? Is it related to the version? My fastqc is version v0.10.1

    I attached both results in a pdf file.

    (3)I know trimmomatic cut the noises, but how much I can trim my sequences without affecting my following analysis? (Of course I can cut a 90 base pairs sequence to a 20 base pairs but for further analysis it is not reliable. For example for cufflinks to measure differential gene expression) So what is the limitation for trimming?

    I am so sorry for so many questions.

    Thank you in advance for helping me
    Attached Files

  • #2
    FastQC frequently worries people when there's no need to worry, and doesn't always point out the things that are most important. I've got a few questions:
    • Are these RNA reads?
    • What is the expected GC fraction of your target genome?
    • How much DNA was present in the sample?
    • Have spike-ins (e.g. ERCC, lambda) been used?
    • What are the overrepresented sequences?

    In a best-case scenario, the double peak in the GC graph and the over-represented sequences could be explained by a spike-in taking up a large proportion of the reads, which would happen if the DNA hadn't been accurately quantified. Alternatively, a targeted sequencing of multiple genes might produce a similar effect.


    • #3
      These are cDNA reads (made from RNA)
      I don't know the expected GC fraction of target genome (The data is for someone else and I should analyze it and enhance it).
      No spike-ins were used.
      There are three overrepresented sequences:


      • #4
        Well, a BLAST of all those sequences returns 100% identity matches to chloroplast genomes (probably rice).

        My guess is that what you're seeing here is cDNA reads that haven't been properly depleted for high-abundance transcripts, so there is a large amount of contaminant sequences in the data. My ball-park assumption from looking at the GC graph would be that there is about 30% chloroplast sequence in there.

        If at all possible, I'd recommend that your collaborator re-sequences these samples including a RiboZero preparation:

        Compare key features of ribosomal RNA (rRNA) and globin mRNA depletion kits. View sample type compatibility and the rRNA types removed by each kit.

        Otherwise, run a mapping only to the chloroplast sequence of the target (e.g. Oryza sativa) and exclude those sequences (e.g. HISAT2 has "--un-conc" and "--un" options for doing precisely that), then re-run FastQC to see if it changes things. Even with that 30% contamination (assuming it's expected), you still should get reasonable results.


        • #5
          Your answer surprised me. Yeap it's for rice and Oryza sativa. And the way you found the source of contamination made me excited. Smart answers

          So now I should find for rice chloroplast sequence and then exclude that from reads. but I don't know how to do it with HISAT as you mentioned. I have to learn it first.

          Thank you~Thank you~Thank you


          • #6
            Originally posted by Saeideh View Post
            And the way you found the source of contamination made me excited.
            Yes, BLAST is very useful. I'm glad that NCBI still provides a service for "where is this sequence from", despite all the newer locally-faster search tools that are available.

            I don't know how to do it with HISAT as you mentioned. I have to learn it first.
            Learning HISAT2 would be a good idea, as it's the latest in a new generation of ultra-fast mappers, and has almost identical command-line parameters to Bowtie2. Another option would be STAR, which has a really great manual and might be easier to pick up and use as a naive high-throughput sequencing bioinformatician.


            Latest Articles


            • seqadmin
              Advanced Methods for the Detection of Infectious Disease
              by seqadmin

              The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
              11-27-2023, 01:15 PM
            • seqadmin
              Strategies for Investigating the Microbiome
              by seqadmin

              Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
              11-09-2023, 07:02 AM





            Topics Statistics Last Post
            Started by seqadmin, 12-01-2023, 09:55 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 11-30-2023, 10:48 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 11-29-2023, 08:26 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 11-29-2023, 08:12 AM
            0 responses
            Last Post seqadmin