Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence Duplication Levels failure

    Hiii

    Good [morning | afternoon | evening | night]

    I used fastqc to qualify my data. At the beginning I had failure in (Pair base sequence content, Per base GC content, Per sequence GC content and Sequence duplication levels ). I noticed the most error was due to 9 first bases, so I trimmed them by trimmomatic. After that I still get error in (Per sequence GC content and Sequence duplication levels).

    For per sequence GC content, it is more than normal.

    For Sequence duplication levels the graph raises up after 9.

    (1)What should I do with them? Is it due to contamination?

    Btw my "Sequence duplication levels" has only one red line and no blue line. (2)Why it is like that? Is it related to the version? My fastqc is version v0.10.1

    I attached both results in a pdf file.

    (3)I know trimmomatic cut the noises, but how much I can trim my sequences without affecting my following analysis? (Of course I can cut a 90 base pairs sequence to a 20 base pairs but for further analysis it is not reliable. For example for cufflinks to measure differential gene expression) So what is the limitation for trimming?

    I am so sorry for so many questions.

    Thank you in advance for helping me
    Attached Files

  • #2
    FastQC frequently worries people when there's no need to worry, and doesn't always point out the things that are most important. I've got a few questions:
    • Are these RNA reads?
    • What is the expected GC fraction of your target genome?
    • How much DNA was present in the sample?
    • Have spike-ins (e.g. ERCC, lambda) been used?
    • What are the overrepresented sequences?


    In a best-case scenario, the double peak in the GC graph and the over-represented sequences could be explained by a spike-in taking up a large proportion of the reads, which would happen if the DNA hadn't been accurately quantified. Alternatively, a targeted sequencing of multiple genes might produce a similar effect.

    Comment


    • #3
      These are cDNA reads (made from RNA)
      I don't know the expected GC fraction of target genome (The data is for someone else and I should analyze it and enhance it).
      No spike-ins were used.
      There are three overrepresented sequences:
      1. CGCTCGCCGCTACTACGGGAATCGCTTTTGCTTTCTTTTCCTCTGGCTAC
      2. GATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAATGC
      3. TGGATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAAT

      Comment


      • #4
        Well, a BLAST of all those sequences returns 100% identity matches to chloroplast genomes (probably rice).

        My guess is that what you're seeing here is cDNA reads that haven't been properly depleted for high-abundance transcripts, so there is a large amount of contaminant sequences in the data. My ball-park assumption from looking at the GC graph would be that there is about 30% chloroplast sequence in there.

        If at all possible, I'd recommend that your collaborator re-sequences these samples including a RiboZero preparation:

        Compare key features of ribosomal RNA (rRNA) and globin mRNA depletion kits. View sample type compatibility and the rRNA types removed by each kit.


        Otherwise, run a mapping only to the chloroplast sequence of the target (e.g. Oryza sativa) and exclude those sequences (e.g. HISAT2 has "--un-conc" and "--un" options for doing precisely that), then re-run FastQC to see if it changes things. Even with that 30% contamination (assuming it's expected), you still should get reasonable results.

        Comment


        • #5
          Your answer surprised me. Yeap it's for rice and Oryza sativa. And the way you found the source of contamination made me excited. Smart answers

          So now I should find for rice chloroplast sequence and then exclude that from reads. but I don't know how to do it with HISAT as you mentioned. I have to learn it first.

          Thank you~Thank you~Thank you

          Comment


          • #6
            Originally posted by Saeideh View Post
            And the way you found the source of contamination made me excited.
            Yes, BLAST is very useful. I'm glad that NCBI still provides a service for "where is this sequence from", despite all the newer locally-faster search tools that are available.

            I don't know how to do it with HISAT as you mentioned. I have to learn it first.
            Learning HISAT2 would be a good idea, as it's the latest in a new generation of ultra-fast mappers, and has almost identical command-line parameters to Bowtie2. Another option would be STAR, which has a really great manual and might be easier to pick up and use as a naive high-throughput sequencing bioinformatician.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM
            • seqadmin
              Understanding Genetic Influence on Infectious Disease
              by seqadmin




              During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

              Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
              09-09-2024, 10:59 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-30-2024, 08:33 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-26-2024, 12:57 PM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Working...
            X