Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shotgun Meta of Environ Sam: Per Base Seq Cont Per Seq GC Cont failed aft trimming

    Dear all,

    I am really a newbie for analyzing shotgun metagenomics data. Here I encountered some issues when I checked the quality of my data. I post my concerns here and hope someone can help me.

    DNA samples: Genomic DNA isolated from environmental samples (soil, sewage, or freshwater). We are interested in the community structures of bacteria and archaea in those samples as well as detecting functional genes.

    Sequencing platform: Illumina, Shallow Metagenomics, Shotgun sequencing of DNA, Paired-end sequencing

    Library: Nextera kits (I got this information when running TrimGalore!)

    Concern-1: Per Base Sequence Content
    Before trimming, I checked the quality of the raw data using FastQC + MultiQC. Many samples failed the Per Base Sequence Content test with biased composition at the 5-end (see the attached Per Base Sequence Content-No trimming.jpg), and all samples failed the Adapter Content test (see the attached Adapter Content--No trimming.jpg). I then thought that I needed to trim the 5-end by removing 15 bp from each read and also trim the adapters. I trimmed all the raw reads with TrimGalore! with the following command:
    ===============
    ~/TrimGalore-0.6.5/trim_galore --clip_R1 15 --clip_R2 15 --paired read_1_sample_1.fastq.gz read_2_sample_2.fastq.gz read_1_sample_2.fastq.gz read_2_sample_2.fastq.gz … read_1_sample_N.fastq.gz read_2_sample_N.fastq.gz
    ===============
    After the trimming, I ran FastQC + MultiQC and found that, surprisingly, all samples failed the Per Base Sequence Content test. I found that all samples shared the same pattern: the 3-end is significantly biased with the content of C being very low (see the attached Per Base Sequence Content-After trimming.jpg).
    My question is, should I worry about the bias at the 3-end? Or, should I further trim the 3-end? Specifically, the curve/line for C is roughly horizontal before the trimming. Why this curve/line dropped to almost zero after the trimming? An online discussion (https://github.com/FelixKrueger/Trim...-auto-detectio) mentioned that [Note that the sharp decrease of A at the last position is a result of removing the adapter sequence very stringently, i.e. even a single trailing A at the end is removed.] However, as far as I can understand, the trimming at the 3-end just means removing the sequencing of the adapter (if there is sequencing read-through). The trimming should not affect the remaining (i.e., the sequence that is kept) sequences. If the curve of C before the trimming is horizontal, it should also be horizontal after the trimming. I am a bit confused.

    Concern-2: Per Sequence GC Content
    Before trimming, I found that many samples failed the Per Sequence GC Content test because of the multiple peaks in the plot (see the attached Per Sequence GC Content--No trimming.jpg). I thought that this failure was due to adapter contamination. However, after trimming, many samples still have the issue (see the attached Per Sequence GC Content--After trimming.jpg).

    My question is, why my samples show multiple peaks? Is it possible that my samples contain more than one dominant species? Or, the multiple peaks were due to sequencing/process errors? How should I fix this issue?

    Question-3: The sequencing I did is shallow sequencing. Also, my samples are not pure culture samples--they contain millions of different species of microbes. We will examine the microbial community structure and detect/find functional genes. In this case, should I do assembly before the downstream analysis? I read some online discussions. Some suggest assembly, and some say that it is better to skip the assembly. I am really new in this area and do not know which (with vs. without assembly) is a better choice.

    Thanks for reading this posting!
    Attached Files

  • #2
    Rule #1: Do not get hung up on the big red X's in FastQC.

    The thresholds which delineate Pass|Warn|Fail for the various metrics in FastQC were set using beautiful, single species, perfectly random and uniform genomic DNA libraries. Things that deviate from this in terms of sampling method, library content and library construction produce false failures. It is likely that the data is perfectly good for your organism(s), given that you are performing a metagenomic experiment with widely variable samples.

    You stated that you made these libraries using a Nextera kit. The tagmentation in Nextera library kits is not perfectly random, there is a sequence composition bias for the tagmentation site. Your original (untrimmed) Per Base Sequence content is perfectly normal for Nextera libraries; the bias at the 5' end simply shows the bias of the tagmentation enzyme. There is no need to trim the 5' end but if you want to go ahead.

    The highly skewed 3' end in the Per Base Sequence content plot after trimming I have seen before with trimmed reads. I'm not sure if it is an artifact of trimming or of the grouping algorithm in FastQC when it doesn't have enough bases left to include in its default group size of 5bp. (This is purely speculation.)

    Regarding the GC content plots, you are sampling a large diversity of bacteria from a variety of very distinct environments. It is totally expected that the bacterial populations in your different environments would have widely variable GC content distributions. This has nothing to do with adapters. Again, the failure is due to FastQC's expectations not matching the reality of the experiment you are performing.

    The Adapter content plot is the only one which really shows something you need to address. It is normal (especially for libraries prepared using Nextera kits) to have some fragments shorter than your read length (150bp in your case). Your particular libraries vary from ~20% to 35% in the percentage of fragments < 150bp. Performing 3' adapter trimming is required to remove adapter sequences from these reads.
    Last edited by kmcarr; 03-10-2020, 11:19 AM. Reason: Correct 5'/3' mixup

    Comment


    • #3
      Dear kmcarr,

      Thanks a lot for the reply and explaining the details. Appreciate that!

      After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

      Thanks again!

      Comment


      • #4
        Originally posted by yy273826987 View Post
        Dear kmcarr,

        Thanks a lot for the reply and explaining the details. Appreciate that!

        After reading your response, I understand that the adapter contamination is the only thing that I need to worry about. I have used TrimGalore! to remove the adapters from the 3'-end of the raw reads. However, you also suggested that "Performing 5' adapter trimming is required to remove adapter sequences from these reads." I am a bit confused. Based on my current understanding (maybe I am wrong), in my case, I only have adapters at the 3'-end of the reads. Do we have adapters at both ends (3'- and 5'-)?

        Thanks again!
        Sorry, that was an error. I meant to type "Performing 3' adapter trimming...."

        I have edited my original post to fix this.

        Comment


        • #5
          Dear kmcarr,

          Thanks for the quick response and the clarification.

          Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

          Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

          Thanks!

          Comment


          • #6
            Originally posted by yy273826987 View Post
            Dear kmcarr,

            Thanks for the quick response and the clarification.

            Here may I have more questions? For my specific case, should I perform assembly before downstream analysis?

            Also, after the Quality Control, which software or pipeline would you suggest for me to begin with (for assembly, annotation, taxonomic analysis, and finding functional genes)? I found that there are numerous software and pipelines. As a real newbie, I have a hard time to find which pipeline I shall start with.

            Thanks!
            yy2,

            The downstream analysis part is a bit outside my area so I'll have to leave that to others to help you.

            Cheers.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              Today, 01:16 PM
            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 07:15 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 10:28 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 07:35 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-22-2024, 02:06 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Working...
            X