Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ronton
    Member
    • Jun 2014
    • 34

    Best adapter trimming?

    To my understanding, only one sided adapter trimming is necessary as these are the unbound floating ends read in the flow cell.

    I would like to trim the adapters from the reads as a first step. I have .fastq.gz files from Illumina HiSeq 2500 rapid runs. I have the actual adapter sequences and can provide the Illumina reagents used for the sequencer if necessary.

    I would like to use 'best practices' for trimming the adapters. Is it necessary and/or helpful to supply the trimming software with a comprehensive list of adapter and/or contaminant sequences? Where would be the best place to find this? I did some searching but I would appreciate you advice, thank you.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    In general adapters should not be present in your reads unless you have a not so good quality library/have adapter dimers. But I suppose you may have determined that adapters are present in your reads based on FastQC analysis.

    http://seqanswers.com/forums/showthread.php?t=42776 describes about the simplest tool you can use to trim adapters. Trimmomatic/cutadapt (or Trim galore its wrapper) are other good options but will require a bit of a learning curve with the command line parameters. There are separate threads for those tools.

    Comment

    • ronton
      Member
      • Jun 2014
      • 34

      #3
      I guess I should try to rephrase my question. Filtering out and/or trimming as much as possible that is not sample DNA would be a logical first step with the files from the sequencer, wouldn't it?

      Would e.g. Trim galore or BBDuk be a good way to accomplish this?

      You said that in general adapters should not be present. What would you recommend, a size selection step to get rid of the short fragments? The actual fragment size going into the sequencer has a peak right around 350bp and doesn't appear to be 'too broad,' using 100bp paired end Illumina rapid runs.

      I posted examples of the FastQC Adapter and Kmer graphs in a FastQC thread. Your advice is appreciated.


      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        BBDuk can run in trimming mode or filtering mode. Adapters should be trimmed, while other artifacts such as spike-ins should be filtered.

        bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r

        ...will trim adapters to the right (3' end), while

        bbduk.sh in=trimmed.fq out=filtered.fq ref=contam.fa stats=statistics.txt

        ...will filter out sequences that share kmers with that reference, and write a file "statistics.txt" telling you what was detected. For greater sensitivity you can add 'hdist=1' to allow up to 1 mismatch (or a higher value, if you want). Normally I trim adapters from fragment libraries like this:

        bbduk.sh -Xmx1g in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=28 mink=13 hdist=1 tbo tpe

        The extra flags adjust the sensitivity and are documented in the shellscript.

        If you have paired reads in two files, you should trim both at the same time using the in1, in2, out1, and out2 flags, to prevent the loss of pairing information. From looking at your pictures, you probably DO have adapter contamination.

        You do need to provide files with contaminant sequences for filtering, and it's best to provide adapter sequences for trimming, though the 'tbo' flag will allow most adapters to be trimmed even without specifying what they are. The BBMap package includes Illumina's truseq adapters in the /resources/ directory.

        I also like to remove human contamination (when working with non-mammalian data), which is very common.
        Last edited by Brian Bushnell; 08-15-2014, 11:53 AM.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Both of your questions have a simple answer of "yes". Your libraries look normal (most libraries have some adapter contamination due to short inserts, dimers etc) since the process for selecting the fragments is not perfect about selecting only 300 bp+ fragments.

          Use any of the trimming programs you feel comfortable with and check the results with FastQC afterwards.

          Comment

          • ronton
            Member
            • Jun 2014
            • 34

            #6
            The data we receive is actually several .fastq.gz files per sample, FastQC calls this Casava. As in, several .fastq.gz files that are 'left' for the pair, and matching 'right' .fastq.gz files, per sample. So there may be 3 left .fastq.gz files and 3 right .fastq.gz files for one sample.

            I would have to unzip them first, correct? I know which are the left and right files of each pair so I can enter that information into BBDuk. I would not need to merge all of the 'lefts' into one file though right? Rather, I could just run BBDuk on one pair of files, and then on the next pair of files, and so on.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              BBDuk will accept gzipped input and output. And yes, you can just run 3 times, one pair at a time; no need to merge ahead of time.

              Comment

              • ronton
                Member
                • Jun 2014
                • 34

                #8
                I tried a few step wise passes with BBDuk as a kind of experiment and it seems to be a definite improvement.

                The original FastQC report for my sample:





                After BBDuk to trim the read length to 100bp from 101 and to trim adapters from the reads:

                /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt


                FastQC report:





                Next, BBDuk to filter contaminants:

                /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_clean.fastq out2=/path/to/sample_R2_no_clean.fastq ref=/path/to/bbmap/resources/artifacts.fa.gz k=24 hdist=1 stats=/path/to/sample_stats2.txt


                FastQC report:





                Overall this looks much better, although there still appears to be some kmer content. One more pass to filter phix removed a tiny fraction of the total reads but did not seem to change much.

                Any ideas on the kmer content? We can try to address this as well as the duplication in the sample preparation also.

                So, what would be a next logical step to analyze the sample? Quality trimming, deduplication, mapping, and/or quality recalibration?

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  If you know what organism this is, and have a reference, you can try mapping to the reference and BLASTing the unmapped reads to something like nt or some database of synthetic oligos to see what they are, then filter them. Alternately, you can assemble and BLAST the contigs to see what those are; potentially some will be contaminants, which you can then remove from the reads.

                  BBTools does include a deduplication tool, Dedupe, that does reference-free pair-based deduplication, but it requires a lot of memory (1kb per read). Whether that would help you is unclear, but you can try it like this:

                  dedupe.sh in1=r1.fq in2=r2.fq out1=dd1.fq out2=dd2.fq -Xmx30g

                  I also have a quality recalibration tool, but I don't see how that would help you; and you can use BBDuk to do quality-trimming or just remove the last few bases, which seem to have unusual kmer frequencies. But before doing additional preprocessing, I think it's important to know your goal - what kind of organism is this, what kind of data, what are you going to use it for, and do you have a reference? Even deduplication is inadvisable in many cases (like quantification), and it's possible that the remaining FastQC anomalies are not important, or perhaps expected from your data type. Also, posting the per-base quality profile and base frequencies would be useful.

                  Comment

                  • ronton
                    Member
                    • Jun 2014
                    • 34

                    #10
                    This is a human sample, normal tissue to be compared with tumor, I figure why not start right with the easy stuff =)

                    We have a pipeline that we use, but I want to try going through the analysis to have a better understanding of what's going on.

                    Here are the quality and base profiles before trimming:




                    And after:




                    The goal is to call variants, and ultimately identify whatever anomalies are responsible or driving the mutations.

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      #11
                      If you want to remove more contaminants, you can try mapping to human and blasting some of the unmapped reads. Depending on what you discover, it may be prudent to to another filtering step.

                      The quality looks excellent and probably does not need any quality-trimming; for mapping + variant calling I think it's best to do quality-trimming after mapping to allow maximal information for the mapper, though that operation is slightly trickier. There is still a drift in the base frequencies toward the read tails, and that's probably due to residual adapter sequence. You can try to get rid of it by adding "tbo" and "tpe" to your adapter-trimming command:

                      /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt tbo tpe

                      The problem is that currently adapters of length under 13bp were not trimmed, because BBDuk was run with 'mink=13'. It's not good to go much shorter than that as you will incur false positives. But if you have a paired-end fragment library, the 'tbo' flag will allow you to additionally trim by overlapping the two reads; this can catch adapter sequence down to 1bp long and gets rid of virtually all adapters, if the reads are high quality. The 'tpe' flag means "trim pairs evenly", so if an adapter is detected on one, it will be assumed to be in the same place on the other. These flags are not on by default because they are library-specific and should only be used with paired-end fragment libraries, not (for example) long-mate-pair libraries. Sorry for not mentioning them before!

                      Comment

                      • ronton
                        Member
                        • Jun 2014
                        • 34

                        #12
                        Thank you so much for all of your help Brian. I am wondring this too, if preprocessing is something that is necessary and/or how much of a difference it will make. Hopefully I can compare the two after calling variants for example and see any difference. I am going to try and read through Best Practices For Variant Calling With The GATK.

                        Comment

                        • Brian Bushnell
                          Super Moderator
                          • Jan 2014
                          • 2709

                          #13
                          There's no way of telling how much of a difference it will make without trying both ways. At a minimum, it should give you higher coverage and lower file sizes while allowing the mapping and variant-calling to go faster, but hopefully it will give better results, as well. A single false-positive variant due to a contaminant can waste hundreds of hours of analysis if it is in the wrong place.

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          14 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...