Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best adapter trimming?

    To my understanding, only one sided adapter trimming is necessary as these are the unbound floating ends read in the flow cell.

    I would like to trim the adapters from the reads as a first step. I have .fastq.gz files from Illumina HiSeq 2500 rapid runs. I have the actual adapter sequences and can provide the Illumina reagents used for the sequencer if necessary.

    I would like to use 'best practices' for trimming the adapters. Is it necessary and/or helpful to supply the trimming software with a comprehensive list of adapter and/or contaminant sequences? Where would be the best place to find this? I did some searching but I would appreciate you advice, thank you.

  • #2
    In general adapters should not be present in your reads unless you have a not so good quality library/have adapter dimers. But I suppose you may have determined that adapters are present in your reads based on FastQC analysis.

    http://seqanswers.com/forums/showthread.php?t=42776 describes about the simplest tool you can use to trim adapters. Trimmomatic/cutadapt (or Trim galore its wrapper) are other good options but will require a bit of a learning curve with the command line parameters. There are separate threads for those tools.

    Comment


    • #3
      I guess I should try to rephrase my question. Filtering out and/or trimming as much as possible that is not sample DNA would be a logical first step with the files from the sequencer, wouldn't it?

      Would e.g. Trim galore or BBDuk be a good way to accomplish this?

      You said that in general adapters should not be present. What would you recommend, a size selection step to get rid of the short fragments? The actual fragment size going into the sequencer has a peak right around 350bp and doesn't appear to be 'too broad,' using 100bp paired end Illumina rapid runs.

      I posted examples of the FastQC Adapter and Kmer graphs in a FastQC thread. Your advice is appreciated.


      Comment


      • #4
        BBDuk can run in trimming mode or filtering mode. Adapters should be trimmed, while other artifacts such as spike-ins should be filtered.

        bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r

        ...will trim adapters to the right (3' end), while

        bbduk.sh in=trimmed.fq out=filtered.fq ref=contam.fa stats=statistics.txt

        ...will filter out sequences that share kmers with that reference, and write a file "statistics.txt" telling you what was detected. For greater sensitivity you can add 'hdist=1' to allow up to 1 mismatch (or a higher value, if you want). Normally I trim adapters from fragment libraries like this:

        bbduk.sh -Xmx1g in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=28 mink=13 hdist=1 tbo tpe

        The extra flags adjust the sensitivity and are documented in the shellscript.

        If you have paired reads in two files, you should trim both at the same time using the in1, in2, out1, and out2 flags, to prevent the loss of pairing information. From looking at your pictures, you probably DO have adapter contamination.

        You do need to provide files with contaminant sequences for filtering, and it's best to provide adapter sequences for trimming, though the 'tbo' flag will allow most adapters to be trimmed even without specifying what they are. The BBMap package includes Illumina's truseq adapters in the /resources/ directory.

        I also like to remove human contamination (when working with non-mammalian data), which is very common.
        Last edited by Brian Bushnell; 08-15-2014, 11:53 AM.

        Comment


        • #5
          Both of your questions have a simple answer of "yes". Your libraries look normal (most libraries have some adapter contamination due to short inserts, dimers etc) since the process for selecting the fragments is not perfect about selecting only 300 bp+ fragments.

          Use any of the trimming programs you feel comfortable with and check the results with FastQC afterwards.

          Comment


          • #6
            The data we receive is actually several .fastq.gz files per sample, FastQC calls this Casava. As in, several .fastq.gz files that are 'left' for the pair, and matching 'right' .fastq.gz files, per sample. So there may be 3 left .fastq.gz files and 3 right .fastq.gz files for one sample.

            I would have to unzip them first, correct? I know which are the left and right files of each pair so I can enter that information into BBDuk. I would not need to merge all of the 'lefts' into one file though right? Rather, I could just run BBDuk on one pair of files, and then on the next pair of files, and so on.

            Comment


            • #7
              BBDuk will accept gzipped input and output. And yes, you can just run 3 times, one pair at a time; no need to merge ahead of time.

              Comment


              • #8
                I tried a few step wise passes with BBDuk as a kind of experiment and it seems to be a definite improvement.

                The original FastQC report for my sample:





                After BBDuk to trim the read length to 100bp from 101 and to trim adapters from the reads:

                /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt


                FastQC report:





                Next, BBDuk to filter contaminants:

                /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_clean.fastq out2=/path/to/sample_R2_no_clean.fastq ref=/path/to/bbmap/resources/artifacts.fa.gz k=24 hdist=1 stats=/path/to/sample_stats2.txt


                FastQC report:





                Overall this looks much better, although there still appears to be some kmer content. One more pass to filter phix removed a tiny fraction of the total reads but did not seem to change much.

                Any ideas on the kmer content? We can try to address this as well as the duplication in the sample preparation also.

                So, what would be a next logical step to analyze the sample? Quality trimming, deduplication, mapping, and/or quality recalibration?

                Comment


                • #9
                  If you know what organism this is, and have a reference, you can try mapping to the reference and BLASTing the unmapped reads to something like nt or some database of synthetic oligos to see what they are, then filter them. Alternately, you can assemble and BLAST the contigs to see what those are; potentially some will be contaminants, which you can then remove from the reads.

                  BBTools does include a deduplication tool, Dedupe, that does reference-free pair-based deduplication, but it requires a lot of memory (1kb per read). Whether that would help you is unclear, but you can try it like this:

                  dedupe.sh in1=r1.fq in2=r2.fq out1=dd1.fq out2=dd2.fq -Xmx30g

                  I also have a quality recalibration tool, but I don't see how that would help you; and you can use BBDuk to do quality-trimming or just remove the last few bases, which seem to have unusual kmer frequencies. But before doing additional preprocessing, I think it's important to know your goal - what kind of organism is this, what kind of data, what are you going to use it for, and do you have a reference? Even deduplication is inadvisable in many cases (like quantification), and it's possible that the remaining FastQC anomalies are not important, or perhaps expected from your data type. Also, posting the per-base quality profile and base frequencies would be useful.

                  Comment


                  • #10
                    This is a human sample, normal tissue to be compared with tumor, I figure why not start right with the easy stuff =)

                    We have a pipeline that we use, but I want to try going through the analysis to have a better understanding of what's going on.

                    Here are the quality and base profiles before trimming:




                    And after:




                    The goal is to call variants, and ultimately identify whatever anomalies are responsible or driving the mutations.

                    Comment


                    • #11
                      If you want to remove more contaminants, you can try mapping to human and blasting some of the unmapped reads. Depending on what you discover, it may be prudent to to another filtering step.

                      The quality looks excellent and probably does not need any quality-trimming; for mapping + variant calling I think it's best to do quality-trimming after mapping to allow maximal information for the mapper, though that operation is slightly trickier. There is still a drift in the base frequencies toward the read tails, and that's probably due to residual adapter sequence. You can try to get rid of it by adding "tbo" and "tpe" to your adapter-trimming command:

                      /path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt tbo tpe

                      The problem is that currently adapters of length under 13bp were not trimmed, because BBDuk was run with 'mink=13'. It's not good to go much shorter than that as you will incur false positives. But if you have a paired-end fragment library, the 'tbo' flag will allow you to additionally trim by overlapping the two reads; this can catch adapter sequence down to 1bp long and gets rid of virtually all adapters, if the reads are high quality. The 'tpe' flag means "trim pairs evenly", so if an adapter is detected on one, it will be assumed to be in the same place on the other. These flags are not on by default because they are library-specific and should only be used with paired-end fragment libraries, not (for example) long-mate-pair libraries. Sorry for not mentioning them before!

                      Comment


                      • #12
                        Thank you so much for all of your help Brian. I am wondring this too, if preprocessing is something that is necessary and/or how much of a difference it will make. Hopefully I can compare the two after calling variants for example and see any difference. I am going to try and read through Best Practices For Variant Calling With The GATK.

                        Comment


                        • #13
                          There's no way of telling how much of a difference it will make without trying both ways. At a minimum, it should give you higher coverage and lower file sizes while allowing the mapping and variant-calling to go faster, but hopefully it will give better results, as well. A single false-positive variant due to a contaminant can waste hundreds of hours of analysis if it is in the wrong place.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Exploring the Dynamics of the Tumor Microenvironment
                            by seqadmin




                            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                            07-08-2024, 03:19 PM
                          • seqadmin
                            Exploring Human Diversity Through Large-Scale Omics
                            by seqadmin


                            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                            06-25-2024, 06:43 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 11:09 AM
                          0 responses
                          15 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-19-2024, 07:20 AM
                          0 responses
                          147 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-16-2024, 05:49 AM
                          0 responses
                          121 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-15-2024, 06:53 AM
                          0 responses
                          111 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X