Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low-abundance kmer - what to do?

    Dear members,


    I need some advice or clarification on the topic of “Low-abundance k-mer trimming”.


    Background:
    source: bacterial genomes (isolates of <6MB genome). The GC content vary according to the isolate. For the following sample the GC content is ~63%.
    sequencing technology: HiSeq2500 PE125 Rapid Mode sequencing using the Nextera XT. Avg insert size of 293.46
    objective: assembly, mapping and annotation of sequenced genomes


    The raw data (kmer frequency histogram, see figure 1 in attachment) was processed following the steps (see below) using the BBMap package:

    ****************************************************************************
    1. Removal of adapters
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=raw_r1.fastq.gz in2= raw_r2.fastq.gz out1= step01_clean_r1.fastq.gz out2= step01_clean_r2.fastq.gz outm1=step01_adapters_r1.fastq.gz outm2= step01_adapters_r2.fastq.gz ref=/path/to/resources/nextera.fa.gz ktrim=r k=28 mink=13 hdist=1 tbo tpe

    ****************************************************************************
    2. Removal of phix and artifacts
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step01_clean_r1.fastq.gz in2=step01_clean_r2.fastq.gz out1=step02_clean_r1.fastq.gz out2=step02_clean_r2.fastq.gz outm1=artifacts_r1.fastq.gz outm2=artifacts_r2.fastq.gz ref=/path/to/resources/phix174_ill.ref.fa.gz,/path/to/resources/Illumina.artifacts.2013.12.fa.gz k=31 hdist=1

    ****************************************************************************
    3. Removal of human/plants/fungus/ribosomal
    ****************************************************************************

    java -Xmx29g -cp /path/to/current align2.BBMap minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/resources/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz build=1 qtrim=rl trimq=10 untrim in1=step02_clean_r1.fastq.gz in2=step02_clean_r2.fastq.gz outu1=step03_clean_r1.fastq.gz outu2=step03_clean_r2.fastq.gz outm=contaminants.fastq.gz

    ****************************************************************************
    4. Trimming
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step03_clean_r1.fastq.gz in2=step03_clean_r2.fastq.gz out1=step04_clean_r1.fastq.gz out2=step04_clean_r2.fastq.gz outm1=step04_trimm_r1.fastq.gz outm2=step04_trimm_r2.fastq.gz qtrim=rl trimq=10 minlength=100


    After processing of the raw data I noticed a high peak on the origin of the histogram (see figures 2A and 2B), maybe most of the kmers are unique? A search in literature and forums (not to many) identified this observation as sequencing errors or a mixture of real and erroneous kmers (for variable coverage).

    In those cases they offer two solutions prior to assembly or mapping: (1) trimming the low-abundance kmer - useful for removing errors from reads using the khmer script and (2) error correction – maybe using the “ecc” flag during trimming with the BBMap packager or alternative the error correction tool incorporated in the SPAdes assembler.


    For a test, I followed a suggestion (see below) posted by a member to remove low-coverage reads (<5X) without a reference.


    java -Xmx29g -cp /path/to/current jgi.KmerNormalize in1=step04_clean_r1.fastq.gz in2=step04_clean_r2.fastq.gz out1=step05_clean_r1.fastq.gz out2=step05_clean_r2.fastq.gz target=99999999 min=5 passes=1 hist=kmer_histogram.txt

    I noticed a reduction in the peak, but not significantly (see figure 3A and 3B).


    The questions are: Do I need to worry about this “unique” or low coverage reads? Is something that error correction will ‘correct’ prior to assembly (in the case of SPAdes) or using BBmap (ecc)? Is error correction a viable solution, since JGI do not perform or advise this step for analysis?


    Sorry for the long post, I want to include the more information available. Thanks for your time and patience.

    Vicente
    Attached Files
    Last edited by vingomez; 02-19-2015, 08:26 AM.

  • #2
    Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.

    Comment


    • #3
      Hi Vicente,

      I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.

      Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.

      Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.

      You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.

      Comment


      • #4
        Thanks

        Originally posted by sarvidsson View Post
        Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.
        Thanks for the rapid response.

        Vicente

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          Hi Vicente,

          I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.
          Thanks for the clarification.


          Originally posted by Brian Bushnell View Post
          Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.
          Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?


          Originally posted by Brian Bushnell View Post
          Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.
          Definitely the 'ecc' flag reduce the depth-1 peak.

          Total Unique Kmers--------Unique Kmers (depth=1)-------Step
          113,306,625------------------106,166,190--------------------raw data
          105,694,714-------------------96,663,826--------------------Removal of adapters
          19,649,281--------------------13,639,215--------------------Removal of phix and artifacts
          19,645,918--------------------13,635,884--------------------Removal of human contaminants
          7,164,977----------------------1,576,790--------------------ecc



          Originally posted by Brian Bushnell View Post
          You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.
          I added the quality-trimming flag to step #2 (phiX/artifact removal). This produced no effect (similar number of unique kmer at the end of the analysis), but one step was removed (Step#4).

          Thanks again for your response and the effort to develop and maintain these software packages.
          Last edited by vingomez; 02-20-2015, 07:15 AM. Reason: Add adapters data

          Comment


          • #6
            Originally posted by vingomez View Post
            Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?
            BBNorm does this already, as I designed it for use with single-cell amplified data or other scenarios that have super-high coverage, so the exact same error can occur many times, purely by chance. The principle is, basically, if a kmer in a read has depth X, and the adjacent kmer has depth Y, then the last base in the second kmer is probably an error if the ratio X/Y is greater than some constant. BBNorm, by default, uses 140 for this ratio. This assumption will generally hold unless you have a genomic sequence in your DNA of at least length K (31 by default) with at least 140 copies.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            69 views
            0 likes
            Last Post seqadmin  
            Working...
            X