Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low-abundance kmer - what to do?

    Dear members,


    I need some advice or clarification on the topic of “Low-abundance k-mer trimming”.


    Background:
    source: bacterial genomes (isolates of <6MB genome). The GC content vary according to the isolate. For the following sample the GC content is ~63%.
    sequencing technology: HiSeq2500 PE125 Rapid Mode sequencing using the Nextera XT. Avg insert size of 293.46
    objective: assembly, mapping and annotation of sequenced genomes


    The raw data (kmer frequency histogram, see figure 1 in attachment) was processed following the steps (see below) using the BBMap package:

    ****************************************************************************
    1. Removal of adapters
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=raw_r1.fastq.gz in2= raw_r2.fastq.gz out1= step01_clean_r1.fastq.gz out2= step01_clean_r2.fastq.gz outm1=step01_adapters_r1.fastq.gz outm2= step01_adapters_r2.fastq.gz ref=/path/to/resources/nextera.fa.gz ktrim=r k=28 mink=13 hdist=1 tbo tpe

    ****************************************************************************
    2. Removal of phix and artifacts
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step01_clean_r1.fastq.gz in2=step01_clean_r2.fastq.gz out1=step02_clean_r1.fastq.gz out2=step02_clean_r2.fastq.gz outm1=artifacts_r1.fastq.gz outm2=artifacts_r2.fastq.gz ref=/path/to/resources/phix174_ill.ref.fa.gz,/path/to/resources/Illumina.artifacts.2013.12.fa.gz k=31 hdist=1

    ****************************************************************************
    3. Removal of human/plants/fungus/ribosomal
    ****************************************************************************

    java -Xmx29g -cp /path/to/current align2.BBMap minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/resources/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz build=1 qtrim=rl trimq=10 untrim in1=step02_clean_r1.fastq.gz in2=step02_clean_r2.fastq.gz outu1=step03_clean_r1.fastq.gz outu2=step03_clean_r2.fastq.gz outm=contaminants.fastq.gz

    ****************************************************************************
    4. Trimming
    ****************************************************************************

    java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step03_clean_r1.fastq.gz in2=step03_clean_r2.fastq.gz out1=step04_clean_r1.fastq.gz out2=step04_clean_r2.fastq.gz outm1=step04_trimm_r1.fastq.gz outm2=step04_trimm_r2.fastq.gz qtrim=rl trimq=10 minlength=100


    After processing of the raw data I noticed a high peak on the origin of the histogram (see figures 2A and 2B), maybe most of the kmers are unique? A search in literature and forums (not to many) identified this observation as sequencing errors or a mixture of real and erroneous kmers (for variable coverage).

    In those cases they offer two solutions prior to assembly or mapping: (1) trimming the low-abundance kmer - useful for removing errors from reads using the khmer script and (2) error correction – maybe using the “ecc” flag during trimming with the BBMap packager or alternative the error correction tool incorporated in the SPAdes assembler.


    For a test, I followed a suggestion (see below) posted by a member to remove low-coverage reads (<5X) without a reference.


    java -Xmx29g -cp /path/to/current jgi.KmerNormalize in1=step04_clean_r1.fastq.gz in2=step04_clean_r2.fastq.gz out1=step05_clean_r1.fastq.gz out2=step05_clean_r2.fastq.gz target=99999999 min=5 passes=1 hist=kmer_histogram.txt

    I noticed a reduction in the peak, but not significantly (see figure 3A and 3B).


    The questions are: Do I need to worry about this “unique” or low coverage reads? Is something that error correction will ‘correct’ prior to assembly (in the case of SPAdes) or using BBmap (ecc)? Is error correction a viable solution, since JGI do not perform or advise this step for analysis?


    Sorry for the long post, I want to include the more information available. Thanks for your time and patience.

    Vicente
    Attached Files
    Last edited by vingomez; 02-19-2015, 08:26 AM.

  • #2
    Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.

    Comment


    • #3
      Hi Vicente,

      I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.

      Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.

      Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.

      You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.

      Comment


      • #4
        Thanks

        Originally posted by sarvidsson View Post
        Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.
        Thanks for the rapid response.

        Vicente

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          Hi Vicente,

          I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.
          Thanks for the clarification.


          Originally posted by Brian Bushnell View Post
          Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.
          Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?


          Originally posted by Brian Bushnell View Post
          Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.
          Definitely the 'ecc' flag reduce the depth-1 peak.

          Total Unique Kmers--------Unique Kmers (depth=1)-------Step
          113,306,625------------------106,166,190--------------------raw data
          105,694,714-------------------96,663,826--------------------Removal of adapters
          19,649,281--------------------13,639,215--------------------Removal of phix and artifacts
          19,645,918--------------------13,635,884--------------------Removal of human contaminants
          7,164,977----------------------1,576,790--------------------ecc



          Originally posted by Brian Bushnell View Post
          You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.
          I added the quality-trimming flag to step #2 (phiX/artifact removal). This produced no effect (similar number of unique kmer at the end of the analysis), but one step was removed (Step#4).

          Thanks again for your response and the effort to develop and maintain these software packages.
          Last edited by vingomez; 02-20-2015, 07:15 AM. Reason: Add adapters data

          Comment


          • #6
            Originally posted by vingomez View Post
            Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?
            BBNorm does this already, as I designed it for use with single-cell amplified data or other scenarios that have super-high coverage, so the exact same error can occur many times, purely by chance. The principle is, basically, if a kmer in a read has depth X, and the adjacent kmer has depth Y, then the last base in the second kmer is probably an error if the ratio X/Y is greater than some constant. BBNorm, by default, uses 140 for this ratio. This assumption will generally hold unless you have a genomic sequence in your DNA of at least length K (31 by default) with at least 140 copies.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Exploring the Dynamics of the Tumor Microenvironment
              by seqadmin




              The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
              07-08-2024, 03:19 PM
            • seqadmin
              Exploring Human Diversity Through Large-Scale Omics
              by seqadmin


              In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
              06-25-2024, 06:43 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 07-10-2024, 07:30 AM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-03-2024, 09:45 AM
            0 responses
            200 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-03-2024, 08:54 AM
            0 responses
            209 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-02-2024, 03:00 PM
            0 responses
            192 views
            0 likes
            Last Post seqadmin  
            Working...
            X