Dear members,
I need some advice or clarification on the topic of “Low-abundance k-mer trimming”.
Background:
source: bacterial genomes (isolates of <6MB genome). The GC content vary according to the isolate. For the following sample the GC content is ~63%.
sequencing technology: HiSeq2500 PE125 Rapid Mode sequencing using the Nextera XT. Avg insert size of 293.46
objective: assembly, mapping and annotation of sequenced genomes
The raw data (kmer frequency histogram, see figure 1 in attachment) was processed following the steps (see below) using the BBMap package:
After processing of the raw data I noticed a high peak on the origin of the histogram (see figures 2A and 2B), maybe most of the kmers are unique? A search in literature and forums (not to many) identified this observation as sequencing errors or a mixture of real and erroneous kmers (for variable coverage).
In those cases they offer two solutions prior to assembly or mapping: (1) trimming the low-abundance kmer - useful for removing errors from reads using the khmer script and (2) error correction – maybe using the “ecc” flag during trimming with the BBMap packager or alternative the error correction tool incorporated in the SPAdes assembler.
For a test, I followed a suggestion (see below) posted by a member to remove low-coverage reads (<5X) without a reference.
I noticed a reduction in the peak, but not significantly (see figure 3A and 3B).
The questions are: Do I need to worry about this “unique” or low coverage reads? Is something that error correction will ‘correct’ prior to assembly (in the case of SPAdes) or using BBmap (ecc)? Is error correction a viable solution, since JGI do not perform or advise this step for analysis?
Sorry for the long post, I want to include the more information available. Thanks for your time and patience.
Vicente
I need some advice or clarification on the topic of “Low-abundance k-mer trimming”.
Background:
source: bacterial genomes (isolates of <6MB genome). The GC content vary according to the isolate. For the following sample the GC content is ~63%.
sequencing technology: HiSeq2500 PE125 Rapid Mode sequencing using the Nextera XT. Avg insert size of 293.46
objective: assembly, mapping and annotation of sequenced genomes
The raw data (kmer frequency histogram, see figure 1 in attachment) was processed following the steps (see below) using the BBMap package:
****************************************************************************
1. Removal of adapters
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=raw_r1.fastq.gz in2= raw_r2.fastq.gz out1= step01_clean_r1.fastq.gz out2= step01_clean_r2.fastq.gz outm1=step01_adapters_r1.fastq.gz outm2= step01_adapters_r2.fastq.gz ref=/path/to/resources/nextera.fa.gz ktrim=r k=28 mink=13 hdist=1 tbo tpe
****************************************************************************
2. Removal of phix and artifacts
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step01_clean_r1.fastq.gz in2=step01_clean_r2.fastq.gz out1=step02_clean_r1.fastq.gz out2=step02_clean_r2.fastq.gz outm1=artifacts_r1.fastq.gz outm2=artifacts_r2.fastq.gz ref=/path/to/resources/phix174_ill.ref.fa.gz,/path/to/resources/Illumina.artifacts.2013.12.fa.gz k=31 hdist=1
****************************************************************************
3. Removal of human/plants/fungus/ribosomal
****************************************************************************
java -Xmx29g -cp /path/to/current align2.BBMap minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/resources/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz build=1 qtrim=rl trimq=10 untrim in1=step02_clean_r1.fastq.gz in2=step02_clean_r2.fastq.gz outu1=step03_clean_r1.fastq.gz outu2=step03_clean_r2.fastq.gz outm=contaminants.fastq.gz
****************************************************************************
4. Trimming
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step03_clean_r1.fastq.gz in2=step03_clean_r2.fastq.gz out1=step04_clean_r1.fastq.gz out2=step04_clean_r2.fastq.gz outm1=step04_trimm_r1.fastq.gz outm2=step04_trimm_r2.fastq.gz qtrim=rl trimq=10 minlength=100
1. Removal of adapters
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=raw_r1.fastq.gz in2= raw_r2.fastq.gz out1= step01_clean_r1.fastq.gz out2= step01_clean_r2.fastq.gz outm1=step01_adapters_r1.fastq.gz outm2= step01_adapters_r2.fastq.gz ref=/path/to/resources/nextera.fa.gz ktrim=r k=28 mink=13 hdist=1 tbo tpe
****************************************************************************
2. Removal of phix and artifacts
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step01_clean_r1.fastq.gz in2=step01_clean_r2.fastq.gz out1=step02_clean_r1.fastq.gz out2=step02_clean_r2.fastq.gz outm1=artifacts_r1.fastq.gz outm2=artifacts_r2.fastq.gz ref=/path/to/resources/phix174_ill.ref.fa.gz,/path/to/resources/Illumina.artifacts.2013.12.fa.gz k=31 hdist=1
****************************************************************************
3. Removal of human/plants/fungus/ribosomal
****************************************************************************
java -Xmx29g -cp /path/to/current align2.BBMap minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/resources/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz build=1 qtrim=rl trimq=10 untrim in1=step02_clean_r1.fastq.gz in2=step02_clean_r2.fastq.gz outu1=step03_clean_r1.fastq.gz outu2=step03_clean_r2.fastq.gz outm=contaminants.fastq.gz
****************************************************************************
4. Trimming
****************************************************************************
java -Xmx29g -cp /path/to/current jgi.BBDukF in1=step03_clean_r1.fastq.gz in2=step03_clean_r2.fastq.gz out1=step04_clean_r1.fastq.gz out2=step04_clean_r2.fastq.gz outm1=step04_trimm_r1.fastq.gz outm2=step04_trimm_r2.fastq.gz qtrim=rl trimq=10 minlength=100
After processing of the raw data I noticed a high peak on the origin of the histogram (see figures 2A and 2B), maybe most of the kmers are unique? A search in literature and forums (not to many) identified this observation as sequencing errors or a mixture of real and erroneous kmers (for variable coverage).
In those cases they offer two solutions prior to assembly or mapping: (1) trimming the low-abundance kmer - useful for removing errors from reads using the khmer script and (2) error correction – maybe using the “ecc” flag during trimming with the BBMap packager or alternative the error correction tool incorporated in the SPAdes assembler.
For a test, I followed a suggestion (see below) posted by a member to remove low-coverage reads (<5X) without a reference.
java -Xmx29g -cp /path/to/current jgi.KmerNormalize in1=step04_clean_r1.fastq.gz in2=step04_clean_r2.fastq.gz out1=step05_clean_r1.fastq.gz out2=step05_clean_r2.fastq.gz target=99999999 min=5 passes=1 hist=kmer_histogram.txt
I noticed a reduction in the peak, but not significantly (see figure 3A and 3B).
The questions are: Do I need to worry about this “unique” or low coverage reads? Is something that error correction will ‘correct’ prior to assembly (in the case of SPAdes) or using BBmap (ecc)? Is error correction a viable solution, since JGI do not perform or advise this step for analysis?
Sorry for the long post, I want to include the more information available. Thanks for your time and patience.
Vicente
Comment