Seqanswers Leaderboard Ad

**sarvidsson** · 02-19-2015, 08:17 AM

Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.

**Brian Bushnell** · 02-19-2015, 10:44 AM

Hi Vicente,

I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.

Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.

Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.

You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.

**vingomez** · 02-19-2015, 12:54 PM

Thanks

Originally posted by sarvidsson View Post

Error correction is usually a good idea, and very safe for haploid high coverage samples. I'd try all three alternatives (trimming as you did, error correction with BayesHammer/SPAdes, error correction with BBTools), run the assembly for all the alternatives and look at the assembly stats afterwards.

Thanks for the rapid response.

Vicente

**vingomez** · 02-19-2015, 01:16 PM

Originally posted by Brian Bushnell View Post

Hi Vicente,

I'd like to clarify a couple of things. First, "hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz" was designed to remove human contaminants only, and pass everything else - so, the "ribo_animal_allplant_allfungus" part means that all areas of the human genome similar to those things were masked so that they will not be detected.

Thanks for the clarification.

Originally posted by Brian Bushnell View Post

Second - A peak for kmers with depth 1 is generally going to be present in all data no matter how you process it, due to the presence of errors. You can't make it go away entirely, and it typically if you have sufficient coverage does not negatively affect the assembly process too much other than making it take more time and memory, because assemblers generally ignore kmers with depth 1. It's the error kmers at higher depth that are most troublesome.

Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?

Originally posted by Brian Bushnell View Post

Running BBNorm with "min=5" will only get rid of reads in which most of their kmers are below depth 5; it will not get rid of reads that have a handful of kmers with low depth due to a single error. So, it does not have much impact on the depth-1 peak. Running with the "ecc" flag will have a much greater impact on reducing that peak.

Definitely the 'ecc' flag reduce the depth-1 peak.

Total Unique Kmers--------Unique Kmers (depth=1)-------Step
113,306,625------------------106,166,190--------------------raw data
105,694,714-------------------96,663,826--------------------Removal of adapters
19,649,281--------------------13,639,215--------------------Removal of phix and artifacts
19,645,918--------------------13,635,884--------------------Removal of human contaminants
7,164,977----------------------1,576,790--------------------ecc

Originally posted by Brian Bushnell View Post

You can also reduce the peak by quality-trimming the data. For example, adding "qtrim=rl trimq=10" during phiX/artifact removal will do that.

I added the quality-trimming flag to step #2 (phiX/artifact removal). This produced no effect (similar number of unique kmer at the end of the analysis), but one step was removed (Step#4).

Thanks again for your response and the effort to develop and maintain these software packages.

**Brian Bushnell** · 02-19-2015, 01:31 PM

Originally posted by vingomez View Post

Maybe this is a topic for another post. But there is an approach or basic principle to detect or address error kmers at higher depth?

BBNorm does this already, as I designed it for use with single-cell amplified data or other scenarios that have super-high coverage, so the exact same error can occur many times, purely by chance. The principle is, basically, if a kmer in a read has depth X, and the adjacent kmer has depth Y, then the last base in the second kmer is probably an error if the ratio X/Y is greater than some constant. BBNorm, by default, uses 140 for this ratio. This assumption will generally hold unless you have a genomic sequence in your DNA of at least length K (31 by default) with at least 140 copies.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Low-abundance kmer - what to do?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News