Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-25-2014, 04:32 PM

I don't understand the problem - the high-coverage regions in assemblies are high-coverage because either that organism is abundant, or repetitive, or it is a homologous region shared by many organisms. Either way, it is still an 'abundant contig' in that environment and should be classified as such. If you want to calculate abundance by counting reads, then you need to count reads.

There is a GC-bias in Illumina reads... is that what you mean? So you could normalize by GC content of the contigs if you want, but otherwise I can't think of any processing that would help.

By normalizing, I mean, for example:

Contig_5 has 68% GC and a coverage of 70x.
The sequencing platform overrepresents 68% GC sequences by 47%.
Thus the normalized coverage of Contig_5 is 70x*100/(100+47)=47.6x.

Your mapping should be done with ambiguously-mapped reads going to random locations. Incidentally, I have a neat tool called "pileup.sh" that can generate the fold-coverage and %bases covered for all scaffolds in an assembly from a sam file, without needing conversion to bam and sorting:

pileup.sh in=mapped.sam out=coverage.txt -Xmx31g

(the -Xmx flag should not be necessary on Linux systems. But if you need it, set it at around 85% of available memory)

**bossanova352** · 04-28-2014, 03:08 PM

Hey Brian,

Thanks for the response! Your pileup script sounds very useful, and I'll give it a try.

However, the issue I'm concerned with is how small regions of exceptionally high coverage in assembled contigs will affect coverage calculations. In a 50 kb contig, for example, the average coverage over most of the contig may be around ~10. However there will be 1-2 sections within the same contig, between 200-1000 bp long, that have a much higher coverage (1000x or more). Presumably, these mapped reads correspond to things like transposases or conserved domains found in many organisms in the environment, and don't actually come from the assembled contig. The problem with this, is that if I am calculating coverage based on the entire contig, these high coverage regions will skew the calculated coverages. Since we are using coverage as a means of relative abundance, this will bias our results. I'm just wondering if there is a high-throughput way to mask or limit read mapping in these regions to remove this bias.

**Brian Bushnell** · 04-28-2014, 03:43 PM

They could come from conserved domains like 16s, or they could come from collapsed repeats. You might try blasting a few of the high-coverage areas to see what they are. If they are ribosomal you could just filter out all the reads that map to a ribosomal database like silva.

You can also avoid bias from small areas with abnormal coverage by using the median instead of average. My pileup program does not currently do that, but it seems useful, so I'll add it in soon.

You could also eliminate all ambiguously-mapped reads from the coverage calculation; often areas of super-high coverage have all of the reads marked as ambiguously mapped. With bbmap, for example, you would include the 'ambig=toss' flag.

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 42 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 53 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 42 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Correcting for high coverage in alignments

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News