Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bossanova352
    Junior Member
    • Mar 2013
    • 9

    Correcting for high coverage in alignments

    Hi all,

    I am trying to obtain relative abundances of assembled contigs in a number of metagenomes. My plan was to align my reads to the assemblies, calculate average coverage of each assembly, and normalize for the sequence depth of each metagenome. Unfortunately, I don't have a way of correcting for high-coverage regions in these assemblies, where it can increase by orders of magnitude and potentially skew my results. Is anyone familiar with an available tool that would help with this?
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I don't understand the problem - the high-coverage regions in assemblies are high-coverage because either that organism is abundant, or repetitive, or it is a homologous region shared by many organisms. Either way, it is still an 'abundant contig' in that environment and should be classified as such. If you want to calculate abundance by counting reads, then you need to count reads.

    There is a GC-bias in Illumina reads... is that what you mean? So you could normalize by GC content of the contigs if you want, but otherwise I can't think of any processing that would help.

    By normalizing, I mean, for example:

    Contig_5 has 68% GC and a coverage of 70x.
    The sequencing platform overrepresents 68% GC sequences by 47%.
    Thus the normalized coverage of Contig_5 is 70x*100/(100+47)=47.6x.

    Your mapping should be done with ambiguously-mapped reads going to random locations. Incidentally, I have a neat tool called "pileup.sh" that can generate the fold-coverage and %bases covered for all scaffolds in an assembly from a sam file, without needing conversion to bam and sorting:

    pileup.sh in=mapped.sam out=coverage.txt -Xmx31g

    (the -Xmx flag should not be necessary on Linux systems. But if you need it, set it at around 85% of available memory)
    Last edited by Brian Bushnell; 04-25-2014, 04:43 PM.

    Comment

    • bossanova352
      Junior Member
      • Mar 2013
      • 9

      #3
      Hey Brian,

      Thanks for the response! Your pileup script sounds very useful, and I'll give it a try.

      However, the issue I'm concerned with is how small regions of exceptionally high coverage in assembled contigs will affect coverage calculations. In a 50 kb contig, for example, the average coverage over most of the contig may be around ~10. However there will be 1-2 sections within the same contig, between 200-1000 bp long, that have a much higher coverage (1000x or more). Presumably, these mapped reads correspond to things like transposases or conserved domains found in many organisms in the environment, and don't actually come from the assembled contig. The problem with this, is that if I am calculating coverage based on the entire contig, these high coverage regions will skew the calculated coverages. Since we are using coverage as a means of relative abundance, this will bias our results. I'm just wondering if there is a high-throughput way to mask or limit read mapping in these regions to remove this bias.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        They could come from conserved domains like 16s, or they could come from collapsed repeats. You might try blasting a few of the high-coverage areas to see what they are. If they are ribosomal you could just filter out all the reads that map to a ribosomal database like silva.

        You can also avoid bias from small areas with abnormal coverage by using the median instead of average. My pileup program does not currently do that, but it seems useful, so I'll add it in soon.

        You could also eliminate all ambiguously-mapped reads from the coverage calculation; often areas of super-high coverage have all of the reads marked as ambiguously mapped. With bbmap, for example, you would include the 'ambig=toss' flag.
        Last edited by Brian Bushnell; 04-28-2014, 03:46 PM.

        Comment

        Latest Articles

        Collapse

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        21 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        27 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        38 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        61 views
        0 reactions
        Last Post SEQadmin2  
        Working...