Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correcting for high coverage in alignments

    Hi all,

    I am trying to obtain relative abundances of assembled contigs in a number of metagenomes. My plan was to align my reads to the assemblies, calculate average coverage of each assembly, and normalize for the sequence depth of each metagenome. Unfortunately, I don't have a way of correcting for high-coverage regions in these assemblies, where it can increase by orders of magnitude and potentially skew my results. Is anyone familiar with an available tool that would help with this?

  • #2
    I don't understand the problem - the high-coverage regions in assemblies are high-coverage because either that organism is abundant, or repetitive, or it is a homologous region shared by many organisms. Either way, it is still an 'abundant contig' in that environment and should be classified as such. If you want to calculate abundance by counting reads, then you need to count reads.

    There is a GC-bias in Illumina reads... is that what you mean? So you could normalize by GC content of the contigs if you want, but otherwise I can't think of any processing that would help.

    By normalizing, I mean, for example:

    Contig_5 has 68% GC and a coverage of 70x.
    The sequencing platform overrepresents 68% GC sequences by 47%.
    Thus the normalized coverage of Contig_5 is 70x*100/(100+47)=47.6x.

    Your mapping should be done with ambiguously-mapped reads going to random locations. Incidentally, I have a neat tool called "pileup.sh" that can generate the fold-coverage and %bases covered for all scaffolds in an assembly from a sam file, without needing conversion to bam and sorting:

    pileup.sh in=mapped.sam out=coverage.txt -Xmx31g

    (the -Xmx flag should not be necessary on Linux systems. But if you need it, set it at around 85% of available memory)
    Last edited by Brian Bushnell; 04-25-2014, 04:43 PM.

    Comment


    • #3
      Hey Brian,

      Thanks for the response! Your pileup script sounds very useful, and I'll give it a try.

      However, the issue I'm concerned with is how small regions of exceptionally high coverage in assembled contigs will affect coverage calculations. In a 50 kb contig, for example, the average coverage over most of the contig may be around ~10. However there will be 1-2 sections within the same contig, between 200-1000 bp long, that have a much higher coverage (1000x or more). Presumably, these mapped reads correspond to things like transposases or conserved domains found in many organisms in the environment, and don't actually come from the assembled contig. The problem with this, is that if I am calculating coverage based on the entire contig, these high coverage regions will skew the calculated coverages. Since we are using coverage as a means of relative abundance, this will bias our results. I'm just wondering if there is a high-throughput way to mask or limit read mapping in these regions to remove this bias.

      Comment


      • #4
        They could come from conserved domains like 16s, or they could come from collapsed repeats. You might try blasting a few of the high-coverage areas to see what they are. If they are ribosomal you could just filter out all the reads that map to a ribosomal database like silva.

        You can also avoid bias from small areas with abnormal coverage by using the median instead of average. My pileup program does not currently do that, but it seems useful, so I'll add it in soon.

        You could also eliminate all ambiguously-mapped reads from the coverage calculation; often areas of super-high coverage have all of the reads marked as ambiguously mapped. With bbmap, for example, you would include the 'ambig=toss' flag.
        Last edited by Brian Bushnell; 04-28-2014, 03:46 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 05-14-2024, 07:03 AM
        0 responses
        19 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-10-2024, 06:35 AM
        0 responses
        42 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-09-2024, 02:46 PM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-07-2024, 06:57 AM
        0 responses
        42 views
        0 likes
        Last Post seqadmin  
        Working...
        X