Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alphred
    Junior Member
    • Feb 2013
    • 3

    Automate nucleotide-specific coverage retrieval from BAMs?

    I've read previous threads on computing coverage from BAMs, and I can't seem to find an answer to my specific quandary. I am aware that coverage can be quite easily computed from a BAM file over intervals/genes/exons/etc. provided in a BED or GFF file using Samtools Pileup, Depth, or Bedtools Coverage, or GTK, but I would also like to be able to pull out average nucleotide-specific coverages within each of the intervals specified.

    Specifically, I have a bunch of aligned, indexed BAM files from whole-exome sequencing data, and I have a BED file containing the locations of coding regions of all RefSeq genes. For each BAM, I want to be able to pull out average coverage per coding region of each specific nucleotide, that is, the average coverage of A's, T's, G's and C's. Furthermore, I would like to be able query the coverage of nucleotides within specific contexts for each coding region, such as, for example, what is the average coverage of all G's in CG contexts? Similarly, I would also like to get nucleotide specific lengths for each region in the BED file.

    Does anyone know if this is doable using the basic tools provided in Samtools or Bedtools? If not, how difficult would it be to pipe some output (like a pileup) into a custom python script to get this done without making my vastly under-powered machine explode?

    Thanks in advance for your help!
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    CoverageBed in BEDtools will do most of what you want. You may want a bit of scripting to massage its output to exactly what you want.

    Comment

    • EricHaugen
      Member
      • Sep 2009
      • 13

      #3
      The reference base and read depth at each position could be obtained in an easily-parseable VCF format with
      samtools mpileup -uf reference.fasta alignment.bam | bcftools view -cg -

      I use the BEDOPS tools for other interval slicing/dicing such as you describe, they generally use minimal memory by using only sorted inputs, and the new "--max-mem" option in sort-bed may help make anything doable on your machine.

      bedtools has the advantage of working with that VCF output directly though.

      Comment

      • Alphred
        Junior Member
        • Feb 2013
        • 3

        #4
        Thanks for the responses!

        I'm still a little concerned about the amount of memory I will need to obtain read depth at each position. I'm not trying to get coverage of just a list of SNPs or sequences, but literally every base in every exon in the genome. I was hoping there might be some way to obtain averages of base-specific coverage across intervals without having to output the depth at every base. I think (though I could be wrong) Bedtools genomeCoverageBed -d option can output the number of reads covering each and every base in each chromosome . . . for the entire genome this would equate to 3 billion rows of text. Is it possible to narrow the genomeCoverageBed command to cover only coding regions? Or is there a similar function (like CoverageBed) that can take a BED file of exon locations (as opposed to just chromosomes) and compute their per base coverage? In this case, maybe the number of rows would be narrowed down to 0.75 billion or so . . .

        Am I approaching this problem correctly, or is there a more efficient solution?

        Ultimately, the results I want would be something like this:

        Gene_____Avg.Coverage.A_____Avg.Coverage.T_____Avg.Coverage.G.in.CG
        a___________51________________73________________67
        b___________39________________89________________100
        Last edited by Alphred; 02-28-2013, 06:32 PM.

        Comment

        • EricHaugen
          Member
          • Sep 2009
          • 13

          #5
          The number of rows will not be a memory problem if you only process one line at a time, and the final program in the piped command-line (e.g. your python script or maybe "groupby") just saves the counts per gene.

          That bcftools command gives every base, not just variants. The only reason I suggested it instead of "bedtools coverage" is that it also gives you the reference base, which is not in the BAM file.

          Comment

          • AlexReynolds
            Member
            • Feb 2013
            • 45

            #6
            Originally posted by EricHaugen View Post
            bedtools has the advantage of working with that VCF output directly though.
            The vcf2bed script that is part of BEDOPS can be used to pipe BED data as standard input into other BEDOPS tools, like sort-bed and bedops, e.g.:

            $ vcf2bed < my.vcf \
            | sort-bed --max-mem 2G - \
            | bedops --element-of -1 - foo.bed \
            > answer.bed


            The hyphen character denotes the use of standard input, in place of a regular file. (Of course, standard output from vcf2bed can be piped in bedtools utilities or any other app that takes in standard input.)

            Another option is to use named pipes, if there is more than one input, two or more of which are of non-BED formats, e.g.:

            $ mkfifo my_vcf_pipe
            $ vcf2bed < my.vcf | sort-bed - > my_vcf_pipe
            $ mkfifo my_gff_pipe
            $ gff2bed < my.gff | sort-bed - > my_gff_pipe
            $ bedops --intersect my_vcf_pipe my_gff_pipe > answer.bed
            $ rm my_vcf_pipe my_gff_pipe
            Last edited by AlexReynolds; 02-28-2013, 11:16 PM.

            Comment

            • Alphred
              Junior Member
              • Feb 2013
              • 3

              #7
              Cool, thanks guys!

              Just to make sure I understand correctly, based on the comments, I could use:

              samtools mpileup -u -l exons.bed -f ref.fasta my.bam | bcftools view | “some tool to annotate gene names to positions” | “simple script to sum rows by gene and nucleotide” > output.vcf

              Should this accomplish what I’m after? If so, what would be the simplest (no frills) method to annotate gene names to Chrom + Pos?

              Comment

              • swbarnes2
                Senior Member
                • May 2008
                • 910

                #8
                If you have a .bed file with gene positions, you can use that to assign each line in the .bam to a gene position. You might just want to use that, and then use something like cut | sort | uniq -c to count how many reads hit every gene. With read counts + gene lengths, you can calculate average coverage.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                23 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                28 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                22 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                190 views
                0 reactions
                Last Post seqadmin  
                Working...