Header Leaderboard Ad


count reads spanning multiple variants from BAM file



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • count reads spanning multiple variants from BAM file

    Hello, I have a BAM and a phased VCF for a sample and my goal is to count how many individual reads (not one variant covered by a read and other variant covered by mate pair, but both variants covered by an individual read) confirm the phased genotypes for a pair of variant sites.

    For example, if I have two sites in a phase block at coordinates chr11:128839353 and chr11:128839368, with genotypes 0|1 and 0|1 respectively, I want to know how many individual reads that have the reference allele at the first site also have the reference allele at the second site.

    Any help is much appreciated!

  • #2
    For posterity:
    import argparse
    import pysam
    import os

    # Define the command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("bam_file", help="Path to the BAM file")
    parser.add_argument("nucleotide1", help="The nucleotide to count at position 1 (e.g., T)")
    parser.add_argument("pos1", help="Position 1 (e.g., chr11:128839353)")
    parser.add_argument("nucleotide2", help="The nucleotide to count at position 2 (e.g., C)")
    parser.add_argument("pos2", help="Position 2 (e.g., chr11:128839368)")
    args = parser.parse_args()

    # Split the positions into chromosome and position
    pos1_chrom, pos1_pos = args.pos1.split(":")
    pos2_chrom, pos2_pos = args.pos2.split(":")

    # Convert the position values to integers
    pos1_pos = int(pos1_pos)
    pos2_pos = int(pos2_pos)

    # Open the BAM file
    bam_file = pysam.AlignmentFile(args.bam_file, "rb")

    # Initialize a counter for the number of reads
    count = 0

    # Iterate through the reads in the BAM file
    for read in bam_file.fetch():
    # Check if the read is on the same chromosome as pos1 and pos2
    if read.reference_name == pos1_chrom and read.reference_name == pos2_chrom:
    # Check if the read overlaps with either pos1 or pos2
    if (read.reference_start <= pos1_pos < read.reference_end) or (read.reference_start <= pos2_pos < read.reference_end):
    # Get the positions of the bases in the reference genome that correspond to the bases in the read
    ref_positions = read.get_reference_positions(full_length=True)
    # Initialize a flag for whether the read has the specified nucleotides at both positions
    has_nucleotides_at_both_positions = True
    # Check if the read has the specified nucleotide1 at pos1
    if pos1_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos1_pos)] == args.nucleotide1)
    has_nucleotides_at_both_positions = False
    # Check if the read has the specified nucleotide2 at pos2
    if pos2_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos2_pos)] == args.nucleotide2)
    has_nucleotides_at_both_positions = False
    # If the read has the specified nucleotides at both positions, increment the counter
    if has_nucleotides_at_both_positions:
    count += 1

    # Print the count

    # get read depth
    read_depth = bam_file.count(pos1_chrom, pos1_pos, pos1_pos+1)

    # print read depth

    # Close the BAM file

    To run:
    python phaseCount.py test.bam C chr11:128839352 C chr11:128839367
    where test.bam is subset to include reads overlapping BOTH sites (can be generate with bedtools intersect -wa -a original.bam -b sites.bed -F 1.0 > test.bam)


    Latest Articles


    • seqadmin
      A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
      by seqadmin

      ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

      01-24-2023, 01:19 PM
    • seqadmin
      Introduction to Single-Cell Sequencing
      by seqadmin
      Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

      The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
      01-09-2023, 03:10 PM