Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • count reads spanning multiple variants from BAM file

    Hello, I have a BAM and a phased VCF for a sample and my goal is to count how many individual reads (not one variant covered by a read and other variant covered by mate pair, but both variants covered by an individual read) confirm the phased genotypes for a pair of variant sites.

    For example, if I have two sites in a phase block at coordinates chr11:128839353 and chr11:128839368, with genotypes 0|1 and 0|1 respectively, I want to know how many individual reads that have the reference allele at the first site also have the reference allele at the second site.

    Any help is much appreciated!

  • #2
    For posterity:
    import argparse
    import pysam
    import os

    # Define the command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("bam_file", help="Path to the BAM file")
    parser.add_argument("nucleotide1", help="The nucleotide to count at position 1 (e.g., T)")
    parser.add_argument("pos1", help="Position 1 (e.g., chr11:128839353)")
    parser.add_argument("nucleotide2", help="The nucleotide to count at position 2 (e.g., C)")
    parser.add_argument("pos2", help="Position 2 (e.g., chr11:128839368)")
    args = parser.parse_args()

    # Split the positions into chromosome and position
    pos1_chrom, pos1_pos = args.pos1.split(":")
    pos2_chrom, pos2_pos = args.pos2.split(":")

    # Convert the position values to integers
    pos1_pos = int(pos1_pos)
    pos2_pos = int(pos2_pos)

    # Open the BAM file
    bam_file = pysam.AlignmentFile(args.bam_file, "rb")

    # Initialize a counter for the number of reads
    count = 0

    # Iterate through the reads in the BAM file
    for read in bam_file.fetch():
    # Check if the read is on the same chromosome as pos1 and pos2
    if read.reference_name == pos1_chrom and read.reference_name == pos2_chrom:
    # Check if the read overlaps with either pos1 or pos2
    if (read.reference_start <= pos1_pos < read.reference_end) or (read.reference_start <= pos2_pos < read.reference_end):
    # Get the positions of the bases in the reference genome that correspond to the bases in the read
    ref_positions = read.get_reference_positions(full_length=True)
    # Initialize a flag for whether the read has the specified nucleotides at both positions
    has_nucleotides_at_both_positions = True
    # Check if the read has the specified nucleotide1 at pos1
    if pos1_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos1_pos)] == args.nucleotide1)
    else:
    has_nucleotides_at_both_positions = False
    # Check if the read has the specified nucleotide2 at pos2
    if pos2_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos2_pos)] == args.nucleotide2)
    else:
    has_nucleotides_at_both_positions = False
    # If the read has the specified nucleotides at both positions, increment the counter
    if has_nucleotides_at_both_positions:
    count += 1

    # Print the count
    print(count)

    # get read depth
    read_depth = bam_file.count(pos1_chrom, pos1_pos, pos1_pos+1)

    # print read depth
    print(read_depth)

    # Close the BAM file
    bam_file.close()


    To run:
    python phaseCount.py test.bam C chr11:128839352 C chr11:128839367
    where test.bam is subset to include reads overlapping BOTH sites (can be generate with bedtools intersect -wa -a original.bam -b sites.bed -F 1.0 > test.bam)

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Genetic Variation in Immunogenetics and Antibody Diversity
      by seqadmin



      The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
      11-06-2024, 07:24 PM
    • seqadmin
      Choosing Between NGS and qPCR
      by seqadmin



      Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
      10-18-2024, 07:11 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 11-08-2024, 11:09 AM
    0 responses
    34 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 11-08-2024, 06:13 AM
    0 responses
    28 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 11-01-2024, 06:09 AM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 10-30-2024, 05:31 AM
    0 responses
    23 views
    0 likes
    Last Post seqadmin  
    Working...
    X