Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • count reads spanning multiple variants from BAM file

    Hello, I have a BAM and a phased VCF for a sample and my goal is to count how many individual reads (not one variant covered by a read and other variant covered by mate pair, but both variants covered by an individual read) confirm the phased genotypes for a pair of variant sites.

    For example, if I have two sites in a phase block at coordinates chr11:128839353 and chr11:128839368, with genotypes 0|1 and 0|1 respectively, I want to know how many individual reads that have the reference allele at the first site also have the reference allele at the second site.

    Any help is much appreciated!

  • #2
    For posterity:
    import argparse
    import pysam
    import os

    # Define the command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("bam_file", help="Path to the BAM file")
    parser.add_argument("nucleotide1", help="The nucleotide to count at position 1 (e.g., T)")
    parser.add_argument("pos1", help="Position 1 (e.g., chr11:128839353)")
    parser.add_argument("nucleotide2", help="The nucleotide to count at position 2 (e.g., C)")
    parser.add_argument("pos2", help="Position 2 (e.g., chr11:128839368)")
    args = parser.parse_args()

    # Split the positions into chromosome and position
    pos1_chrom, pos1_pos = args.pos1.split(":")
    pos2_chrom, pos2_pos = args.pos2.split(":")

    # Convert the position values to integers
    pos1_pos = int(pos1_pos)
    pos2_pos = int(pos2_pos)

    # Open the BAM file
    bam_file = pysam.AlignmentFile(args.bam_file, "rb")

    # Initialize a counter for the number of reads
    count = 0

    # Iterate through the reads in the BAM file
    for read in bam_file.fetch():
    # Check if the read is on the same chromosome as pos1 and pos2
    if read.reference_name == pos1_chrom and read.reference_name == pos2_chrom:
    # Check if the read overlaps with either pos1 or pos2
    if (read.reference_start <= pos1_pos < read.reference_end) or (read.reference_start <= pos2_pos < read.reference_end):
    # Get the positions of the bases in the reference genome that correspond to the bases in the read
    ref_positions = read.get_reference_positions(full_length=True)
    # Initialize a flag for whether the read has the specified nucleotides at both positions
    has_nucleotides_at_both_positions = True
    # Check if the read has the specified nucleotide1 at pos1
    if pos1_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos1_pos)] == args.nucleotide1)
    else:
    has_nucleotides_at_both_positions = False
    # Check if the read has the specified nucleotide2 at pos2
    if pos2_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos2_pos)] == args.nucleotide2)
    else:
    has_nucleotides_at_both_positions = False
    # If the read has the specified nucleotides at both positions, increment the counter
    if has_nucleotides_at_both_positions:
    count += 1

    # Print the count
    print(count)

    # get read depth
    read_depth = bam_file.count(pos1_chrom, pos1_pos, pos1_pos+1)

    # print read depth
    print(read_depth)

    # Close the BAM file
    bam_file.close()


    To run:
    python phaseCount.py test.bam C chr11:128839352 C chr11:128839367
    where test.bam is subset to include reads overlapping BOTH sites (can be generate with bedtools intersect -wa -a original.bam -b sites.bed -F 1.0 > test.bam)

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 11:49 AM
    0 responses
    13 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-24-2024, 08:47 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    61 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Working...
    X