Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • count reads spanning multiple variants from BAM file

    Hello, I have a BAM and a phased VCF for a sample and my goal is to count how many individual reads (not one variant covered by a read and other variant covered by mate pair, but both variants covered by an individual read) confirm the phased genotypes for a pair of variant sites.

    For example, if I have two sites in a phase block at coordinates chr11:128839353 and chr11:128839368, with genotypes 0|1 and 0|1 respectively, I want to know how many individual reads that have the reference allele at the first site also have the reference allele at the second site.

    Any help is much appreciated!

  • #2
    For posterity:
    import argparse
    import pysam
    import os

    # Define the command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("bam_file", help="Path to the BAM file")
    parser.add_argument("nucleotide1", help="The nucleotide to count at position 1 (e.g., T)")
    parser.add_argument("pos1", help="Position 1 (e.g., chr11:128839353)")
    parser.add_argument("nucleotide2", help="The nucleotide to count at position 2 (e.g., C)")
    parser.add_argument("pos2", help="Position 2 (e.g., chr11:128839368)")
    args = parser.parse_args()

    # Split the positions into chromosome and position
    pos1_chrom, pos1_pos = args.pos1.split(":")
    pos2_chrom, pos2_pos = args.pos2.split(":")

    # Convert the position values to integers
    pos1_pos = int(pos1_pos)
    pos2_pos = int(pos2_pos)

    # Open the BAM file
    bam_file = pysam.AlignmentFile(args.bam_file, "rb")

    # Initialize a counter for the number of reads
    count = 0

    # Iterate through the reads in the BAM file
    for read in bam_file.fetch():
    # Check if the read is on the same chromosome as pos1 and pos2
    if read.reference_name == pos1_chrom and read.reference_name == pos2_chrom:
    # Check if the read overlaps with either pos1 or pos2
    if (read.reference_start <= pos1_pos < read.reference_end) or (read.reference_start <= pos2_pos < read.reference_end):
    # Get the positions of the bases in the reference genome that correspond to the bases in the read
    ref_positions = read.get_reference_positions(full_length=True)
    # Initialize a flag for whether the read has the specified nucleotides at both positions
    has_nucleotides_at_both_positions = True
    # Check if the read has the specified nucleotide1 at pos1
    if pos1_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos1_pos)] == args.nucleotide1)
    else:
    has_nucleotides_at_both_positions = False
    # Check if the read has the specified nucleotide2 at pos2
    if pos2_pos in ref_positions:
    has_nucleotides_at_both_positions = has_nucleotides_at_both_positions and (read.query_sequence[ref_positions.index(pos2_pos)] == args.nucleotide2)
    else:
    has_nucleotides_at_both_positions = False
    # If the read has the specified nucleotides at both positions, increment the counter
    if has_nucleotides_at_both_positions:
    count += 1

    # Print the count
    print(count)

    # get read depth
    read_depth = bam_file.count(pos1_chrom, pos1_pos, pos1_pos+1)

    # print read depth
    print(read_depth)

    # Close the BAM file
    bam_file.close()


    To run:
    python phaseCount.py test.bam C chr11:128839352 C chr11:128839367
    where test.bam is subset to include reads overlapping BOTH sites (can be generate with bedtools intersect -wa -a original.bam -b sites.bed -F 1.0 > test.bam)

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Non-Coding RNA Research and Technologies
      by seqadmin




      Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

      Nobel Prize for MicroRNA Discovery
      This week,...
      10-07-2024, 08:07 AM
    • seqadmin
      Recent Developments in Metagenomics
      by seqadmin





      Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
      09-23-2024, 06:35 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 10-02-2024, 04:51 AM
    0 responses
    104 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 10-01-2024, 07:10 AM
    0 responses
    112 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 09-30-2024, 08:33 AM
    1 response
    116 views
    0 likes
    Last Post EmiTom
    by EmiTom
     
    Started by seqadmin, 09-26-2024, 12:57 PM
    0 responses
    23 views
    0 likes
    Last Post seqadmin  
    Working...
    X