Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kbhit
    Junior Member
    • Sep 2011
    • 9

    Problems with BWA XA output on

    I believe the XA information is wrong in many of our reads but I can't understand why. Any help will be greatly appreciated. Essentially I want to be able to rely on the XA information (I'm not only concerned about uniq reads).

    History:
    - these are single-end reads from SOLiD
    - bwa flags are as follows:
    bwa aln -c -n 0.04 -N -q 30 -t 60 -f ${OUTFILE}.sai $INDEX ${FASTQ}
    bwa samse -n 1000000 -f ${OUTFILE}.txt $INDEX ${OUTFILE}.sai ${FASTQ}
    - input file that is indexed has 1314 sequences and is a total of 2.3G
    - i'm getting lots of problematic reads (> 74,000) over 16 samples. Could be much more than 74,000 but those were easy to verify
    - one of the problem reads are below with questions/concerns I'm hoping to be able to get assistance with

    A sample problem read:
    208_44_1461 0 gi|307149945|ref|NC_014501.1|_Cyanothece_sp._PCC_7822_chromosome__complete_genome 4159110 25 20M * 0 0 AAATACCGTTCAAATAATGG A!!57A?:@CHQH%=O79:< XT:A:U CM:i:0 X0:i:1 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:20 XA:Z:gi|258513327|ref|NC_013213.1|_Acetobacter_pasteurianus_IFO_3283-01_plasmid_pAPA01-040__complete_sequence,+3199,7M1I1M2I4M1D5M,2;

    Some concerns with the sample read above:
    1. Cyanothece seems to map, but Acetobacter does not. When looking at the Acetobacter sequence, I don't see anything that even closely resembles AAATACCGTTCAAATAATGG. Not even when taking into account the 3 insertions and 1 deletion mentioned in the CIGAR score. As you will see from the X1 score, Acetobacter is a reported suboptimal hit.
    2. The Acetobacter has a start coordinate of +3199 (which makes the coordinates 3199 - 3218). However, the actual sequence of Acetobacter is only 3205 in length, so it can't go to 3218. I don't believe it's a problem with boundries because even when looking at 3199-3205 (that are in the sequence), nothing resembles AAATACCGTTCAAATAATGG. I also looked at the sequences surrounding that sequence and didn't see anything it could map to (even taking the reverse complement). My bigger fear is there are also reported reads that don't exceed the sequence boundry that don't map at all.
    3. Acetobacter has a CIGAR score of 7M1I1M2I4M1D5M which I believe should have been 'filtered out' given the bwa flags shown above. While I still don't see the alignment, the CIGAR score likely should have been filtered despite it being a suboptimal hit (unless I'm not understanding the suboptimal hit thresholds correctly).

    Any help will be greatly appreciated. I'm mainly trying to understand if we should continue to use BWA for reading 'non-uniq' reads, use some other method, or apply some work-arounds. Could the problems be only with sub-optimal hits? Or could it be more wide-spread and effect all repeat-hits or even uniq hits?

    Thank you.
  • jenli
    Junior Member
    • Feb 2012
    • 1

    #2
    I have a similar problem with bwa 0.6.1 and XA.

    I have been checking SRA reads against a portion of the mouse genome.
    One of hits is:

    SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 23 50M * 0 0 ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:U16_3259971_3260068;8547974,+91,6M2I42M,2;

    which means this read is mapped to R16_3260025_3260114;8547975 and U16_3259971_3260068;8547974. But U16_* was too short for an alignment
    that started at base 91.

    So, I isolated R16_* and U16_* (and another one) from my 1 Gb subject fasta
    into a file of three records:

    >U16_3259971_3260068;8547974
    CTTCTTCTATTTTTTGATTATGGTTTTTGGTATGTATATTATATCCAGGGAATGTGTTTA
    TTACAGAAGAAATAATCATCATTTATTAGGACTTATTA
    >R16_3260025_3260114;8547975
    TGTTTATTACAGAAGAAATAATCATCATTTATTAGGACTTATTATATAATATTATTTCAT
    TTACAGAGGAACATATAATACCTTATAAAT
    >U16_3260071_3260149;8547976
    TAATATTATTTCATTTACAGAGGAACATATAATACCTTATAAATGTATATGAAATTGATA
    AAAGTTTTTACAGTGTAAG

    Isolated the read into a fastq file:

    @SRR067605.10526006_0
    ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT
    +SRR067605.10526006_0
    GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG

    And reran the alignment just between the two, the hit from bwa now becomes:

    SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 37 50M * 0 0
    ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

    No U16 hit...very perplexing...
    I can't post the 1 Gb subject fasta file. Don't quite know why it incorrectly
    picked U16_* when aligning against the larger subject file, but not when the
    subject file is only 3 records.

    Any recommendations ?

    Thanks

    Comment

    Latest Articles

    Collapse

    • GATTACAT
      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
      by GATTACAT
      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
      07-01-2026, 11:43 AM
    • SEQadmin2
      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
      by SEQadmin2


      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

      Here are nine questions we think about, in roughly the order they matter, before...
      06-18-2026, 07:11 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by SEQadmin2, 07-02-2026, 11:08 AM
    0 responses
    9 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-30-2026, 05:37 AM
    0 responses
    13 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-26-2026, 11:10 AM
    0 responses
    20 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-17-2026, 06:09 AM
    0 responses
    54 views
    0 reactions
    Last Post SEQadmin2  
    Working...