Problems with BWA XA output on

kbhit

Junior Member

Join Date: Sep 2011

Posts: 9
- Share
- Tweet
#1

Problems with BWA XA output on

12-15-2011, 08:34 AM

I believe the XA information is wrong in many of our reads but I can't understand why. Any help will be greatly appreciated. Essentially I want to be able to rely on the XA information (I'm not only concerned about uniq reads).

History:
- these are single-end reads from SOLiD
- bwa flags are as follows:
bwa aln -c -n 0.04 -N -q 30 -t 60 -f ${OUTFILE}.sai $INDEX ${FASTQ}
bwa samse -n 1000000 -f ${OUTFILE}.txt $INDEX ${OUTFILE}.sai ${FASTQ}
- input file that is indexed has 1314 sequences and is a total of 2.3G
- i'm getting lots of problematic reads (> 74,000) over 16 samples. Could be much more than 74,000 but those were easy to verify
- one of the problem reads are below with questions/concerns I'm hoping to be able to get assistance with

A sample problem read:
208_44_1461 0 gi|307149945|ref|NC_014501.1|_Cyanothece_sp._PCC_7822_chromosome__complete_genome 4159110 25 20M * 0 0 AAATACCGTTCAAATAATGG A!!57A?:@CHQH%=O79:< XT:A:U CM:i:0 X0:i:1 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:20 XA:Z:gi|258513327|ref|NC_013213.1|_Acetobacter_pasteurianus_IFO_3283-01_plasmid_pAPA01-040__complete_sequence,+3199,7M1I1M2I4M1D5M,2;

Some concerns with the sample read above:
1. Cyanothece seems to map, but Acetobacter does not. When looking at the Acetobacter sequence, I don't see anything that even closely resembles AAATACCGTTCAAATAATGG. Not even when taking into account the 3 insertions and 1 deletion mentioned in the CIGAR score. As you will see from the X1 score, Acetobacter is a reported suboptimal hit.
2. The Acetobacter has a start coordinate of +3199 (which makes the coordinates 3199 - 3218). However, the actual sequence of Acetobacter is only 3205 in length, so it can't go to 3218. I don't believe it's a problem with boundries because even when looking at 3199-3205 (that are in the sequence), nothing resembles AAATACCGTTCAAATAATGG. I also looked at the sequences surrounding that sequence and didn't see anything it could map to (even taking the reverse complement). My bigger fear is there are also reported reads that don't exceed the sequence boundry that don't map at all.
3. Acetobacter has a CIGAR score of 7M1I1M2I4M1D5M which I believe should have been 'filtered out' given the bwa flags shown above. While I still don't see the alignment, the CIGAR score likely should have been filtered despite it being a suboptimal hit (unless I'm not understanding the suboptimal hit thresholds correctly).

Any help will be greatly appreciated. I'm mainly trying to understand if we should continue to use BWA for reading 'non-uniq' reads, use some other method, or apply some work-arounds. Could the problems be only with sub-optimal hits? Or could it be more wide-spread and effect all repeat-hits or even uniq hits?

Thank you.
Tags: None
jenli

Junior Member

Join Date: Feb 2012

Posts: 1
- Share
- Tweet
#2

02-29-2012, 02:16 PM

I have a similar problem with bwa 0.6.1 and XA.

I have been checking SRA reads against a portion of the mouse genome.
One of hits is:

SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 23 50M * 0 0 ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:U16_3259971_3260068;8547974,+91,6M2I42M,2;

which means this read is mapped to R16_3260025_3260114;8547975 and U16_3259971_3260068;8547974. But U16_* was too short for an alignment
that started at base 91.

So, I isolated R16_* and U16_* (and another one) from my 1 Gb subject fasta
into a file of three records:

>U16_3259971_3260068;8547974
CTTCTTCTATTTTTTGATTATGGTTTTTGGTATGTATATTATATCCAGGGAATGTGTTTA
TTACAGAAGAAATAATCATCATTTATTAGGACTTATTA
>R16_3260025_3260114;8547975
TGTTTATTACAGAAGAAATAATCATCATTTATTAGGACTTATTATATAATATTATTTCAT
TTACAGAGGAACATATAATACCTTATAAAT
>U16_3260071_3260149;8547976
TAATATTATTTCATTTACAGAGGAACATATAATACCTTATAAATGTATATGAAATTGATA
AAAGTTTTTACAGTGTAAG

Isolated the read into a fastq file:

@SRR067605.10526006_0
ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT
+SRR067605.10526006_0
GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG

And reran the alignment just between the two, the hit from bwa now becomes:

SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 37 50M * 0 0
ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

No U16 hit...very perplexing...
I can't post the 1 Gb subject fasta file. Don't quite know why it incorrectly
picked U16_* when aligning against the larger subject file, but not when the
subject file is only 3 records.

Any recommendations ?

Thanks
Comment

Previous template Next

Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing

by GATTACAT

Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
- Channel: Articles
07-01-2026, 11:43 AM
Nine Things a Sample Prep Scientist Thinks About Before Sequencing

by SEQadmin2

I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

Here are nine questions we think about, in roughly the order they matter, before...
- Channel: Articles
06-18-2026, 07:11 AM

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 13 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Problems with BWA XA output on

Comment

Latest Articles

ad_right_rmr

News