Seqanswers Leaderboard Ad

**lh3** · 04-02-2010, 01:21 PM

Firstly, please DO NOT use the toplevel file from Ensembl. It has >1Gbp ambiguous bases as space holder. You may use the b37 reference file here: ftp://ftp.ncbi.nih.gov/1000genomes/f...cal/reference/

Secondly, bwa-short would not work well for large -n. This is by design. You may consider disable seeding with -l 10000, but I guess it will be impractically slow. What is the sequencing error rate you are simulating? It seems very high. Typical Illumina sequencing error rate is only ~1% which means an 100bp read only has a couple of errors.

**jmartin** · 04-02-2010, 02:23 PM

The expected actual variation between individuals is < 1%, but the sequencing error we're seeing from the Illumina 100mer reads varies quite a bit. I used a collection of 100mer reads from a different project being done here at WashU for my human control data, I did not simulate errors for that. They were real 100mer Illumina reads from a different individual (I did not use the Ensembl/NCBI human build for the control).

I am not sure what sequencing error rate I should expect to see in these Illumina 100mers. I do see some fraction of the reads containing a significant number of ambiguous bases (N). Maybe ~1-2% of the reads will have > 3 ambiguous bases. I was trying such a high -n setting to allow reads containing Ns to align to human.

What is the largest value of -n that bwa-short can safely use in alignments?

**lh3** · 04-02-2010, 03:22 PM

I usually recommend the default. For 100bp reads, the default allows 5 mismatches/gaps. You may use -n 6 or 7, but not more. If I were analyzing the data, I would throw those ~1-2% reads in the first place. In addition, are these purity filtered reads? Do the reads have poor-quality tail? You may try -q15 to trim the tail. In general, i guess your data is non-typical. For single-end 100bp reads, bwa should map more than 90% to human; with paired-end, even more.

**jmartin** · 04-02-2010, 03:52 PM

I will give your suggestion a try. I am not sure if the tail is poor quality or not for 100mer illumina reads, but I can use my control set to play around with the -q setting.

Its very interesting to me to hear that the max suggested -n value is 7. The fact that we are doing alignments with settings as high as -n 25 may explain some of the weird behaviors we are seeing with the aligner.

Thanks for the help.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Need help defining BWA parameters for human screening

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News