I'm working with metagenomic bacterial samples from various human body sites, and these samples have various levels of human contamination. I have a large number of 100mer Illumina reads and I'm trying to setup a pipeline that will use BWA to identify human contaminant from these metagenomic communities of reads. The reference I am using is basically NCBI build 36 human genome. I have build 37 as well, but its > 4Gb in size so for now I'm sticking with b36.
I've built a control query file of reads that are of known bacterial origin, and into that I have spiked a set of known human reads. And I'm trying to assess various BWA parameters using that control set.
BWA is doing surprisingly poorly in what I had thought would be a simple application of the aligner. My control set contains:
40 million - bacterial reads
4 million - human reads
(about 1 lane's worth of data from a 50Gb Illumina run which is our current standard).
My first thought was to simply try setting a simple mismatch/edit-distance cutoff. I've tried:
(parameter) - (result)
-n 15 - found 73.1% of human spike, falsely identified 3593/40mil bacteria
-n 20 - found 76.3% of human spike, falsely identified 3628/40mil bacteria
-n 25 - found 79.3% of human spike, falsely identified 3710/40mil bacteria
This seems quite low. Then I found some cases where a read that clearly had only 3 mismatches (confirmed using BLASTN) was NOT being found by BWA. After some head scratching, I realized that those 3 mismatches were occuring within the first 30 bases of the read, and it looks like the default seeding behavior of BWA for 100mer reads is to allow only 2 mismatches in the first 30 bases. In order to catch these cases where > 2 mismatches occur early in the read, I tried setting the seed to the full length of my read, and then allowing the same number of mismatches in the seed as I do in the full read. I tried:
-n 25 -k 25 -l 99 (for 100mer reads)
found 62.1% of human spike, falsely identified 31/40mil bacteria
but as you can see, those parameters did significantly worse than just using -n 25. This doesn't make sense to me because I had thought those parameters would push the seeding length up to 99bp, and allow 25 mismatches in the seed. And then allow those same 25 mismatches across the full read. So I had expected to find MORE of the human spike than did the -n 25 setting by itself.
Basically I just need to come up with a set of parameters that can do this screening job. I'd love a deeper explanation of why the -n 25 -k 25 -l 99 parameter set found fewer hits than -n 25 does by itself. But I am really hoping that someone here can suggest parameters I can use to do this human screen. Any suggestions?
Thanks,
John Martin
I've built a control query file of reads that are of known bacterial origin, and into that I have spiked a set of known human reads. And I'm trying to assess various BWA parameters using that control set.
BWA is doing surprisingly poorly in what I had thought would be a simple application of the aligner. My control set contains:
40 million - bacterial reads
4 million - human reads
(about 1 lane's worth of data from a 50Gb Illumina run which is our current standard).
My first thought was to simply try setting a simple mismatch/edit-distance cutoff. I've tried:
(parameter) - (result)
-n 15 - found 73.1% of human spike, falsely identified 3593/40mil bacteria
-n 20 - found 76.3% of human spike, falsely identified 3628/40mil bacteria
-n 25 - found 79.3% of human spike, falsely identified 3710/40mil bacteria
This seems quite low. Then I found some cases where a read that clearly had only 3 mismatches (confirmed using BLASTN) was NOT being found by BWA. After some head scratching, I realized that those 3 mismatches were occuring within the first 30 bases of the read, and it looks like the default seeding behavior of BWA for 100mer reads is to allow only 2 mismatches in the first 30 bases. In order to catch these cases where > 2 mismatches occur early in the read, I tried setting the seed to the full length of my read, and then allowing the same number of mismatches in the seed as I do in the full read. I tried:
-n 25 -k 25 -l 99 (for 100mer reads)
found 62.1% of human spike, falsely identified 31/40mil bacteria
but as you can see, those parameters did significantly worse than just using -n 25. This doesn't make sense to me because I had thought those parameters would push the seeding length up to 99bp, and allow 25 mismatches in the seed. And then allow those same 25 mismatches across the full read. So I had expected to find MORE of the human spike than did the -n 25 setting by itself.
Basically I just need to come up with a set of parameters that can do this screening job. I'd love a deeper explanation of why the -n 25 -k 25 -l 99 parameter set found fewer hits than -n 25 does by itself. But I am really hoping that someone here can suggest parameters I can use to do this human screen. Any suggestions?
Thanks,
John Martin
Comment