Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help defining BWA parameters for human screening

    I'm working with metagenomic bacterial samples from various human body sites, and these samples have various levels of human contamination. I have a large number of 100mer Illumina reads and I'm trying to setup a pipeline that will use BWA to identify human contaminant from these metagenomic communities of reads. The reference I am using is basically NCBI build 36 human genome. I have build 37 as well, but its > 4Gb in size so for now I'm sticking with b36.

    I've built a control query file of reads that are of known bacterial origin, and into that I have spiked a set of known human reads. And I'm trying to assess various BWA parameters using that control set.

    BWA is doing surprisingly poorly in what I had thought would be a simple application of the aligner. My control set contains:

    40 million - bacterial reads
    4 million - human reads

    (about 1 lane's worth of data from a 50Gb Illumina run which is our current standard).

    My first thought was to simply try setting a simple mismatch/edit-distance cutoff. I've tried:

    (parameter) - (result)
    -n 15 - found 73.1% of human spike, falsely identified 3593/40mil bacteria
    -n 20 - found 76.3% of human spike, falsely identified 3628/40mil bacteria
    -n 25 - found 79.3% of human spike, falsely identified 3710/40mil bacteria

    This seems quite low. Then I found some cases where a read that clearly had only 3 mismatches (confirmed using BLASTN) was NOT being found by BWA. After some head scratching, I realized that those 3 mismatches were occuring within the first 30 bases of the read, and it looks like the default seeding behavior of BWA for 100mer reads is to allow only 2 mismatches in the first 30 bases. In order to catch these cases where > 2 mismatches occur early in the read, I tried setting the seed to the full length of my read, and then allowing the same number of mismatches in the seed as I do in the full read. I tried:

    -n 25 -k 25 -l 99 (for 100mer reads)
    found 62.1% of human spike, falsely identified 31/40mil bacteria

    but as you can see, those parameters did significantly worse than just using -n 25. This doesn't make sense to me because I had thought those parameters would push the seeding length up to 99bp, and allow 25 mismatches in the seed. And then allow those same 25 mismatches across the full read. So I had expected to find MORE of the human spike than did the -n 25 setting by itself.


    Basically I just need to come up with a set of parameters that can do this screening job. I'd love a deeper explanation of why the -n 25 -k 25 -l 99 parameter set found fewer hits than -n 25 does by itself. But I am really hoping that someone here can suggest parameters I can use to do this human screen. Any suggestions?


    Thanks,
    John Martin

  • #2
    Firstly, please DO NOT use the toplevel file from Ensembl. It has >1Gbp ambiguous bases as space holder. You may use the b37 reference file here: ftp://ftp.ncbi.nih.gov/1000genomes/f...cal/reference/

    Secondly, bwa-short would not work well for large -n. This is by design. You may consider disable seeding with -l 10000, but I guess it will be impractically slow. What is the sequencing error rate you are simulating? It seems very high. Typical Illumina sequencing error rate is only ~1% which means an 100bp read only has a couple of errors.

    Comment


    • #3
      The expected actual variation between individuals is < 1%, but the sequencing error we're seeing from the Illumina 100mer reads varies quite a bit. I used a collection of 100mer reads from a different project being done here at WashU for my human control data, I did not simulate errors for that. They were real 100mer Illumina reads from a different individual (I did not use the Ensembl/NCBI human build for the control).

      I am not sure what sequencing error rate I should expect to see in these Illumina 100mers. I do see some fraction of the reads containing a significant number of ambiguous bases (N). Maybe ~1-2% of the reads will have > 3 ambiguous bases. I was trying such a high -n setting to allow reads containing Ns to align to human.

      What is the largest value of -n that bwa-short can safely use in alignments?

      Comment


      • #4
        I usually recommend the default. For 100bp reads, the default allows 5 mismatches/gaps. You may use -n 6 or 7, but not more. If I were analyzing the data, I would throw those ~1-2% reads in the first place. In addition, are these purity filtered reads? Do the reads have poor-quality tail? You may try -q15 to trim the tail. In general, i guess your data is non-typical. For single-end 100bp reads, bwa should map more than 90% to human; with paired-end, even more.

        Comment


        • #5
          I will give your suggestion a try. I am not sure if the tail is poor quality or not for 100mer illumina reads, but I can use my control set to play around with the -q setting.

          Its very interesting to me to hear that the max suggested -n value is 7. The fact that we are doing alignments with settings as high as -n 25 may explain some of the weird behaviors we are seeing with the aligner.

          Thanks for the help.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM
          • seqadmin
            Choosing Between NGS and qPCR
            by seqadmin



            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
            10-18-2024, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:09 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:13 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-01-2024, 06:09 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-30-2024, 05:31 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Working...
          X