No announcement yet.

Introducing RemoveHuman: Human Contaminant Removal

  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing RemoveHuman: Human Contaminant Removal

    Contaminated DNA leads to lower-quality results, no matter what study you're doing. One of the most, if not the most, common contaminants in DNA sequencing is human. If you are doing human research, then there's not much you can do about it, but for everyone researching non-vertebrate organisms, read on -

    JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems:

    1) The human reference (HG19) appears to contain some contaminant sequences from other organisms.
    2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window.
    3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms.

    Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features.

    So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask:

    1) Areas containing multiple repeat short kmers.
    2) Windows of low entropy, calculated using pentamer frequencies.
    3) Via mapping from sam files:
    The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.

    This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads.

    To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings.

    1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates.

    So, this is the final command line: minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq

    You first have to index the reference, like this: ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g

    The masked reference (880 MB) is available here:

    I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms.
    Last edited by Brian Bushnell; 10-29-2015, 08:37 PM.

  • #2
    Hi Brian,
    That looks very interesting.
    Did you do any test with Phix.
    We routinely remove Phix read by aligning all reads the the reference.
    Obviously Phix is very small so the chance of false positive is much smaller but I've not seen this tested.
    Especially when using long read and more sensitive aligner like bwa mem.



    • #3
      I use BBDuk to remove phiX, with 31mers and a max hamming distance of 1. phiX is so tiny that I don't expect any false positives - I have never found any in my synthetic tests with those settings. Obviously, it might be a different matter if you were specifically studying viruses. -Xmx1g in=reads.fq out=clean.fq ref=phix.fasta k=31 hdist=1

      You could of course use the same procedure I used with human to mask phix, and map to it rather than doing kmer-based removal, but that's probably a waste of time.


      • #4
        Hi Brian,

        Would you recommend to use BBDuk for pre-filtering reads from microbiome samples? I thought about cleaning out phiX contaminants as described above, followed by extracting 16S rDNA reads by mapping against the qiime rep_set (~ 400k rDNAs) and retrieving matching reads via outmatch=xxx, hdist=0, k=31.



        • #5

          BBDuk is great for removal of known sequences. As for specifically isolating rDNA, I expect it would catch most of it, but I'm not sure about the highly variable regions (it depends on how long and how variable they are). If the insert size is long enough so that at least one of the reads is usually expected to be in a highly conserved region then it should work fine.

          I would suggest filtering as you describe above, but also catching the non-matching reads with "outu=xxx". Then you can map those against the database with a low identity requirement and see if anything was missed.

          BBMap also supports the "outm" and "outu" flags. In this case I don't know whether mapping or kmer-filtering would do better. You could even do both, first running BBDuk then mapping the leftovers and merging the resultant files:

 in=reads.fq outm=matched.fq outu=unmatched.fq ref=ribo.fa k=31

 in=unmatched.fq ref=ribo.fa nodisk outm=matched2.fq outu=unmatched2.fq maxindel=20 minid=0.7

          cat matched.fq matched2.fq > combined.fq

          I guess it depends on whether you are more worried about false positives or false negatives. Oh, and mapping might be incredibly slow on that kind of reference where a read can align to 400k different places equally well.


          • #6
            the phiX filtering step worked well. It reported 5% phiX content, which is pretty close to our requested 4% spike for this run.

            The second step mapped 99.7% of the filtered reads to 16S, which would be fantastic.



            • #7
              indexing the reference

              I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)


              • #8
                Originally posted by Naarkhoo View Post
                I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)
                The "ref" folder stores the indexes in its own unique way in "genome" and "index" directories. You can't just look in those folder to see what is there nor depend on the file names you see there.

                For example in hg19 BBMap index there are these files on my system:

                Have you tried doing a masking run and found results only from chromosome 7?


                • #9
                  Masked ref for plant organelles?

                  Hi Brian,

                  I am looking for a good way to ensure all mito- and chloro-genome reads are removed from my read data prior to a de novo assembly. Any chance that you have already generated a masked reference file for the genomes from plant mitochondria or chloroplasts?



                  • #10
                    You can create them yourself using Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

                    I assume you have seen BBsplit, which can be used for this purpose.

                    Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.
                    By default this program will mask using entropy with a window=80 and entropy=0.75

                    Usage: in=<file> out=<file> sam=<file,file,...file>

                    Input may be stdin or a fasta or fastq file, raw or gzipped.[

                    window=80 (w) Window size for entropy calculation.
                    entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.
                    Last edited by GenoMax; 08-15-2016, 10:05 AM.


                    • #11
                      And, sorry, but nope - I have not. Chloroplasts are one thing (and I don't know much about them), but mito is another... there are a lot of genes that can move back and forth between mito and the host organism.

                      I find the best way to handle this is de-novo assembly chloroplast and mitochondria independently based on depth-filtering, then you can pull out the reads mapping to those asemblies prior to main-genome assembly. Both should be present at much higher depth than the rest of the genome.

                      For assembling mitochondria from fungal genomes I wrote a script like this:

             in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt
                      primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
                      cutoff=$(( $primary * 3 ))
             in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
             in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45
             in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
                      mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`
                      upper=$((mitopeak * 6 / 3))
                      lower=$((mitopeak * 3 / 7))
                      mcs=$((mitopeak * 3 / 4))
                      mincov=$((mitopeak * 2 / 3))
             in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
             in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05
             in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6
                      ln -s contigs_bbd.fa contigs.fa
                      Note that this is for 1050bp reads; for shorter ones you may need shorter kmers. And you may need to adjust the GC cutoff for chloroplasts as well. Also, fungi are simpler as they are haploid or diploid and only have mitochondria rather than chloroplasts also.
                      Last edited by Brian Bushnell; 08-15-2016, 10:30 AM.


                      • #12
                        I've uploaded a few more files to the Google drive:


                        Those are masked versions of the cat, dog, and mouse genomes. I also added two of bacteria:


                        Those are common contaminant microbes that we encounter in sequencing. For Eukaryotic genomes, I suggest mapping against fusedEPmasked2, in which the bacteria are masked for entropy and plastids (e.g. chloroplast) only. The other one (fusedERPBBmasked2) is intended for Prokaryotic assembly and is masked for conserved regions in bacteria, including ribosomes. If you want to use it for filtering out common laboratory/human/reagent-associated microbes, it's useful to ensure that your bacteria of interest is not on the list If you know what organism you are sequencing, you can use the tool "" to create a filtered version that file after removing all sequences from organisms in the same family, like this:

               in=fusedERPBBmasked2.fa.gz out=taxfiltered.fa.gz include=f ids=1234 tree=tree.taxtree.gz
                        ...where "1234" is the NCBI ID of the organism and tree.taxtree.gz is made from NCBI's taxdump like this:

               names.dmp nodes.dmp tree.taxtree.gz


                        • #13
                          Hi Brian,

                          First, thanks for your continuous effort.

                          Those are common contaminant microbes that we encounter in sequencing
                          Just for curiosity, How do you (or the center) compiled this list of common bacteria contaminants?

                          Thanks again
                          Last edited by vingomez; 12-08-2016, 07:03 AM. Reason: Deleted sentence -mistake in names of files


                          • #14
                            Hi Vicente,

                            Every project we sequence gets BLASTed to nt/nr and various RefSeq databases. All of the non-target hits are tracked. Then, a few months ago, someone manually went through and examined the non-target hits to make a list of the ones that commonly occurred.

                            Then, I took the list and expanded it slightly to include other microbes which have very high identity to the microbes on the list. For example, E.coli was detected as a common contaminant; but Shigella and Klebsiella are 100% identical to E.coli over large portions of the genome (they are basically strains of E.coli), meaning there is no way to ensure a Shigella library is uncontaminated, for example; and it's difficult to ensure that our BLAST hits to E.coli were not, in fact, Shigella or Klebsiella. So, the final list is 35 microbes plus Lambda phage which is a contaminant in one of our reagents (and shares sequence with E.coli so they are hard to distinguish), but many of them are just different strains (3 strains of Pseudomonas fluorescens, for example). They are generally either human-associated (like E.coli) or associated with laboratories or reagents (like, again, E.coli).


                            • #15
                              Hi Brian,

                              is it appropriate to use the bacterial contaminant files, fusedEPmasked2.fa.gz
                              and/or fusedERPBBmasked2.fa.gz, to filter soil metagenomics reads?

                              Thank you for your very useful tools and discussions!