BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.
For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:
bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t
This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)
In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq
You can get more information about parameters by running bbsplit.sh with no arguments, or reading /bbmap/docs/readme.txt. But I will mention here the inter-reference ambiguity modes, which decide what to do with reads that map to multiple references and pairs where each read maps to a different reference:
ambig2=best
Default. Ambiguous reads go to the first best site.
ambig2=toss
Ambiguous reads are considered unmapped.
ambig2=all
Write a copy to the output for each reference to which it maps.
ambig2=split
Write a copy to the AMBIGUOUS_ output file for each reference to which it maps.
If your OS cannot process bash shellscripts, replace "bbsplit.sh" with "java -Xmx29g -cp /path/to/current align2.BBSplitter", where /path/to/current is the location of the 'current' directory (a subdirectory of bbmap), and -Xmx29g specifies the amount of memory to use (so this would be the command line for a 32GB computer). This should be set to about 85% of physical memory.
BBSplit is extremely fast and highly sensitive, using BBMap for the mapping. So, all flags and features supported by BBMap can be used with BBSplit (aside from sam output).
BBSplit is available here:
P.S. Some people have asked why BBSplit has a lower alignment rate than BBMap. That is because it has a lower default sensitivity, as the original intent was to bin reads using known assemblies. The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"
For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:
bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t
This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)
In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq
You can get more information about parameters by running bbsplit.sh with no arguments, or reading /bbmap/docs/readme.txt. But I will mention here the inter-reference ambiguity modes, which decide what to do with reads that map to multiple references and pairs where each read maps to a different reference:
ambig2=best
Default. Ambiguous reads go to the first best site.
ambig2=toss
Ambiguous reads are considered unmapped.
ambig2=all
Write a copy to the output for each reference to which it maps.
ambig2=split
Write a copy to the AMBIGUOUS_ output file for each reference to which it maps.
If your OS cannot process bash shellscripts, replace "bbsplit.sh" with "java -Xmx29g -cp /path/to/current align2.BBSplitter", where /path/to/current is the location of the 'current' directory (a subdirectory of bbmap), and -Xmx29g specifies the amount of memory to use (so this would be the command line for a 32GB computer). This should be set to about 85% of physical memory.
BBSplit is extremely fast and highly sensitive, using BBMap for the mapping. So, all flags and features supported by BBMap can be used with BBSplit (aside from sam output).
BBSplit is available here:
P.S. Some people have asked why BBSplit has a lower alignment rate than BBMap. That is because it has a lower default sensitivity, as the original intent was to bin reads using known assemblies. The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"
Comment