Thanks for noting this. Some tools (like reformat) support the "extin" and "extout" flags which let you override the default, so you could do this:
reformat.sh in=file.dat out=file2.dat extin=.sam extout=.fq
But, BBMap doesn't support that right now. I'll add it. And I don't particularly recommend sam -> fastq conversion because the names change, since in sam format read 1 and read 2 must have identical names, whereas in fastq format they will typically have "/1" and "/2" or similar to differentiate them. Though you can do that conversion if you want.
I have not used Galaxy and don't know what's possible, but until I make this change, I would suggest one of these:
1) Use BBDuk for this filtering; its default output format is fastq and it's probably faster than BBMap anyway in this case. The syntax is very similar. On the command line, it would be something like "bbduk.sh in=reads.fq outu=clean.fq ref=ecoli.fasta".
2) Tell BBMap "outu=stdout.fq" and pipe that to a file, if Galaxy supports pipes.
As for your question about pairing, the normal behavior in paired-mapping mode is:
"out=" will get everything.
"outm=" will get all pairs in which either of the reads mapped to the reference.
"outu=" will get all pairs in which neither read mapped to the reference.
For BBDuk, it's slightly different but essentially the same:
"out=" is the same as "outu=".
"outu", aka "out", will get all pairs in which neither had a kmer match to the reference.
"outm" will get all pairs in which either had a kmer match to the reference.
For BBDuk, this behavior can be changed with the "reib" (removeIfEitherBad) flag. The assumption of that flag's name is that the reference is contaminants being filtered against, so the default "reib=true" means any pair where either matches the contaminant is removed.
So, for both tools, if the input data is paired, the output data will also be paired - pairs are always kept together in all streams.
reformat.sh in=file.dat out=file2.dat extin=.sam extout=.fq
But, BBMap doesn't support that right now. I'll add it. And I don't particularly recommend sam -> fastq conversion because the names change, since in sam format read 1 and read 2 must have identical names, whereas in fastq format they will typically have "/1" and "/2" or similar to differentiate them. Though you can do that conversion if you want.
I have not used Galaxy and don't know what's possible, but until I make this change, I would suggest one of these:
1) Use BBDuk for this filtering; its default output format is fastq and it's probably faster than BBMap anyway in this case. The syntax is very similar. On the command line, it would be something like "bbduk.sh in=reads.fq outu=clean.fq ref=ecoli.fasta".
2) Tell BBMap "outu=stdout.fq" and pipe that to a file, if Galaxy supports pipes.
As for your question about pairing, the normal behavior in paired-mapping mode is:
"out=" will get everything.
"outm=" will get all pairs in which either of the reads mapped to the reference.
"outu=" will get all pairs in which neither read mapped to the reference.
For BBDuk, it's slightly different but essentially the same:
"out=" is the same as "outu=".
"outu", aka "out", will get all pairs in which neither had a kmer match to the reference.
"outm" will get all pairs in which either had a kmer match to the reference.
For BBDuk, this behavior can be changed with the "reib" (removeIfEitherBad) flag. The assumption of that flag's name is that the reference is contaminants being filtered against, so the default "reib=true" means any pair where either matches the contaminant is removed.
So, for both tools, if the input data is paired, the output data will also be paired - pairs are always kept together in all streams.
Comment