Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-12-2015, 02:32 PM

Originally posted by vingomez View Post

(1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?

Hi Vicente,

I released a new version of BBTools, 34.64. BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.

**manuelkleiner** · 03-13-2015, 12:02 PM

Problem with argument length

Dear Brian,
I have run into a problem when trying to create an index for bbsplit using a large number of files (~20,000). In the example below the variable $files contains 20,000 comma delimited, file names.

Code:

kleiner@breck-Precision-T7610[BBMap] bbsplit.sh -Xmx100g threads=20 ref=$files                                                                                                                         [ 2:47PM]
zsh: argument list too long: bbsplit.sh

What I have found out so far is that the linux kernel has a upper limit for the argument length in the shell, which causes this problem. I have tried various workarounds e.g. trying to pipe the file names in from STDIN or creating a bash script. But nothing has worked.
Any suggestions how to get bbsplit to indext such a large number of files before running the splitting?
Maybe one solution would be if one could hand the ref= argument a folder containing all fasta files and then bbsplit writes an index for each .fasta file in the folder!?

Thanks a lot in advance for your help,
Manuel

**Brian Bushnell** · 03-13-2015, 01:00 PM

Hi Manuel,

I will add the possibility to point "ref=" to a directory in the next release; that's a pretty good idea.

-Brian

**GenoMax** · 03-13-2015, 01:35 PM

Brian: Concatenating some files to make the number smaller should make it work now, correct?

**Brian Bushnell** · 03-13-2015, 02:10 PM

Well... the easiest way to use BBSplit is to have it split input into one output file per reference file. Once you concatenate them, you lose that ability. It's technically possible to specify sets of named sequences within a file but that way is too complicated. Using a tiered approach is possible (though a pain):

for references a, b, c, d, e, f:

abc=cat a b c
def=cat d e f

bbsplit ref=abc,def

then re-split the abc pile between a, b, and c, and the def pile between d, e, and f. That will basically cut the command line lengths (and open file handles) by a factor of the square root of the number of references.

But, I will add this feature quickly since it's hard to do otherwise.

Note that if all you care about are statistics of which reads mapped to which scaffolds, then if all the ref sequences have unique names, you can just run BBMap on the concatenated file with the "scafstats" output flag. But for actually splitting the files easily, the references currently need to be in individual files.

**GenoMax** · 03-13-2015, 02:44 PM

@Manuel: Are these 20000 sequences related (contigs)?

@Brian: Thanks for the additional explanation.

**Brian Bushnell** · 03-13-2015, 04:41 PM

Manuel,

I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!

**manuelkleiner** · 03-16-2015, 02:09 PM

Wow that was fast! Thank you Brian!
I will test it tomorrow and let you know how it works out for me.

@GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.

**luc** · 09-08-2015, 11:36 PM

Hi Brian,

when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?

Thanks!

**Brian Bushnell** · 09-08-2015, 11:54 PM

Hi Luc,

It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.

Thanks,
Brian

P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.

**luc** · 09-09-2015, 11:45 PM

Hi Brian,

yes I am using version 35.43.

I ran for example:
bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

The results look like this:
------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 100
Minimum Score Ratio: 0.8
Mapping Mode: normal
Reads Used: 887008 (419139365 bases)

Mapping: 18.567 seconds.
Reads/sec: 47772.92
kBases/sec: 22574.22

Read 1 data: pct reads num reads pct bases num bases

mapped: 0.3272% 2902 0.3307% 1386018
unambiguous: 0.2006% 1779 0.2065% 865395
ambiguous: 0.1266% 1123 0.1242% 520623
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0046% 41 0.0024% 9939
semiperfect site: 0.0046% 41 0.0024% 9939

Match Rate: NA NA 94.2618% 1332227
Error Rate: 43.3616% 2861 5.7382% 81100
Sub Rate: 43.2707% 2855 2.4932% 35237
Del Rate: 29.7818% 1965 1.9322% 27309
Ins Rate: 28.8118% 1901 1.3128% 18554
N Rate: 0.0000% 0 0.0000% 0

The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.

**Brian Bushnell** · 09-10-2015, 12:18 AM

Hi Luc!

The syntax for BBSplit is different from BBMap. It's like this:

bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).

Syntax:

seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.

**RubenD** · 03-05-2017, 03:51 PM

how to change default bbsplit parameters?

Hi all,

I would like to use bbsplit to separate my fastq files.

First I generated an index and now I am running a for loop for all my fastq's.

However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

Am I doing something wrong here?

Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
Ref merge time: 0.028 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

BBMap version 37.00
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 10
Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
Retaining first best site only for ambiguous mappings.
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
Set genome to 2

**GenoMax** · 03-05-2017, 05:34 PM

@RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.

**RubenD** · 03-05-2017, 05:47 PM

@GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

I have first generated an index for mouse and human:

bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
ref_mouse=/gcdata/gcproj/Ruben/GENOMES/Mouse/Sequences/NCBIM37.67/Mus_musculus.NCBIM37.67.dna.toplevel.fa

Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

Thanks for your help,
Ruben

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News