Seqanswers Leaderboard Ad

**mcauchy** · 03-09-2016, 03:34 PM

I'm not sure what the syntax would be in DOS. I've tried ./bbmap.sh, /bbmap.sh and bbmap.sh

Could you tell me what the command would be?

**GenoMax** · 03-09-2016, 03:51 PM

Originally posted by mcauchy View Post

I'm not sure what the syntax would be in DOS. I've tried ./bbmap.sh, /bbmap.sh and bbmap.sh

Could you tell me what the command would be?

This information is in the BBMap thread. Here is how (note: you need to put the right path to the "current" directory on your machine and the space between that and align2.BBMap):

Code:

c:\> java -Xmx3g -cp c:\path_to\current align2.BBMap in=reads.fq out=mapped.sam

**Brian Bushnell** · 03-11-2016, 07:25 PM

Hi mcauchy,

Were you able to resolve this? The shellscripts (anything.sh) do not work in Windows, so the full syntax is needed. If you are still having trouble, please tell me the location of bbmap.sh (which is probably something like C:\something\bbmap\current\bbmap.sh).

-Brian

**GenoMax** · 03-12-2016, 04:16 AM

@Brian: How much memory does repair.sh need?

**Brian Bushnell** · 03-12-2016, 08:22 AM

Originally posted by GenoMax View Post

@Brian: How much memory does repair.sh need?

That depends. If you have an interleaved file and the interleaving was broken because some reads were discarded, you can run it with the flag "fixinterleaving" and it only needs a trivial amount of memory (at any given time at most 2 reads need to be remembered).

For an arbitrarily disordered file or pair of files, in the worst case, it would store all reads in memory, so the amount of memory needed would be somewhat greater than the size of the uncompressed files.

But in the common case of a pair of files that are ordered correctly but some reads were deleted in each file without removing their mate, the amount of memory needed is proportional to the number of singleton reads.

**GenoMax** · 03-12-2016, 02:59 PM

@mcauchy: Consider @Brian's reply above when choosing a memory setting.

If you know that your files fall in the second category then you may want to re-do the trimming with a paired-end aware trimmer (so the reads don't get out of sync in first place) e.g. use bbduk.sh from BBMap. That option will not require a lot of RAM and would prove convenient.

**jazz710** · 03-14-2016, 06:49 AM

Kmer Counting

Hi Brian, tried to send you a private message, but I guess you're popular:

"Brian Bushnell has exceeded their stored private messages quota and cannot accept further messages until they clear some space."

Anyhow!

New odd question for you Brian: I'm doing some kmer counting with kmercountexact.sh. I'm running a kmer series on 100bp DNA-Seq reads from k=21 to k=55. Two things I noticed:

1) After k=31, the program only calculates kmer counts for even numbers. The output reads (for the example of k=55): "K was changed from 55 to 54", and it did that for every odd value of k from 33 to 55

2) The unique kmer rises as the value of k rises, but only to a point. In my two samples, it seems like after k=31 that as k increases, the number of unique kmers decreases. How can this be? Am I missing something conceptual about how kmer counts should be expected?

Best,
Bob

EDIT: Added kmer graph

**GenoMax** · 03-14-2016, 07:01 AM

@jazz710: kmercountexact.sh is only designed to accept a max of k=31. I am not sure what it is doing once you go past that point.

On a different note: @Brian now works at google and continues to support BBTools in spare time (don't worry, BBTools are going to remain supported/be developed, just at a slower pace than what we were used to last year). So look for an official answer from him late tonight.

Note: People can still contact Brian via his JGI/LBL email address. That email address remains valid and can be found in the help menu for any BBTools program. Same constraint about time applies to this method as well.

**jazz710** · 03-14-2016, 07:05 AM

Thanks GenoMax!

**Brian Bushnell** · 03-16-2016, 09:14 PM

Let me clarify this:

As GenoMax stated, KmerCountExact originally only supported K=1 to 31. It now supports unlimited kmer lengths, with the provision that K=32 to 62 must be multiples of 2, 63 to 93 must be multiples of 3, and generally (N-1)*31+1 to (N)*31 must be multiples of N, which makes reverse-complementing much faster.

It is expected that the kmer counts form an arc like that with a peak at some specific K. As K increases, the number of unique kmers increases in general as longer repeats are spanned. But as K approaches read length, the number of unique kmers will drop to zero because reads will contain fewer kmers. So there is some peak between the zeros on each end, and thus, the graph basically looks as expected.

What should never happen is the discontinuity between K=31 and K=32, which is clearly an artifact of the program. I will investigate that this weekend; in the meantime, could you tell me the exact command you used, and version of BBTools? It's possible that this is due to the way kmers are filtered to exclude error kmers, which can be set in the command line, but I'll find out.

Sorry about my inbox; I've cleared some space!

**jazz710** · 03-18-2016, 06:34 AM

They were just basic kmercountexact.sh commands:

kmercountexact.sh -k <21-55> in=<File1> in2=<File2> mincount=2 prefilter=2

The graph included in my post above is the output from the Unique kmers for each value of K. I should note that while both of my species showed the same 'jump' it wasn't at k=31 each time. I trashed most of that output because I went another way with that part of the analysis, but if it would help, I could always re-run it quickly to get you the specific output.

**jazz710** · 03-24-2016, 07:38 AM

Cross Species Mapping

Hello Geno and Brian,

I have seen BBMap referenced on a handful of forums as a good option for mapping short DNA-seq reads onto a closely related species. Bowtie et al. can't seem to handle very many SNPs.

Looking at the BBMap documentation, I'm not exactly sure what parameters I should tweak to loosen the mapping conditions. I was hoping there would simply be an hdist/edist option that I could set to 4 or 5 or so but I don't see that directly stated.

Are there a set of parameters you would recommend tweaking to get optimal mapping across species (given the obvious biological limitations)?

Best,
Bob

EDIT: Would it be editfilter=4 or 5? That would allow for 4 or 5 SNPs or indels, correct?

**Brian Bushnell** · 03-25-2016, 10:47 AM

Hi Bob,

BBMap does not have an hdist/edist option because it evaluates alignments using a scoring system based on evolutionary probability. For example, two adjacent deletions are much more likely than two non-adjacent deletions. So these are considered as "events", and BBMap makes alignments optimizing the most probable layout of events that would lead to the resulting read. For example, a 100bp read containing a 100bp deletion event could be aligned with 70 matches, and 30 bases in which 1/4 of them are matches and 3/4 of them are mismatches. Or, it could be aligned as 70 matches and 30 bases clipped (which is typical). Or, it could be aligned as 70 matches, a 100bp deletion, and 30 matches. Which is what BBMap does by default.

Every alignment in BBMap has a score, and scores below the minimum are discarded. You can specify that with, for example, "minid=70", which will discard alignments with identity below roughly 70% (it is "roughly" because BBMap discards things based on probability rather than percent identity, which is an inferior metric. It still offers filters like "idfilter" for situations in which percent identity is an important metric, but I don't recommend using them for real research; they are just for publications in which people need to state "I excluded any alignment with more than X of property Y". It also offers specific filters to discard reads with more than X SNPs, or Y insertions, or Z deletions, or whatever. But again, I do not think these are useful in your case because you want cross-species alignments, which are best modeled by evolutionary distance, which BBMap was designed for.

You can increase BBMap's sensitivity using various means.

For preprocessing:

1) Adapter-trim your reads prior to alignment, using BBDuk.
2) Quality-trim your reads to ~Q10, if they have low-quality tails, also with BBDuk.
3) Error-correct reads (if you have sufficient coverage, typically >10x) using Tadpole.

For mapping:

1) Set the "slow" flag, which greatly increases sensitivity: bbmap.sh in=x.fq out=y.fq slow or the "vslow" flag, which increases it even more.
2) Set sensitivity very high: "vslow minid=0 k=11 maxindel=200"

Each of those flags is a bit different. Maxindel is default 16000 but for inter-species mapping reducing it increases sensitivity. You can further reduce k down to, say, 9, but it greatly affects speed and values below 10 start to cause issues in repetitive areas. The default is 13. k is basically the seed length - no alignments will be considered unless the read shares k consecutive exact matches with the reference. For cross-species mapping, using "slow" or "vslow", and varying the parameters maxindel, k, and minid are the best ways of adjusting sensitivity.

**jazz710** · 03-30-2016, 05:32 AM

Thank you, as always for your thorough response! New question time:

I'm running a bbduk contamination removal and with normal settings, and the run takes ~30m-45m (it's a big dataset). However, when I set hdist=1 (rather than 0) now the job has been running for 14h and is still only to this point:

BBDuk version 35.82
Set threads to 40
Initial:
Memory: max=823202m, free=788842m, used=34360m

I know hdist affects the time of the run, but is that amount what you would expect?

Best and many thanks,

Bob

**GenoMax** · 03-30-2016, 05:47 AM

That sounds unusually long (especially if the original job took only 45 min). Are the trimmed output files still increasing in size? Are you using more than one thread for this job?

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, 11-08-2024, 11:09 AM	0 responses 35 views 0 likes	Last Post by seqadmin 11-08-2024, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, 11-08-2024, 06:13 AM	0 responses 28 views 0 likes	Last Post by seqadmin 11-08-2024, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 32 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News