BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

YeastGuy replied

02-07-2017, 08:13 AM
Hi Brian,

I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
Do you have a suggestion how to constrain the alignment within each ORF?

Thank you!
Leave a comment:
Mat29 replied

02-07-2017, 12:15 AM
Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
Leave a comment:
boulund replied

01-26-2017, 02:21 AM
Originally posted by Brian Bushnell View Post

That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.

Perfect, thank you!
Leave a comment:
Brian Bushnell replied

01-25-2017, 10:06 AM
Originally posted by GenoMax View Post

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.

That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.
Leave a comment:
boulund replied

01-25-2017, 06:45 AM
Originally posted by GenoMax View Post

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.

I think so too, it feels natural considering the memory usage profile of BBMap .
Leave a comment:
GenoMax replied

01-25-2017, 06:08 AM
Originally posted by boulund View Post

How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.

@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
Leave a comment:
boulund replied

01-25-2017, 05:21 AM
How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?

If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.
Leave a comment:
JVGen replied

01-06-2017, 07:28 AM
Originally posted by Brian Bushnell View Post

How many files do you have after BBMap, and what are they named?

The file name is: "1-JL08-P1-A3 assembled to HXB2 Nested Amplified Region extraction". Within the file, there are thousands of reads with the naming convention that I shared in the previous post.

I doubt it - BBTools should be able to handle reads named like that.

For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.

I think I might try something like window = 15 and entropy 0.01. That should get rid of the mononucleotide strings. The HIV genome generally doesn't have anything like that, so it should work.

***Update. Brian, I contacted Geneious and they seem to be aware of the problem. They gave me a macro/workflow that extracts the reads from the BBMap'd contig file, and now they are feeding into Tadpole without a problem. Thanks for your help on this, you're getting all the gold stars!

Last edited by JVGen; 01-06-2017, 08:51 AM.
Leave a comment:
Brian Bushnell replied

01-05-2017, 02:42 PM
Originally posted by JVGen View Post

That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads.

How many files do you have after BBMap, and what are they named?

It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1

I doubt it - BBTools should be able to handle reads named like that.

I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?

For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.
Leave a comment:
JVGen replied

01-05-2017, 02:13 PM
Originally posted by Brian Bushnell View Post

I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq tadpole.sh in=mapped.fq out=contigs.fa k=62

I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:

bbduk.sh in=reads.fq out=filtered.fq entropy=0.01

That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.

Hi Brian,

That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads. It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):

MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1

I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?

Thanks,
Jake
Leave a comment:
Brian Bushnell replied

01-05-2017, 01:57 PM
I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?

What you are planning to do should work fine. On the command line, it would be something like this:

Code:

bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq tadpole.sh in=mapped.fq out=contigs.fa k=62

I'm not sure how that would translate to Geneious. Note, by the way, that you can remove homopolymer reads with BBDuk like this:

Code:

bbduk.sh in=reads.fq out=filtered.fq entropy=0.01

That will remove polymonomer reads only (AAAA...). At entropy=0.19 it will also remove polydimer reads (ACACAC...). At entropy=0.29 it will remove polytrimers (ACTACT...), and at 0.34 it will remove polytetramers. E.coli reads seem to have an average entropy of around 0.985 by my calculation method.
Leave a comment:
JVGen replied

01-05-2017, 01:05 PM
Save Reads for Tadpole

Hi Brian, Geno,

I'm wondering if it's possible to save the reads that BBMap uses in its assembly, for subsequent use with Tadpole?

I have some junk reads that I think might be interfering with de novo assembly. I have ample coverage, and won't miss the crappy reads. I thought a clever way of eliminating them would be to restrict the reads used during de novo assembly to those that had previously mapped to the reference with BBMap. I'm getting an error at the moment when I try to use Tadpole with the reads used by BBMap that says it cannot take a mixture of paired and unpaired reads as input (working in Geneious). Do you think what I'm trying to do is possible?

I've already quality trimmed to Q20, and have confirmed that the junk reads are indeed high quality (>Q35). They are internal, repetitive strings of a single nucleotide. Not sure where they're coming from, but such a nuisance.

Thanks for any help.

P.S. Any idea where strings of a single nucleotide might be originating? I'm using 2-color chemistry on a MiniSeq, but the strings can be any nucleotide, not just G. Samples are prep'd with Nextera. My samples are PCR amplicons, however, if I Sanger sequence, I don't get these strings PolyN's
Leave a comment:
Brian Bushnell replied

12-17-2016, 05:21 PM
It would be useful to know what the ambiguous reads are hitting. It's likely that it's something with many copies, such as ribosomal elements. Ribo "contamination" is common in libraries even when some kind of ribo-depletion is used. You can catch the ambiguous reads with a second mapping pass using "ambig=toss outu=unmapped.fq" if you start with just the mapped reads. Then, you can BLAST them, or map them again and look at an annotated version of the reference to see what they're hitting. But it's likely ribosomal.
Leave a comment:
chiayi replied

12-17-2016, 08:41 AM
For maaping diploid maize data, I concatenated all the chromosomes for a given species (e.g. 10 chr in maize) and used it as the mapping reference. I then mapped the BBDuk-trimmed reads to this reference (genome, not transcriptome) using the default setting of BBMap.
For the maize data generated from B73 (B73 is the sequeced ref genome), the ambig rate rages from 3% to 15%. I saw the variation occurred between biologica rep, not within a pair. For example, the ambig rates of Read 1 and Read 2 of a replicate are close to each other, but the ambig rates of different biologica replica could be various, (12%, 3%, 3%; 11%, 11%, 3%). The variatoin between replica almost made me feel like it was due to the technical issue of the libraries and/or biological nature of maize. I could try to map it to the transcriptome and see if there's any improvement. Other than that I wasn't sure how to trace down, and/or if this should be a concern.
For a different set of maize data with mixed background, as for now I still used the same B73 base reference genome for mapping. The ambig ranges went up to 6% - 46%. The increase was expected given the SNPs/indels present between different cultivars. I also observed variation between biological replicates similar to described above.
On a related note, I also have diploid Arabidopsis data which I also mapped to its own genome (TAIR9 genome fasta). I added a -maxindel=2000 to accomodate the compact size of the genome. The ambig rate on average was lower (mostly below 4%). I did not see that radical variation between replicates either. This is consistent with my guess about the ambig rates in maize was partly due to the nature of maize.
Please let me know if any furtehr details would be helpful for you to diagnose. Thank you as always for your input.
Leave a comment:
Brian Bushnell replied

12-16-2016, 08:44 PM
Normally... I see ambig rates of 3% or lower in diploids like human, and 0.1% or lower in haploid bacteria. I have little experience in mapping RNA to plant genomes, aside from feedback from co-workers. And they mainly map to plant transcriptomes rather than genomes.

Can you describe your mapping protocol in more detail? For example, are you concatenating all genomes and mapping to all simultaneously, or are you experiencing a high claimed ambig rate when mapping to a single reference alone?
Leave a comment:

Previous 1 12 19 20 21 22 23 24 25 32 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News