BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

chiayi replied

12-16-2016, 08:22 PM
BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.

I share your thoughts so I will at least try a few background to see how it goes. On the other hand, I'm not sure how to compare the accuracy between the results generated from base reference vs SNP corrected reference. Do you have any suggestion?

I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?

The reads were generated by NEBNext Ultra RNA library prep kit for Illumina (no enrichment) and sequenced on NextSeq. I mapped the reads to the latest genome annotation (B73 RefGen_v4 reference genome). I should also mention that these are mixed background (hybrid of B73, reference genome, and other cultivar). For B73 alone, the ambig rate was lower at about ~10%.
Is there a 'standard' for ambiguous rate when you look at the stats from BBMap?
Leave a comment:
Brian Bushnell replied

12-16-2016, 06:30 PM
Maize is the kind of organism where de-novo assembly is extremely difficult. I recommend avoiding that; it will make things much more complicated. I don't think there's any reason to do that, either, unless you find that structural variations are causing problems.

BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.

I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?
Leave a comment:
chiayi replied

12-14-2016, 10:11 PM
Hi Brian and other members,

I'm trying to map the Illumina reads generated from 19 inbred and hybrid mazie cultivars under two conditions for downstream differential expression analysis. Considering the SNPs and indels in the mixed genetic background, my original plan was to generate a 'corrected refernece' using tools like Quiver or Pilon. Then I came across to BBMap and was happy to find its capability to handle SNPs and indels. I used the default setting and the mapped rate so far was over 97%. Meanwhile, I did notice a huge variaion of ambiguous rate, ranging from 6% to >50%. Therefore, I was wondering if it is possible to evaluate which method (directly map to a single ref vs map to 19 SNP corrected ref) would be more accurate for the DE analysis.

The third option would be to re-assemble de-novo assembly for each genetic background. However, I wasn's sure if it's possible to come up with a consensus contig conrrespondance (contigX_background1, contigX_background2..contigX_background19) so that I could monitor each contig/gene of interest in multiple backgrounds.

Any input/thoughts you may have on this would be much appreciated. Thank you in advance.

Last edited by chiayi; 12-15-2016, 05:46 AM.
Leave a comment:
skbrimer replied

12-14-2016, 10:55 AM
Thanks for the input guys! I will start playing with the updated BBmap as well!
Leave a comment:
Brian Bushnell replied

12-14-2016, 10:09 AM
Originally posted by skbrimer View Post

Thank you Brain!

Is there a variant calling pipeline that works best with BBmap?

Funny you should ask! I just finished developing a variant-caller, because I was unhappy with the speed, usability, and some other aspects of Samtools mpileup and GATK (I've never used Freebayes, though). Also, it has a default ploidy of 1 which better reflects projects at JGI, though the ploidy can be changed to an arbitrary number.

I just uploaded a new version of BBTools (36.71) with the variant caller (older versions do not support vcf output). You can run it like this:

Code:

callvariants.sh in=mapped.sam vcf=vars.vcf ref=ref.fasta ploidy=1

CallVariants works on sam or bam, sorted or unsorted, and it's really fast. Also, it does not care what format the cigar strings are in or whether you trimmed sequence names after the first whitespace. I'd be eager to hear how it compares to FreeBayes, if you'd like to try it out. So far, other people at JGI have indicated that it does a much better job of identifying low-depth variants in high-ploidy organisms than Samtools or GATK (if you adjust the ploidy and minallelefraction flags).

That said, no, I don't have a recommended variant-calling workflow for BBMap, because I have not done rigorous benchmarking of all the possibilities yet. Oh, and to modify my previous post, for variant calling I suggest you also add the "mdtag" flag, which some variant-callers like to have. So, for Freebayes:

Code:

bbmap.sh in=reads.fq ref=ref.fa out=mapped.sam mdtag trd sam=1.3

Originally posted by GenoMax View Post

BBMap suite now has a variant caller (callvariants.sh) :-)

Indeed But if you use it, I highly recommend downloading BBMap 36.71, as prior versions lack proper VCF output.
Last edited by Brian Bushnell; 12-14-2016, 10:12 AM.
Leave a comment:
GenoMax replied

12-14-2016, 10:00 AM
If you are going to re-do mapping then clean up the fasta ID's (and regenerate BBMap index) in your reference genome, if there are spaces in the words there.

BBMap suite now has a variant caller (callvariants.sh) :-)
Leave a comment:
skbrimer replied

12-14-2016, 09:37 AM
Thank you Brain!

Is there a variant calling pipeline that works best with BBmap?
Leave a comment:
Brian Bushnell replied

12-14-2016, 09:31 AM
Unfortunately, it sounds like Freebayes is trimming the names of the sequences after the first whitespace when loading them from the fasta file, but it's not doing that when looking at the sam file. To fix that problem, unfortunately you'd need to rerun the mapping, adding the "trd" flag (trim read description). Additionally, Freebayes is not compliant with the current Sam specification and does not understand cigar strings with "=" and "X" symbols, so you need to output in legacy Sam format with M symbols. Therefore, for Freebayes, you need to add two extra flags when mapping:

Code:

trd sam=1.3
Leave a comment:
skbrimer replied

12-14-2016, 06:21 AM
Hi Brian,

I like to use freebayes for the vcf creation because all of our stuff is haploid, however I do not know how to work with the ref file that BBMap makes during a mapping. Freebayes accepts a fasta for the ref. The ref folder contains a compressed chromosome file which freebayes doesn't like and when I use the fasta that I feed to BBmap it doesn't like the header mismatch.

Can you tell me how to overcome this issue?
Leave a comment:
Brian Bushnell replied

12-11-2016, 09:36 AM
Originally posted by moistplus

Can you confirm that pileup doesn't need sam files to be sorted ?

Yes, that's correct.
Leave a comment:
Brian Bushnell replied

11-29-2016, 10:22 AM
Pileup uses the header lines to determine the names and lengths of the sequences in the reference. So, they must be present. However, it doesn't really care whether there are extra ones randomly in the file; they just get ignored. So, cat works fine.
Leave a comment:
Brian Bushnell replied

11-28-2016, 02:34 PM
Pileup won't accept multiple input files, but it will accept standard in. So if the files were sam rather than bam, you could do this:

Code:

cat lib1.sam lib2.sam | pileup.sh in=stdin.sam out=stats.txt hist=histogram.txt

or for gzipped files:

Code:

cat lib1.sam.gz lib2.sam.gz | pileup.sh in=stdin.sam.gz out=stats.txt hist=histogram.txt

However, I don't see a way to do that with bam files. BBMap will produce sam files instead of bam files if you name the output *.sam instead of *.bam.
Leave a comment:
Brian Bushnell replied

11-28-2016, 01:31 PM
BBMap can output coverage, by the way. Just add the flag "covstats=covstats.txt". You may also want to add the flag "delcov=f" for this purpose.
Leave a comment:
Brian Bushnell replied

11-28-2016, 10:32 AM
The default is roughly 76% identity. You can adjust this with the "minid" flag (e.g. "minid=0.80" for 80% identity.) If you want to restrict alignments to a maximum number of substitutions, you can use "subfilter"; e.g., "subfilter=5" will discard alignments with more than 5 substitutions.

That said, what exactly are you trying to do?

Last edited by Brian Bushnell; 11-28-2016, 10:35 AM.
Leave a comment:
nicklisbon replied

11-26-2016, 08:48 AM
So, the first alignment is run in global alignment mode, while the second one is run in local mode allowing indels up to 400 kb, isn't it?
Sounds like a good solution thanks.
We will try this option and check whether Lumpy likes it.
Leave a comment:

Previous 1 13 20 21 22 23 24 25 26 33 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News