Thanks for the input guys! I will start playing with the updated BBmap as well!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Brian and other members,
I'm trying to map the Illumina reads generated from 19 inbred and hybrid mazie cultivars under two conditions for downstream differential expression analysis. Considering the SNPs and indels in the mixed genetic background, my original plan was to generate a 'corrected refernece' using tools like Quiver or Pilon. Then I came across to BBMap and was happy to find its capability to handle SNPs and indels. I used the default setting and the mapped rate so far was over 97%. Meanwhile, I did notice a huge variaion of ambiguous rate, ranging from 6% to >50%. Therefore, I was wondering if it is possible to evaluate which method (directly map to a single ref vs map to 19 SNP corrected ref) would be more accurate for the DE analysis.
The third option would be to re-assemble de-novo assembly for each genetic background. However, I wasn's sure if it's possible to come up with a consensus contig conrrespondance (contigX_background1, contigX_background2..contigX_background19) so that I could monitor each contig/gene of interest in multiple backgrounds.
Any input/thoughts you may have on this would be much appreciated. Thank you in advance.Last edited by chiayi; 12-15-2016, 05:46 AM.
Comment
-
Maize is the kind of organism where de-novo assembly is extremely difficult. I recommend avoiding that; it will make things much more complicated. I don't think there's any reason to do that, either, unless you find that structural variations are causing problems.
BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.
I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?
Comment
-
BBMap is very tolerant of SNPs and indels so generally you don't need to do any kind of correction, but aligning to a SNP-corrected reference (assuming the SNPs are homozygous, or at least a majority) will be more accurate than aligning to the base reference.
I'm not really sure where the high ambig rate is coming from. Is this WGS, or are you doing some kind of enrichment, RNA-seq, etc? And are you mapping to the genome or transcriptome?
Is there a 'standard' for ambiguous rate when you look at the stats from BBMap?
Comment
-
Normally... I see ambig rates of 3% or lower in diploids like human, and 0.1% or lower in haploid bacteria. I have little experience in mapping RNA to plant genomes, aside from feedback from co-workers. And they mainly map to plant transcriptomes rather than genomes.
Can you describe your mapping protocol in more detail? For example, are you concatenating all genomes and mapping to all simultaneously, or are you experiencing a high claimed ambig rate when mapping to a single reference alone?
Comment
-
For maaping diploid maize data, I concatenated all the chromosomes for a given species (e.g. 10 chr in maize) and used it as the mapping reference. I then mapped the BBDuk-trimmed reads to this reference (genome, not transcriptome) using the default setting of BBMap.
For the maize data generated from B73 (B73 is the sequeced ref genome), the ambig rate rages from 3% to 15%. I saw the variation occurred between biologica rep, not within a pair. For example, the ambig rates of Read 1 and Read 2 of a replicate are close to each other, but the ambig rates of different biologica replica could be various, (12%, 3%, 3%; 11%, 11%, 3%). The variatoin between replica almost made me feel like it was due to the technical issue of the libraries and/or biological nature of maize. I could try to map it to the transcriptome and see if there's any improvement. Other than that I wasn't sure how to trace down, and/or if this should be a concern.
For a different set of maize data with mixed background, as for now I still used the same B73 base reference genome for mapping. The ambig ranges went up to 6% - 46%. The increase was expected given the SNPs/indels present between different cultivars. I also observed variation between biological replicates similar to described above.
On a related note, I also have diploid Arabidopsis data which I also mapped to its own genome (TAIR9 genome fasta). I added a -maxindel=2000 to accomodate the compact size of the genome. The ambig rate on average was lower (mostly below 4%). I did not see that radical variation between replicates either. This is consistent with my guess about the ambig rates in maize was partly due to the nature of maize.
Please let me know if any furtehr details would be helpful for you to diagnose. Thank you as always for your input.
Comment
-
It would be useful to know what the ambiguous reads are hitting. It's likely that it's something with many copies, such as ribosomal elements. Ribo "contamination" is common in libraries even when some kind of ribo-depletion is used. You can catch the ambiguous reads with a second mapping pass using "ambig=toss outu=unmapped.fq" if you start with just the mapped reads. Then, you can BLAST them, or map them again and look at an annotated version of the reference to see what they're hitting. But it's likely ribosomal.
Comment
-
Save Reads for Tadpole
Hi Brian, Geno,
I'm wondering if it's possible to save the reads that BBMap uses in its assembly, for subsequent use with Tadpole?
I have some junk reads that I think might be interfering with de novo assembly. I have ample coverage, and won't miss the crappy reads. I thought a clever way of eliminating them would be to restrict the reads used during de novo assembly to those that had previously mapped to the reference with BBMap. I'm getting an error at the moment when I try to use Tadpole with the reads used by BBMap that says it cannot take a mixture of paired and unpaired reads as input (working in Geneious). Do you think what I'm trying to do is possible?
I've already quality trimmed to Q20, and have confirmed that the junk reads are indeed high quality (>Q35). They are internal, repetitive strings of a single nucleotide. Not sure where they're coming from, but such a nuisance.
Thanks for any help.
P.S. Any idea where strings of a single nucleotide might be originating? I'm using 2-color chemistry on a MiniSeq, but the strings can be any nucleotide, not just G. Samples are prep'd with Nextera. My samples are PCR amplicons, however, if I Sanger sequence, I don't get these strings PolyN's
Comment
-
I'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?
What you are planning to do should work fine. On the command line, it would be something like this:
Code:bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq tadpole.sh in=mapped.fq out=contigs.fa k=62
Code:bbduk.sh in=reads.fq out=filtered.fq entropy=0.01
Comment
-
Originally posted by Brian Bushnell View PostI'm not really sure what the problem is in this case. Tadpole really doesn't care what the input reads look like, whether they are paired, or what format they are in. Can you post the complete error message?
What you are planning to do should work fine. On the command line, it would be something like this:
Code:bbmap.sh ref=ref.fa in=reads.fq outm=mapped.fq outu=junk.fq tadpole.sh in=mapped.fq out=contigs.fa k=62
Code:bbduk.sh in=reads.fq out=filtered.fq entropy=0.01
That's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads. It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1
I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?
Thanks,
Jake
Comment
-
Originally posted by JVGen View PostThat's unfortunately about as much as the error messages says - that Tadpole cannot use a mixture of paired and unpaired reads.
It might be the read-name format that is throwing it off? For instance, my reads are named in the following format after BBMap (In Geneious):
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/2
MN00123:91:000H22WH3:1:22104:14105:5672_1:N:0:1/1
I didn't know about that feature with BBDuk! Will entropy of 0.01 remove any string of a mononucleotide? Or, how many must be present in a string to flag it? Is this with a window size of 50 and kmer size of 5?
Comment
-
Originally posted by Brian Bushnell View PostHow many files do you have after BBMap, and what are they named?
The file name is: "1-JL08-P1-A3 assembled to HXB2 Nested Amplified Region extraction". Within the file, there are thousands of reads with the naming convention that I shared in the previous post.
I doubt it - BBTools should be able to handle reads named like that.
For the default window=50 entropyk=5, reads must be at least 50bp long to be processed by the entropy filter (you can reduce that by making the window smaller). And entropy=0.01 will remove any sequence that is a singly mononucleotide, as long as it's at least 50bp long. Note that if there are some errors so that it is no longer a pure mononucleotide you'd need a higher value for entropy. Something like "AAAAAAAAAAGGGGGGGGGGGGGGGG" would also need a higher value (50 A's and 50 G's appears to need entropy=0.21). Don't set it too high, though, or you'll lose the low complexity parts of your genome.
***Update. Brian, I contacted Geneious and they seem to be aware of the problem. They gave me a macro/workflow that extracts the reads from the BBMap'd contig file, and now they are feeding into Tadpole without a problem. Thanks for your help on this, you're getting all the gold stars!Last edited by JVGen; 01-06-2017, 08:51 AM.
Comment
-
How does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?
If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.
Comment
-
Originally posted by boulund View PostHow does BBMap make use of an index on disk? I'm on a shared cluster system and I'm essentially wondering if BBMap performs a nice single pass of the index to read it into memory, or if it performs a lot of random access to the index on disk?
If it's the latter, I'll just copy it to node-local disks, so no worries. Just interested in how it works.
Comment
-
Originally posted by GenoMax View Post@Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
24 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment