Originally posted by Brian Bushnell
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
I noticed this line:
BBM: bases mapped (cigar): 29307 # more accurate
Also, I find it odd that the insert size average and stdev differ so much. Often the mode or median is more stable as it is less influenced by outliers.
By inverted pairs, do you mean both reads map to the same side of the reference? If not, I wonder if the fact that the reference is very small and circular could play some role in the differences.
Anyway, please feel free to email me both bam files and the reference, and I'll look at them. It's possible that the circularity, or the fact that bwa looks for local rather than global alignments, is the primary factor behind the difference.
Leave a comment:
-
High error rate with BBmap?
Hi Brian,
while trying to call variants on a human MT DNA I noticed the unusually high error rate in mapping with bbmap (37.10 and 36.92 tested) compared to bwa (0.7.12-r1039). Both were run with the default parameters, and below is the output of samtools stats (after sorting and indexing bams). IGV shows many more inverted pairs with bbmap mapping, and this came as a surprise.
I can provide details, however I prefer to share them offline due to the sensitivity of some of the data. Any hints on how to track down the issue?
Many thanks!
BBmap
Code:BBM: raw total sequences: 537764 BBM: filtered sequences: 0 BBM: sequences: 537764 BBM: is sorted: 1 BBM: 1st fragments: 268882 BBM: last fragments: 268882 BBM: reads mapped: 256366 BBM: reads mapped and paired: 230042 # paired-end technology bit set + both mates mapped BBM: reads unmapped: 281398 BBM: reads properly paired: 228748 # proper-pair bit set BBM: reads paired: 537764 # paired-end technology bit set BBM: reads duplicated: 0 # PCR or optical duplicate bit set BBM: reads MQ0: 0 # mapped and MQ=0 BBM: reads QC failed: 4282 BBM: non-primary alignments: 0 BBM: total length: 67353071 # ignores clipping BBM: bases mapped: 32229667 # ignores clipping BBM: bases mapped (cigar): 29307 # more accurate BBM: bases trimmed: 0 BBM: bases duplicated: 0 BBM: mismatches: 1672477 # from NM fields BBM: error rate: 5.706749e+01 # mismatches / bases mapped (cigar) BBM: average length: 125 BBM: maximum length: 151 BBM: average quality: 30.4 BBM: insert size average: 255.7 BBM: insert size standard deviation: 117.0 BBM: inward oriented pairs: 60121 BBM: outward oriented pairs: 2977 BBM: pairs with other orientation: 48 BBM: pairs on different chromosomes: 0
Code:BWA: raw total sequences: 537764 BWA: filtered sequences: 0 BWA: sequences: 537764 BWA: is sorted: 1 BWA: 1st fragments: 268882 BWA: last fragments: 268882 BWA: reads mapped: 266324 BWA: reads mapped and paired: 248064 # paired-end technology bit set + both mates mapped BWA: reads unmapped: 271440 BWA: reads properly paired: 245908 # proper-pair bit set BWA: reads paired: 537764 # paired-end technology bit set BWA: reads duplicated: 0 # PCR or optical duplicate bit set BWA: reads MQ0: 178 # mapped and MQ=0 BWA: reads QC failed: 0 BWA: non-primary alignments: 0 BWA: total length: 67353071 # ignores clipping BWA: bases mapped: 33736838 # ignores clipping BWA: bases mapped (cigar): 32160973 # more accurate BWA: bases trimmed: 0 BWA: bases duplicated: 0 BWA: mismatches: 603462 # from NM fields BWA: error rate: 1.876380e-02 # mismatches / bases mapped (cigar) BWA: average length: 125 BWA: maximum length: 151 BWA: average quality: 30.4 BWA: insert size average: 384.2 BWA: insert size standard deviation: 1040.7 BWA: inward oriented pairs: 67383 BWA: outward oriented pairs: 3645 BWA: pairs with other orientation: 112 BWA: pairs on different chromosomes: 0
Leave a comment:
-
You can reduce kmer length and increase sensitivity - change "k=13" to "k=11" and add "slow". But ultimately, since BBMap is a global aligner, it does not like large structural rearrangements when aligning long queries, since they do not fit into the model of "match/substitution/insertion/deletion" reported in single alignments.
Leave a comment:
-
Originally posted by Brian Bushnell View Post@minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.
@catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.
Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
I haven't been able to improve the mapping percentage. I tried your suggestion (bbmap/mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40) and still get 13%.
Is it because the mapping is too stringent? As I am dealing with different species within the same genus. Is there another flag aside from "minid" that I could change to improve the mapping.
Thanks again
Leave a comment:
-
Normally, the variant quality is the best indicator of whether a variant is real. I think, in this case, it's likely that most of them are real - viruses mutate rapidly, so viral "isolates" often contain multiple different viral genomes (which is why they are hard to assemble). The possibilities are, essentially...
1) Sequencing error randomly matched somewhere distant in the genome, causing the false appearance of a deletion. This is extremely unlikely with small, non-redundant genomes, and would usually only be represented by a single read.
2) Chimeric reads formed from different genomic fragments fusing together during sequencing may look like long deletion events. These should universally be represented by a single read (or pair) with a PCR-free or deduplicated library.
3) Real deletion events. Hopefully these will be represented by multiple reads.
So, you may want to deduplicate your library (if it was PCR'd, otherwise don't) and then call variants using a threshold of, say, 3 reads (minreads=3 flag, placed after "clearfilters").
Leave a comment:
-
Originally posted by Brian Bushnell View PostNormally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:
bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
Thanks again.
Leave a comment:
-
Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:
bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
Leave a comment:
-
Originally posted by Brian Bushnell View PostHi James,
Sorry for the slow reply. You could, in your case, do this:
Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:
reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa
"kmers.fa" will contain all the barcodes and their counts.
Awesome Brian, thanks for the response. It seems like that is going to work. It's a little more complicated since there are actually three amplicons with barcodes in different areas in the same file but if I can pull them out using one of your other tools the kmercountexact.sh should work just fine.
I thought maybe I would ask you a question about another problem I'm currently having. I have been trying to use BBmap to detect large deletions in viral sequencing data (defective genomes) but so far haven't had good luck. Is there a way I can sort out specifically these reads that have "break points" or deletions in the viral genome away from the reads that map normally?
Thanks again for all of your work.
James
Leave a comment:
-
Hi James,
Sorry for the slow reply. You could, in your case, do this:
Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:
reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa
"kmers.fa" will contain all the barcodes and their counts.
Leave a comment:
-
Hello Brian,
I made some viruses that have a set of degenerate nucleotides in the genome to look at bottlenecks - we're calling these barodes. I'm wondering if any of your tools have a way of extracting the specific region where the barcodes are and then listing the different sequences that are present and possibly how many times they are present. I've been able to do pretty much everything I've needed to with your tools so I figured I would ask. Thanks for all of the work you've done.
James
Leave a comment:
-
@minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.
@catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.
Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
Leave a comment:
-
Bbmap for mapping a transcriptome to a genome
Hi Brian,
I would like to use Bbmap to map a transcriptome to a genome.
My first questions would be in the Results it saids:
Reads Used: 165317 (61187186 bases)
but the transcriptome that I am using has 95,389 transcripts. Shouldn't the number of transcripts be the same as the reads used?
2nd. This transcriptome has a N50 of 363 and N75 of 696, the manual saids to use BBMapPacBio for reads longer than 700bp. However, when I use BBMapPacBio instead of bbmap I get a 28% mapping percentage, that is lower than when using bbmap (38%).This transcriptome has transcripts from other organisms and the mapping percentage should be around 50%.
What else should I take into account when mapping a transcriptome to the genome, since is it different than mapping mRNA reads?
Thanks
Leave a comment:
-
Hello Brain,
I'm trying to make use of bbmapskimmer.
Here is a simple fasta example (read.fasta):
Code:>1 GAACATGATCCCCTTGTACTA[B]C[/B]TACATTATCACTAGCTTTATGTTTTCTA >2 TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC >3 GAACATGATCCCCTTGTACTA[B]T[/B]TACATTATCACTAGCTTTATGTTTTCTA
I'm aligning to mm10 genome:
~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
Here is test_bbmap1.sam:
Code:@HD VN:1.4 SO:unsorted @SQ SN:chr10 LN:130694993 .... @SQ SN:chrM LN:16299 @SQ SN:chrX LN:171031299 @SQ SN:chrY LN:91744698 @PG ID:BBMap PN:BBMap VN:37.02 CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local 1 0 chr16 39205299 43 50= * 0 0 GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA * NM:i:0 AM:i:43 NH:i:2 1 272 chr10 38707112 39 28=1X21= * 0 0 * * NM:i:1 AM:i:39 NH:i:2 2 16 chr16 39205299 43 50= * 0 0 GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA * NM:i:0 AM:i:43 NH:i:2 2 256 chr10 38707112 39 28=1X21= * 0 0 * * NM:i:1 AM:i:39 NH:i:2 3 16 chr10 38707112 3 50= * 0 0 TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC * XT:A:R NM:i:0 AM:i:3 NH:i:80 3 272 chr10 72154315 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr10 77806115 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 272 chr10 127797927 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr13 4007635 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr14 17647057 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr14 40019581 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr15 50069753 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 272 chr15 72113945 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr15 75259455 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr17 69579150 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr19 60830365 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr1 95272134 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 3 256 chr1 102465771 3 50= * 0 0 * * NM:i:0 AM:i:3 NH:i:80 ....
And, just to note, without "local" option I got same results.
As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?
Could you, please, help to clarify this?
And one more question, why reads 1 and 2 do not have "XT:A:" tag?
Best,
MinjaLast edited by minja; 03-16-2017, 06:07 AM.
Leave a comment:
-
While you wait for @Brian to respond (he is on California Daylight Time) you could turn off "pigz=f unpigz=f" and see if that allows the job to move forward.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: