BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

SNPsaurus replied

03-21-2018, 03:26 PM
The reference in this case is a list of RAD loci sequences, each at 150 bp. The reads are 150 bp and should match one of the RAD loci. Since there are no N (or degenerate nucleotides) in the reference, and only a small number of reads with an N, it must have to do with the padding. A trimmed read would not have N padding, would it? This sample was degraded DNA, so plenty of reads were trimmed to less than 150 nucleotides. I doubt that 72% of the loci from this sample had an insertion that made the read extend past the reference (generated from consensus of many samples). Odd, but I guess it is not critical for me to fully figure this out.
Leave a comment:
Brian Bushnell replied

03-21-2018, 03:24 PM
Originally posted by kimng View Post

Hello, I've been using different parts of the BBMap suite and mixing them with other softwares for WGS of bacterial samples (primarily for the purpose of automated QC). I'm a fan of callvariants.sh as it works quickly and doesn't require me to sort .sam files before hand and it's simple to clearfilters to return all values. Unfortunately when using callvariants.sh with .sam files created from minimap2 (https://github.com/lh3/minimap2) using illumina nextseq reads. I receive the error seen below. I was curious if this was a minor bug that could be corrected?

I'm running callvariants.sh from bbmap v37.90 obtained from conda (https://anaconda.org/bioconda/bbmap). Appreciate any assistance given.

-Kim

Error:
java -ea -Xmx48779m -Xms48779m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ var2.CallVariants in=aln.sam ref=contigs.fasta vcf=minimap.vcf ploidy=1 clearfilters
Executing var2.CallVariants [in=aln.sam, ref=contigs.fasta, vcf=minimap.vcf, ploidy=1, clearfilters]

Loading reference.
Time: 0.304 seconds.
Processing input files.
Exception in thread "Thread-3" Exception in thread "Thread-4" Exception in thread "Thread-5" java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)

The problem here is that minimap uses old-style cigar strings (M symbol instead of = and X) and also does not produce MD tags. I've added the ability to handle reads in that situation and it will be released in v37.95 (soon).

Prior to that you could probably add MD tags with "samtools calmd".
Leave a comment:
Brian Bushnell replied

03-21-2018, 03:13 PM
This means 72 percent of the reads mapped with an "N" symbol in the match string, an internal data structure similar to a cigar string. The "N" symbol denotes either an N in the read or an N in the reference, which can include sequences that go off the end of scaffolds (the ends are internally padded with Ns). Additionally, degenerate IUPAC codes are counted as Ns.
Leave a comment:

SNPsaurus replied

03-21-2018, 11:26 AM

I have a question about the bbmap summary output. Here is an example:

Code:

Reads Used:             1061591 (151284618 bases)

Mapping:                15.250 seconds.
Reads/sec:              69611.57
kBases/sec:             9920.17


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  76.2432%          809391        77.6086%          117409909
unambiguous:             75.3885%          800318        76.8084%          116199316
ambiguous:                0.8547%            9073         0.8002%            1210593
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:       11.4049%          121073         9.7282%           14717263
semiperfect site:        58.9222%          625513        60.0540%           90852423

Match Rate:                   NA               NA        99.1429%          116413381
Error Rate:              20.6634%          183594         0.2527%             296692
Sub Rate:                19.9279%          177059         0.2369%             278137
Del Rate:                 0.6244%            5548         0.0084%               9819
Ins Rate:                 0.6302%            5599         0.0074%               8736
N Rate:                  72.0298%          639985         0.6044%             709655

What does an N rate of 72% mean in this context? The reference has 0 bases that are N. There are 20,000 out of the 1 million reads that contain an N. The N Rate percent bases of 0.6% seems high compared to 20,000 reads with 1 or 2 Ns, but closer to what I think I see than the 72% N for reads.

Leave a comment:

SNPsaurus replied

03-16-2018, 09:41 AM
Originally posted by darthsequencer View Post

Yeah - I have but from what I can tell it maxes out at 6 reports unless maxsites is increased. I am wondering if there is a setting that avoids maxsites and reports everything above some score.

You might try the align2.BBMapPacBioSkimmer mapper with ambig=all and expectedsites= some high number. Not exactly what you were asking but I get dozens of hits with that.
Leave a comment:
darthsequencer replied

03-15-2018, 09:03 PM
Originally posted by GenoMax View Post

Have you tried ambig=all?

Yeah - I have but from what I can tell it maxes out at 6 reports unless maxsites is increased. I am wondering if there is a setting that avoids maxsites and reports everything above some score.
Leave a comment:
GenoMax replied

03-15-2018, 04:00 PM
Have you tried ambig=all?
Leave a comment:
darthsequencer replied

03-15-2018, 12:38 PM
Hi Brian - What are the right flags to set if I want all of the top scoring mappings for paired end reads?

Thanks
Leave a comment:
kimng replied

03-15-2018, 06:52 AM
Thanks for the input but same issue using sam=1.3, I think Geno's comment on something he intended to do is on the mark.
Leave a comment:
HESmith replied

03-15-2018, 06:48 AM
You can try the reformat.sh command per @GenoMax but with 'sam=1.3' flag. That worked for me with a different variant caller (FreeBayes).
Leave a comment:
kimng replied

03-15-2018, 05:36 AM
Thanks for the info didn't realise that was the site for tickets so appreciate it, I'll create a ticket there.
Leave a comment:
GenoMax replied

03-15-2018, 05:25 AM
It was worth a try.

Based on the `TODO` tag I see perhaps this is something @Brian intends to work on. Unfortunately he has not been participating in forums of late so there is no guarantee as to when this question may be addressed. You could try creating a ticket at BBMap SF repository for this question.
Leave a comment:
kimng replied

03-15-2018, 04:34 AM
Appreciate the reply but unfortunately it leads to the same error:

java -ea -Xmx200m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ jgi.ReformatReads in=minimap.sam out=new.bam sam=1.4
Executing jgi.ReformatReads [in=minimap.sam, out=new.bam, sam=1.4]

Input is being processed as unpaired
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:118)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:89)
at stream.SamReadInputStream.hasMore(SamReadInputStream.java:53)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:664)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:653)
Leave a comment:
GenoMax replied

03-15-2018, 04:21 AM
You can try reformatting your alignment file using "reformat.sh in=you.bam out=new.bam sam=1.4" to see if that helps.
Leave a comment:
kimng replied

03-15-2018, 12:52 AM
Bug in call variants with relation to sam files created from minimap2

Hello, I've been using different parts of the BBMap suite and mixing them with other softwares for WGS of bacterial samples (primarily for the purpose of automated QC). I'm a fan of callvariants.sh as it works quickly and doesn't require me to sort .sam files before hand and it's simple to clearfilters to return all values. Unfortunately when using callvariants.sh with .sam files created from minimap2 (https://github.com/lh3/minimap2) using illumina nextseq reads. I receive the error seen below. I was curious if this was a minor bug that could be corrected?

I'm running callvariants.sh from bbmap v37.90 obtained from conda (https://anaconda.org/bioconda/bbmap). Appreciate any assistance given.

-Kim

Error:
java -ea -Xmx48779m -Xms48779m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ var2.CallVariants in=aln.sam ref=contigs.fasta vcf=minimap.vcf ploidy=1 clearfilters
Executing var2.CallVariants [in=aln.sam, ref=contigs.fasta, vcf=minimap.vcf, ploidy=1, clearfilters]

Loading reference.
Time: 0.304 seconds.
Processing input files.
Exception in thread "Thread-3" Exception in thread "Thread-4" Exception in thread "Thread-5" java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
at stream.SamLine.toShortMatch(SamLine.java:1200)
at stream.SamLine.toRead(SamLine.java:1972)
at stream.SamLine.toRead(SamLine.java:1832)
at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
Leave a comment:

Previous 1 6 7 8 9 10 11 12 19 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News