Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNPsaurus
    replied
    The reference in this case is a list of RAD loci sequences, each at 150 bp. The reads are 150 bp and should match one of the RAD loci. Since there are no N (or degenerate nucleotides) in the reference, and only a small number of reads with an N, it must have to do with the padding. A trimmed read would not have N padding, would it? This sample was degraded DNA, so plenty of reads were trimmed to less than 150 nucleotides. I doubt that 72% of the loci from this sample had an insertion that made the read extend past the reference (generated from consensus of many samples). Odd, but I guess it is not critical for me to fully figure this out.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by kimng View Post
    Hello, I've been using different parts of the BBMap suite and mixing them with other softwares for WGS of bacterial samples (primarily for the purpose of automated QC). I'm a fan of callvariants.sh as it works quickly and doesn't require me to sort .sam files before hand and it's simple to clearfilters to return all values. Unfortunately when using callvariants.sh with .sam files created from minimap2 (https://github.com/lh3/minimap2) using illumina nextseq reads. I receive the error seen below. I was curious if this was a minor bug that could be corrected?

    I'm running callvariants.sh from bbmap v37.90 obtained from conda (https://anaconda.org/bioconda/bbmap). Appreciate any assistance given.

    -Kim

    Error:
    java -ea -Xmx48779m -Xms48779m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ var2.CallVariants in=aln.sam ref=contigs.fasta vcf=minimap.vcf ploidy=1 clearfilters
    Executing var2.CallVariants [in=aln.sam, ref=contigs.fasta, vcf=minimap.vcf, ploidy=1, clearfilters]

    Loading reference.
    Time: 0.304 seconds.
    Processing input files.
    Exception in thread "Thread-3" Exception in thread "Thread-4" Exception in thread "Thread-5" java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
    The problem here is that minimap uses old-style cigar strings (M symbol instead of = and X) and also does not produce MD tags. I've added the ability to handle reads in that situation and it will be released in v37.95 (soon).

    Prior to that you could probably add MD tags with "samtools calmd".

    Leave a comment:


  • Brian Bushnell
    replied
    This means 72 percent of the reads mapped with an "N" symbol in the match string, an internal data structure similar to a cigar string. The "N" symbol denotes either an N in the read or an N in the reference, which can include sequences that go off the end of scaffolds (the ends are internally padded with Ns). Additionally, degenerate IUPAC codes are counted as Ns.

    Leave a comment:


  • SNPsaurus
    replied
    I have a question about the bbmap summary output. Here is an example:

    Code:
    Reads Used:             1061591 (151284618 bases)
    
    Mapping:                15.250 seconds.
    Reads/sec:              69611.57
    kBases/sec:             9920.17
    
    
    Read 1 data:            pct reads       num reads       pct bases          num bases
    
    mapped:                  76.2432%          809391        77.6086%          117409909
    unambiguous:             75.3885%          800318        76.8084%          116199316
    ambiguous:                0.8547%            9073         0.8002%            1210593
    low-Q discards:           0.0000%               0         0.0000%                  0
    
    perfect best site:       11.4049%          121073         9.7282%           14717263
    semiperfect site:        58.9222%          625513        60.0540%           90852423
    
    Match Rate:                   NA               NA        99.1429%          116413381
    Error Rate:              20.6634%          183594         0.2527%             296692
    Sub Rate:                19.9279%          177059         0.2369%             278137
    Del Rate:                 0.6244%            5548         0.0084%               9819
    Ins Rate:                 0.6302%            5599         0.0074%               8736
    N Rate:                  72.0298%          639985         0.6044%             709655
    What does an N rate of 72% mean in this context? The reference has 0 bases that are N. There are 20,000 out of the 1 million reads that contain an N. The N Rate percent bases of 0.6% seems high compared to 20,000 reads with 1 or 2 Ns, but closer to what I think I see than the 72% N for reads.

    Leave a comment:


  • SNPsaurus
    replied
    Originally posted by darthsequencer View Post
    Yeah - I have but from what I can tell it maxes out at 6 reports unless maxsites is increased. I am wondering if there is a setting that avoids maxsites and reports everything above some score.
    You might try the align2.BBMapPacBioSkimmer mapper with ambig=all and expectedsites= some high number. Not exactly what you were asking but I get dozens of hits with that.

    Leave a comment:


  • darthsequencer
    replied
    Originally posted by GenoMax View Post
    Have you tried ambig=all?
    Yeah - I have but from what I can tell it maxes out at 6 reports unless maxsites is increased. I am wondering if there is a setting that avoids maxsites and reports everything above some score.

    Leave a comment:


  • GenoMax
    replied
    Have you tried ambig=all?

    Leave a comment:


  • darthsequencer
    replied
    Hi Brian - What are the right flags to set if I want all of the top scoring mappings for paired end reads?

    Thanks

    Leave a comment:


  • kimng
    replied
    Thanks for the input but same issue using sam=1.3, I think Geno's comment on something he intended to do is on the mark.

    Leave a comment:


  • HESmith
    replied
    You can try the reformat.sh command per @GenoMax but with 'sam=1.3' flag. That worked for me with a different variant caller (FreeBayes).

    Leave a comment:


  • kimng
    replied
    Thanks for the info didn't realise that was the site for tickets so appreciate it, I'll create a ticket there.

    Leave a comment:


  • GenoMax
    replied
    It was worth a try.

    Based on the `TODO` tag I see perhaps this is something @Brian intends to work on. Unfortunately he has not been participating in forums of late so there is no guarantee as to when this question may be addressed. You could try creating a ticket at BBMap SF repository for this question.

    Leave a comment:


  • kimng
    replied
    Appreciate the reply but unfortunately it leads to the same error:

    java -ea -Xmx200m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ jgi.ReformatReads in=minimap.sam out=new.bam sam=1.4
    Executing jgi.ReformatReads [in=minimap.sam, out=new.bam, sam=1.4]

    Input is being processed as unpaired
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadInputStream.toReadList(SamReadInputStream.java:118)
    at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:89)
    at stream.SamReadInputStream.hasMore(SamReadInputStream.java:53)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:664)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:653)

    Leave a comment:


  • GenoMax
    replied
    You can try reformatting your alignment file using "reformat.sh in=you.bam out=new.bam sam=1.4" to see if that helps.

    Leave a comment:


  • kimng
    replied
    Bug in call variants with relation to sam files created from minimap2

    Hello, I've been using different parts of the BBMap suite and mixing them with other softwares for WGS of bacterial samples (primarily for the purpose of automated QC). I'm a fan of callvariants.sh as it works quickly and doesn't require me to sort .sam files before hand and it's simple to clearfilters to return all values. Unfortunately when using callvariants.sh with .sam files created from minimap2 (https://github.com/lh3/minimap2) using illumina nextseq reads. I receive the error seen below. I was curious if this was a minor bug that could be corrected?

    I'm running callvariants.sh from bbmap v37.90 obtained from conda (https://anaconda.org/bioconda/bbmap). Appreciate any assistance given.

    -Kim

    Error:
    java -ea -Xmx48779m -Xms48779m -cp /home/ssi.ad/kimn/.conda/envs/env_kim/opt/bbmap-37.90/current/ var2.CallVariants in=aln.sam ref=contigs.fasta vcf=minimap.vcf ploidy=1 clearfilters
    Executing var2.CallVariants [in=aln.sam, ref=contigs.fasta, vcf=minimap.vcf, ploidy=1, clearfilters]

    Loading reference.
    Time: 0.304 seconds.
    Processing input files.
    Exception in thread "Thread-3" Exception in thread "Thread-4" Exception in thread "Thread-5" java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)
    java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag.
    at stream.SamLine.toShortMatch(SamLine.java:1200)
    at stream.SamLine.toRead(SamLine.java:1972)
    at stream.SamLine.toRead(SamLine.java:1832)
    at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:206)
    at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:135)

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 08:47 AM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
54 views
0 likes
Last Post seqadmin  
Working...
X