Yes .. BBMap can do that!

mcauchy replied

03-09-2016, 05:30 AM
Newbie here! I have unzipped and untared bbmap but it wont run any commands. I have a Linux virtual box in windows 10. Am I missing some software to use BBMap?
Leave a comment:
FridaJoh replied

03-09-2016, 02:25 AM
Hi

I came across this when searching for a way to demultiplex non-overlapping paired end reads that were sequenced using combinatorial barcodes. I don't suppose there is a way of doing that somehow using seal (or other tools?).

Originally posted by Brian Bushnell View Post

It is almost possible to do this with Seal, which outputs reads into bins based on kmer matching.

seal.sh in=reads.fq pattern=%.fq k=6 restrictleft=6 mm=f ref=barcodes.fa rcomp=f

That would require a file "barcodes.fa" like this:
>AACTGA
AACTGA
>GGCCTT
GGCCTT

etc., with one fasta entry per barcode, so the output reads would be in file AACTGA.fq and so forth. This is sort of a common request, so maybe I will make it unnecessary to provide a fasta file of the barcodes. Does that matter to you either way?

However, BBDuk has the flags "skipr1" and "skipr2", which allow it to only do kmer operations on one read or the other. Seal currently lacks this, but it's essential for processing inline barcodes. I'll add it for the next release.
Leave a comment:
Brian Bushnell replied

02-23-2016, 09:20 PM
Reformat should never run out of memory with the default settings and short (<200kbp) reads. I think the input file is corrupt, and should be re-downloaded. The corruption probably occurs somewhere around the 32.77 millionth read, but it's hard to be sure...
Leave a comment:
GenoMax replied

02-23-2016, 06:21 AM
Looks like you only ran this with 200MB of RAM. Can you try with -Xmx2g?

How old is this data BTW (in years)?
Leave a comment:
DNA Sorcerer replied

02-23-2016, 05:51 AM
See below. I run it for only one fo the files because doing both would go over my storage quota.

java -da -Xmx200m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ jgi.ReformatReads -da ibq qin=33 in=scratch/s_3_1_sequence.fastq out=scratch/fixed_1.fq
Executing jgi.ReformatReads [-da, ibq, qin=33, in=scratch/s_3_1_sequence.fastq, out=scratch/fixed_1.fq]

Input is being processed as unpaired
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at fileIO.ByteFile1.fillBuffer(ByteFile1.java:180)
at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
at stream.FASTQ.toReadList(FASTQ.java:648)
at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
Input: 32775200 reads 3539721600 bases
Output: 32775200 reads (100.00%) 3539721600 bases (100.00%)

Time: 807.191 seconds.
Reads Processed: 32775k 40.60k reads/sec
Bases Processed: 3539m 4.39m bases/sec
Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
at jgi.ReformatReads.process(ReformatReads.java:1032)
at jgi.ReformatReads.main(ReformatReads.java:45)
Leave a comment:
Brian Bushnell replied

02-22-2016, 09:16 PM
Hmm... can you post the errors? Reformat by default uses very little memory, which is all that should be needed for a correctly-formatted file containing reads. It will run out of memory if you use it without increasing the default memory allocation on extremely long sequences (over tens of megabases) such as the human genome. It will never run out of memory on a correctly-formatted Illumina fastq file.

So, it would also be helpful if you could post the results of "head" (the first 10 lines of the file).
Leave a comment:
DNA Sorcerer replied

02-22-2016, 01:22 PM
testformat says: illumina fastq raw single-ended 108bp

As far as I remember this was a HiSeq run.

I tried the reformat line suggested by Brian but the process stops after a while with errors. Apparently short of memory. Will try to improve that and try again.
Leave a comment:
GenoMax replied

02-22-2016, 10:09 AM
Originally posted by DNA Sorcerer View Post

Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?

Historically there has been more than one format for Q-scores for fastq files so you need to let BBMap know which score scale to use (most new data is in Sanger fastq format). More on the encoding here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

@Brian also has a program to identify the correct encoding for your files. That example is in post # 1 of this thread.

Code:

$ testformat.sh in=seq.fq.gz
Leave a comment:
DNA Sorcerer replied

02-22-2016, 09:49 AM
Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?
Leave a comment:
Brian Bushnell replied

02-19-2016, 07:04 PM
Your error message indicates a problem with the quality scores (admittedly, I should improve the error message from that assertion, but I've never seen it happen before!). Specifically, there are some N bases (no-calls) that have a quality of greater than 0. This could mean that you have an ASCII-64 file being processed as ASCII-33, or it could mean... well, something else, like bug in the base-calling software.

If the problem is just Ns with nonzero quality scores, you can fix the files by running them through Reformat (this command assumes the file is ASCII-33/Sanger):

reformat.sh -da ibq qin=33 in=scratch/s_3_#_sequence.fastq out=fixed_#.fq

Do you happen to know what the quality encoding is of these files? For example, what platform are they from, and how old are they?
Leave a comment:
GenoMax replied

02-19-2016, 08:12 AM
You don't have to use all of the cores/threads available on a physical server (some clusters may be setup in a way that require you to reserve exclusive access to a node to do that).

I generally find that threads=6 works plenty fast (BBMap is well written). You may run into some I/O limits unless you have a high performance file system that goes along with your HPC cluster.
Leave a comment:
DNA Sorcerer replied

02-19-2016, 08:06 AM
Originally posted by GenoMax View Post

Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.

Thanks, I'll take a closer look at the cluster's manual and try again.
Leave a comment:
GenoMax replied

02-19-2016, 08:02 AM
Do you know how many cores there are on the node that you are running this on (16 was an example, you may have more or less available on the hardware you have access to)? Are you providing the necessary request for using all available cores on a node in your SGE command line?

Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.
Leave a comment:
DNA Sorcerer replied

02-19-2016, 07:42 AM
Thanks Brian,
I got it run on 16 threads, but after a little while I get these sort of errors:

Started 16 mapping threads.
Exception in thread "Thread-7" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)
Detecting finished threads: 0Exception in thread "Thread-13" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)
Exception in thread "Thread-12" java.lang.AssertionError
at stream.Read.expectedErrors(Read.java:2099)
at stream.Read.avgQualityByProbability(Read.java:1713)
at stream.Read.avgQualityByProbability(Read.java:1706)
at stream.Read.avgQuality(Read.java:1700)
at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
at align2.BBMapThread.processReadPair(BBMapThread.java:953)
at align2.AbstractMapThread.run(AbstractMapThread.java:508)

Before this run the program suggested using ignorebadquality, which I did for this run.
Leave a comment:
Brian Bushnell replied

02-18-2016, 07:52 PM
Also, note this line:

Code:

Started 1 mapping thread.

This means it is running using only 1 thread. Depending on your input size, it could take a while! If that's intentional, then it's fine. But BBMap is fully multithreaded so normally I request all available slots on a node when scheduling, and tell BBMap to use all of them with the flag "threads=16" (for example). It will normally autodetect the number of available processors and use all of them, but on SGE/UGE (which I think you are using), if the NSLOTS environment variable is set, it will cap the max threads at that value (so for example on a 16-core machine, if you request 1 slot, such that "echo $NSLOTS" returns 1, then unless you manually specify the number of threads it will only use 1, to ensure fairness).

I recommend you ssh into the node it's running on and run "top -c", then look for a java process and ensure it is using CPU resources (should be at 100%). You can also examine the output file; it should gradually be growing.
Last edited by Brian Bushnell; 02-18-2016, 07:55 PM.
Leave a comment:

Previous 1 3 10 11 12 13 14 15 16 17 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 24 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 58 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News