Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mcauchy
    replied
    Newbie here! I have unzipped and untared bbmap but it wont run any commands. I have a Linux virtual box in windows 10. Am I missing some software to use BBMap?

    Leave a comment:


  • FridaJoh
    replied
    Hi

    I came across this when searching for a way to demultiplex non-overlapping paired end reads that were sequenced using combinatorial barcodes. I don't suppose there is a way of doing that somehow using seal (or other tools?).

    Originally posted by Brian Bushnell View Post
    It is almost possible to do this with Seal, which outputs reads into bins based on kmer matching.

    seal.sh in=reads.fq pattern=%.fq k=6 restrictleft=6 mm=f ref=barcodes.fa rcomp=f

    That would require a file "barcodes.fa" like this:
    >AACTGA
    AACTGA
    >GGCCTT
    GGCCTT

    etc., with one fasta entry per barcode, so the output reads would be in file AACTGA.fq and so forth. This is sort of a common request, so maybe I will make it unnecessary to provide a fasta file of the barcodes. Does that matter to you either way?

    However, BBDuk has the flags "skipr1" and "skipr2", which allow it to only do kmer operations on one read or the other. Seal currently lacks this, but it's essential for processing inline barcodes. I'll add it for the next release.

    Leave a comment:


  • Brian Bushnell
    replied
    Reformat should never run out of memory with the default settings and short (<200kbp) reads. I think the input file is corrupt, and should be re-downloaded. The corruption probably occurs somewhere around the 32.77 millionth read, but it's hard to be sure...

    Leave a comment:


  • GenoMax
    replied
    Looks like you only ran this with 200MB of RAM. Can you try with -Xmx2g?

    How old is this data BTW (in years)?

    Leave a comment:


  • DNA Sorcerer
    replied
    See below. I run it for only one fo the files because doing both would go over my storage quota.

    java -da -Xmx200m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ jgi.ReformatReads -da ibq qin=33 in=scratch/s_3_1_sequence.fastq out=scratch/fixed_1.fq
    Executing jgi.ReformatReads [-da, ibq, qin=33, in=scratch/s_3_1_sequence.fastq, out=scratch/fixed_1.fq]

    Input is being processed as unpaired
    java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at fileIO.ByteFile1.fillBuffer(ByteFile1.java:180)
    at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
    at stream.FASTQ.toReadList(FASTQ.java:648)
    at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
    at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
    Input: 32775200 reads 3539721600 bases
    Output: 32775200 reads (100.00%) 3539721600 bases (100.00%)

    Time: 807.191 seconds.
    Reads Processed: 32775k 40.60k reads/sec
    Bases Processed: 3539m 4.39m bases/sec
    Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
    at jgi.ReformatReads.process(ReformatReads.java:1032)
    at jgi.ReformatReads.main(ReformatReads.java:45)

    Leave a comment:


  • Brian Bushnell
    replied
    Hmm... can you post the errors? Reformat by default uses very little memory, which is all that should be needed for a correctly-formatted file containing reads. It will run out of memory if you use it without increasing the default memory allocation on extremely long sequences (over tens of megabases) such as the human genome. It will never run out of memory on a correctly-formatted Illumina fastq file.

    So, it would also be helpful if you could post the results of "head" (the first 10 lines of the file).

    Leave a comment:


  • DNA Sorcerer
    replied
    testformat says: illumina fastq raw single-ended 108bp

    As far as I remember this was a HiSeq run.

    I tried the reformat line suggested by Brian but the process stops after a while with errors. Apparently short of memory. Will try to improve that and try again.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by DNA Sorcerer View Post
    Thank Brian,

    You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?
    Historically there has been more than one format for Q-scores for fastq files so you need to let BBMap know which score scale to use (most new data is in Sanger fastq format). More on the encoding here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

    @Brian also has a program to identify the correct encoding for your files. That example is in post # 1 of this thread.

    Code:
    $ testformat.sh in=seq.fq.gz

    Leave a comment:


  • DNA Sorcerer
    replied
    Thank Brian,

    You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?

    Leave a comment:


  • Brian Bushnell
    replied
    Your error message indicates a problem with the quality scores (admittedly, I should improve the error message from that assertion, but I've never seen it happen before!). Specifically, there are some N bases (no-calls) that have a quality of greater than 0. This could mean that you have an ASCII-64 file being processed as ASCII-33, or it could mean... well, something else, like bug in the base-calling software.

    If the problem is just Ns with nonzero quality scores, you can fix the files by running them through Reformat (this command assumes the file is ASCII-33/Sanger):

    reformat.sh -da ibq qin=33 in=scratch/s_3_#_sequence.fastq out=fixed_#.fq

    Do you happen to know what the quality encoding is of these files? For example, what platform are they from, and how old are they?

    Leave a comment:


  • GenoMax
    replied
    You don't have to use all of the cores/threads available on a physical server (some clusters may be setup in a way that require you to reserve exclusive access to a node to do that).

    I generally find that threads=6 works plenty fast (BBMap is well written). You may run into some I/O limits unless you have a high performance file system that goes along with your HPC cluster.

    Leave a comment:


  • DNA Sorcerer
    replied
    Originally posted by GenoMax View Post
    Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.
    Thanks, I'll take a closer look at the cluster's manual and try again.

    Leave a comment:


  • GenoMax
    replied
    Do you know how many cores there are on the node that you are running this on (16 was an example, you may have more or less available on the hardware you have access to)? Are you providing the necessary request for using all available cores on a node in your SGE command line?

    Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.

    Leave a comment:


  • DNA Sorcerer
    replied
    Thanks Brian,
    I got it run on 16 threads, but after a little while I get these sort of errors:

    Started 16 mapping threads.
    Exception in thread "Thread-7" java.lang.AssertionError
    at stream.Read.expectedErrors(Read.java:2099)
    at stream.Read.avgQualityByProbability(Read.java:1713)
    at stream.Read.avgQualityByProbability(Read.java:1706)
    at stream.Read.avgQuality(Read.java:1700)
    at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
    at align2.BBMapThread.processReadPair(BBMapThread.java:953)
    at align2.AbstractMapThread.run(AbstractMapThread.java:508)
    Detecting finished threads: 0Exception in thread "Thread-13" java.lang.AssertionError
    at stream.Read.expectedErrors(Read.java:2099)
    at stream.Read.avgQualityByProbability(Read.java:1713)
    at stream.Read.avgQualityByProbability(Read.java:1706)
    at stream.Read.avgQuality(Read.java:1700)
    at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
    at align2.BBMapThread.processReadPair(BBMapThread.java:953)
    at align2.AbstractMapThread.run(AbstractMapThread.java:508)
    Exception in thread "Thread-12" java.lang.AssertionError
    at stream.Read.expectedErrors(Read.java:2099)
    at stream.Read.avgQualityByProbability(Read.java:1713)
    at stream.Read.avgQualityByProbability(Read.java:1706)
    at stream.Read.avgQuality(Read.java:1700)
    at align2.AbstractMapThread.quickMap(AbstractMapThread.java:688)
    at align2.BBMapThread.processReadPair(BBMapThread.java:953)
    at align2.AbstractMapThread.run(AbstractMapThread.java:508)
    Before this run the program suggested using ignorebadquality, which I did for this run.

    Leave a comment:


  • Brian Bushnell
    replied
    Also, note this line:

    Code:
    Started 1 mapping thread.
    This means it is running using only 1 thread. Depending on your input size, it could take a while! If that's intentional, then it's fine. But BBMap is fully multithreaded so normally I request all available slots on a node when scheduling, and tell BBMap to use all of them with the flag "threads=16" (for example). It will normally autodetect the number of available processors and use all of them, but on SGE/UGE (which I think you are using), if the NSLOTS environment variable is set, it will cap the max threads at that value (so for example on a 16-core machine, if you request 1 slot, such that "echo $NSLOTS" returns 1, then unless you manually specify the number of threads it will only use 1, to ensure fairness).

    I recommend you ssh into the node it's running on and run "top -c", then look for a java process and ensure it is using CPU resources (should be at 100%). You can also examine the output file; it should gradually be growing.
    Last edited by Brian Bushnell; 02-18-2016, 07:55 PM.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Analysis Tools
    by seqadmin


    The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
    05-06-2024, 07:48 AM
  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 05-14-2024, 07:03 AM
0 responses
23 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-10-2024, 06:35 AM
0 responses
44 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-09-2024, 02:46 PM
0 responses
57 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-07-2024, 06:57 AM
0 responses
44 views
0 likes
Last Post seqadmin  
Working...
X