Seqanswers Leaderboard Ad

**EssigSchurke** · 09-17-2017, 10:01 PM

Thanks for the fast response. I will try reformat.

**boulund** · 09-28-2017, 01:40 AM

Hi,

I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).

**Brian Bushnell** · 09-28-2017, 09:48 AM

Originally posted by boulund View Post

Hi,

I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).

You can find kmers that are shared between two samples like this (bearing in mind that it may take a lot of memory):

kcompress.sh in=a.fq.gz out=a_kmers.fa.gz
kcompress.sh in=b.fq.gz out=b_kmers.fa.gz
kcompress.sh in=a_kmers.fa.gz,b_kmers.fa.gz out=shared_kmers.fa.gz mincount=2

However, I think I'd probably tend to just assemble what you consider to be the background, and then map reads to the assembly requiring fairly high identity, keeping the reads that don't map. Either approach works (and also has disadvantages) but whole reads tend to be more specific than kmers.

Originally posted by EssigSchurke View Post

Thanks for the fast response. I will try reformat.

BBDuk is fixed now, by the way

**GenoMax** · 09-28-2017, 10:13 AM

"just assemble what you consider to be the background"

Assemble as?

**Brian Bushnell** · 09-28-2017, 10:27 AM

To clarify, I meant running Megahit (for example) on the reads of the sample considered to be the background, then mapping the reads of the other sample to that assembly.

**boulund** · 09-28-2017, 10:32 PM

Ok, that's interesting. Our current approach was just that; assembly of the background sample with Megahit, and then mapping the sample to be filtered against the background assembly to remove anything that matches. I was hoping it'd be possible to do it without the massive overhead of assembling the background sample, as that's fairly time consuming and memory hungry for these large samples. I will have to evaluate the different approaches against each other to see which one fits our setup the best. Thanks for your input, and thanks for pointing me to kcompress.sh!

**dariober** · 10-15-2017, 06:30 AM

Different output with interleaved input

Hello again-

I think there is an inconsistent behavior in how bbduk handles interleaved input depending on whether the interleaved option is set to "true" or "auto".

Consider the case where only one read in a pair is discarded because too short.
With "interleaved=auto" we get in (interleaved) output only the read passing the filter, thus appearing as a single-end read. With "interleaved=true" both reads are discarded.

Is this difference intentional? In my opinion, interleaved=auto does the right thing in discarding only the bad read and keeping the other. However, this creates an interleaved output with single-end reads (which are actually paired-end but with the mate gone) intercalated in pair-end ones. I'm not sure if the interleaved format has ever been defined to allow such a case (for the record, bwa seems to handle it correctly).

I just thought useful to point this out...

This should reproduce the issue wih BBDuk version 37.54:

Code:

bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=auto
bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=true

With int2.fq being:

Code:

@r1
ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@r1
TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@s1
ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
+
AAFFFJJJJJFJFJJJJJFAFJJJFJ7<FAJJJJJJJJAAFJJJJJJJJ
@s1
TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
+
#################################################

**GenoMax** · 10-15-2017, 02:25 PM

@Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.

**dariober** · 10-16-2017, 12:28 AM

Originally posted by GenoMax View Post

@Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.

Hi Genomax- These are R1 and R2 real identifiers:

Code:

@E00295:75:H7LLTALXX:8:1101:4553:1643 1:N:0:8
@E00295:75:H7LLTALXX:8:1101:4553:1643 2:N:0:8

I think one issue is that the interleaved format has never been formally defined so it's not clear which of option of interleaved, auto or true, is doing the right thing.

By the way, with interleaved=auto bbduk gives to stderr the message:

Code:

Input is being processed as unpaired

If I append /1 and /2 to the read names then I get the same output as with interleaved=true (both reads discarded) and the message becomes:

Code:

Input is being processed as paired

so you are right /1 and /2 must be there for interleaved=auto to work as expected. Maybe it would be useful to make this behavior explicit in the documentation (unless I missed it)?

Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.

**GenoMax** · 10-16-2017, 04:39 AM

Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.

Not having reads in sync causes issues with mapping with most aligners (you note that bwa is an exception). I assume only case where R2 would be eliminated (while R1 remains) would be if you were trimming based on quality. If R2 contains adapter contamination then likely the entire insert is small and may result in elimination of R1 as well. I generally never have to trim data based on quality so have not run into this issue so far.

**boulund** · 10-25-2017, 10:51 PM

Hi!

Is it possible to get the various quality histograms for both before and after e.g. trimming with BBDuk in a single run, or do I need to run BBDuk it twice to produce metrics for before and after trimming?
That is; run once without any trimming, just outputting histograms, and then again to trim and output histograms? Or am I missing something? The histograms output by BBDuk normally show metrics after trimming/contaminant removal, right?

By the way, I might mention that I finally tried to assemble my very large background sample using Tadpole, and then align my primary sample to that to remove 'background/contamination' reads. It produced a fairly poor assembly overall, but at least it ran to completion on the 500GB background sample on my memory constrained machine (64GB). The kmer-based approach was just too memory consuming.

**mcmc** · 11-20-2017, 10:38 AM

Seal not printing outu file

Hi - I'm running seal to map reads to ref genomes. this is the command I ran:

Code:

seal in="${samplename}_nonribo.fq.gz" ref="${all4genomes}" pattern="${samplename}_out_%.fq.gz" outu="${samplename}_unmapped.fq.gz" ambig=all stats="${samplename}_mapstats.txt"

It's outputting the matched reads per reference sequence using the 'pattern', but it's not making the 'outu' file of unmapped reads (and there should be some unmapped reads, according to my stderr output).

Is there another trick to making this file?

Thanks,
MC

**catagui** · 01-19-2018, 12:06 PM

Hi I would like to use bbduk to filter the reads that map to a genome. I have 48 pair samples and want to make sure I understand how to input the file.
My sample are are Pair end and are labeled as F and R (plus _1 and _2).
Should I put them all after in= ? or use in2= for the reverse? Does interleave means that I put each pair together?
eg. in= S1_F_paired_1.fq,S1_R_paired_2.fq,S10_F_paired_1.fq,S10_R_paired_2.fq...

I have them as two separate lines at the moment:
S1_F_paired_1.fq,S10_F_paired_1.fq,S11_F_paired_1.fq,S12_F_paired_1.fq,S13_F_paired_1.fq..
S1_R_paired_2.fq,S10_R_paired_2.fq,S11_R_paired_2.fq,S12_R_paired_2.fq,S13_R_paired_2.fq..

Thanks,

Catalina

**DrYak** · 02-12-2018, 01:56 AM

java error when trying to filter on entropy

Hi,

I am getting an error when trying to use BBDuk to filter based on entropy. I have previously filtered the dataset for phix and for adapters with no issues before so I'm a bit confused as to why it won't work now.

I'm working on a node with 12 Gb RAM so the 8 Gb called shouldn't be an issue. I got a similar error without the -Xmx flag.

Running CentOS Linux release 7.3.1611
java version "1.7.0_131"

My fq file contains ~135 million 100 bp unpaired sequences.

command and error logs as follows:

Code:

$ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]

Initial:
Memory: max=8232m, free=8061m, used=171m

Input is being processed as unpaired
Started output streams: 0.038 seconds.
Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
        at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
        at structures.EntropyTracker.passes(EntropyTracker.java:348)
        at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
Processing time:                0.192 seconds.

Input:                          34841 reads             3436691 bases.
Low entropy discards:           2157 reads (6.19%)      215168 bases (6.26%)
Total Removed:                  2181 reads (6.26%)      216121 bases (6.29%)
Result:                         32660 reads (93.74%)    3220570 bases (93.71%)

Time:                           0.255 seconds.
Reads Processed:       34841    136.55k reads/sec
Bases Processed:       3436k    13.47m bases/sec

Any suggestions?

Thank you.

**DrYak** · 02-14-2018, 02:28 AM

Entropy filtering: Java ArrayIndexOutOfBoundsException

Hi,

I'm having an issue while trying to filter a fastq file using an entropy filter. The library protocol used ribozero so there are a lot of poly T sequences that I would like to remove.

I have successfully removed adapter and phiX contamination from the file but when I try the entropy filter (with various -Xmx settings or none) I get a java array error.

There are ~137 million 100 bp unpaired reads in the fastq file and they have been filtered for adapters, low quality and phiX (using BBDuk).

I'm working on a node with 24 cores and 128 GiB of RAM running CentOS Linux release 7.3.1611 and java version "1.7.0_131".

Command and error messages follow:

$ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]

Initial:
Memory: max=8232m, free=8061m, used=171m

Input is being processed as unpaired
Started output streams: 0.038 seconds.
Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
at structures.EntropyTracker.passes(EntropyTracker.java:348)
at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
Processing time: 0.192 seconds.

Input: 34841 reads 3436691 bases.
Low entropy discards: 2157 reads (6.19%) 215168 bases (6.26%)
Total Removed: 2181 reads (6.26%) 216121 bases (6.29%)
Result: 32660 reads (93.74%) 3220570 bases (93.71%)

Time: 0.255 seconds.
Reads Processed: 34841 136.55k reads/sec
Bases Processed: 3436k 13.47m bases/sec

Any suggestions?

Thanks.
Dave

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News