Thanks for the fast response. I will try reformat.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi,
I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
Comment
-
Originally posted by boulund View PostHi,
I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
kcompress.sh in=a.fq.gz out=a_kmers.fa.gz
kcompress.sh in=b.fq.gz out=b_kmers.fa.gz
kcompress.sh in=a_kmers.fa.gz,b_kmers.fa.gz out=shared_kmers.fa.gz mincount=2
However, I think I'd probably tend to just assemble what you consider to be the background, and then map reads to the assembly requiring fairly high identity, keeping the reads that don't map. Either approach works (and also has disadvantages) but whole reads tend to be more specific than kmers.
Originally posted by EssigSchurke View PostThanks for the fast response. I will try reformat.
Comment
-
Ok, that's interesting. Our current approach was just that; assembly of the background sample with Megahit, and then mapping the sample to be filtered against the background assembly to remove anything that matches. I was hoping it'd be possible to do it without the massive overhead of assembling the background sample, as that's fairly time consuming and memory hungry for these large samples. I will have to evaluate the different approaches against each other to see which one fits our setup the best. Thanks for your input, and thanks for pointing me to kcompress.sh!
Comment
-
Different output with interleaved input
Hello again-
I think there is an inconsistent behavior in how bbduk handles interleaved input depending on whether the interleaved option is set to "true" or "auto".
Consider the case where only one read in a pair is discarded because too short.
With "interleaved=auto" we get in (interleaved) output only the read passing the filter, thus appearing as a single-end read. With "interleaved=true" both reads are discarded.
Is this difference intentional? In my opinion, interleaved=auto does the right thing in discarding only the bad read and keeping the other. However, this creates an interleaved output with single-end reads (which are actually paired-end but with the mate gone) intercalated in pair-end ones. I'm not sure if the interleaved format has ever been defined to allow such a case (for the record, bwa seems to handle it correctly).
I just thought useful to point this out...
This should reproduce the issue wih BBDuk version 37.54:
Code:bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=auto bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=true
Code:@r1 ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT + JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @r1 TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT + JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ @s1 ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT + AAFFFJJJJJFJFJJJJJFAFJJJFJ7<FAJJJJJJJJAAFJJJJJJJJ @s1 TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT + #################################################
Comment
-
Originally posted by GenoMax View Post@Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.
Code:@E00295:75:H7LLTALXX:8:1101:4553:1643 1:N:0:8 @E00295:75:H7LLTALXX:8:1101:4553:1643 2:N:0:8
By the way, with interleaved=auto bbduk gives to stderr the message:
Code:Input is being processed as unpaired
Code:Input is being processed as paired
Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
Comment
-
Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
Comment
-
Hi!
Is it possible to get the various quality histograms for both before and after e.g. trimming with BBDuk in a single run, or do I need to run BBDuk it twice to produce metrics for before and after trimming?
That is; run once without any trimming, just outputting histograms, and then again to trim and output histograms? Or am I missing something? The histograms output by BBDuk normally show metrics after trimming/contaminant removal, right?
By the way, I might mention that I finally tried to assemble my very large background sample using Tadpole, and then align my primary sample to that to remove 'background/contamination' reads. It produced a fairly poor assembly overall, but at least it ran to completion on the 500GB background sample on my memory constrained machine (64GB). The kmer-based approach was just too memory consuming.Last edited by boulund; 10-25-2017, 10:54 PM.
Comment
-
Seal not printing outu file
Hi - I'm running seal to map reads to ref genomes. this is the command I ran:
Code:seal in="${samplename}_nonribo.fq.gz" ref="${all4genomes}" pattern="${samplename}_out_%.fq.gz" outu="${samplename}_unmapped.fq.gz" ambig=all stats="${samplename}_mapstats.txt"
Is there another trick to making this file?
Thanks,
MC
Comment
-
Hi I would like to use bbduk to filter the reads that map to a genome. I have 48 pair samples and want to make sure I understand how to input the file.
My sample are are Pair end and are labeled as F and R (plus _1 and _2).
Should I put them all after in= ? or use in2= for the reverse? Does interleave means that I put each pair together?
eg. in= S1_F_paired_1.fq,S1_R_paired_2.fq,S10_F_paired_1.fq,S10_R_paired_2.fq...
I have them as two separate lines at the moment:
S1_F_paired_1.fq,S10_F_paired_1.fq,S11_F_paired_1.fq,S12_F_paired_1.fq,S13_F_paired_1.fq..
S1_R_paired_2.fq,S10_R_paired_2.fq,S11_R_paired_2.fq,S12_R_paired_2.fq,S13_R_paired_2.fq..
Thanks,
Catalina
Comment
-
java error when trying to filter on entropy
Hi,
I am getting an error when trying to use BBDuk to filter based on entropy. I have previously filtered the dataset for phix and for adapters with no issues before so I'm a bit confused as to why it won't work now.
I'm working on a node with 12 Gb RAM so the 8 Gb called shouldn't be an issue. I got a similar error without the -Xmx flag.
Running CentOS Linux release 7.3.1611
java version "1.7.0_131"
My fq file contains ~135 million 100 bp unpaired sequences.
command and error logs as follows:
Code:$ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1 java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1 Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1] Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1] Initial: Memory: max=8232m, free=8061m, used=171m Input is being processed as unpaired Started output streams: 0.038 seconds. Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39 at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302) at structures.EntropyTracker.passes(EntropyTracker.java:348) at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583) Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException Processing time: 0.192 seconds. Input: 34841 reads 3436691 bases. Low entropy discards: 2157 reads (6.19%) 215168 bases (6.26%) Total Removed: 2181 reads (6.26%) 216121 bases (6.29%) Result: 32660 reads (93.74%) 3220570 bases (93.71%) Time: 0.255 seconds. Reads Processed: 34841 136.55k reads/sec Bases Processed: 3436k 13.47m bases/sec
Thank you.
Comment
-
Entropy filtering: Java ArrayIndexOutOfBoundsException
Hi,
I'm having an issue while trying to filter a fastq file using an entropy filter. The library protocol used ribozero so there are a lot of poly T sequences that I would like to remove.
I have successfully removed adapter and phiX contamination from the file but when I try the entropy filter (with various -Xmx settings or none) I get a java array error.
There are ~137 million 100 bp unpaired reads in the fastq file and they have been filtered for adapters, low quality and phiX (using BBDuk).
I'm working on a node with 24 cores and 128 GiB of RAM running CentOS Linux release 7.3.1611 and java version "1.7.0_131".
Command and error messages follow:
$ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
Initial:
Memory: max=8232m, free=8061m, used=171m
Input is being processed as unpaired
Started output streams: 0.038 seconds.
Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
at structures.EntropyTracker.passes(EntropyTracker.java:348)
at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
Processing time: 0.192 seconds.
Input: 34841 reads 3436691 bases.
Low entropy discards: 2157 reads (6.19%) 215168 bases (6.26%)
Total Removed: 2181 reads (6.26%) 216121 bases (6.29%)
Result: 32660 reads (93.74%) 3220570 bases (93.71%)
Time: 0.255 seconds.
Reads Processed: 34841 136.55k reads/sec
Bases Processed: 3436k 13.47m bases/sec
Any suggestions?
Thanks.
Dave
Comment
Latest Articles
Collapse
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
-
by seqadmin
The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...-
Channel: Articles
08-27-2024, 04:44 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 06:25 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Today, 06:25 AM
|
||
Started by seqadmin, Yesterday, 01:02 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 01:02 PM
|
||
Started by seqadmin, 09-18-2024, 06:39 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-18-2024, 06:39 AM
|
||
Started by seqadmin, 09-11-2024, 02:44 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-11-2024, 02:44 PM
|
Comment