Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by willd View PostDear Brian,
I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)
Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?
Many thanks for any help
Assuming the reads are named "read1.fq" and "read2.fq", and your 299bp sequence is in sequence.fa:
bbduk.sh in=read#.fq out=filtered#.fq ref=sequence.fa k=31
You can adjust the sensitivity/specificity with other flags (like adding hdist=1, to catch the very-low-quality reads) but the command above should be adequate. It's technically possible to do this at the same time as adapter-trimming by adding the sequence to the adapters file; reads of that sequence should end up fully trimmed and thus will be discarded because they are too short (shorter than minlen which is 10 by default, I think). But you would achieve higher sensitivity and specificity by filtering in a different pass, so I'd recommend doing it in 2 steps.
Comment
-
Dear Brian - thanks very much!
In case it's of interest: I tried both 1 pass (contaminant filtering at same time as adapter-trimming) and 2 pass (adapter trimming then contaminant filtering) on two of my samples, and the output was close to identical: the number of remaining reads was ~0.4% lower for 1 pass v 2 pass (for both samples); differences in plots of per-base sequence quality, GC curves, etc. were almost unnoticeable.
Comment
-
Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
Code:Executing jgi.BBDukF [in1=../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, in2=../../C467_RE_B00GGOO_8_2_C68T3ACXX.IND1.fastq.gz, out1=B00GGOO_1_TrimPass1.fastq, out2=B00GGOO_2_TrimPass1.fastq, minlen=25, qtrim=r, trimq=10, ktrim=r, k=25, mink=11, ref=my_adapters.fa, hdist=1, tpe, tbo] BBDuk version 35.82 maskMiddle was disabled because useShortKmers=true Initial: Memory: max=777m, free=754m, used=23m Added 64944 kmers; time: 0.212 seconds. Memory: max=777m, free=717m, used=60m Input is being processed as paired Started output streams: 0.176 seconds. java.lang.AssertionError: Error in ../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, line 1210999, with these 4 lines: CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1 at stream.FASTQ.quadToRead(FASTQ.java:722) at stream.FASTQ.toReadList(FASTQ.java:653) at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111) at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96) at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635) Exception in thread "Thread-19" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-20" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-18" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-23" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-21" Exception in thread "Thread-24" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-22" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Processing time: 3.675 seconds. Input: 603200 reads 60923200 bases. QTrimmed: 67817 reads (11.24%) 3329362 bases (5.46%) KTrimmed: 52402 reads (8.69%) 1521198 bases (2.50%) Trimmed by overlap: 36830 reads (6.11%) 190222 bases (0.31%) Result: 575826 reads (95.46%) 55882418 bases (91.73%) Time: 4.073 seconds. Reads Processed: 603k 148.08k reads/sec Bases Processed: 60923k 14.96m bases/sec Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt. at jgi.BBDukF.process(BBDukF.java:822) at jgi.BBDukF.main(BBDukF.java:70)
Comment
-
Originally posted by willd View PostDear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
Code:CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
Code:@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
Actually, "#ID Median_fold" should not be in there at all. It looks like you somehow got unrelated text mixed into your fastq file, which is what is corrupting it.
Comment
-
Dear Brian, indeed there seems to be a "@HISEQ....." line missing in this region of the file, and I found that this has happened in at least one other part of the file as well. I guess I will check with the platform who generated this if they have another version of the file, though the md5sum checks out so I guess this sample might be lost if we can't trust it.
Anyway... NO PROBLEM WITH YOUR EXCELLENT BBDuk! (sorry to have suggested that, and thanks again for your help).
Code:@HISEQ10:C68T3ACXX:8:1101:20187:76019/1 ATTTATTTATTTTATTATTTTTTTGAAACATTTTTACATGTTTATTTATCAAAAGTGAAAAAGCATACAATTCAACACAATTTCAATAGCATTTGACACCA + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJFHIJJJJJIGGIJJHHHHGHEDFFDEEEEEEEEDCDEDDEEDDDBB @HISEQ10:C68T3ACXX:8:1101:20097:76043/1 GGTTAGTATAGCTTAGTTAAACTTTCGTTTATTGCTAAAGGTTAATCACTGCTGTTTCCCGTGGGGGTGTGGCTAGGCTAAGCGTTTTGAGCTGCATTGCT + @@?DDFADFHHHHJIJHGIJJIJJIJIFHIJGJIAC,=?B01:19795:76016/1 CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1 GTACATTTAACCCAGTTTAGTGGCAAGTTCTTTAGCCTTTGCCTTTTCGAGCTTGGCGATACGAGCCACAGACTTAGGACCCAGGACGTTGCCACCCCAGT + C@BFFFFFHHHHHJJHIJJIJIJIJJJHIIIIJIJIJJJJJJIIIJJJJIJIIJJJJJJIJIJHHHFFFEEEEEDCCCCDDDDDDDDDDDDDDDDDDDDD> @HISEQ10:C68T3ACXX:8:1101:19768:76027/1 ACACGCTGGACTCCCCTGGCCATTTCTGGGTTATATAAATGTTTGATGAAAAATCTCTCAGGGGAAATCAAATTTTTGTAAGAAGTTAATGACAGCCCATC + @@CFFFFFHHHHHJJJJJJDIJJJJJJIIJDHIJIIFIIJJGHIJIJJGIFIBHJJJJJHIIJJGHFHHFFFFFF;CDEEEDDDDCCDEEDDDDDDDDDDC @HISEQ10:C68T3ACXX:8:1101:19877:76028/1 AGCCATCCAACCACTCAGTCTTGGCAGTGCAGATGAAAAACTGGGAACCATTTGTGTTGGATCGGAAGAGCACACGACTGAACTCAAGTCACATCACGATC + ;;?DDDDDFFFAFD3C@:<+C<CBEEG@9CFFCAFGI9DD)?F<2)9?1D3??4).8)==C@)'5C1=263?;@@<',;8:@B(5(5>>:,>AABBB####
Comment
-
Filtering different contaminants with different k-mer sizes
Dear Brian,
I use BBDuk for filtering reads based on quality, adapter content, and chloroplast content. It is a great tool!
For some applications, I also want to filter out reads that contain a restriction enzyme motif (a sequence of length 4-8).
Currently, this leads to up to four BBduk jobs that are connected by a pipe:
Code:bbduk.sh in=in_R1.fastq in2=in_R2.fastq removeifeitherbad=t out=stdout.fq ref=restriction_motifs.fa k=4 rcomp=t maskmiddle=f qout=33 | bbduk.sh in=stdin.fq int=t removeifeitherbad=t out=stdout.fq minlength=30 qout=33 ref=adapters.fa hammingdistance=2 k=20 mink=10 hammingdistance2=1 ktrim=n maxns=0 maxbadkmers=0 rcomp=t | bbduk.sh in=stdin.fq int=t removeifeitherbad=t out=stdout.fq ref=chloroplast.fa k=31 rcomp=t qout=33 | bbduk.sh in=stdin.fq int=t out1=out_R1.fastq removeifeitherbad=t out2=out_R2.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33
Can I condense these four jobs into less (ideally one) jobs?
I assume that if a FASTA entry in the file given with "ref=" is smaller than the k-mer size, it is effectively not used for the filtering.
Therefore, the main problem that I see is to define different k-mer sizes for different references (e.g. 4 for the restriction motif, and 31 for the chloroplast).
Thanks!
Chris
Comment
-
Hi Chris,
You can merge the last two stages:
Code:bbduk.sh in=stdin.fq int=t out=out_R#.fastq removeifeitherbad=t minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33 ref=chloroplast.fa k=31 rcomp=t
Code:bbduk.sh in=stdin.fq int=t out=out_R#.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 ref=chloroplast.fa k=31
Note that (depending on the environment) BBDuk will try to grab around half of available memory for each process, which may lead to unpredictable results. I suggest using the -Xmx flag in each stage to tell it how much memory to use (-Xmx1g should be fine in each case) whenever piping BBDuk.
The way you accomplished variable-kmer-length adapter-filtering via the "ktrim=n" and "maxns=0" combo was clever, by the way!
Comment
-
Some pairs breaking after multiple steps
I've been doing some multi-step trimming of paired libraries that has been causing a low frequency of pairs breaking, and I'm trying to figure out if there's a way to avoid having this happen.
These libraries have a structure where read1 contains exclusively sample barcoding information, and read2 contains the sequences that will eventually be aligned. Ideally, I would like to preserve every base of read1, but on read2, trim several bases from the 5' of the read, as well as do some kmer and quality trimming on read2.
my solution was to do this in two steps, first doing the 5' trim on just read2, then doing the kmer trim with the skipr1 flag active. (I found out the hard way that the skipr1 flag doesn't skip ftl operations)
ie
Code:bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
Thanks
Comment
-
BBDuk is multithreaded, and will reorder reads unless you explicitly tell it not to via the "ordered" flag. So in this case, you could do this:
Code:bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz [U][B]ordered minlength=1[/B][/U] bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
It's possible to repair the pairing with "repair.sh" but of course it's always most efficient to just avoid the problem in the first place.
Comment
-
Hi, I'm trying to trim fasta reads w/ quals but get this error.
Thanks,
Pat
~pminx/bin/bbmap/bbduk.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
error:
java -Djava.library.path=/gscuser/pminx/bin/bbmap/jni/ -ea -Xmx991m -Xms991m -cp /gscuser/pminx/bin/bbmap/current/ jgi.BBDukF in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
Executing jgi.BBDukF [in=Cat_BES_325_500.fasta, qfin=Cat_BES_325_500.qual, out=Cat_BES_325_500_trim.fa, qfout=Cat_BES_325_500_trim.qual, ftl=125, ftr=125]
BBDuk version 36.30
Exception in thread "main" java.lang.RuntimeException: Unknown parameter qfout=Cat_BES_325_500_trim.qual
at jgi.BBDukF.<init>(BBDukF.java:421)
at jgi.BBDukF.main(BBDukF.java:66)
Comment
-
Hi Pat,
My apologies, I somehow forgot to include support for qual file output with BBDuk! And, reformat does not support ftl/ftr. I will fix BBDuk's support for qual files in the next release. What you can do now is this (and apologies for it being tedious):
reformat.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=x.fq
bbduk.sh in=x.fq out=y.fq ftl=125 ftr=125
reformat.sh in=y.fq out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual
However, please note that "ftl=125 ftr=125" will trim everything to the left of position 125, and everything to the right of position 125, leaving you with only 1bp reads, which are not very useful. What are you trying to accomplish with this command?
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
23 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment