Seqanswers Leaderboard Ad

**GenoMax** · 01-11-2016, 04:01 AM

Originally posted by willd View Post

Dear Brian,

I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

Would it be possible to do it in one run, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

Many thanks for any help

While you can probably use BBDuk, you may want to look at BBSplit which should be able to bin reads that map to your sequence away from the rest of the data.

**Brian Bushnell** · 01-11-2016, 07:23 PM

Originally posted by willd View Post

Dear Brian,

I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

Many thanks for any help

You can do this with either mapping (BBMap/BBSplit) or kmer-filtering (BBDuk/Seal); the result should be the same. BBDuk will be faster though.

Assuming the reads are named "read1.fq" and "read2.fq", and your 299bp sequence is in sequence.fa:

bbduk.sh in=read#.fq out=filtered#.fq ref=sequence.fa k=31

You can adjust the sensitivity/specificity with other flags (like adding hdist=1, to catch the very-low-quality reads) but the command above should be adequate. It's technically possible to do this at the same time as adapter-trimming by adding the sequence to the adapters file; reads of that sequence should end up fully trimmed and thus will be discarded because they are too short (shorter than minlen which is 10 by default, I think). But you would achieve higher sensitivity and specificity by filtering in a different pass, so I'd recommend doing it in 2 steps.

**willd** · 01-13-2016, 05:11 AM

Dear Brian - thanks very much!

In case it's of interest: I tried both 1 pass (contaminant filtering at same time as adapter-trimming) and 2 pass (adapter trimming then contaminant filtering) on two of my samples, and the output was close to identical: the number of remaining reads was ~0.4% lower for 1 pass v 2 pass (for both samples); differences in plots of per-base sequence quality, GC curves, etc. were almost unnoticeable.

**willd** · 01-14-2016, 03:08 AM

Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?

Code:

    Executing jgi.BBDukF [in1=../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, in2=../../C467_RE_B00GGOO_8_2_C68T3ACXX.IND1.fastq.gz, out1=B00GGOO_1_TrimPass1.fastq, out2=B00GGOO_2_TrimPass1.fastq, minlen=25, qtrim=r, trimq=10, ktrim=r, k=25, mink=11, ref=my_adapters.fa, hdist=1, tpe, tbo]
    
    BBDuk version 35.82
    maskMiddle was disabled because useShortKmers=true
    Initial:
    Memory: max=777m, free=754m, used=23m
    
    Added 64944 kmers; time: 	0.212 seconds.
    Memory: max=777m, free=717m, used=60m
    
    Input is being processed as paired
    Started output streams:	0.176 seconds.
    java.lang.AssertionError: 
    Error in ../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, line 1210999, with these 4 lines:
    CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
    +
    CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
    @HISEQ10:C68T3ACXX:8:1101:19859:76022/1
    
    	at stream.FASTQ.quadToRead(FASTQ.java:722)
    	at stream.FASTQ.toReadList(FASTQ.java:653)
    	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
    	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
    	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
    	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
    Exception in thread "Thread-19" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-20" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-18" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-23" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-21" Exception in thread "Thread-24" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-22" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Processing time:   		3.675 seconds.
    
    Input:                  	603200 reads 		60923200 bases.
    QTrimmed:               	67817 reads (11.24%) 	3329362 bases (5.46%)
    KTrimmed:               	52402 reads (8.69%) 	1521198 bases (2.50%)
    Trimmed by overlap:     	36830 reads (6.11%) 	190222 bases (0.31%)
    Result:                 	575826 reads (95.46%) 	55882418 bases (91.73%)
    
    Time:   			4.073 seconds.
    Reads Processed:        603k 	148.08k reads/sec
    Bases Processed:      60923k 	14.96m bases/sec
    Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
    	at jgi.BBDukF.process(BBDukF.java:822)
    	at jgi.BBDukF.main(BBDukF.java:70)

**Brian Bushnell** · 01-14-2016, 07:14 AM

Originally posted by willd View Post

Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?

It looks like the fastq file is corrupt and is missing a line, or something similar.

Code:

CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold

should be more like

Code:

@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################

Try looking at the file in that region, by piping zcat to grep around the term "@HISEQ10:C68T3ACXX:8:1101:19859:76022/1" with plus and minus 10 lines, then posting the output.

Actually, "#ID Median_fold" should not be in there at all. It looks like you somehow got unrelated text mixed into your fastq file, which is what is corrupting it.

**willd** · 01-15-2016, 03:02 AM

Dear Brian, indeed there seems to be a "@HISEQ....." line missing in this region of the file, and I found that this has happened in at least one other part of the file as well. I guess I will check with the platform who generated this if they have another version of the file, though the md5sum checks out so I guess this sample might be lost if we can't trust it.

Anyway... NO PROBLEM WITH YOUR EXCELLENT BBDuk! (sorry to have suggested that, and thanks again for your help).

Code:

@HISEQ10:C68T3ACXX:8:1101:20187:76019/1
ATTTATTTATTTTATTATTTTTTTGAAACATTTTTACATGTTTATTTATCAAAAGTGAAAAAGCATACAATTCAACACAATTTCAATAGCATTTGACACCA
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJFHIJJJJJIGGIJJHHHHGHEDFFDEEEEEEEEDCDEDDEEDDDBB
@HISEQ10:C68T3ACXX:8:1101:20097:76043/1
GGTTAGTATAGCTTAGTTAAACTTTCGTTTATTGCTAAAGGTTAATCACTGCTGTTTCCCGTGGGGGTGTGGCTAGGCTAAGCGTTTTGAGCTGCATTGCT
+
@@?DDFADFHHHHJIJHGIJJIJJIJIFHIJGJIAC,=?B01:19795:76016/1
CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
@HISEQ10:C68T3ACXX:8:1101:19859:76022/1
GTACATTTAACCCAGTTTAGTGGCAAGTTCTTTAGCCTTTGCCTTTTCGAGCTTGGCGATACGAGCCACAGACTTAGGACCCAGGACGTTGCCACCCCAGT
+
C@BFFFFFHHHHHJJHIJJIJIJIJJJHIIIIJIJIJJJJJJIIIJJJJIJIIJJJJJJIJIJHHHFFFEEEEEDCCCCDDDDDDDDDDDDDDDDDDDDD>
@HISEQ10:C68T3ACXX:8:1101:19768:76027/1
ACACGCTGGACTCCCCTGGCCATTTCTGGGTTATATAAATGTTTGATGAAAAATCTCTCAGGGGAAATCAAATTTTTGTAAGAAGTTAATGACAGCCCATC
+
@@CFFFFFHHHHHJJJJJJDIJJJJJJIIJDHIJIIFIIJJGHIJIJJGIFIBHJJJJJHIIJJGHFHHFFFFFF;CDEEEDDDDCCDEEDDDDDDDDDDC
@HISEQ10:C68T3ACXX:8:1101:19877:76028/1
AGCCATCCAACCACTCAGTCTTGGCAGTGCAGATGAAAAACTGGGAACCATTTGTGTTGGATCGGAAGAGCACACGACTGAACTCAAGTCACATCACGATC
+
;;?DDDDDFFFAFD3C@:<+C<CBEEG@9CFFCAFGI9DD)?F<2)9?1D3??4).8)==C@)'5C1=263?;@@<',;8:@B(5(5>>:,>AABBB####

(PS: the "#ID Median_fold" that you mentioned was not in my output)

**chrism** · 03-02-2016, 12:15 AM

Filtering different contaminants with different k-mer sizes

Dear Brian,

I use BBDuk for filtering reads based on quality, adapter content, and chloroplast content. It is a great tool!

For some applications, I also want to filter out reads that contain a restriction enzyme motif (a sequence of length 4-8).

Currently, this leads to up to four BBduk jobs that are connected by a pipe:

Code:

bbduk.sh  in=in_R1.fastq in2=in_R2.fastq removeifeitherbad=t out=stdout.fq ref=restriction_motifs.fa k=4 rcomp=t maskmiddle=f qout=33  | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq minlength=30 qout=33 ref=adapters.fa hammingdistance=2 k=20 mink=10 hammingdistance2=1 ktrim=n maxns=0 maxbadkmers=0 rcomp=t | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq ref=chloroplast.fa k=31 rcomp=t qout=33  | bbduk.sh in=stdin.fq int=t out1=out_R1.fastq removeifeitherbad=t out2=out_R2.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33

My question is:
Can I condense these four jobs into less (ideally one) jobs?

I assume that if a FASTA entry in the file given with "ref=" is smaller than the k-mer size, it is effectively not used for the filtering.
Therefore, the main problem that I see is to define different k-mer sizes for different references (e.g. 4 for the restriction motif, and 31 for the chloroplast).

Thanks!

Chris

**Brian Bushnell** · 03-02-2016, 08:05 PM

Hi Chris,

You can merge the last two stages:

Code:

bbduk.sh in=stdin.fq int=t out=out_R#.fastq removeifeitherbad=t minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33 ref=chloroplast.fa k=31 rcomp=t

And, you don't actually need to set "qout=33 rcomp=t removeifeitherbad=t" since those are all defaults, so the last two stages shorten to:

Code:

bbduk.sh in=stdin.fq int=t out=out_R#.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 ref=chloroplast.fa k=31

Unfortunately, you cannot combine stages 1, 2, and 3, as they each need different kmer lengths or hamming distances. So the best way to do it is in a pipeline. Your assumption about ignoring sequences shorter than kmer length is correct, but unfortunately it does not help in this situation.

Note that (depending on the environment) BBDuk will try to grab around half of available memory for each process, which may lead to unpredictable results. I suggest using the -Xmx flag in each stage to tell it how much memory to use (-Xmx1g should be fine in each case) whenever piping BBDuk.

The way you accomplished variable-kmer-length adapter-filtering via the "ktrim=n" and "maxns=0" combo was clever, by the way!

**chrism** · 03-07-2016, 09:40 AM

Dear Brian,

thanks for your help!

Best,
Chris

**cmbetts** · 06-14-2016, 01:08 PM

Some pairs breaking after multiple steps

I've been doing some multi-step trimming of paired libraries that has been causing a low frequency of pairs breaking, and I'm trying to figure out if there's a way to avoid having this happen.
These libraries have a structure where read1 contains exclusively sample barcoding information, and read2 contains the sequences that will eventually be aligned. Ideally, I would like to preserve every base of read1, but on read2, trim several bases from the 5' of the read, as well as do some kmer and quality trimming on read2.

my solution was to do this in two steps, first doing the 5' trim on just read2, then doing the kmer trim with the skipr1 flag active. (I found out the hard way that the skipr1 flag doesn't skip ftl operations)
ie

Code:

bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz
bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE

However, even though there's the same number of reads before and after 5' trimming, and the same number of reads in the final trimmed reads, some of the pairing breaks at a low frequency (I haven't picked out all the broken pairs yet, but the head and tail of the file have the same read names, and the first broken pair is ~2000 reads into the files). Is there a way to run this sort of trimming on only read2 while keeping the pairs properly mated? My other alternative would be to do the trimming on just read2, and then write a custom python script to pull out the correct ordered reads in the read1 fastq.

Thanks

**Brian Bushnell** · 06-14-2016, 01:15 PM

BBDuk is multithreaded, and will reorder reads unless you explicitly tell it not to via the "ordered" flag. So in this case, you could do this:

Code:

bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz [U][B]ordered minlength=1[/B][/U]
bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE

It does not matter whether the reads are re-ordered in the second pass, because pairs will be kept together; but when processing just one of the files alone, it's important to retain the original order, and not discard any reads (which is the reason for the minlength=1 flag; the default is 10).

It's possible to repair the pairing with "repair.sh" but of course it's always most efficient to just avoid the problem in the first place.

**cmbetts** · 06-14-2016, 02:41 PM

Thanks!
I thought it was probably something simple I was leaving out. Adding in the "ordered minlength=1" to the first command preserved the pairs as checked with reformat.sh vpair=TRUE

**Mizzou55** · 09-09-2016, 12:04 PM

Hi, I'm trying to trim fasta reads w/ quals but get this error.

Thanks,

Pat

~pminx/bin/bbmap/bbduk.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125

error:
java -Djava.library.path=/gscuser/pminx/bin/bbmap/jni/ -ea -Xmx991m -Xms991m -cp /gscuser/pminx/bin/bbmap/current/ jgi.BBDukF in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
Executing jgi.BBDukF [in=Cat_BES_325_500.fasta, qfin=Cat_BES_325_500.qual, out=Cat_BES_325_500_trim.fa, qfout=Cat_BES_325_500_trim.qual, ftl=125, ftr=125]
BBDuk version 36.30
Exception in thread "main" java.lang.RuntimeException: Unknown parameter qfout=Cat_BES_325_500_trim.qual
at jgi.BBDukF.<init>(BBDukF.java:421)
at jgi.BBDukF.main(BBDukF.java:66)

**Brian Bushnell** · 09-09-2016, 12:38 PM

Hi Pat,

My apologies, I somehow forgot to include support for qual file output with BBDuk! And, reformat does not support ftl/ftr. I will fix BBDuk's support for qual files in the next release. What you can do now is this (and apologies for it being tedious):

reformat.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=x.fq
bbduk.sh in=x.fq out=y.fq ftl=125 ftr=125
reformat.sh in=y.fq out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual

However, please note that "ftl=125 ftr=125" will trim everything to the left of position 125, and everything to the right of position 125, leaving you with only 1bp reads, which are not very useful. What are you trying to accomplish with this command?

**Mizzou55** · 09-12-2016, 06:54 PM

Thanks for the quick response. I'll give it a try.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News