Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-17-2014, 08:08 AM

You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f

**gab0** · 08-17-2014, 12:17 PM

Originally posted by Brian Bushnell View Post

You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f

Hi Brian:

Thanks a lot for your post. Turns out that I had installed in my computer your package, and indeed I noticed the duplicate read removal tool, but didn't want to go that way (or use it as last resort) as it will eliminate true read duplicates, and I'm interested in removing only the control sequences.

As my reads are of 101bp, and the control sequences are longer that 101 bp (150bp, 250bp, and so on), is it okay to give the full lenght control sequences in the fasta file (control.fa)? or should I reduce the lenght to 101bp?

Before posting for help, I tried BBDuk but as I didn't know exactly how to use it, I used trimming parameters (wrong move, d'ouh :P).

This was the line that I executed previously.
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20 k=80

I'll try again with the exact command line that you suggested.

Thanks again Brian!!

**Brian Bushnell** · 08-17-2014, 12:49 PM

Originally posted by gab0 View Post

Before posting for help, I tried BBDuk but as I didn't know that the control sequences were included in the package database, I gave a file with substrings of the control sequence. Also, when I tested it, I used trimming parameters (wrong move, d'ouh :P).

The BBTools distribution only includes the trueseq adapter sequences, which should generally be right-trimmed (since they occur at the 5' tail of reads with insert size shorter than read length, but sequence prior to that should be valid); other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

As for "k=80" in your command line... BBDuk has a max kmer length of 31. If you tell it to use a longer kmer length it will emulate this by requiring consecutive kmers; for example, k=80 would require 49 consecutive matching 31-mers, which in practice gives almost identical results to using true 80-mers but is not quite the same.

Most importantly, "k=N" will not filter sequences shorter than N bp. Since adapters and primers are generally shorter than 80 bp, it's not useful for removing those. The information content of a random 31bp sequence is 62 bits, and thus has a 1/4,611,686,018,427,387,904 chance of matching another random 31-mer. The reason I included the ability in BBDuk to handle k>31 is purely for homologous genes, low-complexity sequences, and highly-conserved sequences like ribosomes; the chances of a 31-mer from an artificial sequence accidentally matching a real organism is negligible. So, I recommend using k=31 or less when filtering/trimming synthetic compounds (typically, I use k=25 for adapter-trimming and k=27 for filtering, because some synthetic compounds are shorter than 31bp); I only use k>31 when trying to distinguish between different organisms.

As for read length - since BBDuk operates on 31-mers (or less), it's best to include the entire sequence in the reference, no matter how long. As an aside, since you are using 101-bp reads, I advise you to trim the last 1bp before using them, as it will have a very high error rate (up to 7% in my tests). You can do this with BBDuk using the "ftr=99" flag (which means force-trim-right after 99bp, using 0-based coordinates, so only the last 1bp gets trimmed).

**gab0** · 08-19-2014, 04:46 AM

Hi Brian

Your suggestion worked as a charm, now I have my files without these control sequences and FastQC does not show any Kmer warning at all. I ran it exactly as you suggested, by using a fasta file with the whole length of the control sequences.

Thanks for your help!!

**ronton** · 08-19-2014, 06:51 AM

Originally posted by Brian Bushnell View Post

other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

I've seen this suggestion before; where to find or how to generate such a file?

**ssully** · 09-29-2014, 04:49 PM

I installed BBmap on my Windows 7 machine (Java RE 1.7 is installed). But I can't seem to get dedup to read any of my input files. They're standard paired fastq files I named test1.f1 and test2.fq. I put them in the bbmaps directory.

here's the command
dedup.sh in1=testq1.fq in2-test2.fq out1=dedup1.fq out2=dedup2.fq ac=f

here is the output:

Code:

java -Djava.library.path=/c/Users/Steven/bbmap/jni/ -ea -Xmx1310m -Xms1310m -cp /c/Users/Steven/bbmap/current/ jgi.Dedupe in1=test1.fq in2=test2.fq  out1=dd1.fq out2=dd2.fq ac=f

Executing jgi.Dedupe [in1=test1.fq, in2=test2.fq, out1=dedup1.fq, out2=dedup2.fq, ac=f]

Exception in thread "main" java.lang.RuntimeException: Unknown parameter in1=test1.fq

at jgi.Dedupe.<init>(Dedupe.java:342)
at jgi.Dedupe.main(Dedupe.java:56)

using -in1=test1.fq etc, or in=test#.fq etc was no better.

**Brian Bushnell** · 09-29-2014, 06:12 PM

ssully,

I forgot to add the "in1" and "in2" flags to Dedupe's parser, since we use interleaved files for everything. I'll fix that tomorrow. For now, please do this:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

...then use the resulting interleaved file for Dedupe:

java -da -Xmx1310m -Xms1310m -cp C:\Users\Steven\bbmap\current\ jgi.Dedupe in=interleaved.fq out=dd.fq ac=f interleaved=t

Then you can deinterleave it with Reformat, if you wish:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in=dd.fq out1=dd1.fq out2=dd2.fq

Sorry for the inconvenience!

-Brian

**ssully** · 09-30-2014, 09:08 AM

tried it but 'java' was not recognized as a command. I added the path to 'my' java.exe
and reran, but then it threw this error

Code:

Error: Could not find or load main class jgi.ReformatReads

There is a 'ReformatReads.class' under
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/
but that path doesn't work either.

Code:

java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq
Error: Could not find or load main class jgi.ReformatReads

as a last resort this command (removing the space)
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

returns

Code:

Error: Could not find or load main class in1=test1.fq

Obviously I'm not familiar with java syntax so any help is appreciated.

**Brian Bushnell** · 09-30-2014, 09:18 AM

I updated my prior post. The command should be:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

Note that Dedupe uses a lot of memory, ~1kb per read. How big are your input files?

**ssully** · 09-30-2014, 01:35 PM

That works! Thanks.

**ssully** · 10-01-2014, 11:17 AM

Hmm..the interleaved file is 140 million reads. I don't have 140 Gb of RAM. Guess I have to run this on a unix cluster. Is the corrected package (for running the dedup command) ready for download?

**Brian Bushnell** · 10-01-2014, 11:24 AM

In that case... you will need to deduplicate using a sorting-based approach, such as mapping to a reference, sorting by position, then marking duplicates with a tool like Picard. Without a reference, you can still sort alphabetically, but I don't know of any tools that do that (for paired reads) except one I wrote a while ago that is deprecated. Probably, someone else will have a suggestion.

**ssully** · 10-02-2014, 01:18 PM

When I wrote 'I don't have 140 Gb of RAM' I meant on my workstation (it has 32Gb). But I do have access to a cluster where I can allocate >140 gb to a job . Should I try that or are you suggesting I shouldn't bother?

**Brian Bushnell** · 10-02-2014, 03:20 PM

The latest version (33.54, uploaded yesterday) allows dual input files for Dedupe, though it will still only output a single interleaved file. So on your high-memory cluster, you can do this:

java -ea -Xmx140g -Xms140g -cp /path/ jgi.Dedupe in1=read1.fq in2=read2.fq out=deduped.fq ac=f

You will still need to de-interleave the output with Reformat, if you need 2 output files:

java -ea -Xmx200m -cp /path/ jgi.ReformatReads in=deduped.fq out1=dd1.fq out2=dd2.fq

Whether or not you SHOULD deduplicate the data depends on your experiment. What are you doing? It's often a good idea if you expect lots of PCR duplicates, but not necessarily otherwise.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

duplicate read removal

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News