duplicate read removal

Brian Bushnell replied

10-02-2014, 03:20 PM
The latest version (33.54, uploaded yesterday) allows dual input files for Dedupe, though it will still only output a single interleaved file. So on your high-memory cluster, you can do this:

java -ea -Xmx140g -Xms140g -cp /path/ jgi.Dedupe in1=read1.fq in2=read2.fq out=deduped.fq ac=f

You will still need to de-interleave the output with Reformat, if you need 2 output files:

java -ea -Xmx200m -cp /path/ jgi.ReformatReads in=deduped.fq out1=dd1.fq out2=dd2.fq

Whether or not you SHOULD deduplicate the data depends on your experiment. What are you doing? It's often a good idea if you expect lots of PCR duplicates, but not necessarily otherwise.

Last edited by Brian Bushnell; 01-12-2015, 09:49 AM.
Leave a comment:
ssully replied

10-02-2014, 01:18 PM
When I wrote 'I don't have 140 Gb of RAM' I meant on my workstation (it has 32Gb). But I do have access to a cluster where I can allocate >140 gb to a job . Should I try that or are you suggesting I shouldn't bother?

Last edited by ssully; 10-02-2014, 02:04 PM.
Leave a comment:
Brian Bushnell replied

10-01-2014, 11:24 AM
In that case... you will need to deduplicate using a sorting-based approach, such as mapping to a reference, sorting by position, then marking duplicates with a tool like Picard. Without a reference, you can still sort alphabetically, but I don't know of any tools that do that (for paired reads) except one I wrote a while ago that is deprecated. Probably, someone else will have a suggestion.
Leave a comment:
ssully replied

10-01-2014, 11:17 AM
Hmm..the interleaved file is 140 million reads. I don't have 140 Gb of RAM. Guess I have to run this on a unix cluster. Is the corrected package (for running the dedup command) ready for download?

Last edited by ssully; 10-01-2014, 11:22 AM.
Leave a comment:
ssully replied

09-30-2014, 01:35 PM
That works! Thanks.
Leave a comment:
Brian Bushnell replied

09-30-2014, 09:18 AM
I updated my prior post. The command should be:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

Note that Dedupe uses a lot of memory, ~1kb per read. How big are your input files?

Last edited by Brian Bushnell; 09-30-2014, 09:59 AM.
Leave a comment:
ssully replied

09-30-2014, 09:08 AM
tried it but 'java' was not recognized as a command. I added the path to 'my' java.exe
and reran, but then it threw this error

Code:

Error: Could not find or load main class jgi.ReformatReads

There is a 'ReformatReads.class' under
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/
but that path doesn't work either.

Code:

java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq Error: Could not find or load main class jgi.ReformatReads

as a last resort this command (removing the space)
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

returns

Code:

Error: Could not find or load main class in1=test1.fq

Obviously I'm not familiar with java syntax so any help is appreciated.
Last edited by ssully; 09-30-2014, 09:11 AM.
Leave a comment:
Brian Bushnell replied

09-29-2014, 06:12 PM
ssully,

I forgot to add the "in1" and "in2" flags to Dedupe's parser, since we use interleaved files for everything. I'll fix that tomorrow. For now, please do this:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

...then use the resulting interleaved file for Dedupe:

java -da -Xmx1310m -Xms1310m -cp C:\Users\Steven\bbmap\current\ jgi.Dedupe in=interleaved.fq out=dd.fq ac=f interleaved=t

Then you can deinterleave it with Reformat, if you wish:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in=dd.fq out1=dd1.fq out2=dd2.fq

Sorry for the inconvenience!

-Brian

Last edited by Brian Bushnell; 09-30-2014, 09:58 AM.
Leave a comment:
ssully replied

09-29-2014, 04:49 PM
I installed BBmap on my Windows 7 machine (Java RE 1.7 is installed). But I can't seem to get dedup to read any of my input files. They're standard paired fastq files I named test1.f1 and test2.fq. I put them in the bbmaps directory.

here's the command
dedup.sh in1=testq1.fq in2-test2.fq out1=dedup1.fq out2=dedup2.fq ac=f

here is the output:

Code:

java -Djava.library.path=/c/Users/Steven/bbmap/jni/ -ea -Xmx1310m -Xms1310m -cp /c/Users/Steven/bbmap/current/ jgi.Dedupe in1=test1.fq in2=test2.fq out1=dd1.fq out2=dd2.fq ac=f Executing jgi.Dedupe [in1=test1.fq, in2=test2.fq, out1=dedup1.fq, out2=dedup2.fq, ac=f] Exception in thread "main" java.lang.RuntimeException: Unknown parameter in1=test1.fq at jgi.Dedupe.<init>(Dedupe.java:342) at jgi.Dedupe.main(Dedupe.java:56)

using -in1=test1.fq etc, or in=test#.fq etc was no better.
Last edited by ssully; 09-29-2014, 04:59 PM.
Leave a comment:
ronton replied

08-19-2014, 06:51 AM
Originally posted by Brian Bushnell View Post

other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

I've seen this suggestion before; where to find or how to generate such a file?
Leave a comment:
gab0 replied

08-19-2014, 04:46 AM
Hi Brian

Your suggestion worked as a charm, now I have my files without these control sequences and FastQC does not show any Kmer warning at all. I ran it exactly as you suggested, by using a fasta file with the whole length of the control sequences.

Thanks for your help!!
Leave a comment:
Brian Bushnell replied

08-17-2014, 12:49 PM
Originally posted by gab0 View Post

Before posting for help, I tried BBDuk but as I didn't know that the control sequences were included in the package database, I gave a file with substrings of the control sequence. Also, when I tested it, I used trimming parameters (wrong move, d'ouh :P).

The BBTools distribution only includes the trueseq adapter sequences, which should generally be right-trimmed (since they occur at the 5' tail of reads with insert size shorter than read length, but sequence prior to that should be valid); other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

As for "k=80" in your command line... BBDuk has a max kmer length of 31. If you tell it to use a longer kmer length it will emulate this by requiring consecutive kmers; for example, k=80 would require 49 consecutive matching 31-mers, which in practice gives almost identical results to using true 80-mers but is not quite the same.

Most importantly, "k=N" will not filter sequences shorter than N bp. Since adapters and primers are generally shorter than 80 bp, it's not useful for removing those. The information content of a random 31bp sequence is 62 bits, and thus has a 1/4,611,686,018,427,387,904 chance of matching another random 31-mer. The reason I included the ability in BBDuk to handle k>31 is purely for homologous genes, low-complexity sequences, and highly-conserved sequences like ribosomes; the chances of a 31-mer from an artificial sequence accidentally matching a real organism is negligible. So, I recommend using k=31 or less when filtering/trimming synthetic compounds (typically, I use k=25 for adapter-trimming and k=27 for filtering, because some synthetic compounds are shorter than 31bp); I only use k>31 when trying to distinguish between different organisms.

As for read length - since BBDuk operates on 31-mers (or less), it's best to include the entire sequence in the reference, no matter how long. As an aside, since you are using 101-bp reads, I advise you to trim the last 1bp before using them, as it will have a very high error rate (up to 7% in my tests). You can do this with BBDuk using the "ftr=99" flag (which means force-trim-right after 99bp, using 0-based coordinates, so only the last 1bp gets trimmed).

Last edited by Brian Bushnell; 08-17-2014, 12:51 PM.
Leave a comment:
gab0 replied

08-17-2014, 12:17 PM
Originally posted by Brian Bushnell View Post

You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f

Hi Brian:

Thanks a lot for your post. Turns out that I had installed in my computer your package, and indeed I noticed the duplicate read removal tool, but didn't want to go that way (or use it as last resort) as it will eliminate true read duplicates, and I'm interested in removing only the control sequences.

As my reads are of 101bp, and the control sequences are longer that 101 bp (150bp, 250bp, and so on), is it okay to give the full lenght control sequences in the fasta file (control.fa)? or should I reduce the lenght to 101bp?

Before posting for help, I tried BBDuk but as I didn't know exactly how to use it, I used trimming parameters (wrong move, d'ouh :P).

This was the line that I executed previously.
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20 k=80

I'll try again with the exact command line that you suggested.

Thanks again Brian!!

Last edited by gab0; 08-17-2014, 12:36 PM.
Leave a comment:
Brian Bushnell replied

08-17-2014, 08:08 AM
You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f

Last edited by Brian Bushnell; 08-17-2014, 08:16 AM.
Leave a comment:
gab0 started a topic duplicate read removal

08-17-2014, 05:58 AM
duplicate read removal

Hello again:

I have a question regarding duplicate removal. Previously I had detected in my files (PE library, 2x100bp, hiseq 2000) some reads that match perfectly with the "Process Controls for TruSeq® Sample Preparation Kits"

It seems that these reads biased the Kmer content in a strange way (I posted before in http://seqanswers.com/forums/showthread.php?t=45539 in case you want to see the fastqc report).

But I don't know whether I should concern about removing these reads before the de novo assembly. Mapping against a genome reference is not an option as my reads almost don't map with the closest reference genome (which isn't so close).

I do think that these controls shouldn't affect the assembly, but I'm still not sure about what should be done.

So, is it worth to remove these reads? Do you do this step in your assembly pipelines?

Thank you very much

Gabriel
Tags: None

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: