Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    The latest version (33.54, uploaded yesterday) allows dual input files for Dedupe, though it will still only output a single interleaved file. So on your high-memory cluster, you can do this:

    java -ea -Xmx140g -Xms140g -cp /path/ jgi.Dedupe in1=read1.fq in2=read2.fq out=deduped.fq ac=f

    You will still need to de-interleave the output with Reformat, if you need 2 output files:

    java -ea -Xmx200m -cp /path/ jgi.ReformatReads in=deduped.fq out1=dd1.fq out2=dd2.fq

    Whether or not you SHOULD deduplicate the data depends on your experiment. What are you doing? It's often a good idea if you expect lots of PCR duplicates, but not necessarily otherwise.
    Last edited by Brian Bushnell; 01-12-2015, 09:49 AM.

    Leave a comment:


  • ssully
    replied
    When I wrote 'I don't have 140 Gb of RAM' I meant on my workstation (it has 32Gb). But I do have access to a cluster where I can allocate >140 gb to a job . Should I try that or are you suggesting I shouldn't bother?
    Last edited by ssully; 10-02-2014, 02:04 PM.

    Leave a comment:


  • Brian Bushnell
    replied
    In that case... you will need to deduplicate using a sorting-based approach, such as mapping to a reference, sorting by position, then marking duplicates with a tool like Picard. Without a reference, you can still sort alphabetically, but I don't know of any tools that do that (for paired reads) except one I wrote a while ago that is deprecated. Probably, someone else will have a suggestion.

    Leave a comment:


  • ssully
    replied
    Hmm..the interleaved file is 140 million reads. I don't have 140 Gb of RAM. Guess I have to run this on a unix cluster. Is the corrected package (for running the dedup command) ready for download?
    Last edited by ssully; 10-01-2014, 11:22 AM.

    Leave a comment:


  • ssully
    replied
    That works! Thanks.

    Leave a comment:


  • Brian Bushnell
    replied
    I updated my prior post. The command should be:

    java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

    Note that Dedupe uses a lot of memory, ~1kb per read. How big are your input files?
    Last edited by Brian Bushnell; 09-30-2014, 09:59 AM.

    Leave a comment:


  • ssully
    replied
    tried it but 'java' was not recognized as a command. I added the path to 'my' java.exe
    and reran, but then it threw this error
    Code:
    Error: Could not find or load main class jgi.ReformatReads
    There is a 'ReformatReads.class' under
    java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/
    but that path doesn't work either.
    Code:
    java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq
    Error: Could not find or load main class jgi.ReformatReads

    as a last resort this command (removing the space)
    java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

    returns

    Code:
    Error: Could not find or load main class in1=test1.fq
    Obviously I'm not familiar with java syntax so any help is appreciated.
    Last edited by ssully; 09-30-2014, 09:11 AM.

    Leave a comment:


  • Brian Bushnell
    replied
    ssully,

    I forgot to add the "in1" and "in2" flags to Dedupe's parser, since we use interleaved files for everything. I'll fix that tomorrow. For now, please do this:

    java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

    ...then use the resulting interleaved file for Dedupe:

    java -da -Xmx1310m -Xms1310m -cp C:\Users\Steven\bbmap\current\ jgi.Dedupe in=interleaved.fq out=dd.fq ac=f interleaved=t

    Then you can deinterleave it with Reformat, if you wish:

    java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in=dd.fq out1=dd1.fq out2=dd2.fq

    Sorry for the inconvenience!

    -Brian
    Last edited by Brian Bushnell; 09-30-2014, 09:58 AM.

    Leave a comment:


  • ssully
    replied
    I installed BBmap on my Windows 7 machine (Java RE 1.7 is installed). But I can't seem to get dedup to read any of my input files. They're standard paired fastq files I named test1.f1 and test2.fq. I put them in the bbmaps directory.

    here's the command
    dedup.sh in1=testq1.fq in2-test2.fq out1=dedup1.fq out2=dedup2.fq ac=f

    here is the output:
    Code:
    java -Djava.library.path=/c/Users/Steven/bbmap/jni/ -ea -Xmx1310m -Xms1310m -cp /c/Users/Steven/bbmap/current/ jgi.Dedupe in1=test1.fq in2=test2.fq  out1=dd1.fq out2=dd2.fq ac=f
    
    Executing jgi.Dedupe [in1=test1.fq, in2=test2.fq, out1=dedup1.fq, out2=dedup2.fq, ac=f]
    
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter in1=test1.fq
    
    at jgi.Dedupe.<init>(Dedupe.java:342)
    at jgi.Dedupe.main(Dedupe.java:56)
    using -in1=test1.fq etc, or in=test#.fq etc was no better.
    Last edited by ssully; 09-29-2014, 04:59 PM.

    Leave a comment:


  • ronton
    replied
    Originally posted by Brian Bushnell View Post
    other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.
    I've seen this suggestion before; where to find or how to generate such a file?

    Leave a comment:


  • gab0
    replied
    Hi Brian

    Your suggestion worked as a charm, now I have my files without these control sequences and FastQC does not show any Kmer warning at all. I ran it exactly as you suggested, by using a fasta file with the whole length of the control sequences.

    Thanks for your help!!

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by gab0 View Post
    Before posting for help, I tried BBDuk but as I didn't know that the control sequences were included in the package database, I gave a file with substrings of the control sequence. Also, when I tested it, I used trimming parameters (wrong move, d'ouh :P).
    The BBTools distribution only includes the trueseq adapter sequences, which should generally be right-trimmed (since they occur at the 5' tail of reads with insert size shorter than read length, but sequence prior to that should be valid); other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

    As for "k=80" in your command line... BBDuk has a max kmer length of 31. If you tell it to use a longer kmer length it will emulate this by requiring consecutive kmers; for example, k=80 would require 49 consecutive matching 31-mers, which in practice gives almost identical results to using true 80-mers but is not quite the same.

    Most importantly, "k=N" will not filter sequences shorter than N bp. Since adapters and primers are generally shorter than 80 bp, it's not useful for removing those. The information content of a random 31bp sequence is 62 bits, and thus has a 1/4,611,686,018,427,387,904 chance of matching another random 31-mer. The reason I included the ability in BBDuk to handle k>31 is purely for homologous genes, low-complexity sequences, and highly-conserved sequences like ribosomes; the chances of a 31-mer from an artificial sequence accidentally matching a real organism is negligible. So, I recommend using k=31 or less when filtering/trimming synthetic compounds (typically, I use k=25 for adapter-trimming and k=27 for filtering, because some synthetic compounds are shorter than 31bp); I only use k>31 when trying to distinguish between different organisms.

    As for read length - since BBDuk operates on 31-mers (or less), it's best to include the entire sequence in the reference, no matter how long. As an aside, since you are using 101-bp reads, I advise you to trim the last 1bp before using them, as it will have a very high error rate (up to 7% in my tests). You can do this with BBDuk using the "ftr=99" flag (which means force-trim-right after 99bp, using 0-based coordinates, so only the last 1bp gets trimmed).
    Last edited by Brian Bushnell; 08-17-2014, 12:51 PM.

    Leave a comment:


  • gab0
    replied
    Originally posted by Brian Bushnell View Post
    You can quickly remove them with BBDuk, like this:

    bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

    They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

    P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

    dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f
    Hi Brian:

    Thanks a lot for your post. Turns out that I had installed in my computer your package, and indeed I noticed the duplicate read removal tool, but didn't want to go that way (or use it as last resort) as it will eliminate true read duplicates, and I'm interested in removing only the control sequences.

    As my reads are of 101bp, and the control sequences are longer that 101 bp (150bp, 250bp, and so on), is it okay to give the full lenght control sequences in the fasta file (control.fa)? or should I reduce the lenght to 101bp?

    Before posting for help, I tried BBDuk but as I didn't know exactly how to use it, I used trimming parameters (wrong move, d'ouh :P).

    This was the line that I executed previously.
    bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20
    bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20 k=80

    I'll try again with the exact command line that you suggested.

    Thanks again Brian!!
    Last edited by gab0; 08-17-2014, 12:36 PM.

    Leave a comment:


  • Brian Bushnell
    replied
    You can quickly remove them with BBDuk, like this:

    bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

    They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

    P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

    dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f
    Last edited by Brian Bushnell; 08-17-2014, 08:16 AM.

    Leave a comment:


  • gab0
    started a topic duplicate read removal

    duplicate read removal

    Hello again:

    I have a question regarding duplicate removal. Previously I had detected in my files (PE library, 2x100bp, hiseq 2000) some reads that match perfectly with the "Process Controls for TruSeq® Sample Preparation Kits"

    It seems that these reads biased the Kmer content in a strange way (I posted before in http://seqanswers.com/forums/showthread.php?t=45539 in case you want to see the fastqc report).

    But I don't know whether I should concern about removing these reads before the de novo assembly. Mapping against a genome reference is not an option as my reads almost don't map with the closest reference genome (which isn't so close).

    I do think that these controls shouldn't affect the assembly, but I'm still not sure about what should be done.

    So, is it worth to remove these reads? Do you do this step in your assembly pipelines?

    Thank you very much

    Gabriel

Latest Articles

Collapse

  • seqadmin
    Multiomics Techniques Advancing Disease Research
    by seqadmin


    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

    A major leap in the field has
    ...
    02-08-2024, 06:33 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 08:52 AM
0 responses
21 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-20-2024, 08:57 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-14-2024, 09:19 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-12-2024, 03:37 PM
0 responses
436 views
0 likes
Last Post seqadmin  
Working...
X