Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicate read removal

    Hello again:

    I have a question regarding duplicate removal. Previously I had detected in my files (PE library, 2x100bp, hiseq 2000) some reads that match perfectly with the "Process Controls for TruSeq® Sample Preparation Kits"

    It seems that these reads biased the Kmer content in a strange way (I posted before in http://seqanswers.com/forums/showthread.php?t=45539 in case you want to see the fastqc report).

    But I don't know whether I should concern about removing these reads before the de novo assembly. Mapping against a genome reference is not an option as my reads almost don't map with the closest reference genome (which isn't so close).

    I do think that these controls shouldn't affect the assembly, but I'm still not sure about what should be done.

    So, is it worth to remove these reads? Do you do this step in your assembly pipelines?

    Thank you very much

    Gabriel

  • #2
    You can quickly remove them with BBDuk, like this:

    bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

    They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

    P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

    dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f
    Last edited by Brian Bushnell; 08-17-2014, 08:16 AM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      You can quickly remove them with BBDuk, like this:

      bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

      They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

      P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

      dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f
      Hi Brian:

      Thanks a lot for your post. Turns out that I had installed in my computer your package, and indeed I noticed the duplicate read removal tool, but didn't want to go that way (or use it as last resort) as it will eliminate true read duplicates, and I'm interested in removing only the control sequences.

      As my reads are of 101bp, and the control sequences are longer that 101 bp (150bp, 250bp, and so on), is it okay to give the full lenght control sequences in the fasta file (control.fa)? or should I reduce the lenght to 101bp?

      Before posting for help, I tried BBDuk but as I didn't know exactly how to use it, I used trimming parameters (wrong move, d'ouh :P).

      This was the line that I executed previously.
      bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20
      bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20 k=80

      I'll try again with the exact command line that you suggested.

      Thanks again Brian!!
      Last edited by gab0; 08-17-2014, 12:36 PM.

      Comment


      • #4
        Originally posted by gab0 View Post
        Before posting for help, I tried BBDuk but as I didn't know that the control sequences were included in the package database, I gave a file with substrings of the control sequence. Also, when I tested it, I used trimming parameters (wrong move, d'ouh :P).
        The BBTools distribution only includes the trueseq adapter sequences, which should generally be right-trimmed (since they occur at the 5' tail of reads with insert size shorter than read length, but sequence prior to that should be valid); other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

        As for "k=80" in your command line... BBDuk has a max kmer length of 31. If you tell it to use a longer kmer length it will emulate this by requiring consecutive kmers; for example, k=80 would require 49 consecutive matching 31-mers, which in practice gives almost identical results to using true 80-mers but is not quite the same.

        Most importantly, "k=N" will not filter sequences shorter than N bp. Since adapters and primers are generally shorter than 80 bp, it's not useful for removing those. The information content of a random 31bp sequence is 62 bits, and thus has a 1/4,611,686,018,427,387,904 chance of matching another random 31-mer. The reason I included the ability in BBDuk to handle k>31 is purely for homologous genes, low-complexity sequences, and highly-conserved sequences like ribosomes; the chances of a 31-mer from an artificial sequence accidentally matching a real organism is negligible. So, I recommend using k=31 or less when filtering/trimming synthetic compounds (typically, I use k=25 for adapter-trimming and k=27 for filtering, because some synthetic compounds are shorter than 31bp); I only use k>31 when trying to distinguish between different organisms.

        As for read length - since BBDuk operates on 31-mers (or less), it's best to include the entire sequence in the reference, no matter how long. As an aside, since you are using 101-bp reads, I advise you to trim the last 1bp before using them, as it will have a very high error rate (up to 7% in my tests). You can do this with BBDuk using the "ftr=99" flag (which means force-trim-right after 99bp, using 0-based coordinates, so only the last 1bp gets trimmed).
        Last edited by Brian Bushnell; 08-17-2014, 12:51 PM.

        Comment


        • #5
          Hi Brian

          Your suggestion worked as a charm, now I have my files without these control sequences and FastQC does not show any Kmer warning at all. I ran it exactly as you suggested, by using a fasta file with the whole length of the control sequences.

          Thanks for your help!!

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.
            I've seen this suggestion before; where to find or how to generate such a file?

            Comment


            • #7
              I installed BBmap on my Windows 7 machine (Java RE 1.7 is installed). But I can't seem to get dedup to read any of my input files. They're standard paired fastq files I named test1.f1 and test2.fq. I put them in the bbmaps directory.

              here's the command
              dedup.sh in1=testq1.fq in2-test2.fq out1=dedup1.fq out2=dedup2.fq ac=f

              here is the output:
              Code:
              java -Djava.library.path=/c/Users/Steven/bbmap/jni/ -ea -Xmx1310m -Xms1310m -cp /c/Users/Steven/bbmap/current/ jgi.Dedupe in1=test1.fq in2=test2.fq  out1=dd1.fq out2=dd2.fq ac=f
              
              Executing jgi.Dedupe [in1=test1.fq, in2=test2.fq, out1=dedup1.fq, out2=dedup2.fq, ac=f]
              
              Exception in thread "main" java.lang.RuntimeException: Unknown parameter in1=test1.fq
              
              at jgi.Dedupe.<init>(Dedupe.java:342)
              at jgi.Dedupe.main(Dedupe.java:56)
              using -in1=test1.fq etc, or in=test#.fq etc was no better.
              Last edited by ssully; 09-29-2014, 04:59 PM.

              Comment


              • #8
                ssully,

                I forgot to add the "in1" and "in2" flags to Dedupe's parser, since we use interleaved files for everything. I'll fix that tomorrow. For now, please do this:

                java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

                ...then use the resulting interleaved file for Dedupe:

                java -da -Xmx1310m -Xms1310m -cp C:\Users\Steven\bbmap\current\ jgi.Dedupe in=interleaved.fq out=dd.fq ac=f interleaved=t

                Then you can deinterleave it with Reformat, if you wish:

                java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in=dd.fq out1=dd1.fq out2=dd2.fq

                Sorry for the inconvenience!

                -Brian
                Last edited by Brian Bushnell; 09-30-2014, 09:58 AM.

                Comment


                • #9
                  tried it but 'java' was not recognized as a command. I added the path to 'my' java.exe
                  and reran, but then it threw this error
                  Code:
                  Error: Could not find or load main class jgi.ReformatReads
                  There is a 'ReformatReads.class' under
                  java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/
                  but that path doesn't work either.
                  Code:
                  java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq
                  Error: Could not find or load main class jgi.ReformatReads

                  as a last resort this command (removing the space)
                  java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

                  returns

                  Code:
                  Error: Could not find or load main class in1=test1.fq
                  Obviously I'm not familiar with java syntax so any help is appreciated.
                  Last edited by ssully; 09-30-2014, 09:11 AM.

                  Comment


                  • #10
                    I updated my prior post. The command should be:

                    java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

                    Note that Dedupe uses a lot of memory, ~1kb per read. How big are your input files?
                    Last edited by Brian Bushnell; 09-30-2014, 09:59 AM.

                    Comment


                    • #11
                      That works! Thanks.

                      Comment


                      • #12
                        Hmm..the interleaved file is 140 million reads. I don't have 140 Gb of RAM. Guess I have to run this on a unix cluster. Is the corrected package (for running the dedup command) ready for download?
                        Last edited by ssully; 10-01-2014, 11:22 AM.

                        Comment


                        • #13
                          In that case... you will need to deduplicate using a sorting-based approach, such as mapping to a reference, sorting by position, then marking duplicates with a tool like Picard. Without a reference, you can still sort alphabetically, but I don't know of any tools that do that (for paired reads) except one I wrote a while ago that is deprecated. Probably, someone else will have a suggestion.

                          Comment


                          • #14
                            When I wrote 'I don't have 140 Gb of RAM' I meant on my workstation (it has 32Gb). But I do have access to a cluster where I can allocate >140 gb to a job . Should I try that or are you suggesting I shouldn't bother?
                            Last edited by ssully; 10-02-2014, 02:04 PM.

                            Comment


                            • #15
                              The latest version (33.54, uploaded yesterday) allows dual input files for Dedupe, though it will still only output a single interleaved file. So on your high-memory cluster, you can do this:

                              java -ea -Xmx140g -Xms140g -cp /path/ jgi.Dedupe in1=read1.fq in2=read2.fq out=deduped.fq ac=f

                              You will still need to de-interleave the output with Reformat, if you need 2 output files:

                              java -ea -Xmx200m -cp /path/ jgi.ReformatReads in=deduped.fq out1=dd1.fq out2=dd2.fq

                              Whether or not you SHOULD deduplicate the data depends on your experiment. What are you doing? It's often a good idea if you expect lots of PCR duplicates, but not necessarily otherwise.
                              Last edited by Brian Bushnell; 01-12-2015, 09:49 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:53 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              204 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              213 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X