Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Brian,

    I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

    I have a couple hundred of PE libries to trim. I started by using the default for one of them,

    bbduk.sh -Xmx1g in1=in1.fastq in2=in2.fastq out1=out1.fastq out2.fastq ref=~/package/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
    The standard err output looks like this,

    Input: 50775382 reads 7667082682 bases.
    KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
    Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
    Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
    Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)
    Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?

    Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?

    Thank you in advance for your answers.

    Comment


    • Originally posted by chiayi View Post
      Hi Brian,

      I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

      I have a couple hundred of PE libries to trim. I started by using the default for one of them, (...)
      The standard err output looks like this,

      Input: 50775382 reads 7667082682 bases.
      KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
      Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
      Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
      Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)
      Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?
      Hi Chiayi,

      The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.

      Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?
      Yes, that's because BBDuk defaults to lower compression (level 2) than gzip normally does (level 6), to increase speed. You can increase compression to gzip's default level by adding the flag "zl=6". That will make it run a little slower, unless you are on a multi-core computer and have pigz installed - which I highly recommend, especially if you will be processing hundreds of libraries.

      Thank you in advance for your answers.
      You're welcome! Incidentally, if you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It only supports interleaved files for paired reads (rather than dual files) but for libraries with sufficient coverage, meaning >10x or so, it can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

      clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz

      Comment


      • The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.
        Thank you, Brian. I guess I was confused with adapter trimming and adapter filtering, and it's clear now with your explanation. I have a follow-up question. I thought the percentage of reads with adapters with be higher than 10%, is not correct? Aren't all reads fused with adapters during library preparation? Thank you.
        Last edited by chiayi; 12-03-2016, 04:00 PM.

        Comment


        • Hi Chiayi,

          Reads are not usually supposed to have adapters. The DNA fragments that are being sequenced are fused with adapters. But sequencing starts at the adapter and proceeds in the opposite direction, into the genomic portion of the molecule. So you do not see adapter sequence from the end of the molecule you are sequencing; rather, if the genomic part of the read is shorter than the number of cycles you run sequencing, adapter sequence from the other end of the molecule will appear at the end of the read.

          In other words, if you do 2x150bp sequencing, you try to design the library with an insert size of over 150bp; perhaps 250bp so that the reads overlap by 50bp. Only the short fragments, under 150bp, will have adapter sequence appear in the reads which needs trimming.

          Comment


          • Hi Brian,

            Thank you again. That was very helpful. One last quesiton, is bz2 not compatible with BBDuk? Some of my files are bz2 and for those runs it looked like the files were not loaded and the std err stuck at memory usage line without going further.

            Comment


            • Bzip2 should work fine if you have the bzip2 or pbzip2 executable in your path, and files are named with the .bz2 extension. Can you post your exact command line, the complete screen output, and the results when you type from the command line "bzip2"?

              Comment


              • Sure.

                $ bbduk.sh -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
                java -Djava.library.path=jni/ -ea -Xmx1g -Xms1g -cp current/ jgi.BBDukF -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
                Executing jgi.BBDukF [-Xmx1g, in=in.fastq.bz2, out=clean.fastq, ref=adapters.fa, ktrim=r, k=25, mink=11, hdist=1, lhist=length-hist]

                BBDuk version 36.62
                Set length histogram output to length-hist
                maskMiddle was disabled because useShortKmers=true
                Initial:
                Memory: max=1029m, free=1004m, used=25m

                Added 252967 kmers; time: 0.367 seconds.
                Memory: max=1029m, free=956m, used=73m
                This is where the run would stuck.

                Comment


                • I just tested it, and both compression and decompression of .bz2 files is working fine for me. It should work as long as either bzip2 or pbzip2 is in your path. What happens when you type "bzip2" and "pbzip2"? Also, bzip2 decompression is pretty slow. When you run "top -c" do you see a java process hanging around with 0 cpu utilization?

                  Comment


                  • Hi Brian,

                    bzip2 was in my path yet pbzip2 was not. As you predicted, when I checked the status using "top -c" a java was hanging around with 0 cpu utilization. I intalled pbzip2 and the issue was resolved. Thank you so much for your patience and all the guidance.

                    Comment


                    • Hi Brian,
                      I have a question regarding bbduk singleton reads output (outs). My impression was that outs should contain singleton reads which passed filtering criteria, while their mates did not pass the filtering criteria.
                      But when I filter for a contaminant (35x polyG string) and at the same time I trim and filter for quality and minimum read length ( qtrim=rl trimq=15 maq=15 minlength=42), the outs file contains also reads shorter than 42bp with very poor quality (e.g. 35bp long reads consisting of Ns with 0 phred scores). Is this result supposed to happen?

                      the bbduk stand. output looks like this:
                      Input: 6882588 reads 1013379854 bases.
                      Contaminants: 320048 reads (4.65%) 42774629 bases (4.22%)
                      QTrimmed: 1971637 reads (28.65%) 71416735 bases (7.05%)
                      Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
                      Total Removed: 707746 reads (10.28%) 114191364 bases (11.27%)
                      Result: 6174842 reads (89.72%) 899188490 bases (88.73%)

                      Thank you in advance for an advice,
                      dejsha

                      Comment


                      • Hi Brian,
                        I also have a question. I downloaded today the latest version of BBmap and I have a strange behaviour with left- and right-trimming

                        For example:
                        Code:
                        TCCGAGGCAGTAGGCT[COLOR="Red"]CAACAGCAATATACCTTCTCGAGAGGTCT[/COLOR]CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                        being a read (in red is the adapter I want to look for).

                        I used this command line:
                        Code:
                        ~/Applications/bbmap/bbduk.sh in=./Data/To_scan.fastq out=./Data/Rejected.fastq outm=./out/No_mm_No_Indel/${Samplename}.fastq literal=CAACAGCAATATACCTTCTCGAGAGGTCT k=29 mm=f hdist=0 edist=0 ktrim=l rcomp=f
                        and I don't get
                        Code:
                        CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                        as expected...


                        EDIT:
                        Funnily, I have to set maxlength=1 to make it work...
                        Last edited by SylvainL; 01-12-2017, 05:52 AM.

                        Comment


                        • Here is what I get. Using your sequence as a fasta file. Looks right to me.

                          Code:
                          $ cat nn.fa
                          
                          >test
                          TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                          
                          $ bbduk.sh in=nn.fa  out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
                          
                          java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fa out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
                          Executing jgi.BBDukF [in=nn.fa, out=stdout.fa, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]
                          
                          BBDuk version 36.81
                          Initial:
                          Memory: max=19597m, free=19188m, used=409m
                          
                          Added 1 kmers; time:    0.027 seconds.
                          Memory: max=19597m, free=18472m, used=1125m
                          
                          Input is being processed as unpaired
                          Started output streams: 0.015 seconds.
                          
                          >test
                          CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                          
                          Processing time:                0.006 seconds.
                          
                          Input:                          1 reads                 100 bases.
                          KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
                          Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
                          Result:                         1 reads (100.00%)       55 bases (55.00%)
                          
                          Time:                           0.055 seconds.
                          Reads Processed:           1    0.02k reads/sec
                          Bases Processed:         100    0.00m bases/sec
                          Last edited by GenoMax; 01-12-2017, 06:04 AM.

                          Comment


                          • Yes I know...

                            As I edited, it works only if I set maxlength=1 !!

                            At least for the version I downloaded this morning

                            Comment


                            • It worked for me with your original command (minus the maxlength). I used your sequence as a fasta file though. Let me try as fastq.

                              Edit: Worked as fastq as well.

                              Code:
                              $ cat nn.fq
                              @test
                              TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                              +
                              CCCCCGGGGGFFFGGGGGGGGGGGGGGGGE7FGGGGGGGGGGGGGFGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII
                              
                              $ bbduk.sh in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
                              
                              java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
                              Executing jgi.BBDukF [in=nn.fq, out=stdout.fq, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]
                              
                              BBDuk version 36.81
                              Initial:
                              Memory: max=19597m, free=19188m, used=409m
                              
                              Added 1 kmers; time:    0.031 seconds.
                              Memory: max=19597m, free=18472m, used=1125m
                              
                              Input is being processed as unpaired
                              Started output streams: 0.019 seconds.
                              
                              @test
                              CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
                              +
                              FGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII
                              Processing time:                0.009 seconds.
                              
                              Input:                          1 reads                 100 bases.
                              KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
                              Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
                              Result:                         1 reads (100.00%)       55 bases (55.00%)
                              
                              Time:                           0.066 seconds.
                              Reads Processed:           1    0.02k reads/sec
                              Bases Processed:         100    0.00m bases/sec
                              Last edited by GenoMax; 01-12-2017, 06:29 AM.

                              Comment


                              • I am using BBDuk version 36.84, I guess that's the main difference...

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                23 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X