Announcement

Collapse
No announcement yet.

Introducing Tadpole: an assembler, error-corrector, and read-extender

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi, Brian,

    Thanks for building this amazing assembler. I have been using it from within the Geneious software package for a few weeks now with great success. Quick question: is it possible to enter a command line under 'custom Tadpole options' that will cause the assembler to only use a specific percentage of the data? Also, is it possible to specify usage of maximum number of reads when performing the assembly (as opposed to a percent)?

    I ask because I am trying to assembly a large number of plasmids from MiSeq FASTQ data, and I've found that the results are much better when only using 10% of the data. However, there is a bug wherein this option is missing from the workflow version of Tadpole; and another where it defaults to 100% when attempting to assemble each sequence list separately. They know about the first bug and are fixing it, but I thought there might be a quick workaround via the command line.

    Thanks!

    Rob

    Comment


    • #32
      Hi Rob,

      Tadpole does not do subsampling; you'd have to first sample 10% of the reads with another tool (such as Reformat, with the "samplerate=0.1" flag). However, you CAN restrict the input to a certain number of reads. "reads=400k", for example, will only use the first 400,000 reads (or if the reads are paired, the first 400,000 pairs).

      Tadpole also has "mincr" and "maxcr" flags which allow you to ignore kmers with counts outside of some band, and is sometimes useful for ignoring low-depth junk than can occur when you have too much data:

      mincountretain=0 (mincr) Discard kmers with count below this.
      maxcountretain=INF (maxcr) Discard kmers with count above this.
      Also, I recommend you try normalization (targeting maybe 100x coverage) - there are many situations where normalization gives a better assembly than subsampling, particularly when the coverage of features is highly variable.

      Good luck!

      Comment


      • #33
        Hi, Brian,

        Thanks very much for your quick feedback. That does help, thank you. I was able to specify reads=1000 and that approximated the 10% for some test cases. I took a quick shot at normalizing, and I'm having some issues with it, but they may have more to do with Geneious than anything.

        Thanks again!

        Rob

        Comment


        • #34
          Originally posted by Brian Bushnell View Post
          Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
          How do you identify the mitochondrial band? Could you share the script?

          Comment


          • #35
            Sure:

            Code:
            #First link reference as ref.fa and reads as reads.fa
            
            /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt
            
            primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
            cutoff=$(( $primary * 3 ))
            
            bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
            reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45
            
            kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
            
            mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`
            
            upper=$((mitopeak * 6 / 3))
            lower=$((mitopeak * 3 / 7))
            mcs=$((mitopeak * 3 / 4))
            mincov=$((mitopeak * 2 / 3))
            
            tadpole.sh in=highpass_gc.fq.gz out=contigs78.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=78
            tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
            tadpole.sh in=highpass_gc.fq.gz out=contigs120.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=120
            
            bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05
            
            tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6
            
            ln -s contigs_bbd.fa contigs.fa
            This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.

            Comment


            • #36
              Originally posted by Brian Bushnell View Post
              This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.
              ok, noted. I will test on Illumina reads and report back later. Many thanks.

              Comment


              • #37
                Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
                As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
                I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!

                Comment


                • #38
                  Originally posted by JeanLove View Post
                  Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
                  As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
                  I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
                  Provide a number that you want to use e.g. reads=55000000. I don't know if that randomly subsamples though. It may be first 5.5M reads. @Brian will have to clarify.

                  If you want to independently subsample the reads you could also use reformat.sh or seqtk (sample) from Heng Li.

                  Comment


                  • #39
                    Hi, I am using tadpole from BBMap v36.11 and received some java error messages. The command I used is:

                    tadpole.sh -Xmx1000g in1=read1.corrected.fq.gz in2=read2.corrected.fq.gz out=contigs.fa.gz

                    Code:
                    java.io.EOFException: Unexpected end of ZLIB input stream
                            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
                            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
                            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
                            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
                            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
                            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
                    open=true
                    
                    java.io.EOFException: Unexpected end of ZLIB input stream
                            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
                            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
                            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
                            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
                            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
                            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
                    open=true
                    
                    Exception in thread "Thread-4" java.lang.AssertionError:
                    There appear to be different numbers of reads in the paired input files.
                    The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
                            at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
                            at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
                            at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
                            at java.lang.Thread.run(Thread.java:745)

                    Comment


                    • #40
                      From the error messages, your input read files are corrupted:

                      ...Unexpected end of ZLIB input stream
                      ...Unexpected end of ZLIB input stream
                      ...The pairing may have been corrupted by an upstream process.
                      Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

                      Code:
                      wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
                      [numbers should be identical, and this should produce no errors]

                      Comment


                      • #41
                        Originally posted by gringer View Post
                        From the error messages, your input read files are corrupted:



                        Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

                        Code:
                        wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
                        [numbers should be identical, and this should produce no errors]
                        Gringer is correct; the files appear to be corrupt. But just to add a small correction, it does not look like wc can handle gzipped files, so I'd suggest either:

                        Code:
                        zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l
                        ...which I have no idea how it would behave on a corrupt file, but should produce a correct answer for a correct file. Or this:
                        Code:
                        gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz
                        ...which will test the integrity of the files. It seems to print nothing if the file is OK; if the file is truncated, it will print:
                        Code:
                        gzip: x.gz: unexpected end of file
                        Probably, the process generating the corrected files (probably Tadpole) was killed or timed out before it was finished.

                        Comment


                        • #42
                          Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.

                          Comment


                          • #43
                            I just saw this comment from Brian Bushnell:

                            P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

                            I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide.

                            Is Tadpole reliable for such usages?

                            Comment


                            • #44
                              Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.

                              Comment


                              • #45
                                Originally posted by Brian Bushnell View Post
                                BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

                                tadpole.sh mode=correct in=reads.fq out=corrected.fq

                                tadpole.sh mode=correct in1=read1.fq in2=read2.fq oute1=corrected1.fq oute2=corrected2.fq
                                For output, this is out1 and out2 or oute1 and oute2 (I don't see oute in the manual).

                                Comment

                                Working...
                                X