Introducing Tadpole: an assembler, error-corrector, and read-extender

Brian Bushnell replied

12-16-2016, 08:43 AM
Thanks for noting that; I've corrected it. Originally, for error-correction, you had to say "oute" for the output but I changed that a while ago so the current syntax is "out1" and "out2".
Leave a comment:
Brian Bushnell replied

12-01-2016, 10:24 AM
Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.
Leave a comment:
bavci replied

12-01-2016, 07:22 AM
I just saw this comment from Brian Bushnell:

P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide.

Is Tadpole reliable for such usages?
Leave a comment:
ycl6 replied

06-30-2016, 04:00 PM
Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.
Leave a comment:
Brian Bushnell replied

06-30-2016, 09:15 AM
Originally posted by gringer View Post

From the error messages, your input read files are corrupted:

Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

Code:

wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz

[numbers should be identical, and this should produce no errors]

Gringer is correct; the files appear to be corrupt. But just to add a small correction, it does not look like wc can handle gzipped files, so I'd suggest either:

Code:

zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l

...which I have no idea how it would behave on a corrupt file, but should produce a correct answer for a correct file. Or this:

Code:

gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz

...which will test the integrity of the files. It seems to print nothing if the file is OK; if the file is truncated, it will print:

Code:

gzip: x.gz: unexpected end of file

Probably, the process generating the corrected files (probably Tadpole) was killed or timed out before it was finished.
Leave a comment:
gringer replied

06-30-2016, 02:06 AM
From the error messages, your input read files are corrupted:

...Unexpected end of ZLIB input stream
...Unexpected end of ZLIB input stream
...The pairing may have been corrupted by an upstream process.

Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

Code:

wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz

[numbers should be identical, and this should produce no errors]
Leave a comment:

ycl6 replied

06-30-2016, 12:37 AM

Hi, I am using tadpole from BBMap v36.11 and received some java error messages. The command I used is:

tadpole.sh -Xmx1000g in1=read1.corrected.fq.gz in2=read2.corrected.fq.gz out=contigs.fa.gz

Code:

java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
        at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
        at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
open=true

java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
        at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
        at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
open=true

Exception in thread "Thread-4" java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
        at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
        at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
        at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
        at java.lang.Thread.run(Thread.java:745)

Leave a comment:

GenoMax replied

04-18-2016, 05:29 AM
Originally posted by JeanLove View Post

Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!

Provide a number that you want to use e.g. reads=55000000. I don't know if that randomly subsamples though. It may be first 5.5M reads. @Brian will have to clarify.

If you want to independently subsample the reads you could also use reformat.sh or seqtk (sample) from Heng Li.
Leave a comment:
JeanLove replied

04-18-2016, 05:09 AM
Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
Leave a comment:
nepossiver replied

11-06-2015, 11:10 AM
Originally posted by Brian Bushnell View Post

This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.

ok, noted. I will test on Illumina reads and report back later. Many thanks.
Leave a comment:

Brian Bushnell replied

11-06-2015, 10:52 AM

Sure:

Code:

#First link reference as ref.fa and reads as reads.fa

/global/projectb/sandbox/gaag/bbtools/jgi-bbtools/kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))
lower=$((mitopeak * 3 / 7))
mcs=$((mitopeak * 3 / 4))
mincov=$((mitopeak * 2 / 3))

tadpole.sh in=highpass_gc.fq.gz out=contigs78.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=78
tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
tadpole.sh in=highpass_gc.fq.gz out=contigs120.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=120

bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

ln -s contigs_bbd.fa contigs.fa

This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.

Leave a comment:

nepossiver replied

11-06-2015, 05:45 AM
Originally posted by Brian Bushnell View Post

Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.

How do you identify the mitochondrial band? Could you share the script?
Leave a comment:
RobMaher replied

10-28-2015, 08:07 PM
Hi, Brian,

Thanks very much for your quick feedback. That does help, thank you. I was able to specify reads=1000 and that approximated the 10% for some test cases. I took a quick shot at normalizing, and I'm having some issues with it, but they may have more to do with Geneious than anything.

Thanks again!

Rob
Leave a comment:
Brian Bushnell replied

10-28-2015, 03:59 PM
Hi Rob,

Tadpole does not do subsampling; you'd have to first sample 10% of the reads with another tool (such as Reformat, with the "samplerate=0.1" flag). However, you CAN restrict the input to a certain number of reads. "reads=400k", for example, will only use the first 400,000 reads (or if the reads are paired, the first 400,000 pairs).

Tadpole also has "mincr" and "maxcr" flags which allow you to ignore kmers with counts outside of some band, and is sometimes useful for ignoring low-depth junk than can occur when you have too much data:

mincountretain=0 (mincr) Discard kmers with count below this.
maxcountretain=INF (maxcr) Discard kmers with count above this.

Also, I recommend you try normalization (targeting maybe 100x coverage) - there are many situations where normalization gives a better assembly than subsampling, particularly when the coverage of features is highly variable.

Good luck!
Leave a comment:
RobMaher replied

10-28-2015, 01:17 PM
Hi, Brian,

Thanks for building this amazing assembler. I have been using it from within the Geneious software package for a few weeks now with great success. Quick question: is it possible to enter a command line under 'custom Tadpole options' that will cause the assembler to only use a specific percentage of the data? Also, is it possible to specify usage of maximum number of reads when performing the assembly (as opposed to a percent)?

I ask because I am trying to assembly a large number of plasmids from MiSeq FASTQ data, and I've found that the results are much better when only using 10% of the data. However, there is a bug wherein this option is missing from the workflow version of Tadpole; and another where it defaults to 100% when attempting to assemble each sequence list separately. They know about the first bug and are fixing it, but I thought there might be a quick workaround via the command line.

Thanks!

Rob
Leave a comment:

Previous 1 2 3 4 5 6 7 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News