sub-setting input for tadpole.sh
Hey there,
I have been using tadpole for error correction and I am extremely happy with its results and performance. Much appreciated!
I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?
Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).
It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries (where UMIs serve as a handle) basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or and indels.
The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores)
From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation.
Thanks again for the great work here!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by GenoMax View PostInstead of trying to extend the reads use tadpole in assembly mode. I linked a guide for how to use "tadpole" and its modes in my last comment.
You may have too much coverage for small genomes such as these (if the libraries were made from isolated mitochondria/chloroplast DNA). You may have to use "bbnorm.sh" to normalize your data. A guide to use BBNorm is available at this link.
Hi GenoMax! Thank you for your reply!
My libraries were not made from isolated mitochondria and chloroplast DNA. We made a whole genome sequencing of purple maize. Then I mapped all the reads agains the reference genome (10 chromosomes + mitochondria + chloroplast) using bowtie2. Then, using samtools, I extracted the alignments (bams) only for mitochondria and chloroplast and with samtools fastq I extracted the reads mapped agains the reference mitochondria and chloroplast. I used repair.sh to sort the reads by name and to have the same number of reads per fastq file.
I want to use these reads to do de novo assembly.
For chloroplast, I have a total coverage of 5600x. Doing sampling to have a 60x or 90x of coverage, and kmers of 37,47,57,67 or close, I get a highly fragmented assembly.
For mitochondria, I have a total coverage of 1400x. Doing sampling to have a 60x of coverage and kmers 47,57,67,77 I got an assembly of 46 contigs and using kmers of 45,65,85,95, I got 31 contigs.
I was thinking of using tadpole for assembly but I wasn't sure if I should extend my reads and what would be the correct coverage and kmer length to use in that case...
Thank you so much in advance for your adviceLast edited by silverfox; 07-27-2019, 04:05 PM.
Leave a comment:
-
Instead of trying to extend the reads use tadpole in assembly mode. I linked a guide for how to use "tadpole" and its modes in my last comment.
You may have too much coverage for small genomes such as these (if the libraries were made from isolated mitochondria/chloroplast DNA). You may have to use "bbnorm.sh" to normalize your data. A guide to use BBNorm is available at this link.
Leave a comment:
-
Originally posted by Brian Bushnell View PostIn default mode, Tadpole assembles reads and produces contigs. In "extend" or "correct" mode, it will extend or correct input sequences - which can be reads or contigs, but it's designed for reads. When I use Tadpole for assembly, I often first correct the reads, then assemble them, which takes two passes. Tadpole will build contigs unless you explicitly add the flag "mode=extend" or "mode=correct", regardless of whether you have 1 or 2 inputs. In extend or correct mode, it will modify the input reads, and not make contigs.
I'm glad to hear that you've achieved more contiguous assemblies after extending the reads and assembling them with longer kmers - that was my goal in designing that mode (and particularly, to allow merging of non-overlapping reads), but unfortunately I've been too busy to test it thoroughly. You've given me a second data point about it being beneficial, though, so thanks!
"shave" and "rinse" are what some assemblers call "tip removal" and "bubble removal". But, they are implemented a bit differently and occur before the graph is built, rather than as graph simplification routines. As such, they pose virtually no risk of causing misassemblies, and reduce the risk of misassemblies due to single chimeric reads. But unfortunately, in my experience, they also only provide very minor improvements in continuity or error-correction. Sometimes they make subsequent operations faster, though. By default, adding the flag "shave" will remove dead-end kmer paths of depth no more than 1 and length no more than 150 that branch out of a path with substantially greater depth. "rinse", similarly, only removes short paths of depth no more than 1 in which each end terminates in a branch node of substantially greater depth. Because these operations are so conservative, they seem to have little impact. Assemblers like Velvet and AllPaths-LG can collapse bubbles with a 50-50 split as to the path depth, which greatly increases the continuity (particularly with diploid organisms), but poses the risk of misassemblies when there are repeat elements. Tadpole always errs on the side of caution, preferring lower continuity to possible misassemblies.
Tadpole is still not pair-aware and does not perform scaffolding, though that's certainly my next goal, when I get a chance. When you generate contigs, Tadpole automatically runs AssemblyStats (which you can run as standalone using stats.sh). This mentions scaffolds in various places, because it's designed for assemblies that are potentially scaffolded, but you'll note that for Tadpole the scaffold statistics and contig statistics are identical.
Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
I was using the reads that mapped agains a close reference for both, and doing de novo assembly. I plan to extend my read length and perform an assembly with longer kmer length like it is suggested here
A question please, what coverage over the reference genome of my mitocondria do you suggest?
I have a total coverage of 1400x but my advisor told me to use 90x.
Also, reading your comments, can you point to a guide to learn how to "Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram"?
Also, what do you mean by "depth band to assemble" ?
Sorry, i'm very very new to this field.
Thank you very much in advanceLast edited by silverfox; 07-27-2019, 09:52 AM.
Leave a comment:
-
Originally posted by GenoMax View PostIs this data in addition to contigs file or are they reads that were used to make the contig assembly?
Take a look at this tadpole usage guide to see if it helps.
These reads data that I gave tadpole are the whole set of reads (interleaved), including the ones used in the assembly but also new ones.Last edited by silverfox; 07-27-2019, 07:55 AM.
Leave a comment:
-
mit_1.fastq y mit_2.fastq (I'm assembling a mitogenome) and I want to use them to extend my contig
Take a look at this tadpole usage guide to see if it helps.
Leave a comment:
-
Originally posted by Brian Bushnell View PostBBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:
tadpole.sh mode=correct in=reads.fq out=corrected.fq
tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq
...will give identical results, but this:
tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered
...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.
Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.
HTML Code:Extending contigs with reads could be done like this: tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62
How should I put my two sets of reads? (mit_1.fastq y mit_2.fastq)
Perhaps: extra=mit_1.fastq,mit_2.fastq ?
or
extra1=mit_1.fastq extra2=mit_2.fastq ?
or
should I interleaved my paired-end fastq files?
edit:
I interleaved my two fastq files and I executed:
> tadpole.sh in=contigs.fasta out=extended.fa el=1000 er=1000 mode=extend extra=all.mt.interleaved.fq k=62
It gaves me this error:
Exception in thread "main" java.lang.RuntimeException: Can't read file 'all.mt.interleaved.fq'
at shared.Tools.testInputFiles(Tools.java:1121)
at shared.Tools.testInputFiles(Tools.java:1089)
at ukmer.KmerTableSetU.<init>(KmerTableSetU.java:345)
at assemble.Tadpole2.<init>(Tadpole2.java:70)
at assemble.Tadpole.makeTadpole(Tadpole.java:74)
at assemble.Tadpole.main(Tadpole.java:57)
Thank you in advance for you helpLast edited by silverfox; 07-26-2019, 09:55 PM.
Leave a comment:
-
Hello there,
I am running into an error whenever I tried to run Tadpole:
Performing error-correction with Tadpole
java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=CFC280618_S3_R1_001.fastq.gz in2=CFC280618_S3_R2_001.fastq.gz out1=tadpole/CFC280618_R1.tadpole.fastq.gz out2=tadpole/CFC280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
Picked up _JAVA_OPTIONS: -Xmx1g
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=MFC280618_S2_R1_001.fastq.gz in2=MFC280618_S2_R2_001.fastq.gz out1=tadpole/MFC280618_R1.tadpole.fastq.gz out2=tadpole/MFC280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
Picked up _JAVA_OPTIONS: -Xmx1g
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=SBW280618_S1_R1_001.fastq.gz in2=SBW280618_S1_R2_001.fastq.gz out1=tadpole/SBW280618_R1.tadpole.fastq.gz out2=tadpole/SBW280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
Picked up _JAVA_OPTIONS: -Xmx1g
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
(END)
This was the command that I ran with:
echo "Performing error-correction with Tadpole"
echo
for x in *_R1_001.fastq.gz
do tadpole.sh in1=$x in2=${x%_R1_001*}_R2_001.fastq.gz out1=tadpole/${x%_S*_R1_001.*}_R1.tadpole.fastq.gz out2=tadpole/${x%_S*_R1_001.*}_R2.tadpole.fastq.gz mode=correct ziplev$
done
Is there something wrong that I am doing here?
Thanks
Leave a comment:
-
Thanks for your responce to
Originally posted by gringer View PostWow! Joined in 2014, only one post, and you chose to make your first post something about mutant clouds in response to a 3-year old post of mine from the first page of this thread. I'm not sure if I should be pleased, or concerned. If you want a more recent update on that, try here in this thread (where I got a 3.6kb sequence with crazy-high coverage):
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
But to be frank, it's been so long ago that I've forgotten the virus and host that I was looking at. Feel free to expand on what you meant (preferably with a reference to a preprint or other form of publication), but bear in mind that it's probably better in the future to try to respond only to the most recent posts in a thread.
But thank you: You have given me an interesting idea. You have been a mirror for me. While responding to you now I found a solution for myself, So thank you in many countless times
Leave a comment:
-
Wow! Joined in 2014, only one post, and you chose to make your first post something about mutant clouds in response to a 3-year old post of mine from the first page of this thread. I'm not sure if I should be pleased, or concerned. If you want a more recent update on that, try here in this thread (where I got a 3.6kb sequence with crazy-high coverage):
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
But to be frank, it's been so long ago that I've forgotten the virus and host that I was looking at. Feel free to expand on what you meant (preferably with a reference to a preprint or other form of publication), but bear in mind that it's probably better in the future to try to respond only to the most recent posts in a thread.
Leave a comment:
-
Originally posted by gringer View PostFor a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:
Code:tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
Leave a comment:
-
RQCFilter Norm and ECC
Hello!
Does RQCFilter has a parameter to normalise and error correct? Or I should do it afterwards?
Leave a comment:
-
Originally posted by susanklein View PostHi,
yes I get your point. I have reduced it now.. Will update progress.
S.
Leave a comment:
-
Originally posted by SNPsaurus View PostDo you need to use k=96? That is a large kmer size and will increase memory demand. A metagenome plus sequence errors will create many 96-mers. What happens with the default k=31... does it run with that? I guess tadpole is estimating memory for k=96 and it should work but I would try a smaller k and see what happens.
yes I get your point. I have reduced it now.. Will update progress.
S.
Leave a comment:
-
Do you need to use k=96? That is a large kmer size and will increase memory demand. A metagenome plus sequence errors will create many 96-mers. What happens with the default k=31... does it run with that? I guess tadpole is estimating memory for k=96 and it should work but I would try a smaller k and see what happens.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 11-22-2024, 07:36 AM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:36 AM
|
||
Started by seqadmin, 11-22-2024, 07:04 AM
|
0 responses
81 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:04 AM
|
||
Started by seqadmin, 11-21-2024, 09:19 AM
|
0 responses
76 views
0 likes
|
Last Post
by seqadmin
11-21-2024, 09:19 AM
|
||
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
320 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
Leave a comment: