Hey there,
I have been using tadpole for error correction (and BBtools in general) and I am extremely happy with its results and performance. Much appreciated!
I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?
Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).
It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries using UMIs as a handle; basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or/and indels.
The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this use paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores) ...
From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation. Another idea is to cat fastq | grep UMI | tadpole ... but I am not sure if one can pass pipes to tadpole.sh (and my guess is a no based on the documentation)
Thanks again for the great work here!
I have been using tadpole for error correction (and BBtools in general) and I am extremely happy with its results and performance. Much appreciated!
I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?
Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).
It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries using UMIs as a handle; basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or/and indels.
The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this use paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores) ...
From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation. Another idea is to cat fastq | grep UMI | tadpole ... but I am not sure if one can pass pipes to tadpole.sh (and my guess is a no based on the documentation)
Thanks again for the great work here!
Comment