Hi Luc,
Tadpole does not replace BBMerge, though I plan to further integrate them in the future - mainly, so that BBMerge can first attempt to merge, then if unsuccessful extend the reads using Tadpole, then attempt to merge again, and if still unsuccessful, undo all the changes.
The main reason I made "Tadpole" was because I need to quantify the insert size of libraries, to determine whether they are acceptable; for example, if a project needs a 2x150bp library with a 500bp insert size, for an unknown organism... how do you determine whether it passes? BBMerge only works when the reads are largely overlapping. So, I wrote Tadpole to extend the right end of non-overlapping reads so that they will overlap and can be merged. So far, it works really well on 2x150bp single-cell data with a 350bp insert size, but I have not tested it further than that.
Tadpole is a complete (and very fast) assembler; you can run "tadpole.sh in=reads.fq out=contigs.fa" and it will give you a conservative assembly with a low error rate. The main drawback is that currently the max kmer length is 31, and as such the continuity is poor. I'm evaluating it for use in making a quick assembly for mapping reads to recalibrate their quality, prior to feeding them to a more sophisticated assembler; for recalibration, a low error rate and low misassembly rate is more important than continuity.
For extending reads, the command would be:
tadpole.sh in=reads.fq extend=reads.fq oute=extended.fq mode=extend extendleft=100 extendright=100
That will extend the reads by up to 100bp in each direction, stopping early if a branch is hit. For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
Header Leaderboard Ad
Collapse
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Brian,
what are the applications/assembly operations for which tadpole.sh is designed? In part it replaces BBmerge?
Thanks!
Leave a comment:
-
Oh, yes, that sounds perfectly fine. You may want to adapter-trim the unknown reads first (treating them as singletons) by specifying the junction-adapter as the adapter sequence, like this:
bbduk.sh in=unknown.fq int=f out=trimmed.fq adapter=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG k=21 mink=5 ktrim=r hdist=1
That will ensure you trim junction adapter from the end of the read, in case it was present but not detected because it was just a few bases. Otherwise the unknown bin may be enriched for reads with junction adapter at the end, that was just too short to be positively identified. Technically it could be present on the left side, as well. Making a base-frequency histogram of the unknown bin might be useful; I have not done that.Last edited by Brian Bushnell; 06-01-2015, 10:10 AM.
Leave a comment:
-
Originally posted by Brian Bushnell View PostYou can... whether that's a good idea depends on the assembler. Some assemblers probably use LMP reads for kmers just like the pair-end reads. Particularly if you don't have enough coverage of PE reads, using the LMP reads that way is probably fine, though they will probably have a more biased coverage distribution than the PE reads, which could interfere with the heuristics of some assemblers.
Virtually all of the unknown-binned reads are non-overlapping LMP reads that you can't assemble into single reads (in the data I have examined). They end up in the unknown bin because the junction adapter was not visible in either read. But that usually means that the junctions were in the unsequenced portion between the two reads. This really depends on your insert size distribution (the physical insert size of the sequenced fragments, not the insert size of the long transposased pieces). If you fragment to substantially longer than 2x read length, a lot of LMP pairs will end up in the unknown bin because the junction is in the unsequenced middle. If you fragment to a shorter insert size, such that most of your pairs overlap, the unknown bin would consist more of PE reads.
The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.
Thanks for your reply!
I understand your points of the differences between "unknown" and "single-end" reads. However, I am not going to merge the "unknown" pairs into one single read, but treat one pair of reads as two single reads only for the contig assembly step. By doing this, we neither throw the "unknown" reads away nor rely on their mate-pair information to do scaffolding as insurance. Is it valid to you?
Leave a comment:
-
Originally posted by blsfoxfox View PostHi Brian,
Thesedays I am thinking of several assembly questions related to what you said above.
1) After removing those "pair-end" and "single-end" contamination, why can't we reverse_complement those LMP reads and treat them as pair-end reads for the contig assembly stage? Maybe due to the larger variance of the LMP insert sizes?
2) For those "unknown" reads, I think it is also helpful to assemble them as single ends, right?
The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.Last edited by Brian Bushnell; 06-01-2015, 09:18 AM.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...-
Channel: Articles
03-21-2023, 01:49 PM -
-
by seqadmin
Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...-
Channel: Articles
03-10-2023, 05:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 12:26 PM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
Yesterday, 12:26 PM
|
||
Started by seqadmin, 03-17-2023, 12:32 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
03-17-2023, 12:32 PM
|
||
Started by seqadmin, 03-15-2023, 12:42 PM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
03-15-2023, 12:42 PM
|
||
Started by seqadmin, 03-09-2023, 10:17 AM
|
0 responses
68 views
1 like
|
Last Post
by seqadmin
03-09-2023, 10:17 AM
|
Leave a comment: