Unconfigured Ad

**JohnN** · 07-07-2014, 04:13 PM

The Staden package's gap5 is what I use.

**Brian Bushnell** · 07-07-2014, 06:13 PM

Lizzy,

I wrote a tool, Dedupe, specifically for this purpose. We run it before Minimus to prevent it from taking forever and/or crashing (sometimes it does anyway, so we just use the output of Dedupe). It does NOT merge overlaps in which both extend beyond the other's end (because that could introduce misassemblies at branches), but it does merge overlaps in which one is fully contained by the other (or have identical boundaries). It's extremely fast, and you can specify thresholds for minimum overlap, minimum identity, and maximum edit distance for merging, and various other options (run the shellscript without arguments for more details).

There's no thread for it, so I'll start one tomorrow. The command line in your case would be something like this:

dedupe.sh -Xmx28g in=contigs.fa out=clean.fa minoverlap=4000 maxedits=50 minidentity=99

If you want to allow indels (and with PacBio data, you should), then you must set "maxedits" as that defines the length of a buffers used for alignment. "minidentity" is optional.

The -Xmx flag should be set to about 85% of physical memory (though it's not strictly required, as it is normally autodetected. Microbial assemblies need under 1g).

-Brian

**ewilbanks** · 07-07-2014, 09:18 PM

Sweet, thanks! I'll give this a whirl tomorrow.

**rhinoceros** · 07-07-2014, 09:49 PM

I've merged contigs successfully with SSPACE..

**ewilbanks** · 07-08-2014, 08:51 AM

Hey Brian,

I tried testing dedupe.sh on two contigs that I checked out with nucmer first (see below). I've tried playing with the settings (overlap length, maxedits, etc...) but I can't seem to convince it to merge them. Any thoughts? I attached the align output as well if that's helpful.

Thanks!!
Lizzy

NUCMER

[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]
===============================================================================================================================
2121976 2128241 | 1 6272 | 6266 6272 | 99.11 | 2128241 1076355 | 0.29 0.58 | unitig_109 unitig_111

Attached Files

unitig109_unitig111.align.txt.gz (4.4 KB, 133 views)

**ewilbanks** · 07-08-2014, 08:58 AM

Fasta file of contigs I'm trying to merge is here in case that's helpful...

Dropbox - 404

https://dl.dropboxusercontent.com/u/2796343/unitig109_unitig111.fasta.gz

**Brian Bushnell** · 07-08-2014, 09:10 AM

Lizzy,

Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

____AGCTAGCTACGT
AGCTAGCTAGCT____

...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.

**ewilbanks** · 07-10-2014, 02:26 PM

Originally posted by JohnN View Post

The Staden package's gap5 is what I use.

Hey @JohnN,

I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

Thanks for the suggestion!

**ewilbanks** · 07-10-2014, 02:28 PM

Originally posted by Brian Bushnell View Post

Lizzy,

Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

____AGCTAGCTACGT
AGCTAGCTAGCT____

...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.

Ah OK, thanks anyways! Yeah, I've looked into these contig regions and see a drop in coverage, and no other competing overlaps with other contigs in the asm. Most are in regions w/ mobile genetic elements, and this is population consensus (metagenomic) sample, so some inaccuracy is OK with me. Just want to make a consensus in one piece for a reference - looks do-able and reasonable, just looking for the first program to jam them together.... (hopefully avoiding doing it manually!!!)

**JohnN** · 07-11-2014, 12:00 PM

Originally posted by ewilbanks View Post

Hey @JohnN,

I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

Thanks for the suggestion!

Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

$ tg_index -o output_gap5 -f contigs.fasta

(-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

Then start gap5 and load your output_gap5 database.

Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

If the join looks good, join it.

HTH
J

PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.

**ewilbanks** · 07-11-2014, 12:22 PM

Originally posted by JohnN View Post

Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

$ tg_index -o output_gap5 -f contigs.fasta

(-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

Then start gap5 and load your output_gap5 database.

Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

If the join looks good, join it.

HTH
J

PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.

Ah ha! Thank you!! I got lost in the massive manual pdf for the staden pacakge - found the documentation for doing what you describe on p. 28 Will try this now

thanks again.

**dgscofield** · 07-11-2014, 03:32 PM

CAP3 (http://seq.cs.iastate.edu/cap3.html) is also good for this, you can set identity percentage cutoffs and, though it does end-clipping by default I believe this can be turned off with -k 0. It produces useful output showing you exactly where the overlaps are.

**yifangt** · 12-12-2014, 09:40 AM

SSPACE merge contigs

Hello, rhinoceros!
Could you please post your script or cammand line to merge contigs with SSPACE? I'm doing similar thing. Thanks in advance!

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Merging contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News