SSPACE merge contigs
Hello, rhinoceros!
Could you please post your script or cammand line to merge contigs with SSPACE? I'm doing similar thing. Thanks in advance!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
CAP3 (http://seq.cs.iastate.edu/cap3.html) is also good for this, you can set identity percentage cutoffs and, though it does end-clipping by default I believe this can be turned off with -k 0. It produces useful output showing you exactly where the overlaps are.
Leave a comment:
-
Originally posted by JohnN View PostPut all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).
Then use the "tg_index" program which comes with the Staden package to make the gap5 database:
$ tg_index -o output_gap5 -f contigs.fasta
(-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).
Then start gap5 and load your output_gap5 database.
Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.
If the join looks good, join it.
HTH
J
PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.
Ah ha! Thank you!! I got lost in the massive manual pdf for the staden pacakge - found the documentation for doing what you describe on p. 28 Will try this now thanks again.
Leave a comment:
-
Originally posted by ewilbanks View PostHey @JohnN,
I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...
Thanks for the suggestion!
Then use the "tg_index" program which comes with the Staden package to make the gap5 database:
$ tg_index -o output_gap5 -f contigs.fasta
(-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).
Then start gap5 and load your output_gap5 database.
Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.
If the join looks good, join it.
HTH
J
PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.Last edited by JohnN; 07-11-2014, 12:03 PM.
Leave a comment:
-
Originally posted by Brian Bushnell View PostLizzy,
Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:
____AGCTAGCTACGT
AGCTAGCTAGCT____
...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).
I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.
Leave a comment:
-
Originally posted by JohnN View PostThe Staden package's gap5 is what I use.
I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...
Thanks for the suggestion!
Leave a comment:
-
Lizzy,
Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:
____AGCTAGCTACGT
AGCTAGCTAGCT____
...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).
I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.
Leave a comment:
-
Fasta file of contigs I'm trying to merge is here in case that's helpful...
Leave a comment:
-
Hey Brian,
I tried testing dedupe.sh on two contigs that I checked out with nucmer first (see below). I've tried playing with the settings (overlap length, maxedits, etc...) but I can't seem to convince it to merge them. Any thoughts? I attached the align output as well if that's helpful.
Thanks!!
Lizzy
NUCMER
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]
===============================================================================================================================
2121976 2128241 | 1 6272 | 6266 6272 | 99.11 | 2128241 1076355 | 0.29 0.58 | unitig_109 unitig_111Attached Files
Leave a comment:
-
Lizzy,
I wrote a tool, Dedupe, specifically for this purpose. We run it before Minimus to prevent it from taking forever and/or crashing (sometimes it does anyway, so we just use the output of Dedupe). It does NOT merge overlaps in which both extend beyond the other's end (because that could introduce misassemblies at branches), but it does merge overlaps in which one is fully contained by the other (or have identical boundaries). It's extremely fast, and you can specify thresholds for minimum overlap, minimum identity, and maximum edit distance for merging, and various other options (run the shellscript without arguments for more details).
There's no thread for it, so I'll start one tomorrow. The command line in your case would be something like this:
dedupe.sh -Xmx28g in=contigs.fa out=clean.fa minoverlap=4000 maxedits=50 minidentity=99
If you want to allow indels (and with PacBio data, you should), then you must set "maxedits" as that defines the length of a buffers used for alignment. "minidentity" is optional.
The -Xmx flag should be set to about 85% of physical memory (though it's not strictly required, as it is normally autodetected. Microbial assemblies need under 1g).
-Brian
Leave a comment:
-
Merging contigs
Hi folks,
I'm parsing through the results of a pacbio assembly and have found places where two contigs could be merged into one - (>99%id, ~5k overlap) - but weren't by the assembler for whatever reason. I'd like to merge these but can't seem to find the right tool. Anyone have suggestions?
I've tried nucmer-based minimus2 but its crashing when I just feed it these two sequences (not sure why) and phrap is just hanging.
Anyone else have ideas?
Thanks!
LizzyTags: None
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
166 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
||
Started by seqadmin, 11-08-2024, 06:13 AM
|
0 responses
131 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
77 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Leave a comment: