Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • yifangt
    replied
    SSPACE merge contigs

    Hello, rhinoceros!
    Could you please post your script or cammand line to merge contigs with SSPACE? I'm doing similar thing. Thanks in advance!

    Leave a comment:


  • dgscofield
    replied
    CAP3 (http://seq.cs.iastate.edu/cap3.html) is also good for this, you can set identity percentage cutoffs and, though it does end-clipping by default I believe this can be turned off with -k 0. It produces useful output showing you exactly where the overlaps are.

    Leave a comment:


  • ewilbanks
    replied
    Originally posted by JohnN View Post
    Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

    Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

    $ tg_index -o output_gap5 -f contigs.fasta

    (-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

    Then start gap5 and load your output_gap5 database.

    Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

    If the join looks good, join it.

    HTH
    J

    PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.

    Ah ha! Thank you!! I got lost in the massive manual pdf for the staden pacakge - found the documentation for doing what you describe on p. 28 Will try this now thanks again.

    Leave a comment:


  • JohnN
    replied
    Originally posted by ewilbanks View Post
    Hey @JohnN,

    I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

    Thanks for the suggestion!
    Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

    Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

    $ tg_index -o output_gap5 -f contigs.fasta

    (-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

    Then start gap5 and load your output_gap5 database.

    Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

    If the join looks good, join it.

    HTH
    J

    PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.
    Last edited by JohnN; 07-11-2014, 12:03 PM.

    Leave a comment:


  • ewilbanks
    replied
    Originally posted by Brian Bushnell View Post
    Lizzy,

    Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

    ____AGCTAGCTACGT
    AGCTAGCTAGCT____

    ...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

    I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.
    Ah OK, thanks anyways! Yeah, I've looked into these contig regions and see a drop in coverage, and no other competing overlaps with other contigs in the asm. Most are in regions w/ mobile genetic elements, and this is population consensus (metagenomic) sample, so some inaccuracy is OK with me. Just want to make a consensus in one piece for a reference - looks do-able and reasonable, just looking for the first program to jam them together.... (hopefully avoiding doing it manually!!!)

    Leave a comment:


  • ewilbanks
    replied
    Originally posted by JohnN View Post
    The Staden package's gap5 is what I use.
    Hey @JohnN,

    I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

    Thanks for the suggestion!

    Leave a comment:


  • Brian Bushnell
    replied
    Lizzy,

    Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

    ____AGCTAGCTACGT
    AGCTAGCTAGCT____

    ...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

    I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.

    Leave a comment:


  • ewilbanks
    replied
    Fasta file of contigs I'm trying to merge is here in case that's helpful...

    Leave a comment:


  • ewilbanks
    replied
    Hey Brian,

    I tried testing dedupe.sh on two contigs that I checked out with nucmer first (see below). I've tried playing with the settings (overlap length, maxedits, etc...) but I can't seem to convince it to merge them. Any thoughts? I attached the align output as well if that's helpful.

    Thanks!!
    Lizzy




    NUCMER

    [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]
    ===============================================================================================================================
    2121976 2128241 | 1 6272 | 6266 6272 | 99.11 | 2128241 1076355 | 0.29 0.58 | unitig_109 unitig_111
    Attached Files

    Leave a comment:


  • rhinoceros
    replied
    I've merged contigs successfully with SSPACE..

    Leave a comment:


  • ewilbanks
    replied
    Sweet, thanks! I'll give this a whirl tomorrow.

    Leave a comment:


  • Brian Bushnell
    replied
    Lizzy,

    I wrote a tool, Dedupe, specifically for this purpose. We run it before Minimus to prevent it from taking forever and/or crashing (sometimes it does anyway, so we just use the output of Dedupe). It does NOT merge overlaps in which both extend beyond the other's end (because that could introduce misassemblies at branches), but it does merge overlaps in which one is fully contained by the other (or have identical boundaries). It's extremely fast, and you can specify thresholds for minimum overlap, minimum identity, and maximum edit distance for merging, and various other options (run the shellscript without arguments for more details).

    There's no thread for it, so I'll start one tomorrow. The command line in your case would be something like this:

    dedupe.sh -Xmx28g in=contigs.fa out=clean.fa minoverlap=4000 maxedits=50 minidentity=99

    If you want to allow indels (and with PacBio data, you should), then you must set "maxedits" as that defines the length of a buffers used for alignment. "minidentity" is optional.

    The -Xmx flag should be set to about 85% of physical memory (though it's not strictly required, as it is normally autodetected. Microbial assemblies need under 1g).

    -Brian

    Leave a comment:


  • JohnN
    replied
    The Staden package's gap5 is what I use.

    Leave a comment:


  • ewilbanks
    started a topic Merging contigs

    Merging contigs

    Hi folks,

    I'm parsing through the results of a pacbio assembly and have found places where two contigs could be merged into one - (>99%id, ~5k overlap) - but weren't by the assembler for whatever reason. I'd like to merge these but can't seem to find the right tool. Anyone have suggestions?

    I've tried nucmer-based minimus2 but its crashing when I just feed it these two sequences (not sure why) and phrap is just hanging.

    Anyone else have ideas?

    Thanks!
    Lizzy

Latest Articles

Collapse

  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    11-06-2024, 07:24 PM
  • seqadmin
    Choosing Between NGS and qPCR
    by seqadmin



    Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
    10-18-2024, 07:11 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 11-08-2024, 11:09 AM
0 responses
166 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-08-2024, 06:13 AM
0 responses
131 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-01-2024, 06:09 AM
0 responses
77 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-30-2024, 05:31 AM
0 responses
26 views
0 likes
Last Post seqadmin  
Working...
X