Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging contigs

    Hi folks,

    I'm parsing through the results of a pacbio assembly and have found places where two contigs could be merged into one - (>99%id, ~5k overlap) - but weren't by the assembler for whatever reason. I'd like to merge these but can't seem to find the right tool. Anyone have suggestions?

    I've tried nucmer-based minimus2 but its crashing when I just feed it these two sequences (not sure why) and phrap is just hanging.

    Anyone else have ideas?

    Thanks!
    Lizzy

  • #2
    The Staden package's gap5 is what I use.

    Comment


    • #3
      Lizzy,

      I wrote a tool, Dedupe, specifically for this purpose. We run it before Minimus to prevent it from taking forever and/or crashing (sometimes it does anyway, so we just use the output of Dedupe). It does NOT merge overlaps in which both extend beyond the other's end (because that could introduce misassemblies at branches), but it does merge overlaps in which one is fully contained by the other (or have identical boundaries). It's extremely fast, and you can specify thresholds for minimum overlap, minimum identity, and maximum edit distance for merging, and various other options (run the shellscript without arguments for more details).

      There's no thread for it, so I'll start one tomorrow. The command line in your case would be something like this:

      dedupe.sh -Xmx28g in=contigs.fa out=clean.fa minoverlap=4000 maxedits=50 minidentity=99

      If you want to allow indels (and with PacBio data, you should), then you must set "maxedits" as that defines the length of a buffers used for alignment. "minidentity" is optional.

      The -Xmx flag should be set to about 85% of physical memory (though it's not strictly required, as it is normally autodetected. Microbial assemblies need under 1g).

      -Brian

      Comment


      • #4
        Sweet, thanks! I'll give this a whirl tomorrow.

        Comment


        • #5
          I've merged contigs successfully with SSPACE..
          savetherhino.org

          Comment


          • #6
            Hey Brian,

            I tried testing dedupe.sh on two contigs that I checked out with nucmer first (see below). I've tried playing with the settings (overlap length, maxedits, etc...) but I can't seem to convince it to merge them. Any thoughts? I attached the align output as well if that's helpful.

            Thanks!!
            Lizzy




            NUCMER

            [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]
            ===============================================================================================================================
            2121976 2128241 | 1 6272 | 6266 6272 | 99.11 | 2128241 1076355 | 0.29 0.58 | unitig_109 unitig_111
            Attached Files

            Comment


            • #7
              Fasta file of contigs I'm trying to merge is here in case that's helpful...

              Comment


              • #8
                Lizzy,

                Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

                ____AGCTAGCTACGT
                AGCTAGCTAGCT____

                ...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

                I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.

                Comment


                • #9
                  Originally posted by JohnN View Post
                  The Staden package's gap5 is what I use.
                  Hey @JohnN,

                  I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

                  Thanks for the suggestion!

                  Comment


                  • #10
                    Originally posted by Brian Bushnell View Post
                    Lizzy,

                    Dedupe only discards contigs that are fully contained by other contigs (within the set limits of %id, etc). It will not merge reads or make a consensus, and will not touch contigs that overlap like this:

                    ____AGCTAGCTACGT
                    AGCTAGCTAGCT____

                    ...where each one goes off the end of the other. Sorry if I gave you false hopes! But generally, I don't really think it's safe to forcibly merge such contigs, as they could represent imperfect repeats in the genome. If the assembler does not merge them, it probably believes they are independent (or, of course, has a bug).

                    I suggest that you map the raw reads to all of the assembled contigs, using random placement of ambiguously-mapping reads. If these two contigs have roughly half the coverage of other contigs, then they are assembler artifacts and you can manually delete one of them. But if the coverage is similar to other contigs, you should keep both of them.
                    Ah OK, thanks anyways! Yeah, I've looked into these contig regions and see a drop in coverage, and no other competing overlaps with other contigs in the asm. Most are in regions w/ mobile genetic elements, and this is population consensus (metagenomic) sample, so some inaccuracy is OK with me. Just want to make a consensus in one piece for a reference - looks do-able and reasonable, just looking for the first program to jam them together.... (hopefully avoiding doing it manually!!!)

                    Comment


                    • #11
                      Originally posted by ewilbanks View Post
                      Hey @JohnN,

                      I've been looking into using gap5 but can't figure out how to get it to accept / open my existing assembly. Do you have any pointers or suggestions? Links to other documentation? Tried running pre-gap4 but couldn't seem to find right settings for importing an existing assembly...

                      Thanks for the suggestion!
                      Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

                      Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

                      $ tg_index -o output_gap5 -f contigs.fasta

                      (-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

                      Then start gap5 and load your output_gap5 database.

                      Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

                      If the join looks good, join it.

                      HTH
                      J

                      PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.
                      Last edited by JohnN; 07-11-2014, 12:03 PM.

                      Comment


                      • #12
                        Originally posted by JohnN View Post
                        Put all of your contigs into a single fasta file (or use the SAM or CAF file which an assembler spits out).

                        Then use the "tg_index" program which comes with the Staden package to make the gap5 database:

                        $ tg_index -o output_gap5 -f contigs.fasta

                        (-f for fasta input , -C for CAF input, -s for SAM input, etc. Running "tg_index" w/o params gives the params).

                        Then start gap5 and load your output_gap5 database.

                        Click on internal joins, and wait a few seconds/minutes. Click on the area in the internal join window to get to the aligned contigs (if any). In the new edit window, you may need to click on align again to smooth out the bumps.

                        If the join looks good, join it.

                        HTH
                        J

                        PS I use this method a lot for joining PacBio contigs but I'm a bit paranoid about raggedy ends, and usually reference map an illumina run of the same genome against the pacbio output before I join contigs. But I checked and just using fasta files works.

                        Ah ha! Thank you!! I got lost in the massive manual pdf for the staden pacakge - found the documentation for doing what you describe on p. 28 Will try this now thanks again.

                        Comment


                        • #13
                          CAP3 (http://seq.cs.iastate.edu/cap3.html) is also good for this, you can set identity percentage cutoffs and, though it does end-clipping by default I believe this can be turned off with -k 0. It produces useful output showing you exactly where the overlaps are.

                          Comment


                          • #14
                            SSPACE merge contigs

                            Hello, rhinoceros!
                            Could you please post your script or cammand line to merge contigs with SSPACE? I'm doing similar thing. Thanks in advance!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            22 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X