Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • maasha
    Senior Member
    • Apr 2009
    • 153

    De-novo assembly of bacteria genomes - which tools?

    Hi all,

    I have been given the task of assembling three bacterial genomes.

    Each have been sequenced with 454 with a read length around 250nt. But also with Solexa mate-pair (2.5kb space - 35nt reads) because the initial assembly of the 454 data gave too many contigs (using the Newbler assembler resulting in a couple of thousand contigs per genome - these being GC rich).

    Which assembly software would be appropriate? And are there any recommendations on how to proceed with assembly having two types of data?

    Cheers,

    Martin
  • BaCh
    Member
    • May 2008
    • 81

    #2
    Originally posted by maasha View Post
    Which assembly software would be appropriate? And are there any recommendations on how to proceed with assembly having two types of data?
    You could try out release candidate 1 of MIRA 3 (or wait for rc2 which will be put online this weekend): http://chevreux.org/projects_mira.html (comes with extensive docs)

    On top of my head: Velvet and Euler-SR will also work with 454 and Solexa. I certainly missed some. Have also a look at: http://seqanswers.com/forums/showthread.php?t=43

    Regards,
    B.

    PS: I'm a bit biased as I wrote MIRA

    Comment

    • maasha
      Senior Member
      • Apr 2009
      • 153

      #3
      Thanks BaCh,

      So where to start? The Solexa Data or the 454? It occurs to me that the 454 - with the longer reads - are best for initial assembly - and then gaps can be closed with the mate-pair Solexa data.

      And can you feed the resulting contigs of one of these assemblies into the software as reads?


      Martin

      Comment

      • BaCh
        Member
        • May 2008
        • 81

        #4
        Originally posted by maasha View Post
        So where to start? The Solexa Data or the 454? It occurs to me that the 454 - with the longer reads - are best for initial assembly - and then gaps can be closed with the mate-pair Solexa data.
        If you can't feed both read types at the same time, 454 will indeed give you a better initial assembly. Though things may change slowly with 76mers and soon the double.

        Originally posted by maasha View Post
        And can you feed the resulting contigs of one of these assemblies into the software as reads?
        Most of the time not: contigs are normally pretty long sequences where assemblers have problems with (they're currently built to expect reads up to 1-2k bases).

        Try feeding the assembler with all reads at first. Compare what looks best to you. If that does not work at all, the resort to assembling Solexas, shred resulting contigs and feed those shreds alongside 454 reads to "normal" assemblers (Newbler, CABOG, MIRA, and others that grok 454).

        At least that's what I'd do.

        HOWEVER ...

        First, I am skeptical that the Solexas will greatly reduce contigs in your project. First, having "thousands" of contigs for a small bacterium with 454 data is *extremely* unusual (unless your average coverage is a low single digit number). Something doesn't feel right.

        Second, you wrote "GC" rich. Which might get problematic as Solexa has problem sometimes with GGC.G motifs. What I've often seen is this:

        refseq .....GGCGGCGGCxxxxxxxxGCCGCCGGC.......

        Now, the GGC motif in forward and reverse direction pretty much leads to a complete depletion of correct Solexa reads (they all have errors in the bases marked with x).

        B.

        Comment

        • ewilbanks
          Member
          • Mar 2009
          • 83

          #5
          Try Velvet http://www.ebi.ac.uk/~zerbino/velvet/

          has support for using both short (illumina) and long (454) reads in the assembly.

          Comment

          • maasha
            Senior Member
            • Apr 2009
            • 153

            #6
            @BaCh,

            Thanks,

            First, I am skeptical that the Solexas will greatly reduce contigs in your project. First, having "thousands" of contigs for a small bacterium with 454 data is *extremely* unusual (unless your average coverage is a low single digit number). Something doesn't feel right.
            Remember that the Solexa data is mate pair. So the question is also - is there any software that will take both single reads and mate pair reads as input?

            @ewilbanks,

            Thanks, velvet is on top of my list of tools to try (along with MIRA of cause).


            Martin

            Comment

            • BaCh
              Member
              • May 2008
              • 81

              #7
              Originally posted by maasha View Post
              Remember that the Solexa data is mate pair. So the question is also - is there any software that will take both single reads and mate pair reads as input?
              Any software able to treat paired-end should be very well able to handle single reads. It needs to anyway for cases where a mate is not present or unusable.

              B.

              Comment

              • maasha
                Senior Member
                • Apr 2009
                • 153

                #8
                Now I have battled extensively to assemble one bacterial genome sequenced with 454 and Solexa (mate pair).

                454 data: 478840 reads covering 117689186nt with a mean length of 246 and 33% GC content (the other two genomes are the ones with high GC content).

                Solexa data: Mate pair reads (distance ~2500nt) of 35nt and 44nt length in total 3308974 reads covering 130704473nt and GC content of 38% (can anyone explain this bias?).

                Expected genome size: 2.8Mb

                Using the Newbler assembler on the 454 data alone results in ~50 contigs.

                Using Velvet on both data type I could at the best get down to 459 contigs using the below:

                velveth test_both 31 -long M1_454.fna -shortPaired M1_solexa_paired.fna
                velvetg test_both/ -ins_length 2500 -cov_cutoff 10 -exp_cov 41 -max_coverage 300

                This is not too good (


                (Installing MIRA is not going well because of the libboost requirement - can't libboost be included in MIRA?)



                Martin

                Comment

                • BaCh
                  Member
                  • May 2008
                  • 81

                  #9
                  Originally posted by maasha View Post
                  Using the Newbler assembler on the 454 data alone results in ~50 contigs.
                  ...
                  This is not too good (
                  I beg to differ: this is already not too bad for a 40x+ coverage. Now, having a 3MB data set that assembles into 700 contigs ... that's "not too good"

                  Originally posted by maasha View Post
                  (Installing MIRA is not going well because of the libboost requirement - can't libboost be included in MIRA?)
                  Won't happen: including an almost standard library with a program is not a good idea, especially when all major distributions have it included. You could give the binary packages a try though, or don't they run on your machine?

                  Regards,
                  B.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Pathogen Surveillance with Advanced Genomic Tools
                    by seqadmin




                    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                    03-24-2025, 11:48 AM
                  • seqadmin
                    New Genomics Tools and Methods Shared at AGBT 2025
                    by seqadmin


                    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                    The Headliner
                    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                    03-03-2025, 01:39 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-20-2025, 05:03 AM
                  0 responses
                  41 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-19-2025, 07:27 AM
                  0 responses
                  51 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-18-2025, 12:50 PM
                  0 responses
                  38 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-03-2025, 01:15 PM
                  0 responses
                  193 views
                  0 reactions
                  Last Post seqadmin  
                  Working...