Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Tom_C
    Member
    • Aug 2012
    • 16

    Help towards closing a genome?

    Hello All,

    I am a graduate student trying to learn NGS as I wrap up my PhD. That said, we have sequenced our pet bacterial genome (Illumina HISeq 2500 PE 101 BP) and I have so far managed to produce what to me looks like a good assembly. Reads were cleaned up with trimmomatic and assembled using Ray-2.3.1 with a default kmer of 31. The output is as follows

    Contigs >= 100 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Contigs >= 500 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Scaffolds >= 100 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686
    Scaffolds >= 500 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686

    The total length is in good agreement with other sequenced genomes of the same species (ranging 4.8-5.0 MB). But I am now beyond what anyone at my institute has experience with. I would like to go as far as possible towards closing the genome, but I am unsure what next steps to take. Can anyone provide some input as to what next logical steps I should take? Thank you very much!
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    First ask the question why do you want/need a finished genome? How much time and money can you spend on getting one?

    If you only care about one or two regions of interest, it may be cost effective to do it the old fashioned way (PCR and "Sanger" capillary sequencing to close a gaps).

    Comment

    • Tom_C
      Member
      • Aug 2012
      • 16

      #3
      Thanks for the reply!

      I had assumed a closed, or mostly closed genome would make downstream applications much easier. We plan to do ChIP-Seq and possibly RNA-Seq with this bacterium later on, and figured having a mostly closed genome would be best.

      That being said, if a closed genome is not required for these experiments we would still like to join as many contigs as possible to publish a decent draft genome. And that is where we need some expert advice.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        A closed circle is of course nice, but if all you care about is gene content you may be fine as it is. Finishing it will cost time and money whichever route you take.

        Comment

        • JohnN
          Member
          • Jan 2011
          • 31

          #5
          There are several approaches of varying complexity and cost:

          The easiest in my recent experience, is to get PacBio sequencing done. With the illumina reads mapped to a PacBio assembly, you can close and finish the genome in about 2 days solid work. But (and there are at least two big buts), it will cost you about $1500 for the sequencing, and the PacBio assembly process is not that easy or automated, so you may have to out-source that too. But it works, and we have done it for about 30 reference genomes needed for diagnostic purposes.

          You can find a very closely related genome or two, and use synteny to help you arrange your contigs (mauve, MUMmer, or reference mapping would help here), and then you can PCR close the smaller PCRable gaps. The rRNA regions will be difficult, and you could either ignore them - because they are not really that important for many studies, or generate primer sets to stitch the rRNA reads together. I've done it, it's a pain, but that's what we did in the old days.

          Or, as mentioned above, you can simply use your contig set in your downstream experiments. A large proportion of the genes involved with virulence, etc, are there already. The assembler typically quits when read length of the extending reads is less than the size of a repeated region. A quick way of assessing the quality of your assembly, is to auto-annotate the genome with something like 'prokka" and look at what you have. You could probably use gap5 to join a few contigs which have some overlap, and to fix the odd frameshift, but you likely have what you need to continue your studies.
          Last edited by JohnN; 09-19-2014, 07:02 AM. Reason: typos

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            You already have a very good assembly, and closing the 28 remaining gaps probably won't effect many downstream programs. You will almost certainly need more data for a significant improvement - either a long-mate-pair library for better scaffolding, or PacBio for gap-filling. If you go PacBio, you may as well just run 2-3 SMRT cells and try for a complete single-contig PacBio-only assembly.

            Comment

            • bastianwur
              Member
              • Feb 2014
              • 98

              #7
              I'd try first to scaffold it according to a reference, and try to determine from that how much could be missing, and if this is relevant.
              Because if e.g. 3/4 of the gaps possibly consist out of 23s or stretches of tRNA, then just go and ignore it.

              If the missing parts seem to be more relevant, then there are a few things to consider:
              - is repeat structure a problem (doesn't seem so)
              - how much is missing? If it's a bigger size, then you might need to consider a second run with not so small coverage
              - is the raw material still there? Because I think (not a lab person) that a PE jumping library (4 - 8 kb should get over the rRNAs; as suggested above) can be made from the same input material, so that would save time.


              You should also do some QC on your genome. It can happen (had that with Ray, HGAP and with other assemblers as well) that parts can be duplicated, which might not be obvious at first. e.g. it turned out during some other processing of one of our genomes that it had the right size (5 MB), the right amount of proteins (5k), but not the right amount of "unique" proteins (4k). Why that? One of the scaffolds was just duplicated in the output.
              Check as well that there's no obvious contamination in the assembly. It doesn't help you if a good part is e.coli (or whatever).

              Comment

              • Tom_C
                Member
                • Aug 2012
                • 16

                #8
                Thanks for the input everyone!

                Unfortunately additional large scale sequencing is not in the budget for this project, so we will not be able to use mate-paired or PacBio reads to close the genome. The number of Illumi However we now know to use PacBio for all future genome projects.

                Running the initial assembly through RAST indicates it is a fairly complete genome, with the correct number of proteins and a full compliment of rRNA's and tRNA's. At the suggestions of those in this thread, we plan to go ahead with ChIP and RNA-Seq using the current assembly.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...