Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Your perspectives on assembling bacterial genomes with one set of reads

    Hello Everyone,

    I am in the early stages of a comparative analysis of several strains of E. coli, and I will admit up-front that I am not a bioinformatician. We already published the draft genome sequences (de novo assembly with Velvet of Illumina 2x101 PE reads), and the quality of the assemblies based on N50 and number of contigs was okay--the assemblies gave our lab most of the information we were interested in at an early date, such as protein-encoding genes and metabolic reconstructions.

    However, in an effort to answer some more interesting questions about variable phenotypes between strains, we are performing a comparative genomics analysis. I am concerned about information that is lost due to the lack of synteny that afflicts many draft genomes. Furthermore, I have encountered a few algorithms thus far that require fully closed genomes as input (because they take synteny into account), and it has led me to ask the following questions for bacterial genomes:


    1. Is there a proven pipeline for closing "simple" bacterial genomes (E. coli, where many reference strains have already been closed) with a single set of reads, say 2x101 PE Illumina reads? In this case, I mean complete closure with no Sanger sequencing.
    2. Besides synteny, what other information is missing from an unfinished bacterial genome? Stated another way, without closure, what sort of information that would be relevant to a comparative genomics study is impossible/difficult to deduce? (For instance, some genomic/pathogenicity island prediction algorithms require fully closed genomes, presumably due to their association with direct repeats, insertion sequences, etc.) Perhaps my concern regarding closing the genomes is unwarranted?


    For these reasons, I hesitate to move forward with the comparative analysis until the genome assemblies are closed. In an effort to explore closing the draft genomes with the original Illumina reads, I have tried the following pipeline with decent results, although the genome still has ~100 scaffolds containing ~1000 N's:

    Velvet de novo assembly of random sampling of paired-end reads (and optimization for low # of contigs and high N50) while maintaining sufficient coverage >30x ----> GapFiller to fill ambiguous bases ----> SIS (Scaffolds from Inversion Signatures) against a closely related, closed reference genome

    I have also considered:

    Bowtie2 to map reads to the aforementioned reference genome ----> SSPACE ----> GapFiller ----> SIS ----> back to bowtie2

    or some permutation of this.

    Has anybody had success in completely closing a "simple" bacterial genome in this manner? If so, what were your strategies?

    Many thanks for your assistance.

    Best,
    Brady

  • #2
    Hi Brady,

    I closed two of small bacterial genomes about 2Mbp genome with 454 PE pyrosequencing reads with similar strategies ( without SIS, but with manual closure based on viewing short reads alignment of reads to scaffold generated from Newbler).

    1. Newbler assembly with 454 sff files to generate 454 Scaffolds
    2. GapFiller
    3. Bowtie alignment of reads to gap filled scaffolds
    4. Manual close some gaps based on alignment
    5. Iterative 2-4 steps.

    Your case is different somehow, but this strategy (plus your SIS) would definitely be an efficient one. I would like to try SIS in the future.

    Justin

    Comment


    • #3
      Thanks for your response, Justin. Can you tell me the read length and insert size of your library? I think the short read length and insert size of my Illumina library puts me at a disadvantage when compared to the longer reads and insert sizes from a typical 454 library that can span longer repetitive regions.

      SIS appears to be a nice tool, but it looks like it threw away ~100kb from my de novo assembly. I haven't looked at the content, so I can't really comment on that aspect yet. I'd be interested to hear if any seasoned veterans have used SIS yet. It would be nice to know what pitfalls to avoid with SIS with respect to bacterial genome assemblies.

      Brady

      Comment


      • #4
        454 read length and insert size

        Originally posted by bcress View Post
        Thanks for your response, Justin. Can you tell me the read length and insert size of your library? I think the short read length and insert size of my Illumina library puts me at a disadvantage when compared to the longer reads and insert sizes from a typical 454 library that can span longer repetitive regions.

        SIS appears to be a nice tool, but it looks like it threw away ~100kb from my de novo assembly. I haven't looked at the content, so I can't really comment on that aspect yet. I'd be interested to hear if any seasoned veterans have used SIS yet. It would be nice to know what pitfalls to avoid with SIS with respect to bacterial genome assemblies.

        Brady
        My 454 reads are from 300 to 500 bps of length. The insert size I could not say for sure, but give your an estimate 7523.38/std 8572.91 from Newbler output.

        For De novo assembly, Illumina paired end reads only usually won't give you nice long scaffolds, therefore large number of contigs.

        Keep each other updated for any achievements with SIS tool. I would try it out with next genome.

        Comment


        • #5
          You should map out the cost of any of these strategies and then compare it to running long libraries on the PacBio. For E.coli-sized genomes and the new single library pipelines (e.g. http://www.ncbi.nlm.nih.gov/pubmed/23644548 ) and the new RS II instrument, it should be just 2-3 SMRTcells per genome, or about $1K-1.5K/genome at typical core facility charges. As shown recently (http://arxiv.org/abs/1304.3752), for the majority of bacterial genomes this strategy will give you a single contig; only a few known bacterial repeats are too large to resolve.

          Illumina paired ends will be cheaper, but you'll have many more contigs. In my experience, velvet is not as aggressive as other assemblers out there such as Ray or MIRA, and it would seem you can tolerate aggressive. You can get some large contigs this way, but you will have quite a few of them.

          454 should do better than Illumina, but will be inferior to PacBio and cost more.

          Comment


          • #6
            Just saw your post. Thanks for that. These are the kinds of insights that would have been useful before we started sequencing. I will try MIRA and Ray, but I think resequencing is out of the question at this point. In your experience, what benefits have you seen from closing bacterial genomes in terms of informational content (other than the satisfaction of knowing that the genome is tidy, of course)?

            Comment


            • #7
              For my work, I don't actually require the genome to be closed. But I do need large gene clusters to each reside on a single contig, and it turns out that natural product gene clusters are extremely hard to assemble. So I still have a bit of a local view (I need specific regions fully assembled, not the whole bug), but a tough standard there.

              Another assembler to look at is MaSuRCA, though in my initial test Ray beat it (see also the GAGE-B paper )

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-25-2024, 11:49 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              62 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Working...
              X