Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for advice on bacterial de novo genome sequencing

    Hi all,

    I want to obtain the complete genome sequences of three closely related bacterial strains in order to identify genes involved in toxin metabolism.

    We have a reference genome of the type strain, which consists of a single (unannotated) contig of about 2.8 Mb.

    So far the most cost effective solution seems to be a partial throughput run on a HiSeq using 2x100 bp reads. The service provider I've talked to tells me that they can do this for $700 per sample. This should correspond to about 1-2 Gb per sample, which means several hundred x coverage.

    The price is right but I'm unsure if short Illumina reads are going to present problems when we try to assemble the genomes, even with such high coverage. I have no experience with genome assembly, so I am hoping people here might be able to share their experience and tell me if they think we should be looking at alternative platforms to get longer reads for assembly.

    Edit: I realised too late that my title says 'de novo'. This is not really the case as we have a reference genome. The accuracy of this reference is an unknown quantity however.
    Last edited by coralnerd; 07-25-2013, 09:59 PM.

  • #2
    If you really need to do genome assembly, I'd look into what PacBio data would cost. That seems ideally suited to bacterial genome assembly but I have no experience with it yet, I've just been eagerly awaiting that technology.

    My experience with assembling Illumina reads has not been very successful. Many assemblers are fine for contigs of a few kb to tens of kb, but nothing approaching full genome assembly automatically. This might be good enough to see if there are new genes not represented in the type strain though.

    Illumina reads are great for just aligning to a reference and producing an updated consensus sequence, that you might use as a reference in another iteration. Discordantly-mapping read pairs will point to larger differences (or errors in the reference assembly).

    Comment


    • #3
      On PacBio using HGAP, Celera assembler (or MIRA) and Quiver, you can probably get a very high quality sequence (quite likely a single contig, with probably <10 base substitutions or indels) from a single SMRT cell, or around $1000 USD. I'm not sure about access to providers Down Under -- it looks like Millennium Science is a commercial provider (expect to go up a bit on my price estimate if they are like most commercial shops). The long reads will give you a good ability to detect structural variations, especially changes in repeats.

      My standard advice for these projects can be found at http://omicsomics.blogspot.com/2013/...-for-help.html . In particular, I'd advise you to think about what questions you are going to ask & how the continuity of the sequence might affect those.

      Depending on the G+C content of your organism, HiSeq may generate quite good data, but it won't be able to span long repeats such as Insertion Sequences and ribosomal RNA genes. With a moderate G+C organism, with 100X coverage you may have some contigs well over 200Kb, but the N50 of the assembly will probably be more in the 20-50Kb range or so -- but that's a rough guess. Some of that also depends on which assembler you use.

      Comment


      • #4
        If your goal is to identify the genes on a presence/absence type basis, then the Illumina reads will be more than sufficient. The best route would be to map them against your reference, and use a variant caller to look for differences between each strain and your reference. You can also assemble them reads de novo and get a pretty good idea of what's present in your strains but not in the reference.

        Where PacBio is best suited is if you need to know physical arrangement of the genes. The Illumina data may be able to tell you if there are any rearrangements in the genome that affect your genes of interest, but it's a lot harder than if you have the long reads that PacBio can give you.

        Saying all of that, $700/strain seems like a pretty high price to me. I don't know what sort of facilities are available to you and what their over-head is, but you could easily make all of the libraries using Nextera XT (~$150/library) and sequence them on a MiSeq for ~$1000. That would give you 2x250bp reads which would give you better rearrangement information than the 2x100bp reads from the HiSeq, and if you're only doing 3 strains then you should be able to get ~2Gbp of data per sample with the MiSeq.

        Comment


        • #5
          Hi guys,

          Thanks for the helpful replies. Just to give you a better idea of what I'm hoping to achieve here are some more details of what I'm working with.

          Ideally I'd be comparing a strain that can't degrade the plant derived compounds we're interested in to one that can, but unfortunately that isn't quite the case. The three strains that I want to sequence are all phenotypically similar in that they are all able to degrade one or both of the two plant toxins, but they do it to differing degrees. One strain is particularly bad at it and is barely able to degrade the second compound at all.

          As a very simple first step to look for genomic differences between the strains I've tried producing restriction digest fingerprints, which show subtle, but noticeable differences.

          Our thinking therefore is that the genes involved in this pathway are present in all of the strains, but that there might be SNPs or other small differences between them. Of course these differences might be located elsewhere like in regulatory genes etc. At this stage we have virtually no information at all about where these genes might be located in the genome or what homologous sequences we should be looking for.

          So - whatever sequencing method we end up using needs to be able to produce data that allows us to resolve potentially small differences between the genomes. With my limited experience in genome assembly and analysis I don't know how feasible it would be to map short Illumina reads to the reference genome and use this to try to identify SNPs.

          Based on what I've read and the replies I've recieved here so far it sounds like PacBio might be a good option. I'll happily jump on the latest technological bandwagon if it can produce the results we're after. I've contacted Millennium to see what they can do for us in terms of price.

          Comment


          • #6
            Illumina data is fine for SNPs. With enough coverage, you could likely de novo assemble any genes present in your samples not present in the reference.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 12:17 PM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-29-2024, 10:49 AM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-25-2024, 11:49 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Working...
            X