Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variant discovery in experimental evolution

    I have some fairly basic questions about software tools to use on my data, which is from a population of yeast evolved experimentally over about 2-3 weeks. I'm asking this because most of the tools I've been finding discuss applications that sound very different from mine (diploid, especially human, data from natural populations that have evolved for a very long time away from the reference sequence), so I would appreciate any guidance from those more knowledgeable.

    My data consists of reads (76 bp) from a novel strain of yeast evolved over a fairly short time under stressed conditions, so presumably many variants are low-frequency (due to the short time) and driven by selection (due to the stress). I've already aligned them using Bowtie 2. Coverage is in the neighborhood of 100x in most places.

    Let's say I am just interested in identifying variants (rather than inferring more detailed things like allele frequencies) -- basically I would like a list of where these variants are, how many reads they appeared on, some statistical criterion to evaluate their quality, etc. I have tried SAMtools and FreeBayes for this, but the have a lot of technical details -- about the allele frequency spectra, genotyping, Bayesian analysis, neutral evolution priors -- that I don't yet understand and don't know if they are even relevant for this level of analysis, which I believe to be pretty simple. Ideally, I would prefer something that just searches through the alignment and reports everywhere reads don't match the reference, which are then scored in some way using the quality score of that read's mapping and base quality. I would prefer not to impose any prior knowledge about a model (e.g., neutral evolution), but based on the documentation I've read, I can't tell if this is relevant here or not.

    Again, thanks for help anyone can provide!

  • #2
    Maybe VarScan2? But how to validate that one?

    Hi!

    I have a very similar setup and been playing around with VarScan2. I think this would give you the information you want, i.e. a list of variants at a certain coverage (or other filter) threshold.
    Then, on the other hand, I just stumbled upon your post when coming here to ask what instead of VarScan2, or to put it the other way around, which program to use to get a second 'opinion' to VarScan? I just started playing with freebayes, but it seems to be doing a lot of thins I am not really interested in and maybe the information I want gets obscured that way.
    What do people think about GATK?

    So, maybe I put my question here as well:
    Given a self assembled reference genome, how to best and most reliably call and quantify snps (and indels) from some mapped (bowtie2 most stringent settings) reference genomes (Illumina sequencing)?


    Thanks

    Phil

    Comment


    • #3
      VarScan does seem to do what I suggested -- thanks very much for the suggestion!

      So I guess now the question is what are the strengths and weaknesses of this approach -- which is based on heuristic filtering of variants according to coverage, mapping/base quality, etc. -- and the Bayesian approach that, from my understanding, is implemented in SAMtools, FreeBayes, GATK, and most of the other ones I looked at.

      My impression is that the Bayesian approach is more kosher on statistical grounds, but it critically requires a prior distribution of allele frequencies, which has to come from a model. So if you think your SNPs are truly neutral, a standard Wright-Fisher AFS prior may be adequate, but if you're looking for SNPs that are selected (as I am in my experimental evolution data), then I don't think that is valid. So in the absence of a reliable model to serve as a prior, the heuristic filtering approach employed by VarScan seems to be only alternative.

      Am I on the right track here? I'd love to hear from those more experienced with this analysis.

      Comment


      • #4
        For the case you are talking about, in particular resequencing yeast from experimental evolution, the best tool to use out of the box is BreSeq



        It misses some mutations, but catches almost all of them, has an easy to use setup, and outputs files for you to evaluate things.

        As for GATK, etc. It is great, but definitely not designed with yeast in mind. I will be posting a blog entry discussing how to use it for resequencing in experimental evolution projects in the next couple days at

        evolutionary biologist, statistician, nice guy


        As for SNP calling, if you have ~100X coverage a bayesian or other approach to SNP calling won't matter much (I assume you are working with isolated plates and are not re-sequencing the whole population). The prior has little effect with that much data for a haploid organism. I would try a couple packages and make sure they agree on the called variants.

        Comment


        • #5
          Thanks for the reply!

          I had looked into BreSeq before, having been familiar with Lenski's work, but I rejected it because I thought it was ill-suited to a genome of our size -- the main BreSeq documentation



          suggests it for genomes < 10 Mb, while the Google Code page you sent suggests it for genomes < 20 Mb. But I suppose these aren't hard cut-offs, so yeast (~12 Mb) should still be fine?

          It does sound like a good idea to compare a few tools, especially to see if the priors have no appreciable effect like you say. I look forward to reading your blog entry when it comes up!

          Comment


          • #6
            Almost finished the blog entry tonight, but it will have to wait for another day, glad to hear I have at least one reader though! (I spent some time building a website on my research while looking for a job last year, but it turned out all the traffic to the website went to this one blog post I wrote that I thought nobody would care about, a great joke on me. It did however motivate me to start posting other possibly useful tidbits online).

            Actually, if enough people think it would be useful for calling SNPs I can also bundle all the GATK tools into one command line program for people working with genomes under a gigabase in experimental evolution. I really like the GATK, but the one big thing that is a substantial advantage of breseq is that it does make a solid attempt to discover structural variants (like insertion sequences and large deletions). These are too hard to find systematically in large genomes, but can be found in microbes. As for breseq, I feel 99.9% certain that a 12 MB genome will be absolutely fine. There are two possible caveats to this:

            1- They actually coded it so that it has hard memory limits. However, this is extremely unlikely, it almost certainly has completely dynamic memory allocation.

            2- You do not have enough RAM on your computer. This is more likely, but you can diagnose this pretty easily if so (you will get something along the lines of an out of memory error). The last time I ran the program on a ~6 MB genome it took ~1072 MB of ram. Assuming that scales linearly with the genome size, I would say you should find a computer with at least 4GB and you should be fine to run things.

            Such is to say, give it a go!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-25-2024, 11:49 AM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            62 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Working...
            X