Seqanswers Leaderboard Ad

**PHSchi** · 01-27-2013, 03:25 AM

Maybe VarScan2? But how to validate that one?

Hi!

I have a very similar setup and been playing around with VarScan2. I think this would give you the information you want, i.e. a list of variants at a certain coverage (or other filter) threshold.
Then, on the other hand, I just stumbled upon your post when coming here to ask what instead of VarScan2, or to put it the other way around, which program to use to get a second 'opinion' to VarScan? I just started playing with freebayes, but it seems to be doing a lot of thins I am not really interested in and maybe the information I want gets obscured that way.
What do people think about GATK?

So, maybe I put my question here as well:
Given a self assembled reference genome, how to best and most reliably call and quantify snps (and indels) from some mapped (bowtie2 most stringent settings) reference genomes (Illumina sequencing)?

Thanks

Phil

**mmanhart** · 02-01-2013, 02:17 PM

VarScan does seem to do what I suggested -- thanks very much for the suggestion!

So I guess now the question is what are the strengths and weaknesses of this approach -- which is based on heuristic filtering of variants according to coverage, mapping/base quality, etc. -- and the Bayesian approach that, from my understanding, is implemented in SAMtools, FreeBayes, GATK, and most of the other ones I looked at.

My impression is that the Bayesian approach is more kosher on statistical grounds, but it critically requires a prior distribution of allele frequencies, which has to come from a model. So if you think your SNPs are truly neutral, a standard Wright-Fisher AFS prior may be adequate, but if you're looking for SNPs that are selected (as I am in my experimental evolution data), then I don't think that is valid. So in the absence of a reliable model to serve as a prior, the heuristic filtering approach employed by VarScan seems to be only alternative.

Am I on the right track here? I'd love to hear from those more experienced with this analysis.

**ndelaney** · 02-01-2013, 04:07 PM

For the case you are talking about, in particular resequencing yeast from experimental evolution, the best tool to use out of the box is BreSeq

Google Code Archive - Long-term storage for Google Code Project Hosting.

http://code.google.com/p/breseq/

It misses some mutations, but catches almost all of them, has an easy to use setup, and outputs files for you to evaluate things.

As for GATK, etc. It is great, but definitely not designed with yeast in mind. I will be posting a blog entry discussing how to use it for resequencing in experimental evolution projects in the next couple days at

nigel delaney

https://www.evolvedmicrobe.com/blogs

evolutionary biologist, statistician, nice guy

As for SNP calling, if you have ~100X coverage a bayesian or other approach to SNP calling won't matter much (I assume you are working with isolated plates and are not re-sequencing the whole population). The prior has little effect with that much data for a haploid organism. I would try a couple packages and make sure they agree on the called variants.

**mmanhart** · 02-01-2013, 04:30 PM

Thanks for the reply!

I had looked into BreSeq before, having been familiar with Lenski's work, but I rejected it because I thought it was ill-suited to a genome of our size -- the main BreSeq documentation

Introduction — breseq 0.38.2 documentation

http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/introduction.html

suggests it for genomes < 10 Mb, while the Google Code page you sent suggests it for genomes < 20 Mb. But I suppose these aren't hard cut-offs, so yeast (~12 Mb) should still be fine?

It does sound like a good idea to compare a few tools, especially to see if the priors have no appreciable effect like you say. I look forward to reading your blog entry when it comes up!

**ndelaney** · 02-01-2013, 06:29 PM

Almost finished the blog entry tonight, but it will have to wait for another day, glad to hear I have at least one reader though! (I spent some time building a website on my research while looking for a job last year, but it turned out all the traffic to the website went to this one blog post I wrote that I thought nobody would care about, a great joke on me. It did however motivate me to start posting other possibly useful tidbits online).

Actually, if enough people think it would be useful for calling SNPs I can also bundle all the GATK tools into one command line program for people working with genomes under a gigabase in experimental evolution. I really like the GATK, but the one big thing that is a substantial advantage of breseq is that it does make a solid attempt to discover structural variants (like insertion sequences and large deletions). These are too hard to find systematically in large genomes, but can be found in microbes. As for breseq, I feel 99.9% certain that a 12 MB genome will be absolutely fine. There are two possible caveats to this:

1- They actually coded it so that it has hard memory limits. However, this is extremely unlikely, it almost certainly has completely dynamic memory allocation.

2- You do not have enough RAM on your computer. This is more likely, but you can diagnose this pretty easily if so (you will get something along the lines of an out of memory error). The last time I ran the program on a ~6 MB genome it took ~1072 MB of ram. Assuming that scales linearly with the genome size, I would say you should find a computer with at least 4GB and you should be fine to run things.

Such is to say, give it a go!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Variant discovery in experimental evolution

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News