Hi all,
we've developed a generic, sensitive and fast method (quite a mouthful but it's true, see for yourself ) for prediction of very rare variants called LoFreq and I'd be happy if other people would try it out. LoFreq can be used to predict variants in any type of data, i.e. viral or bacterial populations, but also pooled data, human genome data etc. It's neither restricted to haploid nor diploid data. It's main area of application is high coverage Illumina data (makes use of base call quality scores), but as long as you have a SAM/BAM file (best with recalibrated qualities) you can simply use it.
LoFreq takes a samtools [m]pileup (via file or pipe) as input and produces a simple list of variants including associated frequencies, p-values (use for filtering!), base-counts, strand-bias info etc. The format is self-explanatory, plain CSV at the moment, but I will add VCF-support soon. For now, if your data consists of several chromosomes, you have to run it once per chromosome.
An example call would look like this:
Note, that you might want to play with Samtools' options, most importantly the depth cap and BAQ. For example, if you have high coverage data, it's probably best to switch samtools' coverage cap off (e.g. -d 100000) to make use of all the data available. Make sure to use a reference fasta for the pileup, as no calls can be made on columns that have an N as reference.
The code is accessible from https://github.com/andreaswilm/LoFreq
(requires Python 2.6 or Python 2.7).
I'd be happy to receive any type of feedback!
Cheers,
Andreas
we've developed a generic, sensitive and fast method (quite a mouthful but it's true, see for yourself ) for prediction of very rare variants called LoFreq and I'd be happy if other people would try it out. LoFreq can be used to predict variants in any type of data, i.e. viral or bacterial populations, but also pooled data, human genome data etc. It's neither restricted to haploid nor diploid data. It's main area of application is high coverage Illumina data (makes use of base call quality scores), but as long as you have a SAM/BAM file (best with recalibrated qualities) you can simply use it.
LoFreq takes a samtools [m]pileup (via file or pipe) as input and produces a simple list of variants including associated frequencies, p-values (use for filtering!), base-counts, strand-bias info etc. The format is self-explanatory, plain CSV at the moment, but I will add VCF-support soon. For now, if your data consists of several chromosomes, you have to run it once per chromosome.
An example call would look like this:
Code:
samtools mpileup -f your-ref.fa your-bam.bam | lofreq_snpcaller.py -b 1 -o snp-out.txt -v
The code is accessible from https://github.com/andreaswilm/LoFreq
(requires Python 2.6 or Python 2.7).
I'd be happy to receive any type of feedback!
Cheers,
Andreas
Comment