GATK is a standard tool for calling SNPs however their authors did not provide any reference genomes or reference SNPs for non-human organism, such as mouse. Here is my quick tutorial for building a mm10 reference mouse genome and dbSNP reference SNP from scratch. It's not automated. I appreciate any inputs to make this workflow more efficient.
1. Build reference mm10 genome.
1.1 Download reference here:http://ccb.jhu.edu/software/tophat/igenomes.shtml, make sure you are downloading the "Mus musculus UCSC MM10" reference.
1.2 Untar the file, find the directory which contains the sequence for each individual chromosomes. The directory looks like this "Mus_musculus_UCSC_mm10\Mus_musculus\UCSC\mm10\Sequence\Chromosomes"
Enter the directory.
1.3 Change the chromosome header:
sed -i -- "s/chr//g" #.fa
1.4 Combine the chromosomes into a full genome:
cat ch1.fa chr2.fa...chrX.fa chr.Y.fa > mm10.fa #Make sure you are combining the chromosomes in karyotypic order and you are not including random or unmapped chromosomes.
1.5 index the genome and build dictionary file:
samtools faidx mm10.fa
java -jar CreateSequenceDictionary.jar R=mm10.fa O=mm10.dict
1.6 Create BWA index
bwa index -a bwtsw mm10.fa
2. Build reference mouse SNP
2.1 Download VCF (reference mouse SNP)
wget ftp://ftp.ncbi.nih.gov/snp/organisms...f_chr_*.vcf.gz
#Discard un and MT and randome chromosome, then unzip
#Remove excessive header (delete first 14 rows):
sed "1,14d" chr2.vcf #do all except chr1
#merge all vcf
cat chr1.vcf chr2.vcf... chrX.vcf chrY.vcf > dbsnp.vcf
Now you can use BWA to align the raw reads first, and then use GATK to call the SNPs.
1. Build reference mm10 genome.
1.1 Download reference here:http://ccb.jhu.edu/software/tophat/igenomes.shtml, make sure you are downloading the "Mus musculus UCSC MM10" reference.
1.2 Untar the file, find the directory which contains the sequence for each individual chromosomes. The directory looks like this "Mus_musculus_UCSC_mm10\Mus_musculus\UCSC\mm10\Sequence\Chromosomes"
Enter the directory.
1.3 Change the chromosome header:
sed -i -- "s/chr//g" #.fa
1.4 Combine the chromosomes into a full genome:
cat ch1.fa chr2.fa...chrX.fa chr.Y.fa > mm10.fa #Make sure you are combining the chromosomes in karyotypic order and you are not including random or unmapped chromosomes.
1.5 index the genome and build dictionary file:
samtools faidx mm10.fa
java -jar CreateSequenceDictionary.jar R=mm10.fa O=mm10.dict
1.6 Create BWA index
bwa index -a bwtsw mm10.fa
2. Build reference mouse SNP
2.1 Download VCF (reference mouse SNP)
wget ftp://ftp.ncbi.nih.gov/snp/organisms...f_chr_*.vcf.gz
#Discard un and MT and randome chromosome, then unzip
#Remove excessive header (delete first 14 rows):
sed "1,14d" chr2.vcf #do all except chr1
#merge all vcf
cat chr1.vcf chr2.vcf... chrX.vcf chrY.vcf > dbsnp.vcf
Now you can use BWA to align the raw reads first, and then use GATK to call the SNPs.
Comment