Hello,
I'm working on a pipeline for obtaining phased haplotype sequences from diploid organisms. The input data are Illumina reads from reduced representation libraries, and the goal is to use the phased sequences to estimate gene trees for coalescent analyses. I am working with non-model species, so I don't have a reference genome nor any reference panels for phasing SNPs. I've worked up a pipeline (see below), but given the proliferation of tools out there, I was hoping to get feedback on whether alternative (better) tools exist than those that I've selected. Pipeline:
1) Demultiplex and clean reads (Casava, Illumiprocessor)
2) de novo assembly (ABySS)
3) Map contigs to reference sequences of interest (in some cases we have a set of reference loci we are interested in recovering; python scripts already written for this step)
3) Map reads to consensus (BWA)
4) Call SNPs and phase using read information (GATK)
5) Output phased haplotype sequences (custom python scripts?)
In addition to advice on alternative tools, I would appreciate any input on step (5) above. Are there any tools that can do this? From what I can tell, samtools can output sequences from VCF files of phased SNPs, but these will just contain ambiguity codes rather than 2 phased haplotype sequences. I don't think GATK has this functionality yet. Will I just have to write a script to take the phasing SNP information from the phased VCF from GATK and add it back into the consensus sequences?
Thanks,
Mike
I'm working on a pipeline for obtaining phased haplotype sequences from diploid organisms. The input data are Illumina reads from reduced representation libraries, and the goal is to use the phased sequences to estimate gene trees for coalescent analyses. I am working with non-model species, so I don't have a reference genome nor any reference panels for phasing SNPs. I've worked up a pipeline (see below), but given the proliferation of tools out there, I was hoping to get feedback on whether alternative (better) tools exist than those that I've selected. Pipeline:
1) Demultiplex and clean reads (Casava, Illumiprocessor)
2) de novo assembly (ABySS)
3) Map contigs to reference sequences of interest (in some cases we have a set of reference loci we are interested in recovering; python scripts already written for this step)
3) Map reads to consensus (BWA)
4) Call SNPs and phase using read information (GATK)
5) Output phased haplotype sequences (custom python scripts?)
In addition to advice on alternative tools, I would appreciate any input on step (5) above. Are there any tools that can do this? From what I can tell, samtools can output sequences from VCF files of phased SNPs, but these will just contain ambiguity codes rather than 2 phased haplotype sequences. I don't think GATK has this functionality yet. Will I just have to write a script to take the phasing SNP information from the phased VCF from GATK and add it back into the consensus sequences?
Thanks,
Mike
Comment