Hi Everyone,
I am working with pooled-seq data from several plant populations. From this data we are primarily interested in obtaining nuclear SNP frequencies by population but would also like to recover whatever information we can about the chloroplast and mitochondrial genome. I have the following questions:
1) Is it standard to always include the mitochondria and chloroplast genome as part of the reference genome or do people usually only align reads to the nuclear genome?
2) Do we have to set any of the parameters in BWA differently if we have multiple reference genomes in the same fasta file (e.g. the nuclear, chloroplast and mitochondrial genome)?
3) Is it straightforward to separate the results for the nuclear and plastid genomes downstream (e.g. is it that by indexing the reference genome, we will be able to somehow partition the SAM/BAM file into data for our analysis of just nuclear SNP frequencies and data that goes into our analysis of the plastid genomes)?
4) Finally, wondering if anyone has a sense of how common plastid pseudogenes are in plant nuclear genomes? My thinking with the alignment of our reads to all three genomes is (in part) that we will be able to detect and avoid these types of pseudogenes but how important is this?
Thank you in advance for the help! Sorry if some of these are naive questions: still new to all of this!
I am working with pooled-seq data from several plant populations. From this data we are primarily interested in obtaining nuclear SNP frequencies by population but would also like to recover whatever information we can about the chloroplast and mitochondrial genome. I have the following questions:
1) Is it standard to always include the mitochondria and chloroplast genome as part of the reference genome or do people usually only align reads to the nuclear genome?
2) Do we have to set any of the parameters in BWA differently if we have multiple reference genomes in the same fasta file (e.g. the nuclear, chloroplast and mitochondrial genome)?
3) Is it straightforward to separate the results for the nuclear and plastid genomes downstream (e.g. is it that by indexing the reference genome, we will be able to somehow partition the SAM/BAM file into data for our analysis of just nuclear SNP frequencies and data that goes into our analysis of the plastid genomes)?
4) Finally, wondering if anyone has a sense of how common plastid pseudogenes are in plant nuclear genomes? My thinking with the alignment of our reads to all three genomes is (in part) that we will be able to detect and avoid these types of pseudogenes but how important is this?
Thank you in advance for the help! Sorry if some of these are naive questions: still new to all of this!
Comment