I am working on a project that seeks to call SNPs for a non-model organism with no existing reference genome or transcriptome using multiplexed Illumina RNA-seq data.
I used Trinity to assemble a partial 'reference' transcriptome of the most highly expressed transcripts for which we had sufficient coverage, as well as many fragments of lower-expressed transcripts. Then I used BWA to map all data for multiple individuals back to that reference, and finally used GATK to call SNPs.
However, I am running into an issue where reads derived from paralogous genes or a multigene family are mapping back to the same reference contig, creating false SNPs in divergent positions. My evidence of this is that in general one 'allele' (actually a slightly divergent gene) is supported by significantly fewer than half of the reads for a given individual that is called a heterozygote. These 'SNPs' are also generally observed across several individuals, leading me to believe that these are not sequencing/library prep errors.
I think that I will be able to identify these cases with some statistic, but I am wondering if there is a good way to modify the corresponding SAM files to remove the mis-mapped reads, then re-genotype. Has anyone else encountered similar issues, and if so how did you deal with it?
I used Trinity to assemble a partial 'reference' transcriptome of the most highly expressed transcripts for which we had sufficient coverage, as well as many fragments of lower-expressed transcripts. Then I used BWA to map all data for multiple individuals back to that reference, and finally used GATK to call SNPs.
However, I am running into an issue where reads derived from paralogous genes or a multigene family are mapping back to the same reference contig, creating false SNPs in divergent positions. My evidence of this is that in general one 'allele' (actually a slightly divergent gene) is supported by significantly fewer than half of the reads for a given individual that is called a heterozygote. These 'SNPs' are also generally observed across several individuals, leading me to believe that these are not sequencing/library prep errors.
I think that I will be able to identify these cases with some statistic, but I am wondering if there is a good way to modify the corresponding SAM files to remove the mis-mapped reads, then re-genotype. Has anyone else encountered similar issues, and if so how did you deal with it?
Comment