Hello,
I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:
I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.
To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:
Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:
5' UTR:
0.0003928025
Coding Region:
0.00008306061
3' UTR:
0.001019351
Which makes sense in that the coding region is more conserved than the UTRs.
However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.
Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?
Thanks in advance.
I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:
I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.
To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:
Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:
5' UTR:
0.0003928025
Coding Region:
0.00008306061
3' UTR:
0.001019351
Which makes sense in that the coding region is more conserved than the UTRs.
However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.
Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?
Thanks in advance.
Comment