I recently performed an RNA-seq experiment that was mapped using STAR through a package called zUMIs. Typically, our reads are 66 bp (and in the past our experiments have been mapped to the human genome) but this time our data ended up being 50 bp reads to the mouse genome. Our final counts dataset seems to be dominated by pseudogenes; close to 25% of the UMIs are linked to pseudogenes.
Of course, while pseudogenes can be transcribed, in our case it seems more likely that we have had an issue from the mapping front (I'm guessing due to the shorter read length). From our biological context, we certainly don't expect a massive number of transcribed pseudogenes. My question is - is there any way to coalesce the counts between genes and their corresponding pseudogenes (without remapping)? And if there isn't a good way to handle this, what STAR settings should we try adjusting to map to further promote mapping to canonical genes over pseudogenes?
Thanks!
Of course, while pseudogenes can be transcribed, in our case it seems more likely that we have had an issue from the mapping front (I'm guessing due to the shorter read length). From our biological context, we certainly don't expect a massive number of transcribed pseudogenes. My question is - is there any way to coalesce the counts between genes and their corresponding pseudogenes (without remapping)? And if there isn't a good way to handle this, what STAR settings should we try adjusting to map to further promote mapping to canonical genes over pseudogenes?
Thanks!