I am aligning reads to the GrCh37 human reference genome using STAR and am confused as to how to get the most reliable results. At first I aligned my reads with no pre-processing using STAR and got essentially 2 groups: 1 group that aligned well, with 85-95+% of reads uniquely mapping, and another group with only about 60% uniquely mapping. I then tried using sickle to trim low quality reads (q<20). Reads that mapped 85-95+% unique lost about 1 million reads and map a similar % uniquely, and reads that mapped about 60% now map >80% but lost millions of uniquely mapped reads. Though millions of reads are lost after quality trimming, on average ~90% of reads survive in every case, with a max of 21% reads lost after quality trimming. My question is which pipeline should I trust? Should I be more interested in the raw number of mapping reads, or the percentage? Are the reads lost after quality trimming untrustworthy anyways?
I have done a lot of reading on this and have yet to find a definitive consensus. Any help or guidance would be much appreciated!
I have done a lot of reading on this and have yet to find a definitive consensus. Any help or guidance would be much appreciated!
Comment