Originally posted by lh3
View Post
thanks a lot for posting the data. I have looked at it and I think the results are quite educating.
First, I re-run STAR on this subset of 503 erroneously spliced alignments with the following parameters:
--scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --scoreGap -2 --scoreGapNoncan -100 --outFilterMatchNmin 95 --alignIntronMax 100000 --seedSearchStartLmax 25
Compared to the default parameters, this lowers the penalty for indels, introduces non-zero penalty for canonical junctions and very large penalty for non-canonical junctions. It also reduces the maximum junction gap to 100kb (from ~600kb by default), and requires at least 95 out of 101 bases matched - to avoid poor quality alignments.
--seedSearchStartLmax 25 option increases STAR sensitivity to alignments with short overhangs over indels or junctions.
With these parameters STAR produces just 29 spliced reads. I have compared these 29 splices alignments with bwa-mem alignments in the attached excel files. Interestingly, for 23 out of 29 alignments, STAR finds spliced reads with the same or smaller edit distance (marked with green in the Excel file). If these were truly RNA-seq data, these would be the cases where it's impossible to differentiate between spliced and unspliced alignments when reads are considered separately from each other. Possibly, they can be filtered by requiring that their junctions are supported by more than one read.
Comment