I have tried to align my paired end RNA-Seq reads to the genome using Tophat. I ran a sample dataset from the SRA (SR018268_1 and _2) and the data looked fine. However, when I run my datasets, I get a lot of spurrious junctions. In the attached example, I show the junctions and coverage for one sample. All the exons map beautifully and have coverage > 200X, but the junctions between exons were not determined for almost all of these exons are not joined and the majority of "junctions" (>80%) in the dataset are intergenic (or intragenic) even with low coverage. For exampl, the far left junction is supported by 92 reads, the middle by 83, and the right by 2.
I have tried to manipulate the alignment parameters such as -r set to either 165 or 41. These correspond to 230 bp DNA identified from the bioanalyzer minus the inner distance alone (230-35-35=165) or including the primer sequences (230-35-35-119=41). This didn't really change things much.
So my questions are:
1) Why aren't these junctions being called by tophat?
2) Why would the junction on the right show up?
3) How do I get past this?
I have tried to manipulate the alignment parameters such as -r set to either 165 or 41. These correspond to 230 bp DNA identified from the bioanalyzer minus the inner distance alone (230-35-35=165) or including the primer sequences (230-35-35-119=41). This didn't really change things much.
So my questions are:
1) Why aren't these junctions being called by tophat?
2) Why would the junction on the right show up?
3) How do I get past this?
Comment