Only 0.5% of my reads were correctly paired with Tophat. And a huge amount were mapping to a different chromosome.
I discovered the problem only occurred when I trimmed the reads first. (?)
I then dug into Bowtie, and found out that it expects paired end reads to be in the same order of the input files. Thus if you remove a read from R1 but not R2, everything from that point in the file will be wrong.
As far as I knew, fastq records were matched up by read ID NOT by position in the file. This requirement was NOT specified on the bowtie page[1]. There is NO warning that this is happening.
I haven't seen this discussed before - so.... a warning: Don't do a quality trim w/discard on paired end data (unless you process the files in sync and throw a bad read in either out in both files so they remain in sync)
I understand why they do this from an algorithm point of view, but it would be good to make the requirement explicit.
Or, it would be pretty easy to test for this. In PairedDualPatternSource::nextReadPair() check that the read IDs are the same before return true.
[1] Actually, I was looking on Bowtie1 page. Bowtie 2 says:
"Pairs are often stored in a pair of files, one file containing the mate 1s and the other containing the mates 2s. The first mate in the file for mate 1 forms a pair with the first mate in the file for mate 2, the second with the second, and so on."
So I guess it does say that... however it could be more explicit ("first fastq record in the file"), or do the read ID comparison test.
I discovered the problem only occurred when I trimmed the reads first. (?)
I then dug into Bowtie, and found out that it expects paired end reads to be in the same order of the input files. Thus if you remove a read from R1 but not R2, everything from that point in the file will be wrong.
As far as I knew, fastq records were matched up by read ID NOT by position in the file. This requirement was NOT specified on the bowtie page[1]. There is NO warning that this is happening.
I haven't seen this discussed before - so.... a warning: Don't do a quality trim w/discard on paired end data (unless you process the files in sync and throw a bad read in either out in both files so they remain in sync)
I understand why they do this from an algorithm point of view, but it would be good to make the requirement explicit.
Or, it would be pretty easy to test for this. In PairedDualPatternSource::nextReadPair() check that the read IDs are the same before return true.
[1] Actually, I was looking on Bowtie1 page. Bowtie 2 says:
"Pairs are often stored in a pair of files, one file containing the mate 1s and the other containing the mates 2s. The first mate in the file for mate 1 forms a pair with the first mate in the file for mate 2, the second with the second, and so on."
So I guess it does say that... however it could be more explicit ("first fastq record in the file"), or do the read ID comparison test.
Comment