To clear up this confusion:
1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.
Furthermore, the format specifies that each read contains information about whether the read was aligned, whether its mate was aligned and where the read aligned to and where its mate aligned to. For the partner, another line has to be present that shows a "mirror" of the same alignment, i.e., the information on whether and where the read and its mate align should match if one exchanges read and mate information.
If you now perform any filtering that might remove reads but not their mates, this rule gets broken, and HTSeq issues a warning (but proceeds anyway, treating the missing partner as not aligned).
Some aligners manage to produce SAM files with missing mates even without filtering.
2. If a SAM line's flag field says that the read was not aligned then, in my opinion, the fields for where the read aligned should not contain any coordinates. HTSeq issues a warning in this case but treats the read as not aligned. BWA has a habit of sometimes producing such reads, and this is even documented somewhere.
3. Finally, a "proper pair" is, as I read the standard, a pair in which both reads align to opposite strands. If a SAM line's flag field indicates that a pair is proper but the alignment information disagrees with this, a warning is issued.
It is rather frustrating that the SAM format allows so many ways of storing self-contradictory information in a SAM file, without giving clear rules in the specification, because this forces software to expect all these cases.
1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.
Furthermore, the format specifies that each read contains information about whether the read was aligned, whether its mate was aligned and where the read aligned to and where its mate aligned to. For the partner, another line has to be present that shows a "mirror" of the same alignment, i.e., the information on whether and where the read and its mate align should match if one exchanges read and mate information.
If you now perform any filtering that might remove reads but not their mates, this rule gets broken, and HTSeq issues a warning (but proceeds anyway, treating the missing partner as not aligned).
Some aligners manage to produce SAM files with missing mates even without filtering.
2. If a SAM line's flag field says that the read was not aligned then, in my opinion, the fields for where the read aligned should not contain any coordinates. HTSeq issues a warning in this case but treats the read as not aligned. BWA has a habit of sometimes producing such reads, and this is even documented somewhere.
3. Finally, a "proper pair" is, as I read the standard, a pair in which both reads align to opposite strands. If a SAM line's flag field indicates that a pair is proper but the alignment information disagrees with this, a warning is issued.
It is rather frustrating that the SAM format allows so many ways of storing self-contradictory information in a SAM file, without giving clear rules in the specification, because this forces software to expect all these cases.
Comment