I would like to suggest a minor but important update/clarification for the SAM format specification regarding unmapped reads.
In the sam spec http://samtools.github.io/hts-specs/SAMv1.pdf, it says the following about the flag:
and further down in the notes:
Thus for an unmapped read, we dont know if the reverse complement flag is valid. This has important implications for extracting the original read data from a bam file, since we dont know whether unmapped reads are represented as-is in the bam, or have been reverse complemented. It would be fairly simple to add the requirement that 0x10 is always valid, regardless of 0x4.
If I understand the bwa code correctly, the reverse complement flag is always set for a reverse complemented read (though confusingly, that flag takes on the same value of its mate if its mate is mapped, reverse complementing the read unnecessarily). If we updated the spec, bwa would already be within spec. I am not sure about other aligners.
In the communities I work with bam files are ubiquitous for storing sequencing data, and with good reason since they are smaller than compressed fastqs. For people relying on bams as a storage format, and wishing to use unmapped reads to predict, for instance, structural variants, it is important the bams allows for consistent extraction of the read sequences as they came off the sequencer.
Posting this to get feedback from the community, and hopefully change the spec. if I have misunderstood the spec, let me now.
Andrew
In the sam spec http://samtools.github.io/hts-specs/SAMv1.pdf, it says the following about the flag:
0x10 SEQ being reverse complemented
and further down in the notes:
Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
Thus for an unmapped read, we dont know if the reverse complement flag is valid. This has important implications for extracting the original read data from a bam file, since we dont know whether unmapped reads are represented as-is in the bam, or have been reverse complemented. It would be fairly simple to add the requirement that 0x10 is always valid, regardless of 0x4.
If I understand the bwa code correctly, the reverse complement flag is always set for a reverse complemented read (though confusingly, that flag takes on the same value of its mate if its mate is mapped, reverse complementing the read unnecessarily). If we updated the spec, bwa would already be within spec. I am not sure about other aligners.
In the communities I work with bam files are ubiquitous for storing sequencing data, and with good reason since they are smaller than compressed fastqs. For people relying on bams as a storage format, and wishing to use unmapped reads to predict, for instance, structural variants, it is important the bams allows for consistent extraction of the read sequences as they came off the sequencer.
Posting this to get feedback from the community, and hopefully change the spec. if I have misunderstood the spec, let me now.
Andrew
Comment