I am using tophat2 (2.0.10) with bowtie2 to map Illumina HiSeq 2000 2x100 paired-end RNA sequencing data to the human genome/transcriptome.
Examinations of the tophat-generated 'accepted_hits.bam' files, for both stranded as well as non-stranded paired-end RNA sequencing, show that there are a lot of alignments in which the XS:A tag appears twice, even though the SAM format specification states that a tag can only appear once in an alignment. The XS:A tag takes a value of "+" or "-", indicating the genomic strand that the RNA that produced a read came from.
Example command (non-stranded sequencing):
Output:
Is this a bug in tophat2 or bowtie2 (I see it with tophat 2.0.9 as well, which is supposed to have fixed such a bug!)
Examinations of the tophat-generated 'accepted_hits.bam' files, for both stranded as well as non-stranded paired-end RNA sequencing, show that there are a lot of alignments in which the XS:A tag appears twice, even though the SAM format specification states that a tag can only appear once in an alignment. The XS:A tag takes a value of "+" or "-", indicating the genomic strand that the RNA that produced a read came from.
Example command (non-stranded sequencing):
Code:
samtools view accepted_hits.bam | head -n 10000 | grep -P "(XS:A:[+-]).+?(XS:A:[+-])"
Code:
SRR452328.33456849 129 1 17368 50 1M237N77M 6 74227619 0 CAGGTTCTCGGTGGTGTTGAAGAGCAGCAAGGAGCTGACAGAGCTGATGTTGCTGGGAAGACCCCCAAGTCCCTCTTC =?+A=:BDFAA<AE8+AEGG9CEE3?;E3?F;:)?;GDCAADFHB@@B<)8=@AHG=F9@;7@EA6<?=;@>;6==>@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:78 YT:Z:UU XS:A:- XS:A:- NH:i:1 SRR452328.25125866 177 1 17368 50 1M237N98M X 155252048 0 CAGGTTCTCGGTGGTGTTGAAGAGCAGCAAGGAGCTGACAGAGCTGATGTTGCTGGGAAGCCCCCCAAGTCCCTCTTCTGCATCGTCCTCGGGCTCCGG ?<BADBD>@BA<<B@:+(@CC>CCADCCCCCCCDCCACAACCCDCCA8DCC?C>?DC<00&FHFIHDBGDIIG@IHDFG?IHIGF6GHHDBFFDDFCCB AS:i:0 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:60A38 YT:Z:UU XS:A:- XS:A:- NH:i:1