No announcement yet.

How to infer Illumina paired-end strand specificity from SAM output?

  • Filter
  • Time
  • Show
Clear All
new posts

  • How to infer Illumina paired-end strand specificity from SAM output?

    Hello, everyone, I'm new here.

    I usually analyze unpaired-read data from SOLiD sequencing of RNA-Seq libraries. The output from mapping these reads is simple and unambiguously specifies whether each read was mapped to a chromosome's POS or NEG strand. Now for the first time I am looking at paired-end reads from Solexa sequencing of an RNA-Seq sample. The library was prepared via a dUTP second-strand method. These reads were mapped (to the mouse genome) by someone else using the program BWA. I've been handed a .bam file, and have since uncompressed it to SAM and familiarized myself with SAMtools and the SAM format.

    At the moment I'm only concerned with characterizing individual reads, not with their pairing per se. My question concerns how to infer the strand to which each read maps, specifically whether a read is sense or antisense to a gene the read overlaps. I know that the 0x10 bit of SAM's FLAG field indicates reverse complementation (rc), which I take to mean that the read mapped to the NEG strand of the reference sequence. The RNA-Seq sample in question should almost entirely map to genes in the sense direction. I find that almost all -- but not quite all -- "left"-end reads of a pair do map to the POS reference strand ... but that almost all "right"-end reads of a pair map to the NEG strand.

    Is it correct to assume that the right-most member of a pair, when rc'ed and thereby mapped to the NEG strand, actually originated on the POS strand, i.e., the same strand as the other member of its pair? In other words, if the left pair member is not rc'ed and the right pair member is rc'ed, do they both "really" map to the POS strand? And if the left member is rc'ed and the right member is not rc'ed, do they both map to the NEG strand? I can't find any reference that specifies "paired reads from the left and right [or 5' and 3'] ends of a fragment map by convention to the same and opposite strand, respectively, as the originating RNA fragment" or some such thing.

    Looking at two highly expressed genes, one annotated on its chromosome's POS strand and the other on a NEG strand, I find that for both of them almost all the left-end reads are not rc'ed and almost all the right-end reads are rc'ed. If this holds for all genes one would conclude, depending on the convention used, that either every read maps to a POS strand, every read maps to a NEG strand, or close to half of reads are found on each strand irrespective of gene orientation. This just doesn't make sense to me, since reads stemming from transcription of these and other genes should align to the genes in the sense orientation -- i.e., they should map to the genes' strands in the reference genome.

    I need to sort this out before deciding whether each read derives from an RNA fragment that was transcribed sense or antisense to a gene.


    Addendum: I don't quite see the conistency between the sign of SAM's TLEN field and bit 0x40 bit of the FLAG field, where the flag indicates "first fragment in the template". ... Before I used TLEN >= 0 to identify "leftmost" (i.e., first) fragments. However, if I use instead use a set x40 bit to indicate first fragment, I find the following. For the gene whose reads should align to the POS strand, most reads in the SAM output are either ( 1st fragment and reverse-complemented ) or ( not 1st fragment and not rc'ed ). For the NEG-strand gene, I find the opposite trend, either ( 1st fragment and not rc'ed ) or ( not 1st fragment and rc'ed ). Good that these are now different. However, they each seem to suggest the wrong strand if one goes by whether the first fragment is reverse-complemented.

    But the "wrong" strand would be reported if the whole sample actually consisted of reverse-complemented copies of RNA transcripts. This does seem to be the case in the dUTP second-strand sample prep: uridine digestion leaves the strand complementary to the initial transcript. Does this make sense?

    If that's correct, then for this library prep a read should be assigned the NEG strand if and only if

    -- it is a "first" fragment (FLAG x40 bit = 1) and its SEQ is not reverse-complemented (FLAG x10 bit = 0), or

    -- it isn't a first fragment (x40 bit = 0) and its SEQ is reverse-complemented (x10 bit = 1).

    Otherwise it's on the POS strand.
    Last edited by David Harmin; 02-16-2011, 08:26 PM. Reason: Update of test cases