Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM flag idioms

    "There are 10 types of people in this world: those who assimilated binary numbers and those who didn't."

    I definitely belong to the 10'th type and hence SAM Flags are a chore. They may be a very compact way of communicating a lot of info about an alignment, but how do we humans learn them? I know it is kind of nerdy to actually look through SAM files but, what can I say? Mea culpa.

    Anyway, this post is my attempt to understand them like a natural language i.e. recognize some idiomatic representations in flags. If you already know these, you are a "binar" and way ahead of us humans on this topic.

    You can use this handy little web page for specific flags:


    However, to "speak SAM", we must know these flags without having to refer to a web page for each line. So, here are some simple idioms.

    Unpaired Reads

    For unpaired reads, the flags are very easy to recognize because there are only 3 values:
    • 4 - 0000000100 - means "this is an unpaired read and is not mapped".
    • 16 - 0000010000 - "this unpaired read is mapped in the reverse orientation".
    • 0 - 0000000000 - "this unpaired read is mapped in the forward orientation".
    I guess it is theoretically possible to have a flag of 20 meaning "unpaired, unmapped read presented in reverse orientation" - however, I doubt any software will do that. Perhaps, that is our first SAM joke: Did you hear about AnnoyingAlign? It is the software that 20's all unpaired, unmapped reads - just to get on users' nerves.

    Paired Reads

    For paired reads, 0'th bit HAS to be set. Hence all flags for paired reads HAVE to be odd. In other words, all even-numbered flags other than the above three (0, 4 and 16) are meaningless. (Good progress. We can recognize non-sense words. Writing a Jabberwocky poem with these flags is left as an exercise for the reader).

    For paired reads all flags in the intervals [65-127] and [193-255] relate to the first read of a pair. All other (odd) flags refer to the second read in a pair.


    "All Good"

    Some values mean "all good" i.e. that both reads in the pair have aligned:
    • 65 - 0001000001 - this is first read in pair and both reads aligned the forward strand.
    • 129 - 0010000001 - This is second read of pair and both reads aligned the forward strand.

    NOTE: 67 (0001000011) and 131 (0010000011) also mean the same as 65 and 129 with the added assurance that "the pair is properly aligned" meaning that they mapped within a proper distance from each other.
    Sometimes both reads of a pair are flipped (reverse complemented) before mapping. If so, you get 113 or 177.
    • 113 - 0001110001 - "this is the first read of a pair, both reads in pair were flipped and both mapped".
    • 177 - 0001110001 - "this is the second read of a pair, both reads in pair were flipped and both mapped".

    Other times only one of the reads in a pair is flipped though both of them map:
    • 81 - 0001010001 - "this is the first read of pair, both reads mapped, we had to flip this read, but mate is in forward orientation".
    • 161 - 0010100001 - "this is second read, this one is forward but we flipped its mate and both reads mapped".

    NOTE: 163 (0010100011) and 83 (0001010011) are the same as 161 and 81 except "it is in a proper pair".
    • 97 - 0001100001 - "this is first read, its mate is flipped but this is forward. Both mapped".
    • 145 - 0010010001 - "this is second read. it is flipped but its mate is not. Both mapped".

    NOTE: 99 (0001100011) and 147 (0010010011) are the same as 97 and 145 except with "proper mapping in pair".
    Exercise: Can you see why the number of reads with flag 113 must be equal to the number of reads with flag 177. Similarly, 81=161 and 97=145. If those numbers don't match, something went wrong with your aligner.

    "All Bad"
    At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:

    77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"

    141 - 0010001101 - Second in pair and "all bad".

    • Exercise: Just like with 20, AnnoyingAlign puts flags of 93 or 125 on all unmapped pairs. What other flags can AnnoyingAlign use to maximize user annoyance?
    • Exercise: Why are 79 and 143 particularly good words for Jabberwocky?
    Only one read maps

    Next, we have the cases when only one read in a pair is mapped.
    • 69 - 0001000101 - First read in pair. This read is unmapped but its mate is mapped.
    • 137 - 0010001001 - second in pair. Read is mapped but mate is unmapped.
    • 73 - 0001001001 - First read in pair. This read is mapped but its mate is not.
    • 133 - 0010000101 - 2nd in pair. Read unmapped but mate is mapped.

    Can you again see why number of reads with flag of 69 must be the same as the number of reads with flag of 137?

    There are of course many other combinations. The purpose here is not to enumerate them but to simply have some fun with the structure of these flags.

    What is your favorite flag? Do you have other ways of remembering what these things mean as you look through SAM files?
    Kamalakar Gulukota,
    Director,
    Center for Bioinformatics and Computational Biology
    NorthShore University Health System, [email protected]

  • #2
    Personally, the ones and zeros aren't helpful to me. I don't think of 147 as "0010010011", but as "128+16+2+1", and I remember what all those numbers stand for. And in most contexts, having both reads map in the forward direction or both map in the reverse direction is not all good, it's weird.

    The four good numbers to remember are 64+16+2+1=83, 64+32+2+1=99, 128+16+2+1=147 and 128+32+2+1=163. Something is very wrong if you ever see both 128 and 64 together, and with most current technologies, you should see 16 or 32, but not both. If you see both, or don't see either, your reads are paired strangely.

    Comment


    • #3
      Thank you so much! This is so useful!
      Why did not see it hot?

      Comment


      • #4
        Originally posted by liu_xt005 View Post
        Thank you so much! This is so useful!
        Why did not see it hot?
        Liu_xt005 -
        I am glad you found it useful. I am not sure why this did not show up hot. But your reply did promote it there. So, thanks!

        Gulu
        Kamalakar Gulukota,
        Director,
        Center for Bioinformatics and Computational Biology
        NorthShore University Health System, [email protected]

        Comment


        • #5
          Very useful! Thanks

          Comment


          • #6
            I assume kgulukota is trying to give example for mate pair library (solid) and swbarnes2 is giving example for paired end (illumina). I think both of them are correct. Please correct me if I am wrong.

            Comment


            • #7
              That is true seq_lover. Which combinations you consider "all good" and which ones "weird" depends on how you constructed your library. Thank you for putting it so succinctly.
              Kamalakar Gulukota,
              Director,
              Center for Bioinformatics and Computational Biology
              NorthShore University Health System, [email protected]

              Comment


              • #8
                Thanks for this post, do you accept doge tips?
                Homepage: Dan Bolser
                MetaBase the database of biological databases.

                Comment


                • #9
                  Years later, this is still pretty darn useful.
                  Thanks!

                  Comment


                  • #10
                    What's going on here?

                    I don't know if I am understanding the correct meaning of the reads index in sam files.

                    This information is present in the Flags description:

                    'Next, we have the cases when only one read in a pair is mapped.
                    69 - 0001000101 - First read in pair. This read is unmapped but its mate is mapped.
                    133 - 0010000101 - 2nd in pair. Read unmapped but mate is mapped.'


                    Soooo, does it means that If I have a read with 133 or 69, its paired read can't be present in the unmapped reads file, ok?
                    I am assuming that reads with the same index (in this case "M03092:8:000000000-AG2GN:1:2117:2591:14346") are paired. Am I correct? If I am wrong I understood what happened but I'd like to know what are these lines with same index.

                    Following this line of thought (same index, paired reads), why are there so many lines of my unmapped paired reads like this?

                    M03092:8:000000000-AG2GN:1:2117:2591:14346 69
                    M03092:8:000000000-AG2GN:1:2117:2591:14346 133

                    Can anyone explain what's going on with these reads?

                    Comment


                    • #11
                      Your aligner seems to have a bug, the flags should be 77 and 141 if both mates are unmapped.

                      Comment


                      • #12
                        Crazy TopHat unmapped reads

                        Did anyone have the same problem with these unmapped reads?

                        Comment


                        • #13
                          It's a bug in TopHat (all versions); it doesn't set the 0x8 bit ("next segment in the template unmapped") when both reads are unmapped.

                          This is one of the issues TopHat-Recondition fixes (https://bmcbioinformatics.biomedcent...859-016-1058-x , https://github.com/cbrueffer/tophat-recondition).

                          Comment


                          • #14
                            Yes, it is a bug in TopHat. Didn't they fix it in TopHat2? I recently used it and the flags were alright.

                            Comment


                            • #15
                              It's still unfixed (as of TopHat 2.1.1) and unlikely to be fixed at all, since TopHat is not really being developed anymore (the developers focus on HISAT2, its successor).

                              Did you use TopHat via bcbio-nextgen by any chance? That fixes the unmapped reads file for you automatically; other frameworks may do the same.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              114 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              120 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Working...
                              X