Header Leaderboard Ad

Collapse

bitwise flag in sam format and others

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    what would a flag 0 imply in sam output? I used novo align and the only flags I see are 0 4 and 16. 4 is unmapped read. 16 is for the strand, but how to interpret 0.

    Also, How can I ascertain the reads that are *not* uniquely mapped. I read that the 5th column MAPQ should be of help to determine multiply-mapped reads. Is MAPQ=0 an indication that the read is multiply-mapped?

    Thanks
    --
    bioinfosm

    Comment


    • #17
      Originally posted by bioinfosm View Post
      what would a flag 0 imply in sam output? I used novo align and the only flags I see are 0 4 and 16. 4 is unmapped read. 16 is for the strand, but how to interpret 0.

      Also, How can I ascertain the reads that are *not* uniquely mapped. I read that the 5th column MAPQ should be of help to determine multiply-mapped reads. Is MAPQ=0 an indication that the read is multiply-mapped?

      Thanks
      Flag 0 means "the read is not paired and mapped, forward strand"

      You are right, MAPQ =0 is an indication that this read has multiple best hits, which means it is not unique match.

      Comment


      • #18
        Actually, I always have the question why SAM format using a binary FLAG field instead of text/char field to indicate the mapping status. IMHO a human readable FLAG field would make much more sense, and maybe the strand information could be listed as a separated field. It's also a headache to look at those aux fields. Excuse me but I have the feeling that SAM format is somewhat over complicated, is there any possibility the SAM format could be modified so that people could read it more easily?

        Thanks.

        Comment


        • #19
          @yxi

          Please use "samtools view -X" to see a human readable FLAG. I agree that not specifying a better FLAG field initially is a shortcoming, but it is too late to change the spec at the moment. samtools view -X comes as a temporary hack which I find useful.

          Could you suggest a better format for the aux fields or to make SAM simpler? Note that SAM should be both human readable and machine readable. The current form is the best we can come to so far. Genbank/EMBL files are human readable, but they cause a lot of troubles in parsing, and we do not want to go in that way again. I think the best solution to human readability is not to change the spec, but to write a script to print a SAM alignment in multiple lines in a beautiful way. If you want to contribute to such a script, that would be great. Thanks.

          Comment


          • #20
            Currently in most of sam/bam files, the header is empty. However, this could potentially be a good place to store all the info people are interested. From the SAM format specification, HEADER SECTION has four default record types - @HD, @SQ, @RG, @RG.
            Can we define our own record type, such as @MN.... or it won't be recognized by samtools.

            In addition, each record type has predefined data field, such as VN, SO, GO in @HD. Can we add more data field?

            thanks

            Comment


            • #21
              Wow, the more I dig into sam the more it looks like a horse put together by a committee. These bit flags are ridiculous for a human readable text alignment format.

              Comment


              • #22
                Originally posted by Nix View Post
                Wow, the more I dig into sam the more it looks like a horse put together by a committee. These bit flags are ridiculous for a human readable text alignment format.
                It was put together by a committee. The flag field, in my opinion, is left over from the internal C representation, which was created for compactness (then compressed into SAM). You will encounter the internal format if you program using the SAMTools C API but nonetheless the internal (compact) data structure should not have been shown to the user when viewing SAM (not BAM files). Just wait until you try to get anything clarified/changed/included in the specification.

                Nils

                Comment


                • #23
                  If SAM were designed by one person, it would not be widely accepted. There are pros and cons. As to the bit flag, have you tried "samtools view -X"?

                  Comment


                  • #24
                    I particularly like the TCGA BAM specification for the BAM headers. It uses most of the header types and their associated tags to capture things like: sequencing platform, reference contig details(MD5s, url..), samples, institution, library, etc...
                    -drd

                    Comment


                    • #25
                      Originally posted by lh3 View Post
                      If SAM were designed by one person, it would not be widely accepted. There are pros and cons. As to the bit flag, have you tried "samtools view -X"?
                      It is widely accepted because good tools (i.e. lh3's implementation in SAMTools and now Picard) were available early and was used by the first major sequencing efforts (aka 1000 Genomes). I no way mean to lambaste the creators of SAM, only to be honest about its development and implementation.

                      Comment


                      • #26
                        OK, 4hrs later and I can parse the strand.....

                        Here's some java code so folks don't have to go through this again:

                        <pre>
                        /**
                        * Index Index^2 MethodName Flag DescriptionFromSamSpec
                        * 0 1 isReadPartOfAPairedAlignment 0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair
                        * 1 2 isReadAProperPairedAlignment 0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1
                        * 2 4 isQueryUnmapped 0x0004 the query sequence itself is unmapped
                        * 3 8 isMateUnMapped 0x0008 the mate is unmapped
                        * 4 16 isQueryReverseStrand 0x0010 strand of the query (false for forward; true for reverse strand)
                        * 5 32 isMateReverseStrand 0x0020 strand of the mate
                        * 6 64 isReadFirstPair 0x0040 the read is the first read in a pair
                        * 7 128 isReadSecondPair 0x0080 the read is the second read in a pair
                        * 8 256 isAlignmentNotPrimary 0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)
                        * 9 512 doesReadFailVendorQC 0x0200 the read fails platform/vendor quality checks
                        * 10 1024 isReadADuplicate 0x0400 the read is either a PCR duplicate or an optical duplicate
                        */
                        public boolean testBitwiseFlag(int testValue){
                        return ((flags & testValue) == testValue);
                        }
                        /**The read is paired in sequencing, no matter whether it is mapped in a pair*/
                        public boolean isPartOfAPairedAlignment(){
                        return testBitwiseFlag(1);
                        }
                        /**The read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)*/
                        public boolean isAProperPairedAlignment(){
                        return testBitwiseFlag(2);
                        }
                        /**The query sequence itself is unmapped*/
                        public boolean isUnmapped(){
                        return testBitwiseFlag(4);
                        }
                        /**The mate is unmapped*/
                        public boolean isMateUnMapped(){
                        return testBitwiseFlag(8);
                        }
                        /**Strand of the query (false for forward; true for reverse strand)*/
                        public boolean isReverseStrand(){
                        return testBitwiseFlag(16);
                        }
                        /**Strand of the mate*/
                        public boolean isMateReverseStrand(){
                        return testBitwiseFlag(32);
                        }
                        /**The read is the first read in a pair*/
                        public boolean isFirstPair(){
                        return testBitwiseFlag(64);
                        }
                        /**The read is the second read in a pair*/
                        public boolean isSecondPair(){
                        return testBitwiseFlag(128);
                        }
                        /**The alignment is not primary (a read having split hits may have multiple primary alignment records)*/
                        public boolean isNotAPrimaryAlignment(){
                        return testBitwiseFlag(265);
                        }
                        /**The read fails platform/vendor quality checks*/
                        public boolean failedQC(){
                        return testBitwiseFlag(512);
                        }
                        /**The read is either a PCR duplicate or an optical duplicate*/
                        public boolean isADuplicate(){
                        return testBitwiseFlag(1024);
                        }
                        </pre>

                        Comment


                        • #27
                          Originally posted by Nix View Post
                          OK, 4hrs later and I can parse the strand.....

                          Here's some java code so folks don't have to go through this again:

                          <pre>
                          /**
                          * Index Index^2 MethodName Flag DescriptionFromSamSpec
                          * 0 1 isReadPartOfAPairedAlignment 0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair
                          * 1 2 isReadAProperPairedAlignment 0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1
                          * 2 4 isQueryUnmapped 0x0004 the query sequence itself is unmapped
                          * 3 8 isMateUnMapped 0x0008 the mate is unmapped
                          * 4 16 isQueryReverseStrand 0x0010 strand of the query (false for forward; true for reverse strand)
                          * 5 32 isMateReverseStrand 0x0020 strand of the mate
                          * 6 64 isReadFirstPair 0x0040 the read is the first read in a pair
                          * 7 128 isReadSecondPair 0x0080 the read is the second read in a pair
                          * 8 256 isAlignmentNotPrimary 0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)
                          * 9 512 doesReadFailVendorQC 0x0200 the read fails platform/vendor quality checks
                          * 10 1024 isReadADuplicate 0x0400 the read is either a PCR duplicate or an optical duplicate
                          */
                          public boolean testBitwiseFlag(int testValue){
                          return ((flags & testValue) == testValue);
                          }
                          /**The read is paired in sequencing, no matter whether it is mapped in a pair*/
                          public boolean isPartOfAPairedAlignment(){
                          return testBitwiseFlag(1);
                          }
                          /**The read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)*/
                          public boolean isAProperPairedAlignment(){
                          return testBitwiseFlag(2);
                          }
                          /**The query sequence itself is unmapped*/
                          public boolean isUnmapped(){
                          return testBitwiseFlag(4);
                          }
                          /**The mate is unmapped*/
                          public boolean isMateUnMapped(){
                          return testBitwiseFlag(8);
                          }
                          /**Strand of the query (false for forward; true for reverse strand)*/
                          public boolean isReverseStrand(){
                          return testBitwiseFlag(16);
                          }
                          /**Strand of the mate*/
                          public boolean isMateReverseStrand(){
                          return testBitwiseFlag(32);
                          }
                          /**The read is the first read in a pair*/
                          public boolean isFirstPair(){
                          return testBitwiseFlag(64);
                          }
                          /**The read is the second read in a pair*/
                          public boolean isSecondPair(){
                          return testBitwiseFlag(128);
                          }
                          /**The alignment is not primary (a read having split hits may have multiple primary alignment records)*/
                          public boolean isNotAPrimaryAlignment(){
                          return testBitwiseFlag(265);
                          }
                          /**The read fails platform/vendor quality checks*/
                          public boolean failedQC(){
                          return testBitwiseFlag(512);
                          }
                          /**The read is either a PCR duplicate or an optical duplicate*/
                          public boolean isADuplicate(){
                          return testBitwiseFlag(1024);
                          }
                          </pre>
                          Have you tried picard for a Java API? Here's the link to the javadoc.

                          Comment


                          • #28
                            Yes I did try Picard's SAMFileReader but it has a major memory management problem when trying to parse RNA-Seq data that contains alignments to chromosomes and splice junctions. These alignment files contain headers with 2.3 million splice junction 'chromosomes' that cause the SAMFileReader to quickly throw an out of memory error. The SAM format requires one '@SQ' line for each splice junction. Ouch.

                            Thus, I had to create a light weight file parser that would skip such header info. I also wanted to get a feel for whether sam could be used as a universal short read alignment format. I'm part of two projects (GenoViz and a $100M NHLBI cardiac project) that are looking to see if SAM could serve that function.

                            I should say that now I understand bitwise flags, they are a pretty clever trick for compressing a bunch of boolean flags in a binary file. For SAM spec 2 though, they should be removed from the text format.

                            Comment


                            • #29
                              Hello, still on the flag meaning ...
                              I am trying to understand those (by bwa):

                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1979 69 * 0 0 * * 0 0 TAGATGGCAAAATGAACTTCCATAAANNNNNCTGTATGCTCTGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBB*2??9<[email protected]@BBB;[email protected]=;'''''<BCCB6+?-B??;3''''''''''''''''''''''''''''''
                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1979 133 * 0 0 * * 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
                              ^C

                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1423 69 * 0 0 * * 0 0 TACACCAAATATATATTATATGCTGTACATAATATATTAAAGTCGACCAAATATATATTATATACTGTACATAAA BBBBBBBBBBBBBBBBBBBBBBCBBBB?BBBBBBBBBCBBBBBBBBBB9;ABBBBBBBB=B=>>>BBBB?;<BB=
                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1423 133 * 0 0 * * 0 0 TATACTTTGGGTATTTTGATATTTTATTTACAGTATATAATACATACTTTGGGTACTTCGATATTTTATGTGCAG BBBBCBCBBBBBBBCCBCBBBBCCBBBBBBBBBBBBCCBBCBBBBBBBBBBBBBBBB<*<BBBBBBBBB8BBBBB

                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1908 69 * 0 0 * * 0 0 ATATTTTCGTTGGAAAAAAGCTCTTGTACTACAGAACAGATCCAGTTTTACGGATACGTGGCTTGAGAGGTAGAT B909B-6<'.?A;A>;6><[email protected]?*4+?+<B=22=2?42+*9?-''326B'''')3)6,'66'.3.72?
                              HWI-EAS255_8282_FC30G79_PE:4:1:1:1908 133 * 0 0 * * 0 0 AAGGGATCCATCCGACGTAGCGTCCGAAACAGTTAAGATGAGAGCACTTAGAAAACTCATCCAGCGAAGGATACA B92<?+<446'.9B;BB<4?7*4,1'?B6'39'.',<6),94;6,'=9'96=BBB.')3-<@+3),>49',B2*9

                              as I understand the flag meaning for 69, the query is unmapped, but the mate is mapped ... ????????

                              Moreover, I get some lines with MAPQ to 0 and flag to 99/147 with coherent insert size ... what could it be ??

                              Many Thanks

                              Aurélie

                              Comment


                              • #30
                                Hello again,
                                I have another question about this :
                                HWI-EAS255_8282_FC30G79_PE:4:1:402:627 97 chr4 49179087 0 75M chr21 9663639 0 GAACTGGATTCAAGCCAGTTAACTACTGGATCAAATCTGATCCTGGGCCCTGTCCCGTTTCTGTCATAACTTCTA BBB?BB>[email protected]<;<BBBBBB?BCBB6?BB=BB;@<B?BBBB0B6BB=BBBCBB8BBBBB;CBBB;BBB=8;BB?C? XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:7 XM:i:2 XO:i:0 XG:i:0 MD:Z:46A3G24
                                HWI-EAS255_8282_FC30G79_PE:4:1:402:627 145 chr21 9663639 37 75M chr4 49179087 0 TGGAACAGAAAATTGCTCAAGGAAACTCAGAGCTCAAAACACAAATCCATGGAGCTATGAACTCCGAGAGAGAAT @BBB?BB8BBBBBBBBB;;;[email protected]@AB;B<[email protected]=??BB;6'3BBBB;0=BBBBB XT:A:U NM:i:2 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:20A40A13


                                how can one pair have a MAPQ at 37 and the other at 0 ..

                                and I have different flag value for other pairs mapped on different chr.
                                HWI-EAS255_8282_FC30G79_PE:4:1:24:1302 65 chr12 82786110 0 75M chr4 22777722 0 CACCTATGGGTGAGAATATGTGGTGTTTGGTTTTTTGTTCTTGTGATAGTTTACTGAGAATGATGATTTCCAATT BBBBBBBBBBBBB=BBA>ABBBBBBBBBBBBBCBBBBBBBCBBBBBBBBBBBBBB?+?=BBBBBBBBBBBBBBBB XT:A:R NM:i:1 SM:i:0 AM:i:0 X0:i:71 XM:i:1 XO:i:0 XG:i:0 MD:Z:8A66
                                HWI-EAS255_8282_FC30G79_PE:4:1:24:1302 129 chr4 22777722 0 75M chr12 82786110 0 TCCAAAAATGATAGACTGGATTAAGAAACTGTGGCACATATACACCATGGAATACTATGCAGCCATAAAAAATGA BCBBBBBBBBBBBBBBBBBBCCCBBBBBBBBBBBCCBBCCBCBBBBBBB?BBBBBCCBBBBBBBB;?2?BBBBBB XT:A:R NM:i:1 SM:i:0 AM:i:0 X0:i:55 XM:i:1 XO:i:0 XG:i:0 MD:Z:28A46

                                thanks
                                Aurélie

                                Comment

                                Working...
                                X