Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very confused about the FLAG in SAM files

    I know the generic description for each number by looking it up on google, but I'm just confused.

    Why does the number associate itself with the mate at times and other times it associates itself with the actual read?

    Aren't we only concerned with the read?

    Read unmapped vs mate unmapped ? Relevance of mate?

    Read reverse strand vs mate reverse strand? Relevance of mate?

    First pair and second pair?



    I am specifically dealing with paired end reads that I mapped to a reference to generate these SAM files. Now that I am supposed to analyze them I feel lost and I do not know what to do.


    Edit:





    I used these links to try and understand the flag part better, but the descriptions are very generic. I feel as if a picture would possibly help me understand better?
    Last edited by prs321; 10-15-2013, 09:51 AM.

  • #2
    All the annotations are indeed with respect to the read in question.

    However, knowing something about the mate can be important. For instance, if you have a pair of reads with one read mapped and the mate unmapped, it may point to things like viral insertion (for instance, your reference is E. coli and a virus has inserted itself into the host genome, and one read maps to reference and mate does not). Or if you want to work with "clean" reads that map properly to reference, you would filter out the reads that didn't have a situation where one read mapped and the other did not. Or if you were mapping fastq reads to reference to eliminate contaminating reads of a reference, then you would take the reads that had a pair that didn't map.

    Read reverse-strand and mate-reverse strand. For example, if you are working with DNA regions that have inverted (DNA break repair?), instead of a standard paired-end

    ->-------
    -------<-

    You might have
    R1
    ->-------
    ------->-
    R2
    If you are R1, your mate is /not/ on the reverse strand (it actually looks like the forward strand of the reference, since your sample had the inversion). Or if you are R2, /you/ are /not/ on the reverse strand. Perhaps it is more appropriate to say "read /looks like reverse strand of reference/" or "read's mate /looks like reverse strand of reference"

    Edit: "First in pair" is usually called "R1", "Second in pair is your second read, often called "R2". You can double-check by parsing a SAM file as follows

    Code:
    samtools view -bS -f 64 sam.sam | bamtools convert -format fastq -in - -out sam_f64.fastq
    Explanation: samtools view to get reads that have the 64 bit (everything that is R1). Convert to bam, feed to bamtools, convert to fastq.
    Code:
    samtools view -bS -f 128 sam.sam | bamtools convert -format fastq -in - -out sam_f128.fastq
    Explanation: samtools view to get reads that have the 128 bit (everything that is R2). Convert to bam, feed to bamtools, convert to fastq.

    You can run head your fastq's and double check that the flags do sort read one and read two out separately just to make sure. If your fastq's are coming out in the style with \1 and \2, then you grep for the flag you don't want to see.

    Edit: An older thread of mine where I ask for help on the subject, and receive few answers.
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    Edit 2: What are you trying to do with your reads? Is the reference "bad" and you want the reads that don't map to it? Is the reference "good" and you want reads that map to it? Are your data mate-pair/jumping-read with outtie orientation (in which case, default mappers can sometimes flag it as improper, especially if they assume innie is proper)
    Last edited by winsettz; 10-15-2013, 11:48 AM.

    Comment


    • #3
      Originally posted by winsettz View Post
      All the annotations are indeed with respect to the read in question.

      However, knowing something about the mate can be important. For instance, if you have a pair of reads with one read mapped and the mate unmapped, it may point to things like viral insertion (for instance, your reference is E. coli and a virus has inserted itself into the host genome, and one read maps to reference and mate does not). Or if you want to work with "clean" reads that map properly to reference, you would filter out the reads that didn't have a situation where one read mapped and the other did not. Or if you were mapping fastq reads to reference to eliminate contaminating reads of a reference, then you would take the reads that had a pair that didn't map.

      Read reverse-strand and mate-reverse strand. For example, if you are working with DNA regions that have inverted (DNA break repair?), instead of a standard paired-end

      ->-------
      -------<-

      You might have
      R1
      ->-------
      ------->-
      R2
      If you are R1, your mate is /not/ on the reverse strand (it actually looks like the forward strand of the reference, since your sample had the inversion). Or if you are R2, /you/ are /not/ on the reverse strand. Perhaps it is more appropriate to say "read /looks like reverse strand of reference/" or "read's mate /looks like reverse strand of reference"

      Edit: "First in pair" is usually called "R1", "Second in pair is your second read, often called "R2". You can double-check by parsing a SAM file as follows

      Code:
      samtools view -bS -f 64 sam.sam | bamtools convert -format fastq -in - -out sam_f64.fastq
      Explanation: samtools view to get reads that have the 64 bit (everything that is R1). Convert to bam, feed to bamtools, convert to fastq.
      Code:
      samtools view -bS -f 128 sam.sam | bamtools convert -format fastq -in - -out sam_f128.fastq
      Explanation: samtools view to get reads that have the 128 bit (everything that is R2). Convert to bam, feed to bamtools, convert to fastq.

      You can run head your fastq's and double check that the flags do sort read one and read two out separately just to make sure. If your fastq's are coming out in the style with \1 and \2, then you grep for the flag you don't want to see.

      Edit: An older thread of mine where I ask for help on the subject, and receive few answers.
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Edit 2: What are you trying to do with your reads? Is the reference "bad" and you want the reads that don't map to it? Is the reference "good" and you want reads that map to it? Are your data mate-pair/jumping-read with outtie orientation (in which case, default mappers can sometimes flag it as improper, especially if they assume innie is proper)

      I am just trying to see how well the mapping results were in comparison to raw reads, reads processed via cutadapt, and reads processed via Scythe.

      I'm not too sure how good the assembly is. I've been told that it could be better.

      What is innie and outtie?

      Comment


      • #4
        Innie and outie refer to the orientation of the reads. In standard paired end, the DNA is sheared and read from 5' to 3' of each strand. Since 5' is outside on both strands, the direction of the reads is towards the middle (-> <-).

        With mate-pair a long section of DNA is taken, circularized, then the area near the join is excised, such that the information in the read represents long distance reads across genomic space. But due to the nature of the cut, the reads are actually pointing outwards.

        5'start-----------------------------end3'
        circle------end3'5start------circle
        snip

        R1>----end3'5'start----<R2

        Which when compared to actual genome is
        5'start-----<R2---------------R1>-------end3'

        If you're assessing map qualities, samtools flagstat is your friend.

        Code:
        samtools flagstat mybam.bam
        Example output
        Code:
        in total (QC-passed reads + QC-failed reads)	18808442
        duplicates	0
        mapped	345583
        paired in sequencing	18808442
        read1	9404221
        read2	9404221
        properly paired	329846
        with itself and mate mapped	332866
        singletons	12717
        w mate mapped to different chr	636
        w mate mapped to different chr (mapQ >= 5)	188

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        33 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X