Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • offspring
    replied
    It's still unfixed (as of TopHat 2.1.1) and unlikely to be fixed at all, since TopHat is not really being developed anymore (the developers focus on HISAT2, its successor).

    Did you use TopHat via bcbio-nextgen by any chance? That fixes the unmapped reads file for you automatically; other frameworks may do the same.

    Leave a comment:


  • Macspider
    replied
    Yes, it is a bug in TopHat. Didn't they fix it in TopHat2? I recently used it and the flags were alright.

    Leave a comment:


  • offspring
    replied
    It's a bug in TopHat (all versions); it doesn't set the 0x8 bit ("next segment in the template unmapped") when both reads are unmapped.

    This is one of the issues TopHat-Recondition fixes (https://bmcbioinformatics.biomedcent...859-016-1058-x , https://github.com/cbrueffer/tophat-recondition).

    Leave a comment:


  • caiosuz
    replied
    Crazy TopHat unmapped reads

    Did anyone have the same problem with these unmapped reads?

    Leave a comment:


  • dpryan
    replied
    Your aligner seems to have a bug, the flags should be 77 and 141 if both mates are unmapped.

    Leave a comment:


  • caiosuz
    replied
    What's going on here?

    I don't know if I am understanding the correct meaning of the reads index in sam files.

    This information is present in the Flags description:

    'Next, we have the cases when only one read in a pair is mapped.
    69 - 0001000101 - First read in pair. This read is unmapped but its mate is mapped.
    133 - 0010000101 - 2nd in pair. Read unmapped but mate is mapped.'


    Soooo, does it means that If I have a read with 133 or 69, its paired read can't be present in the unmapped reads file, ok?
    I am assuming that reads with the same index (in this case "M03092:8:000000000-AG2GN:1:2117:2591:14346") are paired. Am I correct? If I am wrong I understood what happened but I'd like to know what are these lines with same index.

    Following this line of thought (same index, paired reads), why are there so many lines of my unmapped paired reads like this?

    M03092:8:000000000-AG2GN:1:2117:2591:14346 69
    M03092:8:000000000-AG2GN:1:2117:2591:14346 133

    Can anyone explain what's going on with these reads?

    Leave a comment:


  • Smurali
    replied
    Years later, this is still pretty darn useful.
    Thanks!

    Leave a comment:


  • dan
    replied
    Thanks for this post, do you accept doge tips?

    Leave a comment:


  • kgulukota
    replied
    That is true seq_lover. Which combinations you consider "all good" and which ones "weird" depends on how you constructed your library. Thank you for putting it so succinctly.

    Leave a comment:


  • seq_lover
    replied
    I assume kgulukota is trying to give example for mate pair library (solid) and swbarnes2 is giving example for paired end (illumina). I think both of them are correct. Please correct me if I am wrong.

    Leave a comment:


  • sterding
    replied
    Very useful! Thanks

    Leave a comment:


  • kgulukota
    replied
    Originally posted by liu_xt005 View Post
    Thank you so much! This is so useful!
    Why did not see it hot?
    Liu_xt005 -
    I am glad you found it useful. I am not sure why this did not show up hot. But your reply did promote it there. So, thanks!

    Gulu

    Leave a comment:


  • liu_xt005
    replied
    Thank you so much! This is so useful!
    Why did not see it hot?

    Leave a comment:


  • swbarnes2
    replied
    Personally, the ones and zeros aren't helpful to me. I don't think of 147 as "0010010011", but as "128+16+2+1", and I remember what all those numbers stand for. And in most contexts, having both reads map in the forward direction or both map in the reverse direction is not all good, it's weird.

    The four good numbers to remember are 64+16+2+1=83, 64+32+2+1=99, 128+16+2+1=147 and 128+32+2+1=163. Something is very wrong if you ever see both 128 and 64 together, and with most current technologies, you should see 16 or 32, but not both. If you see both, or don't see either, your reads are paired strangely.

    Leave a comment:


  • kgulukota
    started a topic SAM flag idioms

    SAM flag idioms

    "There are 10 types of people in this world: those who assimilated binary numbers and those who didn't."

    I definitely belong to the 10'th type and hence SAM Flags are a chore. They may be a very compact way of communicating a lot of info about an alignment, but how do we humans learn them? I know it is kind of nerdy to actually look through SAM files but, what can I say? Mea culpa.

    Anyway, this post is my attempt to understand them like a natural language i.e. recognize some idiomatic representations in flags. If you already know these, you are a "binar" and way ahead of us humans on this topic.

    You can use this handy little web page for specific flags:


    However, to "speak SAM", we must know these flags without having to refer to a web page for each line. So, here are some simple idioms.

    Unpaired Reads

    For unpaired reads, the flags are very easy to recognize because there are only 3 values:
    • 4 - 0000000100 - means "this is an unpaired read and is not mapped".
    • 16 - 0000010000 - "this unpaired read is mapped in the reverse orientation".
    • 0 - 0000000000 - "this unpaired read is mapped in the forward orientation".
    I guess it is theoretically possible to have a flag of 20 meaning "unpaired, unmapped read presented in reverse orientation" - however, I doubt any software will do that. Perhaps, that is our first SAM joke: Did you hear about AnnoyingAlign? It is the software that 20's all unpaired, unmapped reads - just to get on users' nerves.

    Paired Reads

    For paired reads, 0'th bit HAS to be set. Hence all flags for paired reads HAVE to be odd. In other words, all even-numbered flags other than the above three (0, 4 and 16) are meaningless. (Good progress. We can recognize non-sense words. Writing a Jabberwocky poem with these flags is left as an exercise for the reader).

    For paired reads all flags in the intervals [65-127] and [193-255] relate to the first read of a pair. All other (odd) flags refer to the second read in a pair.


    "All Good"

    Some values mean "all good" i.e. that both reads in the pair have aligned:
    • 65 - 0001000001 - this is first read in pair and both reads aligned the forward strand.
    • 129 - 0010000001 - This is second read of pair and both reads aligned the forward strand.

    NOTE: 67 (0001000011) and 131 (0010000011) also mean the same as 65 and 129 with the added assurance that "the pair is properly aligned" meaning that they mapped within a proper distance from each other.
    Sometimes both reads of a pair are flipped (reverse complemented) before mapping. If so, you get 113 or 177.
    • 113 - 0001110001 - "this is the first read of a pair, both reads in pair were flipped and both mapped".
    • 177 - 0001110001 - "this is the second read of a pair, both reads in pair were flipped and both mapped".

    Other times only one of the reads in a pair is flipped though both of them map:
    • 81 - 0001010001 - "this is the first read of pair, both reads mapped, we had to flip this read, but mate is in forward orientation".
    • 161 - 0010100001 - "this is second read, this one is forward but we flipped its mate and both reads mapped".

    NOTE: 163 (0010100011) and 83 (0001010011) are the same as 161 and 81 except "it is in a proper pair".
    • 97 - 0001100001 - "this is first read, its mate is flipped but this is forward. Both mapped".
    • 145 - 0010010001 - "this is second read. it is flipped but its mate is not. Both mapped".

    NOTE: 99 (0001100011) and 147 (0010010011) are the same as 97 and 145 except with "proper mapping in pair".
    Exercise: Can you see why the number of reads with flag 113 must be equal to the number of reads with flag 177. Similarly, 81=161 and 97=145. If those numbers don't match, something went wrong with your aligner.

    "All Bad"
    At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:

    77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"

    141 - 0010001101 - Second in pair and "all bad".

    • Exercise: Just like with 20, AnnoyingAlign puts flags of 93 or 125 on all unmapped pairs. What other flags can AnnoyingAlign use to maximize user annoyance?
    • Exercise: Why are 79 and 143 particularly good words for Jabberwocky?
    Only one read maps

    Next, we have the cases when only one read in a pair is mapped.
    • 69 - 0001000101 - First read in pair. This read is unmapped but its mate is mapped.
    • 137 - 0010001001 - second in pair. Read is mapped but mate is unmapped.
    • 73 - 0001001001 - First read in pair. This read is mapped but its mate is not.
    • 133 - 0010000101 - 2nd in pair. Read unmapped but mate is mapped.

    Can you again see why number of reads with flag of 69 must be the same as the number of reads with flag of 137?

    There are of course many other combinations. The purpose here is not to enumerate them but to simply have some fun with the structure of these flags.

    What is your favorite flag? Do you have other ways of remembering what these things mean as you look through SAM files?

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X