Hi,
I've been using Tophat to align 76bp paired-end data to the human genome.
I used samtools to convert the .BAM file to .SAM format and when looking at the aligned reads I came across the following questions.
I've been using Tophat to align 76bp paired-end data to the human genome.
I used samtools to convert the .BAM file to .SAM format and when looking at the aligned reads I came across the following questions.
- I understand that for unstranded libraries (like mine) the use of XS tag for reads aligned over splice junctions is essential for Cufflinks. What I don't understand is why for some reads there are multiple alignments at the same position and their only difference is the XS tag. Look here for an example:
IL28_5635:5:100:10006:3569 137 chr1 160325534 3 14M105N62M * 0 0 CAGACACTGCCAAGGCCCTGGCAGATGTGGCCACGGTGCTGGGACGTGCTCTGTATGAGCTTGCAGGAGGAACCAA %%%%%%%$%%%%%%%%%%%%%%%%%#%"%%%%%#%%!%$#%$$"#$"$#$""$""#"####$#####""""#"""# NM:i:0 XS:A:- NH:i:2 CC:Z:= CP:i:160325534
IL28_5635:5:100:10006:3569 137 chr1 160325534 3 14M105N62M * 0 0 CAGACACTGCCAAGGCCCTGGCAGATGTGGCCACGGTGCTGGGACGTGCTCTGTATGAGCTTGCAGGAGGAACCAA %%%%%%%$%%%%%%%%%%%%%%%%%#%"%%%%%#%%!%$#%$$"#$"$#$""$""#"####$#####""""#"""# NM:i:0 XS:A:+ NH:i:2
When I use the genomeCoverageBed function from BEDtools aren't these reads counted twice? Can this somehow be fixed? - Do people filter the Tophat output according to flags or MAPQ qualities?
I would use the reads that are properly paired (flags:83, 99, 147, 163) or any singletons (flags:73, 89, 137, 153) but I don't understand the use of other flags (i.e. flags: 65, 81, 97, 113, 115, 129, 145, 161, 177, 179). What do they mean and should these be filtered out?
Comment