Recently i've been optimising Hisat2 to work with the latest version of the human genome. According to its manual, Hisat2 reports the number of valid alignments within the tag NH:i:N, where N is the number of valid alignments.
At the same time, Hisat2 also outputs an alignment summary at the end of a run, here's an example:
1000000 reads; of these:
1000000 (100.00%) were unpaired; of these:
436566 (43.66%) aligned 0 times
403452 (40.35%) aligned exactly 1 time
159982 (16.00%) aligned >1 times
56.34% overall alignment rate
I decided to isolate uniquely mapping reads from multimappers and based myself on the NH:i tag, here's the code i used to isolate the uniquely mapping reads:
cat logs/run0000.o7572309 | awk '{if($2 < 256) print $0}' | grep -P '^@|NH:i:1\b'
here, the awk section gets rid of all secondary alignments, while grep is supposed to isolate all the reads containing the NH:i:1 tags (i.e. uniquely mapping).
There seems to be a discrepancy between the number of uniquely mapping reads reported at the end of the alignment:
403452 (40.35%) aligned exactly 1 time
and the ones that contain the NH:i:1 tag:
500082
is there something i'm overlooking? why the discrepancy?
At the same time, Hisat2 also outputs an alignment summary at the end of a run, here's an example:
1000000 reads; of these:
1000000 (100.00%) were unpaired; of these:
436566 (43.66%) aligned 0 times
403452 (40.35%) aligned exactly 1 time
159982 (16.00%) aligned >1 times
56.34% overall alignment rate
I decided to isolate uniquely mapping reads from multimappers and based myself on the NH:i tag, here's the code i used to isolate the uniquely mapping reads:
cat logs/run0000.o7572309 | awk '{if($2 < 256) print $0}' | grep -P '^@|NH:i:1\b'
here, the awk section gets rid of all secondary alignments, while grep is supposed to isolate all the reads containing the NH:i:1 tags (i.e. uniquely mapping).
There seems to be a discrepancy between the number of uniquely mapping reads reported at the end of the alignment:
403452 (40.35%) aligned exactly 1 time
and the ones that contain the NH:i:1 tag:
500082
is there something i'm overlooking? why the discrepancy?