I have whole-genome DNA paired-end sequence from the Illumina HiSeq2000. I aligned this to the reference genome with BWA v.5.9 using the default parameters for paired-end.
I have detected an unusual region in the alignment. The region is around 430 base pairs long, has excessive coverage (>40 fold coverage in a genome sample sequenced to ~6x), has an excessive number of orphan reads (~40%), and includes only one known repeat (RNA repeat for around 1/3 of the length of the region). GC content is 54%. Either side of this region is demarcated by partially mapped reads truncated to the same base position (clipped at the start of the read at the 5' end of the region and clipped at the end of the read at the 3' end of the region). These unmapped portions all concur with respect to sequence and BLAT to repeat elements.
Here is what I am puzzled about:
The 5' end of this alignment, as viewed in Samtools tview, shows 100% of the orphans to be mapped to the reverse strand. Of the non-orphan reads, ~70% map to the forward strand. The 3' end of this region shows the opposite trend: 100% of the orphans are mapped to the forward strand, and ~70% of the non-orphans map to the reverse strand. The unmapped pairs of the orphan reads all include repeat sequence (usually simple DNA repeats, some LINE elements).
I can understand that sequence-specific strand bias may exist due to technicalities of the library prep and sequencing process. What I don't understand is why I have a seemingly opposite bias between orphan reads and non-orphan reads.
All comments greatly appreciated.
I have detected an unusual region in the alignment. The region is around 430 base pairs long, has excessive coverage (>40 fold coverage in a genome sample sequenced to ~6x), has an excessive number of orphan reads (~40%), and includes only one known repeat (RNA repeat for around 1/3 of the length of the region). GC content is 54%. Either side of this region is demarcated by partially mapped reads truncated to the same base position (clipped at the start of the read at the 5' end of the region and clipped at the end of the read at the 3' end of the region). These unmapped portions all concur with respect to sequence and BLAT to repeat elements.
Here is what I am puzzled about:
The 5' end of this alignment, as viewed in Samtools tview, shows 100% of the orphans to be mapped to the reverse strand. Of the non-orphan reads, ~70% map to the forward strand. The 3' end of this region shows the opposite trend: 100% of the orphans are mapped to the forward strand, and ~70% of the non-orphans map to the reverse strand. The unmapped pairs of the orphan reads all include repeat sequence (usually simple DNA repeats, some LINE elements).
I can understand that sequence-specific strand bias may exist due to technicalities of the library prep and sequencing process. What I don't understand is why I have a seemingly opposite bias between orphan reads and non-orphan reads.
All comments greatly appreciated.
Comment