I am using Illumina sequencing to verify the sequences of constructs made in the lab. What I need to get from the data are positions on the reference sequence where the Illumina reads diverge from the expectation so I can flag potential errors in the constructs.
I have written a pipeline in Perl using Bio:: DB::Sam to traverse my alignments and identify SNPs and short INDELs, which is working fine for my case but I am struggling a bit to find a way of flagging long differences from the reference to the sequenced reads ("structural variations" basically).
Where I have such differences, e.g. a long fragment of the cloned insert is deleted, I can see them clearly in bam alignment viewers by the lack of read coverage and the fact that hardclipped reads accumulate at the site where the deletion starts. I thought I could use this fact to identify the deletion borders but there is no easy way (unless I have missed something) to get a list of the hard-clipped positions in reference sequence coordinates.
I can do this in Bio:: DB::sam if I follow the cigar alignment string for each read to the current mapping position to the reference in a "pileup" callback but a) it's very slow and b) I'm worried that I am overlooking edge cases.
So, before I reinvent the wheel, could anybody recommend a better tool for the job? I have looked at Dellyrecently, which looks great but wouldn't be useful for me because it identifies deletions by the insert-lengths distribution of paired end reads which I can't use (longish explanation...)
Thanks so much for your help!
I have written a pipeline in Perl using Bio:: DB::Sam to traverse my alignments and identify SNPs and short INDELs, which is working fine for my case but I am struggling a bit to find a way of flagging long differences from the reference to the sequenced reads ("structural variations" basically).
Where I have such differences, e.g. a long fragment of the cloned insert is deleted, I can see them clearly in bam alignment viewers by the lack of read coverage and the fact that hardclipped reads accumulate at the site where the deletion starts. I thought I could use this fact to identify the deletion borders but there is no easy way (unless I have missed something) to get a list of the hard-clipped positions in reference sequence coordinates.
I can do this in Bio:: DB::sam if I follow the cigar alignment string for each read to the current mapping position to the reference in a "pileup" callback but a) it's very slow and b) I'm worried that I am overlooking edge cases.
So, before I reinvent the wheel, could anybody recommend a better tool for the job? I have looked at Dellyrecently, which looks great but wouldn't be useful for me because it identifies deletions by the insert-lengths distribution of paired end reads which I can't use (longish explanation...)
Thanks so much for your help!