I have mapped reads from illumina sequencing onto a reference genome using bwa aln and I wanted to get some regions of the bam files using samtools view:
samtools view -o test_region.sam merged.sort.bam 1:100-1000
however I kept getting the error:
[bam_parse_region] fail to determine the sequence name.
I have realized that I used an NCBI version of the reference genome which resulted in the reference sequence name in the bam file to be that of the NCBI sequence names. For example, here is bit of the bam file:
EBRI093151_0050:4:117:5648:11807#0 99 gi|358485511|ref|NC_006088.3| 66454 60 101M = 66940 587 CTCCAAGATCATCCAGTCCACCCATCCACCCACCACCAATGTAACCCCATTAAAACACGTCCCTCAGTACCACATCTAAATGTTTCTTGAGCACTGGCAGG fffffdffffefffffefeffffeffffffffffffffcdefedffdfeffdfaeffffdeffffdedfffeffe`f`ccdfadfffdaYd^cd`bbd`dc XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101
Notice how instead of "chr1" it has "gi|358485511|ref|NC_006088.3|" as the sequence name. I believe this is the cause of the parsing issue. However, now I cannot seem to figure out how to pull a region from the bam file. I tried using:
samtools view -o test_region.sam merged.sort.bam NC_006088.3:1-1000
But I get the same error as before:
[bam_parse_region] fail to determine the sequence name.
If I put this instead:
samtools view -o test_region.sam merged.sort.bam NC_006088.3
It seems to do as one would expect and pull all read mappings for that sequence region, however, I would like to pull specific coordinates from each chromosome.
Does anybody have some ideas of how I could do this? I was thinking about replacing all the reference sequence names with their generic equivalent (ie "chr1,chr2 etc") however I am not certain that would solve my issue here.
Thank you!
samtools view -o test_region.sam merged.sort.bam 1:100-1000
however I kept getting the error:
[bam_parse_region] fail to determine the sequence name.
I have realized that I used an NCBI version of the reference genome which resulted in the reference sequence name in the bam file to be that of the NCBI sequence names. For example, here is bit of the bam file:
EBRI093151_0050:4:117:5648:11807#0 99 gi|358485511|ref|NC_006088.3| 66454 60 101M = 66940 587 CTCCAAGATCATCCAGTCCACCCATCCACCCACCACCAATGTAACCCCATTAAAACACGTCCCTCAGTACCACATCTAAATGTTTCTTGAGCACTGGCAGG fffffdffffefffffefeffffeffffffffffffffcdefedffdfeffdfaeffffdeffffdedfffeffe`f`ccdfadfffdaYd^cd`bbd`dc XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101
Notice how instead of "chr1" it has "gi|358485511|ref|NC_006088.3|" as the sequence name. I believe this is the cause of the parsing issue. However, now I cannot seem to figure out how to pull a region from the bam file. I tried using:
samtools view -o test_region.sam merged.sort.bam NC_006088.3:1-1000
But I get the same error as before:
[bam_parse_region] fail to determine the sequence name.
If I put this instead:
samtools view -o test_region.sam merged.sort.bam NC_006088.3
It seems to do as one would expect and pull all read mappings for that sequence region, however, I would like to pull specific coordinates from each chromosome.
Does anybody have some ideas of how I could do this? I was thinking about replacing all the reference sequence names with their generic equivalent (ie "chr1,chr2 etc") however I am not certain that would solve my issue here.
Thank you!
Comment