Hi
I have aligned (and filtered) my paired-end sequence data to two alternative reference genomes (e.g. different species). I would like to get a list of reads that align to both references and a list of reads that only align to one. My approach so far is as follows:
1) convert my final bam files to sam files
samtools view -h -o Ref1.sam Ref1.bam
samtools view -h -o Ref2.sam Ref2.bam
2) extract column 1 (sequence IDs) from both files (additional grep step to remove the header)
awk '{print $1}' Ref1.sam > Ref1Ids
grep -v '^@' Ref1Ids > Ref1Ids_Final
(repeat for Ref2 sam file)
3) ?
Now I'm thinking about using basic linux commands to compare the text files listing sequence IDs. But I'm having trouble figuring out how. I think the "comm" command requires that the files be "sorted" but when I use the "sort" command on my files, the resulting order of read IDs in the files doesn't make sense (e.g. read names are composed of both strings and numbers; the strings are constant but the numbers vary and are out of a logical order following the sort command).
Anyone have any suggestions? Or a better way to get a list of read IDs that differ between two bam alignment files?
Thanks!
I have aligned (and filtered) my paired-end sequence data to two alternative reference genomes (e.g. different species). I would like to get a list of reads that align to both references and a list of reads that only align to one. My approach so far is as follows:
1) convert my final bam files to sam files
samtools view -h -o Ref1.sam Ref1.bam
samtools view -h -o Ref2.sam Ref2.bam
2) extract column 1 (sequence IDs) from both files (additional grep step to remove the header)
awk '{print $1}' Ref1.sam > Ref1Ids
grep -v '^@' Ref1Ids > Ref1Ids_Final
(repeat for Ref2 sam file)
3) ?
Now I'm thinking about using basic linux commands to compare the text files listing sequence IDs. But I'm having trouble figuring out how. I think the "comm" command requires that the files be "sorted" but when I use the "sort" command on my files, the resulting order of read IDs in the files doesn't make sense (e.g. read names are composed of both strings and numbers; the strings are constant but the numbers vary and are out of a logical order following the sort command).
Anyone have any suggestions? Or a better way to get a list of read IDs that differ between two bam alignment files?
Thanks!
Comment