Hi all,
I am new to bioinformatics. I am parallelizing a code that use BWA for alignment.
Can someone please explain to me in general how the indexing is happening?
I ask because the previous coder index and run alignment by calling BWA. then, the code "reads-in the reference geneome"; the code extract subsequence from the read-in reference by using the starting position that appears in the alignment file. For instance;
assume we have the following read: ACCCA
the reference has two chromosomes:
>chr1
GGGGGTTTTT
>chr2
AAACCCCCCCC
and assume the read is aligned to ACCCC in the second chromosome, where the starting position is: 13--I think BWA will produce 13.
the code reads in the reference and stores every chromosomes in a separate data structure. The problem is when when the code gets the subsequence "ACCCA" from the reference, it uses the starting position (13) to extract the subsequence. The way the reference is stored leads to out of range (13 is more than the 11 bases stored in chr2) and the program crashes. and even if the 13 is in range, we will be reading the wrong subsequence.
I need to know the BWA indexing mechanism so I get the correct subsequence from the read-in reference. I tried to add the length prefix to get the correct subsequence, but it did not work, where there are some dependencies in the code that may led to the problem.
Thanks.
I am new to bioinformatics. I am parallelizing a code that use BWA for alignment.
Can someone please explain to me in general how the indexing is happening?
I ask because the previous coder index and run alignment by calling BWA. then, the code "reads-in the reference geneome"; the code extract subsequence from the read-in reference by using the starting position that appears in the alignment file. For instance;
assume we have the following read: ACCCA
the reference has two chromosomes:
>chr1
GGGGGTTTTT
>chr2
AAACCCCCCCC
and assume the read is aligned to ACCCC in the second chromosome, where the starting position is: 13--I think BWA will produce 13.
the code reads in the reference and stores every chromosomes in a separate data structure. The problem is when when the code gets the subsequence "ACCCA" from the reference, it uses the starting position (13) to extract the subsequence. The way the reference is stored leads to out of range (13 is more than the 11 bases stored in chr2) and the program crashes. and even if the 13 is in range, we will be reading the wrong subsequence.
I need to know the BWA indexing mechanism so I get the correct subsequence from the read-in reference. I tried to add the length prefix to get the correct subsequence, but it did not work, where there are some dependencies in the code that may led to the problem.
Thanks.
Comment