Dear fellow Bioinformaticians,
to put it short, my work deals with the usage of de novo de bruijn graph-based assembly tools like Velvet and SOAPdenovo.
Since my background is more of the computer science side, I have a question regarding the underlying biology.
I am just going to draw my understanding here, please correct me where it's wrong.
The sequenced DNA we get as input consists of the separated two strands of nucleotides, cut into many pieces which afterwards get amplified (the average multiplicity being known as 'coverage').
Afterwards these pieces are read out by sequencers (resulting in 'reads'), which have a distinct error rate because of the range-ish nature of the reading out process, thus we have to cut the reads into even smaller pieces, called the kmers, to verify the correctness - if you count a kmer just a small amount of the time, we can due to our coverage say it most likely does not occurr in the actual genomic material.
And here comes my main question:
Since we build the genome (in an optimal world) respectively our contigs and scaffolds from this very large set of Kmers, how can we be sure we don't mix reads from the reverse complement side with the other one?
It seems like these algorithms just try to find the "longest possible" solution which might very likely include "bridge-elements" which did not occurr in the actual gene material.
Can someone help me?
Greetz from Germany
Berend
to put it short, my work deals with the usage of de novo de bruijn graph-based assembly tools like Velvet and SOAPdenovo.
Since my background is more of the computer science side, I have a question regarding the underlying biology.
I am just going to draw my understanding here, please correct me where it's wrong.
The sequenced DNA we get as input consists of the separated two strands of nucleotides, cut into many pieces which afterwards get amplified (the average multiplicity being known as 'coverage').
Afterwards these pieces are read out by sequencers (resulting in 'reads'), which have a distinct error rate because of the range-ish nature of the reading out process, thus we have to cut the reads into even smaller pieces, called the kmers, to verify the correctness - if you count a kmer just a small amount of the time, we can due to our coverage say it most likely does not occurr in the actual genomic material.
And here comes my main question:
Since we build the genome (in an optimal world) respectively our contigs and scaffolds from this very large set of Kmers, how can we be sure we don't mix reads from the reverse complement side with the other one?
It seems like these algorithms just try to find the "longest possible" solution which might very likely include "bridge-elements" which did not occurr in the actual gene material.
Can someone help me?
Greetz from Germany
Berend