Dear all,
I want to assemble a fungal genome of approx. 35Mbp. My data are paired end reads of 101 bp in length with two different insertion length (300bp and 600 bp).
I decided to use velvet for the assembly and searched for the best k-mer using VelvetOptimizer. I searched for the "best" k-mer with a small subset of my NGS data regarding N50 and longest contig as criterion for best assembly.
I of course already read about the stagnation of assembly statistics, when a specific level of coverage is reached, one could not assemble more then the "real" genome in the end, BUT: what I observed is that the statistics drop down increasing the coverage (using the full data set).
Inspecting the results of the full set assembly and sub set assembly shows that long contigs from the sub set assembly (med cov 25x) are divided in smaller contigs in the other assembly (med cov 250x).
Initially the genome was "over sequenced" so the calculated expected coverage is about 1000x.
Do you have any suggestions why that happens? And do you think in silico normalization might get rid of that? Using DigiNorm for example I would loose the PE information and even the quality information if I'm right because the output is allays fasta?
I would be very happy if we could start a discussion about that. Did anyone of you observed that phenomenon before?
I want to assemble a fungal genome of approx. 35Mbp. My data are paired end reads of 101 bp in length with two different insertion length (300bp and 600 bp).
I decided to use velvet for the assembly and searched for the best k-mer using VelvetOptimizer. I searched for the "best" k-mer with a small subset of my NGS data regarding N50 and longest contig as criterion for best assembly.
I of course already read about the stagnation of assembly statistics, when a specific level of coverage is reached, one could not assemble more then the "real" genome in the end, BUT: what I observed is that the statistics drop down increasing the coverage (using the full data set).
Inspecting the results of the full set assembly and sub set assembly shows that long contigs from the sub set assembly (med cov 25x) are divided in smaller contigs in the other assembly (med cov 250x).
Initially the genome was "over sequenced" so the calculated expected coverage is about 1000x.
Do you have any suggestions why that happens? And do you think in silico normalization might get rid of that? Using DigiNorm for example I would loose the PE information and even the quality information if I'm right because the output is allays fasta?
I would be very happy if we could start a discussion about that. Did anyone of you observed that phenomenon before?
Comment