Dear all,
working with SOAPdenovo2, I noticed a (to my mind) strange behaviour.
I am trying to assemble simulated MiSeq data of a fungal genome. Inputting about 12 million 2x250 bp sequences, I expect the assembly length to reach around 38Mbp.
Yet, with the rd_len_cutoff set to 250, I observe distressing results, as in
With rd_len_cutoff = 100, the results are significantly better:
(In all cases, kmer length is set to 63, which I empirically found to yield the best results for any read length.)
As far as I can tell after inspecting the read length range between 100 and 250, there seems to be a negative linear correlation between read length and assembly quality.
(Number of scaffolds decreases with rd_len_cutoff, while N50, assembly size and average scaffold length increase.)
Is there any explanation to this behavior?
Thank you in advance and best regards!
working with SOAPdenovo2, I noticed a (to my mind) strange behaviour.
I am trying to assemble simulated MiSeq data of a fungal genome. Inputting about 12 million 2x250 bp sequences, I expect the assembly length to reach around 38Mbp.
Yet, with the rd_len_cutoff set to 250, I observe distressing results, as in
- total scaffold length: 100,681
- average scaffold length: 466
- N50: 146 (!!)
With rd_len_cutoff = 100, the results are significantly better:
- total scaffold length: 34,867,163
- average scaffold length: 13,657
- N50: 29,759
(In all cases, kmer length is set to 63, which I empirically found to yield the best results for any read length.)
As far as I can tell after inspecting the read length range between 100 and 250, there seems to be a negative linear correlation between read length and assembly quality.
(Number of scaffolds decreases with rd_len_cutoff, while N50, assembly size and average scaffold length increase.)
Is there any explanation to this behavior?
Thank you in advance and best regards!
Comment