Hello,
we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
First off, the data-sets:
1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
Thank you for any remarks/suggestions!
we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
First off, the data-sets:
1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
Thank you for any remarks/suggestions!
Comment