Hello everyone,
I’m in the midst of assembling a eukaryotic genome for the first time, working in a non-model plant species, and I could use some insight: my data consists of reads from a full lane of Illumina HiSeq V4 2x125 sequences with insert size ~350. Before starting my assembly, I used flow cytometry to estimate nuclear genome 2C content, which returned 2C = 0.82pg DNA or about 800Mb, for a haploid genome size of about 400Mb. However, kmer-counting programs such as Jellyfish have predicted an assembly size of less than half that number, at about 190Mb, and sure enough- when I conduct the assemblies, the sum of scaffold lengths are always in the range of 170-215Mb.
Does anyone have any idea why the nuclear genome size is so much larger than what I’ve been able to assemble? My first hypothesis is heavy repeat content, but I need to find a way to demonstrate this hypothesis is supported by my reads, and I’m brand new to looking into repeats; I’m sure there are a sizeable set of repeats in my organism’s genome, but is there a way to estimate the approximate density of repeats as a percent of the total genome, given that I’m confident in my nuclear genome size?
Any related thoughts/comments would be, by me, appreciated!
I’m in the midst of assembling a eukaryotic genome for the first time, working in a non-model plant species, and I could use some insight: my data consists of reads from a full lane of Illumina HiSeq V4 2x125 sequences with insert size ~350. Before starting my assembly, I used flow cytometry to estimate nuclear genome 2C content, which returned 2C = 0.82pg DNA or about 800Mb, for a haploid genome size of about 400Mb. However, kmer-counting programs such as Jellyfish have predicted an assembly size of less than half that number, at about 190Mb, and sure enough- when I conduct the assemblies, the sum of scaffold lengths are always in the range of 170-215Mb.
Does anyone have any idea why the nuclear genome size is so much larger than what I’ve been able to assemble? My first hypothesis is heavy repeat content, but I need to find a way to demonstrate this hypothesis is supported by my reads, and I’m brand new to looking into repeats; I’m sure there are a sizeable set of repeats in my organism’s genome, but is there a way to estimate the approximate density of repeats as a percent of the total genome, given that I’m confident in my nuclear genome size?
Any related thoughts/comments would be, by me, appreciated!
Comment