Dear all,
I would like to hear your suggestion what amount of low frequency k-mers in Illumina reads is normal.
I am asking this because I am having a hard time to find a good assembly strategy for two 100mb invertebrate genomes I just received. Most of what used to work for my first genome from a similar species does not work now. I get very different results from the different assemblers (Masurca, Dipspades, Platanus) and sometimes they crash.
The difference in the datasets (all genomes around 100mb):
- old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
- new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert
The main difference seems to be the amount of low frequency k-mers in the reads. To give you an idea: After trimming of one sample with platanus_trim the 32mer histogram from Platanus shows 400 million single occurring k-mers. The hammer correction module of Dipspades also tells me that 80% of k-mers are singletons. A platanus run with the old (trimmed) dataset showed only 400k single occurring 32mers.
So I am trying back and forth with trimming (trimmomatic, platanus), correction (hammer) and normalization (bbnorm). Masurca, however, has its own built in pipeline for correction and trimming and I just give in the reads as I received them. But while Masurca gave me the best assembly last time, with the new datasets it gives me by far the worst.
Are there other reasons than sequencing errors or metagenomic contamination for such an amount of low frequency k-mers? At least from my experience, I don't think that contamination of the genomic DNA during isolation is responsible here.
Any suggestions for a better assembly?
Thank you!
I would like to hear your suggestion what amount of low frequency k-mers in Illumina reads is normal.
I am asking this because I am having a hard time to find a good assembly strategy for two 100mb invertebrate genomes I just received. Most of what used to work for my first genome from a similar species does not work now. I get very different results from the different assemblers (Masurca, Dipspades, Platanus) and sometimes they crash.
The difference in the datasets (all genomes around 100mb):
- old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
- new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert
The main difference seems to be the amount of low frequency k-mers in the reads. To give you an idea: After trimming of one sample with platanus_trim the 32mer histogram from Platanus shows 400 million single occurring k-mers. The hammer correction module of Dipspades also tells me that 80% of k-mers are singletons. A platanus run with the old (trimmed) dataset showed only 400k single occurring 32mers.
So I am trying back and forth with trimming (trimmomatic, platanus), correction (hammer) and normalization (bbnorm). Masurca, however, has its own built in pipeline for correction and trimming and I just give in the reads as I received them. But while Masurca gave me the best assembly last time, with the new datasets it gives me by far the worst.
Are there other reasons than sequencing errors or metagenomic contamination for such an amount of low frequency k-mers? At least from my experience, I don't think that contamination of the genomic DNA during isolation is responsible here.
Any suggestions for a better assembly?
Thank you!
Comment