High amount of low frequency / unique k-mers in Illumina reads

balaena

Member

Join Date: Jan 2015

Posts: 29
- Share
- Tweet
#1

High amount of low frequency / unique k-mers in Illumina reads

04-25-2016, 07:43 AM

Dear all,

I would like to hear your suggestion what amount of low frequency k-mers in Illumina reads is normal.

I am asking this because I am having a hard time to find a good assembly strategy for two 100mb invertebrate genomes I just received. Most of what used to work for my first genome from a similar species does not work now. I get very different results from the different assemblers (Masurca, Dipspades, Platanus) and sometimes they crash.

The difference in the datasets (all genomes around 100mb):

- old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
- new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert

The main difference seems to be the amount of low frequency k-mers in the reads. To give you an idea: After trimming of one sample with platanus_trim the 32mer histogram from Platanus shows 400 million single occurring k-mers. The hammer correction module of Dipspades also tells me that 80% of k-mers are singletons. A platanus run with the old (trimmed) dataset showed only 400k single occurring 32mers.

So I am trying back and forth with trimming (trimmomatic, platanus), correction (hammer) and normalization (bbnorm). Masurca, however, has its own built in pipeline for correction and trimming and I just give in the reads as I received them. But while Masurca gave me the best assembly last time, with the new datasets it gives me by far the worst.

Are there other reasons than sequencing errors or metagenomic contamination for such an amount of low frequency k-mers? At least from my experience, I don't think that contamination of the genomic DNA during isolation is responsible here.

Any suggestions for a better assembly?

Thank you!
Tags: None
Markiyan

Senior Member

Join Date: Sep 2010

Posts: 121
- Share
- Tweet
#2

04-27-2016, 02:47 AM

Start with clean inbread sample and min 2x250 reads...

Originally posted by balaena View Post

Dear all,

The difference in the datasets (all genomes around 100mb):

- old Illumina dataset: 80x coverage, 150bp PE reads, 450bp insert
- new Illumina datasets: 160x coverage, 125bp PE reads, 450bp insert

Any suggestions for a better assembly?
Thank you!

0. Low heterozygosity (inbred) specimen (if possible).

(If using Illumina platform)

1. PCR-Free library,
2. a good quality 2x300 or 2x250 Miseq/Hiseq run,
3. Flash or panda
4. Subsample and get try getting a repeats library.
5. Remove the repetitive regions reads, and try assembling unique ones.
6. You can try CLC-Bio/DNAstars Ngen (if you have access to them, and see what you get).
7. Do not forget yours repeats sequences to your's final contigs set.

PS: A nextera matepair library can be quite a heplfull adition if you need longer scaffolds.

PPS: The above dataset can be used for pacbio/nanopore reads correction sometime in a future.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

High amount of low frequency / unique k-mers in Illumina reads

Comment

Latest Articles

ad_right_rmr

News