Hi All,
I'm sequencing 10 kb PCR products using Illumina 150 x 2 paired-end reads. I'm trying to optimize a de novo assembly workflow, and was hoping that I would find some help here. I outline the process below. My goal is to de novo assemble the PCR product into a single, accurate contig. Questions are in red, interleaved with the step in the protocol they refer to. Thanks for any help and feedback.
Program: Geneious
Input: Trimmed reads (fastq). The sequencing core trims the adapters and barcodes for me.
1) Pair the reads. This generates a single file, in which the pairs are now interleaved.
2) BBNorm (Default settings). Normalize reads to 100x coverage.
Would error correcting be beneficial?
3) BBMerge. Merge the reads remaining after normalization. Merge rate is set to "normal".
4) De Novo Assembly. I'm currently using the Geneious assembler. There are a lot of parameters that can be manipulated. The ones I'm using are attached as a screenshot. Please share if you think these parameters could be improved, and how.
How frequent are miscalls in Illumina sequencing? I'm not sure how much overlap between neighboring reads I should require, and within this overlap, how many mismatches I should allow. Also, how often are insertions created during the illumina process? Should I allow gaps within reads?
5) Extract contig consensus sequences. Minimum Coverage = 15, otherwise a gap is called.
What is an acceptable minimum coverage? Since I'm sequencing a PCR product, I imagine I could increase this significantly. The benefit would be removing possible contaminating sequences.
6) Map to reference. In an ideal world, only 1 of the contigs maps to the reference, and the others are background genomic DNA (my PCR reaction starts with a small amount of genomic DNA as template). If multiple contigs map, it could be because there were multiple viral genomes in the initial PCR reaction. It means I have to throw the data out, as the individual genomes are too similar to accurately differentiate. So, it's important that the de novo assembly is stringent, yet not over stringent so that true neighboring reads cannot be assembled into a single contig.
After this I intend to look for Open Reading Frames.
Thanks again for any feedback/suggestions!
Jake
I'm sequencing 10 kb PCR products using Illumina 150 x 2 paired-end reads. I'm trying to optimize a de novo assembly workflow, and was hoping that I would find some help here. I outline the process below. My goal is to de novo assemble the PCR product into a single, accurate contig. Questions are in red, interleaved with the step in the protocol they refer to. Thanks for any help and feedback.
Program: Geneious
Input: Trimmed reads (fastq). The sequencing core trims the adapters and barcodes for me.
1) Pair the reads. This generates a single file, in which the pairs are now interleaved.
2) BBNorm (Default settings). Normalize reads to 100x coverage.
Would error correcting be beneficial?
3) BBMerge. Merge the reads remaining after normalization. Merge rate is set to "normal".
4) De Novo Assembly. I'm currently using the Geneious assembler. There are a lot of parameters that can be manipulated. The ones I'm using are attached as a screenshot. Please share if you think these parameters could be improved, and how.
How frequent are miscalls in Illumina sequencing? I'm not sure how much overlap between neighboring reads I should require, and within this overlap, how many mismatches I should allow. Also, how often are insertions created during the illumina process? Should I allow gaps within reads?
5) Extract contig consensus sequences. Minimum Coverage = 15, otherwise a gap is called.
What is an acceptable minimum coverage? Since I'm sequencing a PCR product, I imagine I could increase this significantly. The benefit would be removing possible contaminating sequences.
6) Map to reference. In an ideal world, only 1 of the contigs maps to the reference, and the others are background genomic DNA (my PCR reaction starts with a small amount of genomic DNA as template). If multiple contigs map, it could be because there were multiple viral genomes in the initial PCR reaction. It means I have to throw the data out, as the individual genomes are too similar to accurately differentiate. So, it's important that the de novo assembly is stringent, yet not over stringent so that true neighboring reads cannot be assembled into a single contig.
After this I intend to look for Open Reading Frames.
Thanks again for any feedback/suggestions!
Jake
Comment