Hi all,
We are using Illumina 150bp paired-end reads to perform de novo assembly for a bacterial genome (~5Mb). Our procedure goes like this:
1. merge the paired-end reads into a single file
2. trim the reads using Q20 as the cutoff (i.e., remove all positions following the first low quality base)
3. discard reads that are <70bp after trimming
4. separate the reads into two files, one for paired-end reads and one for single-end reads (i.e., one of the PE reads was removed in the previous step)
5. feed the two files to velvet (v1.1.02), test all possible k-mer values and find one that produces best n50/max
The initial result looks reasonably good. However, when we tried to simulate the effects of using shorter reads by first trimming all reads to 100bp, we found the assembly actually becomes much better! The n50 increased from ~175kb to ~341kp and the max increased from ~512kb to ~937kb (the total genome size and the number of reads used didn't change much). Blastn confirmed that the improvement comes from merging of contigs.
I found this really puzzling because I was expecting the opposite result. Can this be due to higher error rates toward the 3' end (even though the quality scores look just fine)?
We are using Illumina 150bp paired-end reads to perform de novo assembly for a bacterial genome (~5Mb). Our procedure goes like this:
1. merge the paired-end reads into a single file
2. trim the reads using Q20 as the cutoff (i.e., remove all positions following the first low quality base)
3. discard reads that are <70bp after trimming
4. separate the reads into two files, one for paired-end reads and one for single-end reads (i.e., one of the PE reads was removed in the previous step)
5. feed the two files to velvet (v1.1.02), test all possible k-mer values and find one that produces best n50/max
The initial result looks reasonably good. However, when we tried to simulate the effects of using shorter reads by first trimming all reads to 100bp, we found the assembly actually becomes much better! The n50 increased from ~175kb to ~341kp and the max increased from ~512kb to ~937kb (the total genome size and the number of reads used didn't change much). Blastn confirmed that the improvement comes from merging of contigs.
I found this really puzzling because I was expecting the opposite result. Can this be due to higher error rates toward the 3' end (even though the quality scores look just fine)?
Comment