Hi,
Recently we received one lane of HiSeq 8kb Mate-Pair reads with 200million 100bp reads.The data is intended for de-novo assembly/scaffolding for ~800MB genome. Initial FastQC assessment indicates good data quality except reported UNUSUALLY high duplicated reads, which is 96.77%! Please find attached the relevant FastQC images.
Searching for relevant posts revealed other reported duplication level as high as 80-85%, which could be due to the PCR bias. The sequencing service provider assured us that this level of duplication is common for illumina mate pair libraries. When we used this data with our existing data(illumina 2 lanes of 400bp PE + 1 lane of 700bp PE) we have either worse results than before (using CLC)or just minimal improvements (using SoapDenovo) in terms of N50, no. of contigs/scaffolds etc.
Now we wonder:
Is it common to have such high duplication level?
Do we need to discard duplicated reads? If yes, best tools? (rmdup? picard?)
and finally the Strategy to improve the assembly with the data we have.
Thanks for your advice.
Cheers.
Recently we received one lane of HiSeq 8kb Mate-Pair reads with 200million 100bp reads.The data is intended for de-novo assembly/scaffolding for ~800MB genome. Initial FastQC assessment indicates good data quality except reported UNUSUALLY high duplicated reads, which is 96.77%! Please find attached the relevant FastQC images.
Searching for relevant posts revealed other reported duplication level as high as 80-85%, which could be due to the PCR bias. The sequencing service provider assured us that this level of duplication is common for illumina mate pair libraries. When we used this data with our existing data(illumina 2 lanes of 400bp PE + 1 lane of 700bp PE) we have either worse results than before (using CLC)or just minimal improvements (using SoapDenovo) in terms of N50, no. of contigs/scaffolds etc.
Now we wonder:
Is it common to have such high duplication level?
Do we need to discard duplicated reads? If yes, best tools? (rmdup? picard?)
and finally the Strategy to improve the assembly with the data we have.
Thanks for your advice.
Cheers.
Comment