Hi all,
I recently had the genome of a bacterial strain I am working with sequenced using both PacBio and Illumina paired end.
I have managed to assemble the Illumina data into ~200 contigs using Soap2. The PacBio data I got back came assembled into 22 contigs. Which I was a little disappointed with especially because other people in my lab have sequenced the same species but different strains and got their data back as one contig! The original idea was to map the Illumina to the PacBio to look for errors.
But anyway, now I am not sure what to do with the data I have. The longest four contigs of the PacBio data cover ~97% of my estimated 4.5Mb genome size but all the other contigs do map to the same species when looking at the BLASR output, although some with low coverage. Now I'm not sure what is "real" and I don't want to underestimate the genome size.
I have read that you can use Pacbio sequences to scaffold Illumina contigs so I am wondering if I should try that? But I can't really find any helpful tutorials/resources on how to do this. I'm not sure about which PacBio data I should use (I have the CCS.fastq, filtered subread fastq and longest subread fastq file). If I need to do anything to the data before using it? Which program to use? etc.
Any help would be appreciated, even if its just a link to a good resource.
Thanks in advance!
I recently had the genome of a bacterial strain I am working with sequenced using both PacBio and Illumina paired end.
I have managed to assemble the Illumina data into ~200 contigs using Soap2. The PacBio data I got back came assembled into 22 contigs. Which I was a little disappointed with especially because other people in my lab have sequenced the same species but different strains and got their data back as one contig! The original idea was to map the Illumina to the PacBio to look for errors.
But anyway, now I am not sure what to do with the data I have. The longest four contigs of the PacBio data cover ~97% of my estimated 4.5Mb genome size but all the other contigs do map to the same species when looking at the BLASR output, although some with low coverage. Now I'm not sure what is "real" and I don't want to underestimate the genome size.
I have read that you can use Pacbio sequences to scaffold Illumina contigs so I am wondering if I should try that? But I can't really find any helpful tutorials/resources on how to do this. I'm not sure about which PacBio data I should use (I have the CCS.fastq, filtered subread fastq and longest subread fastq file). If I need to do anything to the data before using it? Which program to use? etc.
Any help would be appreciated, even if its just a link to a good resource.
Thanks in advance!
Comment