I have been using MIRA to assemble PacBio data for a very small circular genome and I have been observing a strange result in the output. For several datasets when the contigs are compared to the closest available reference There are a large number of contigs in certain regions that represent the same region of the genome.
Even when though these contigs have a high degree of overlap, they are not joined into single contigs.
The problem is especially obvious in one dataset where the whole genome can be represented as two contigs with a large degree of overlap at both ends but are not collapsed into a single contig (shown by MUMmer mapview output attached)
I've been running Mira just with the most basic settings for whole genome, denovo, accurate
The closest theory I can come up with for why this is happening is that errors are prevalent enough in the PacBio data that it is possible to come up with two distinct version of the same sequence as a contig.
I would love to hear any suggestions on how to properly collapse these contigs as I am worried I am missing valuable read and quality information by having identical regions represented by different contigs.
Even when though these contigs have a high degree of overlap, they are not joined into single contigs.
The problem is especially obvious in one dataset where the whole genome can be represented as two contigs with a large degree of overlap at both ends but are not collapsed into a single contig (shown by MUMmer mapview output attached)
I've been running Mira just with the most basic settings for whole genome, denovo, accurate
The closest theory I can come up with for why this is happening is that errors are prevalent enough in the PacBio data that it is possible to come up with two distinct version of the same sequence as a contig.
I would love to hear any suggestions on how to properly collapse these contigs as I am worried I am missing valuable read and quality information by having identical regions represented by different contigs.
Comment