Hi, all
I have 454 transcriptome data, and I am a bit puzzled by some assemblies I have been doing. When I did a first assembly (on CLC Workbench), the ~600,000 reads assembled into 13,000 contigs and 157,000 singletons. Just out of curiosity, I used the set of singletons and did a new assembly of them. Since they were supposedly unmatched reads, I expected them to not assemble very well, if at all. Instead, the singletons assembled into a new set of 14,000 contigs, leaving a remaining 70,000 singletons!! I used the same stringency for both assemblies (0.5 length fraction, and 0.90 identity).
When I look at the coverage for the re-assembled singletons, it seems they have comparable coverage and length distribution to the original assembly. So they seem to be OK.
I guess I don't really understand the assembly algorithm, cause I am a bit puzzled as to why it would leave so many reads unassembled in a first pass, when they were obviously matched to other reads.
So my question is: is re-assembly of 'left-over' singletons from a first assembly a reasonable approach? Or does that somehow force 'bad' contigs to be formed?
Any insight would be extremely helpful, since the structure of my dataset will vary tremendously depending on the answer!
Thanks!
Felipe
I have 454 transcriptome data, and I am a bit puzzled by some assemblies I have been doing. When I did a first assembly (on CLC Workbench), the ~600,000 reads assembled into 13,000 contigs and 157,000 singletons. Just out of curiosity, I used the set of singletons and did a new assembly of them. Since they were supposedly unmatched reads, I expected them to not assemble very well, if at all. Instead, the singletons assembled into a new set of 14,000 contigs, leaving a remaining 70,000 singletons!! I used the same stringency for both assemblies (0.5 length fraction, and 0.90 identity).
When I look at the coverage for the re-assembled singletons, it seems they have comparable coverage and length distribution to the original assembly. So they seem to be OK.
I guess I don't really understand the assembly algorithm, cause I am a bit puzzled as to why it would leave so many reads unassembled in a first pass, when they were obviously matched to other reads.
So my question is: is re-assembly of 'left-over' singletons from a first assembly a reasonable approach? Or does that somehow force 'bad' contigs to be formed?
Any insight would be extremely helpful, since the structure of my dataset will vary tremendously depending on the answer!
Thanks!
Felipe
Comment