First, let me say that I've contacted Roche tech support multiple times and still have not received an answer. If I could talk to a Newbler programmer, perhaps they would shed some light, but you'll never reach one when you contact them.
I've run 30-40 different transcriptome assemblies with Newbler Ver 2.3 on an Ubuntu box (Ver 9.04) and completed both 454 and 454/Sanger hybrid assemblies.
When running in the cDNA mode, Newbler outputs two fasta files (454Isotig.fna and 454AllContigs.fna) and also 454Isotig.ace. I think everyone knows that the Newbler ace file isn't exactly "standard" relative to other assembler ace files. Anyway, here's what I get in EVERY case, and I'm curious if anyone has a clue as to why Roche does this:
Roche says you should use the 454Isotig.fna file as your assembly "contig" file for downstream work, e.g. blast... which is what I did. There are only isotig seqs found in this file. After parsing the 454Isotig.ace file into a MySQL db, I noticed that: i) there are "contig" and isotig seqs in the ace file, ii) thousands of the contig seqs are less than 50 nt and go all the way down to contigs of 2 nt, iii) the total number of contig + isotig seqs in the ace file does not equal the number of seqs found in either fasta file, iv) there are many large "contigs" that did not get annotated since they are not found in the Isotig.fasta file.
Here are the numbers from an assembly of ca. 4 million reads (~280,000 of which are Sanger)
454Isotig.fna = 48,882
4545AllContigs.fna = 72,475
454Isotigs.ace = 62,003
454Isotigs.ace (filtering out all seqs called "contig") = 47,729
454Isotigs.ace (filtering out all seqs ≤ 99 bp) = 54,619
As you can see, even the number of isotigs in the ace file doesn't equal the number in the fasta file... WTF!
So, which number does one use when describing the contig metrics of an assembly? What about all the contig seqs in the ace file that are 2, 3, and 4 kb but are not in the isotig file? And why write all the detritus, e.g. 2,3,4,5,6,7 nt seqs to the ace file? Inquiring minds want to know!
I see the same type of results whether I run only 454 (1/2, full or several plates) or hybrid assemblies like this one.
If anyone has an explanation (any Roche software engineers out there?), I'd love to hear it!
I've run 30-40 different transcriptome assemblies with Newbler Ver 2.3 on an Ubuntu box (Ver 9.04) and completed both 454 and 454/Sanger hybrid assemblies.
When running in the cDNA mode, Newbler outputs two fasta files (454Isotig.fna and 454AllContigs.fna) and also 454Isotig.ace. I think everyone knows that the Newbler ace file isn't exactly "standard" relative to other assembler ace files. Anyway, here's what I get in EVERY case, and I'm curious if anyone has a clue as to why Roche does this:
Roche says you should use the 454Isotig.fna file as your assembly "contig" file for downstream work, e.g. blast... which is what I did. There are only isotig seqs found in this file. After parsing the 454Isotig.ace file into a MySQL db, I noticed that: i) there are "contig" and isotig seqs in the ace file, ii) thousands of the contig seqs are less than 50 nt and go all the way down to contigs of 2 nt, iii) the total number of contig + isotig seqs in the ace file does not equal the number of seqs found in either fasta file, iv) there are many large "contigs" that did not get annotated since they are not found in the Isotig.fasta file.
Here are the numbers from an assembly of ca. 4 million reads (~280,000 of which are Sanger)
454Isotig.fna = 48,882
4545AllContigs.fna = 72,475
454Isotigs.ace = 62,003
454Isotigs.ace (filtering out all seqs called "contig") = 47,729
454Isotigs.ace (filtering out all seqs ≤ 99 bp) = 54,619
As you can see, even the number of isotigs in the ace file doesn't equal the number in the fasta file... WTF!
So, which number does one use when describing the contig metrics of an assembly? What about all the contig seqs in the ace file that are 2, 3, and 4 kb but are not in the isotig file? And why write all the detritus, e.g. 2,3,4,5,6,7 nt seqs to the ace file? Inquiring minds want to know!
I see the same type of results whether I run only 454 (1/2, full or several plates) or hybrid assemblies like this one.
If anyone has an explanation (any Roche software engineers out there?), I'd love to hear it!
Comment