Header Leaderboard Ad

Collapse

Why does Newbler do what it does with .ace files?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why does Newbler do what it does with .ace files?

    First, let me say that I've contacted Roche tech support multiple times and still have not received an answer. If I could talk to a Newbler programmer, perhaps they would shed some light, but you'll never reach one when you contact them.

    I've run 30-40 different transcriptome assemblies with Newbler Ver 2.3 on an Ubuntu box (Ver 9.04) and completed both 454 and 454/Sanger hybrid assemblies.

    When running in the cDNA mode, Newbler outputs two fasta files (454Isotig.fna and 454AllContigs.fna) and also 454Isotig.ace. I think everyone knows that the Newbler ace file isn't exactly "standard" relative to other assembler ace files. Anyway, here's what I get in EVERY case, and I'm curious if anyone has a clue as to why Roche does this:

    Roche says you should use the 454Isotig.fna file as your assembly "contig" file for downstream work, e.g. blast... which is what I did. There are only isotig seqs found in this file. After parsing the 454Isotig.ace file into a MySQL db, I noticed that: i) there are "contig" and isotig seqs in the ace file, ii) thousands of the contig seqs are less than 50 nt and go all the way down to contigs of 2 nt, iii) the total number of contig + isotig seqs in the ace file does not equal the number of seqs found in either fasta file, iv) there are many large "contigs" that did not get annotated since they are not found in the Isotig.fasta file.

    Here are the numbers from an assembly of ca. 4 million reads (~280,000 of which are Sanger)

    454Isotig.fna = 48,882
    4545AllContigs.fna = 72,475

    454Isotigs.ace = 62,003
    454Isotigs.ace (filtering out all seqs called "contig") = 47,729
    454Isotigs.ace (filtering out all seqs ≤ 99 bp) = 54,619

    As you can see, even the number of isotigs in the ace file doesn't equal the number in the fasta file... WTF!

    So, which number does one use when describing the contig metrics of an assembly? What about all the contig seqs in the ace file that are 2, 3, and 4 kb but are not in the isotig file? And why write all the detritus, e.g. 2,3,4,5,6,7 nt seqs to the ace file? Inquiring minds want to know!

    I see the same type of results whether I run only 454 (1/2, full or several plates) or hybrid assemblies like this one.

    If anyone has an explanation (any Roche software engineers out there?), I'd love to hear it!

  • #2
    First, have you read my blog entry on newbler cDNA output (pardon the shameless self-promotion)? http://contig.wordpress.com/2010/09/...-output-files/

    Originally posted by WaltL View Post
    Roche says you should use the 454Isotig.fna file as your assembly "contig" file for downstream work, e.g. blast... which is what I did. There are only isotig seqs found in this file.
    Are you sure? If there are isogroups that did not become isotigs, the contigs of these isogroups should be in the 454Isotigs.fna file...

    The cDNA module is somewhat buggy, as noted in several posts at SeqAnswers.

    I am a it surprised about the large contigs in the ace file missing from the other fna files. I did find equal number of isotigs in the ace file and 454Isotigsfiles for the one I checked:

    grep -c isotig 454Isotigs.ace
    32541
    grep -c isotig 454Isotigs.fna
    32541

    About the short contigs: in a de novo genome assembly, these also exist but are not reported as, by default, the lower limit for '454AllContigs.fna' and the 454Comntigs.ace files is 100 bp. These short ones are the result of the way newbler builds the contig graph (also explained on my blog). Some of them are repeats, some small differences (indels) between transcript variants etc.

    On the metrics: the number of isogroups should potentially tell you how many 'genes' there are. Splice variants (the different isotigs) could actually be just small sequence variants. Collapsing (i.e. clustering) these with CD-HIT or a similar tool might help getting the real splice variants and reduce the number of isotigs. Contigs not in isotigs are a bit of a problem, but if you have a reference genome, maybe you can deduce the real transcript by alignment of the contigs (or reads) to the reference?

    Hope this helps,

    Lex
    Last edited by flxlex; 12-06-2010, 12:53 AM. Reason: Typos

    Comment


    • #3
      Lex,

      Thanks for your response. So I went back and double checked and you are correct, there are contigs in the 454Isotig.fna file. I greped contig and found 1,113 instances... the difference of the total 48,882 being 47,729 which is the # of isotigs found in the ace file.

      I still, however, do not understand why Roche chooses to write all the short (<100 bp) contigs to the ace file. I mean, this is just junk sequence. Since I am using someone else's scripts to parse the ace file into my database, I have no way to filter them out. Seems like writing these bits to a separate debris/boneyard file would be a smarter way to go. Oh well... maybe on the next version!

      Also, thank you for the suggestion on collapsing the assembly. I have tried running some of my miraEST assemblies from the same dataset (> 180K multi-read contigs) through CAP3, but that didn't help very much. The isogroup count for Newbler is ~ 26K isogroups and, given that this particular conifer species has a genome 7X larger than human (no reference yet), it is actually the most collapsed assembly when compared to the other assemblers I've used. Right now, I think it may be collapsing things too much.

      Thanks again!

      Walt




      Best,
      Walt

      Comment

      Working...
      X