No announcement yet.

Bug in gsAssembler 2.6, some contigs missing

  • Filter
  • Time
  • Show
Clear All
new posts

  • Bug in gsAssembler 2.6, some contigs missing

    tl;dr: The bug is, some large contigs are missing from the file of output contigs 454AllContigs.fna
    I believe there is a bug in the current version of gsAssembler, 2.6 (20110517_1502)
    I contacted Roche three weeks ago but still have not heard back. Maybe they have everyone making kits
    Has anyone else here come across this bug? Any solutions?

    I am assembling one plate plus titrations of an 8kb paired end run.
    Here are the exact assembly parameters for the assembly, using command line:

    runProject -siod -het -urt -cpu 28 -info -m -noace -a 1 -l 2000 -large -scaffold /454/assemblydir

    Notice the parameter -a 1 which sets the minimum contig length to 1. I should get ALL contigs, no matter how short (and I do get ones that short)

    Here are a few FASTA headers (omitting the sequence lines) from 454AllContigs.fna, notice that some contigs are missing, such as 75
    >contig00068 length=2566 numreads=160
    >contig00071 length=2081 numreads=130
    >contig00072 length=776 numreads=27
    >contig00073 length=1145 numreads=310
    >contig00074 length=187 numreads=41
    >contig00076 length=1834 numreads=456
    >contig00077 length=1922 numreads=219
    >contig00078 length=432 numreads=45
    >contig00080 length=128 numreads=17
    >contig00081 length=3488 numreads=454
    >contig00082 length=2433 numreads=353
    >contig00083 length=4226 numreads=351

    >contig109403 length=1 numreads=7

    Here is the corresponding section of 454ContigGraph.txt, note that contig00075 IS there, but out of order
    68 contig00068 2566 11.2
    71 contig00071 2081 11.0
    72 contig00072 776 6.0
    73 contig00073 1145 43.3
    74 contig00074 187 20.3
    76 contig00076 1834 45.1
    77 contig00077 1922 16.5
    78 contig00078 432 12.2
    80 contig00080 128 12.4
    81 contig00081 3488 23.4
    82 contig00082 2433 26.1
    75 contig00075 187 22.4
    79 contig00079 18 7.3
    83 contig00083 4226 15.8


    Later on in that same file is the connection information, here is a summary
    $ bb.454contiginfo --in=../assembly --contig=75 --out=-
    Length 187
    Average Coverage 22.4
    Edge 5' Connects to contig 73 3' with 28 reads
    Edge 3' Connects to contig 76 5' with 25 reads
    28 reads flow from 5' end of contig75 and terminate in contig 73
    25 reads flow from 3' end of contig75 and terminate in contig 76
    2 paired end reads flow from 5' end of contig75 and terminate in contig 105881 after passing through 7605.0 b.p. in other contig(s)
    No paired end reads flow from 3' end of contig75

    I want that contig! It goes between 73 and 76. Where is it?
    I tried without the -scaffold parameter, contig numbers change, but there are still missing contigs.

  • #2
    Not sure, it does look like a bug. What happens if you try without "-large" ?

    Also if you instead output ACE files can you extract the contigs from there?


    • #3
      Those are excellent suggestions, I will report back when the assembly is finished

      Edit: No, the contig is not in the .ace file either. Trying without -large now.
      I have this sudden fear, what if it is a memory error? I had a bad chip once before give crazy errors.
      So I am replicating, assembling exactly the same way with the same data on a different computer to see.
      Last edited by dsenalik; 11-11-2011, 12:02 PM.


      • #4
        I think if it was a memory error you'd be more likely to see segfaults or intermittent problems. If it's reproducible between runs then I think a logical error is more likely...

        I have some developer contacts at Roche I can send this link to if you are still struggling.


        • #5
          The replication on a different computer also shows missing contigs. So I think I can rule out memory problems.

          The assembly without -large was taking forever for some reason, so nothing to report on that.

          So, yes, please let your developer contacts know about this. Thanks ever so much!


          • #6
            Just an update on this bug, it was officially submitted to Roche back on Nov. 23, but I have not heard a word back.

            Here, for anyone else who encounters it, is another different aspect of this bug (or a different bug?), the contig numbers in the graphical environment do NOT correspond to the contig numbers in the generated FASTA file.
            For example, in the graphical environment, contig 00008 is the contig numbered 00009 in the FASTA file, as can be seen from the sequence lengths


            • #7
              my suspicions is that using -large will cause the algorithm to take shortcuts and not completely traverse the entire contigGraph (since well its too large). It may so happen that the missing contigs are those that are too large, thus it doesn't bother traversing the graph / generating the actual contigs


              • #8
                This is probably relevant to the topic of this thread (bugs in v2.6) but not relevant to the original poster.
                I performed a cDNA assembly using version 2.6 and also found what appeared to be missing contigs from 454AllContigs.fna. But when I look in the 454ContigGraph.txt they are contigs that were not used in the assembly and somehow have zero length but greater than zero read depth, some over 100 read depth.
                I think they are safe to ignore, but it seems strange.