Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • susanklein
    Senior Member
    • Feb 2014
    • 116

    assembler outputs

    Hi,

    I am trying to use various assemblers to see which works best for my data. I am having some trouble with viewing outputs / stats.

    masurca outputs .asm and frg (for superreads, not the original reads as far as I can tell so it migh tnot be possible to get stats?) I can't install amos or consed on my debian machine.. error after error after error and a never-ending list of dependencies. I basically have 'tablet' and 'geneious' to view assemblies, which restricts me to .caf and .ace and some other formats.

    velvet also outputs only .afg (amos?) files, so can't convert or view those either!

    Any help really appreciated!

    Thanks,

    S.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

    So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

    Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.
    Last edited by Brian Bushnell; 05-21-2014, 09:01 PM.

    Comment

    • susanklein
      Senior Member
      • Feb 2014
      • 116

      #3
      Originally posted by Brian Bushnell View Post
      Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

      So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

      Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.
      ok, I see. although some assemblers also do the mapping for you! Newbler outputs .ace, Mira .caf etc. So, I need to map the reads to the contig.fasta (or whatever) files using bwa or similar.. then I can get the coverage and retrieve the reads that went into each contig? Although it seems to me that the assembler should do that too since it used those reads (or kmers from them) to make the assembly?

      Comment

      • susanklein
        Senior Member
        • Feb 2014
        • 116

        #4
        erm.. thinking about it more.. there must be some stats output somewhere from masurca? it inputs raw reads, but I can't judge how good the assembly was if I don't know how many it used / rejected. I have asked the developer, but no reply yet on that. The documentation doe snot mention stats at all. Perhaps because masurca uses superreads, its just not possible.

        anyone?

        thanks,

        s.

        Comment

        • susanklein
          Senior Member
          • Feb 2014
          • 116

          #5
          hmm.. thinking about it more, perhaps you have misunderstood me.. I know assemblers output a fasta of contigs or scaffolds, but they do normally output some sort of mapping of those reads that made the assembly. Perhaps you weren't aware. If you map the reads with another program to the contigs.. they might bear no relation to the coverage or position they had in the assembly...

          s.

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            There is not usually a strict relationship recorded between the reads and the assembly for De Bruijn graph assemblers, because the reads are not directly used. Some of them can output which contigs were used where, at the expense of vastly more memory during assembly. OLC assemblers should always be able to tell you exactly which reads were used where, for free.

            JGI does hundreds of organism assemblies per year, and we use mapping to determine how the reads relate to the assembly. Even if the assembler gives you an answer, it's not necessarily correct. Not that mapping is strictly correct either, but it will have different biases than the assembler, and thus be more trustworthy than accepting the opinion of an assembler on its own assembly.

            BBMap has a special mode for evaluating assemblies, enabled with the flag 'kfilter=K' where K is an integer. This will prohibit mapping of reads to any location where there are fewer than K consecutive matching bases. Therefore, if you do a DBG assembly with kmer length K, any read that maps to some location was actually used to create that part of the contig. So, I consider it more accurate (with that flag) for alignment to De Bruijn graph assemblies, as compared to other aligners. It WILL bear a strict relation to the coverage of that position in the assembly because the read actually contained that kmer.

            BBMap also displays a lot of useful statistics such as the mapping rate, perfect match rate, substitution rate, N rate, ambiguous mapping rate, pairing rate, and indel rate that are useful in evaluating comparative assemblies to see which is best represented by the reads.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Furthermore, tools like Quast and ALE are both useful for evaluating the relative quality of different assemblies.

              Comment

              • susanklein
                Senior Member
                • Feb 2014
                • 116

                #8
                Thanks Brian, I understand now. I'll try those.

                s

                Comment

                • eyeziko
                  Junior Member
                  • May 2013
                  • 6

                  #9
                  MaSuRCA assembler error help

                  This is an error I got from MaSuRCA assembler. I am not sure what next. My reads are hiseq = 2X100 bp and Miseq= 2X250bp. It says --default-read-length 92 which should not be I guess.

                  Anyhelp will be highly appreciated. Thank you.
                  eyeziko

                  OUTPUT
                  [Wed May 21 11:06:42 CDT 2014] Processing pe library reads
                  Average PE read length 92
                  choosing kmer size of 31 for the graph
                  MIN_Q_CHAR: 33
                  [Wed May 21 11:18:31 CDT 2014] Creating mer database for Quorum.
                  [Wed May 21 11:37:22 CDT 2014] Error correct PE.
                  [Wed May 21 13:02:26 CDT 2014] Estimating genome size.
                  Estimated genome size: 1372852159
                  [Wed May 21 13:25:57 CDT 2014] Creating k-unitigs with k=31
                  [Wed May 21 14:44:12 CDT 2014] Computing super reads from PE
                  Linking PE reads 65185532
                  [Wed May 21 19:30:14 CDT 2014] Celera Assembler
                  ovlMerThreshold=75
                  Overlap/unitig success
                  /ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
                  /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  totalSequenceLength = 1161744854
                  counts at end of consensus sequence: 'A': 360153758, 'C': 220704505, 'G': 220635409, 'T': 360251182, invalid: 0
                  sh: line 1: 32520 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
                  Overlap/unitig success
                  Unitig consensus success
                  /ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
                  /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  totalSequenceLength = 1186002253
                  counts at end of consensus sequence: 'A': 368076306, 'C': 224907414, 'G': 224844721, 'T': 368173812, invalid: 0
                  sh: line 1: 23704 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
                  CA failed, check output under CA/ and runCA3.out

                  Comment

                  • susanklein
                    Senior Member
                    • Feb 2014
                    • 116

                    #10
                    I'm no expert, but it says 'segmentation fault' so probably out of memory...

                    s.

                    Comment

                    • dandence
                      Junior Member
                      • May 2011
                      • 1

                      #11
                      Hi Eyeziko, did you ever resolve your segmentation fault error? I'm encountering the same problem with assembling an AT-rich genome with MaSurCA.

                      ~D

                      Comment

                      • francicco
                        Member
                        • Jul 2010
                        • 28

                        #12
                        Hi guys,

                        I'm also trying to use MaSuRCA to assembly a Drosophila species, since SOAPdenovo and IDBA-UD fail giving me a N50 smaller than 1k. It looks like MaSuRCA some problem.

                        These are my errors:

                        Computing super reads from PE
                        Super reads failed, check super1.err and files in ./work1/

                        Allocating 309311856 bytes for startOverlapByUnitig.
                        Allocating 309311856 bytes for startOverlapIndexByUnitig2.
                        Allocating 154655928 bytes for unitigLengths.
                        terminate called after throwing an instance of 'std::runtime_error'
                        what(): kmer TATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATGCA already present in map
                        Allocating 215326848 bytes for unitig2OverlapIndex.
                        Num pairs with both reads in same unitig: 0
                        Num pairs uniquely joinable: 0
                        .....
                        Output file "work1/readPositionsInSuperReads" is of size 0, must be at least of size 1. Bye!
                        mv work1/readPositionsInSuperReads work1/joinKUnitigs.Failed

                        Can anybody help?

                        I'm also running SPAdes and Cabog, the latter is very very very slow!

                        Thanks guys

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 05:37 AM
                        0 responses
                        5 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-26-2026, 11:10 AM
                        0 responses
                        16 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        109 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...