Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • assembler outputs

    Hi,

    I am trying to use various assemblers to see which works best for my data. I am having some trouble with viewing outputs / stats.

    masurca outputs .asm and frg (for superreads, not the original reads as far as I can tell so it migh tnot be possible to get stats?) I can't install amos or consed on my debian machine.. error after error after error and a never-ending list of dependencies. I basically have 'tablet' and 'geneious' to view assemblies, which restricts me to .caf and .ace and some other formats.

    velvet also outputs only .afg (amos?) files, so can't convert or view those either!

    Any help really appreciated!

    Thanks,

    S.

  • #2
    Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

    So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

    Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.
    Last edited by Brian Bushnell; 05-21-2014, 09:01 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

      So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

      Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.
      ok, I see. although some assemblers also do the mapping for you! Newbler outputs .ace, Mira .caf etc. So, I need to map the reads to the contig.fasta (or whatever) files using bwa or similar.. then I can get the coverage and retrieve the reads that went into each contig? Although it seems to me that the assembler should do that too since it used those reads (or kmers from them) to make the assembly?

      Comment


      • #4
        erm.. thinking about it more.. there must be some stats output somewhere from masurca? it inputs raw reads, but I can't judge how good the assembly was if I don't know how many it used / rejected. I have asked the developer, but no reply yet on that. The documentation doe snot mention stats at all. Perhaps because masurca uses superreads, its just not possible.

        anyone?

        thanks,

        s.

        Comment


        • #5
          hmm.. thinking about it more, perhaps you have misunderstood me.. I know assemblers output a fasta of contigs or scaffolds, but they do normally output some sort of mapping of those reads that made the assembly. Perhaps you weren't aware. If you map the reads with another program to the contigs.. they might bear no relation to the coverage or position they had in the assembly...

          s.

          Comment


          • #6
            There is not usually a strict relationship recorded between the reads and the assembly for De Bruijn graph assemblers, because the reads are not directly used. Some of them can output which contigs were used where, at the expense of vastly more memory during assembly. OLC assemblers should always be able to tell you exactly which reads were used where, for free.

            JGI does hundreds of organism assemblies per year, and we use mapping to determine how the reads relate to the assembly. Even if the assembler gives you an answer, it's not necessarily correct. Not that mapping is strictly correct either, but it will have different biases than the assembler, and thus be more trustworthy than accepting the opinion of an assembler on its own assembly.

            BBMap has a special mode for evaluating assemblies, enabled with the flag 'kfilter=K' where K is an integer. This will prohibit mapping of reads to any location where there are fewer than K consecutive matching bases. Therefore, if you do a DBG assembly with kmer length K, any read that maps to some location was actually used to create that part of the contig. So, I consider it more accurate (with that flag) for alignment to De Bruijn graph assemblies, as compared to other aligners. It WILL bear a strict relation to the coverage of that position in the assembly because the read actually contained that kmer.

            BBMap also displays a lot of useful statistics such as the mapping rate, perfect match rate, substitution rate, N rate, ambiguous mapping rate, pairing rate, and indel rate that are useful in evaluating comparative assemblies to see which is best represented by the reads.

            Comment


            • #7
              Furthermore, tools like Quast and ALE are both useful for evaluating the relative quality of different assemblies.

              Comment


              • #8
                Thanks Brian, I understand now. I'll try those.

                s

                Comment


                • #9
                  MaSuRCA assembler error help

                  This is an error I got from MaSuRCA assembler. I am not sure what next. My reads are hiseq = 2X100 bp and Miseq= 2X250bp. It says --default-read-length 92 which should not be I guess.

                  Anyhelp will be highly appreciated. Thank you.
                  eyeziko

                  OUTPUT
                  [Wed May 21 11:06:42 CDT 2014] Processing pe library reads
                  Average PE read length 92
                  choosing kmer size of 31 for the graph
                  MIN_Q_CHAR: 33
                  [Wed May 21 11:18:31 CDT 2014] Creating mer database for Quorum.
                  [Wed May 21 11:37:22 CDT 2014] Error correct PE.
                  [Wed May 21 13:02:26 CDT 2014] Estimating genome size.
                  Estimated genome size: 1372852159
                  [Wed May 21 13:25:57 CDT 2014] Creating k-unitigs with k=31
                  [Wed May 21 14:44:12 CDT 2014] Computing super reads from PE
                  Linking PE reads 65185532
                  [Wed May 21 19:30:14 CDT 2014] Celera Assembler
                  ovlMerThreshold=75
                  Overlap/unitig success
                  /ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
                  /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  totalSequenceLength = 1161744854
                  counts at end of consensus sequence: 'A': 360153758, 'C': 220704505, 'G': 220635409, 'T': 360251182, invalid: 0
                  sh: line 1: 32520 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
                  Overlap/unitig success
                  Unitig consensus success
                  /ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
                  /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  totalSequenceLength = 1186002253
                  counts at end of consensus sequence: 'A': 368076306, 'C': 224907414, 'G': 224844721, 'T': 368173812, invalid: 0
                  sh: line 1: 23704 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
                  cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
                  CA failed, check output under CA/ and runCA3.out

                  Comment


                  • #10
                    I'm no expert, but it says 'segmentation fault' so probably out of memory...

                    s.

                    Comment


                    • #11
                      Hi Eyeziko, did you ever resolve your segmentation fault error? I'm encountering the same problem with assembling an AT-rich genome with MaSurCA.

                      ~D

                      Comment


                      • #12
                        Hi guys,

                        I'm also trying to use MaSuRCA to assembly a Drosophila species, since SOAPdenovo and IDBA-UD fail giving me a N50 smaller than 1k. It looks like MaSuRCA some problem.

                        These are my errors:

                        Computing super reads from PE
                        Super reads failed, check super1.err and files in ./work1/

                        Allocating 309311856 bytes for startOverlapByUnitig.
                        Allocating 309311856 bytes for startOverlapIndexByUnitig2.
                        Allocating 154655928 bytes for unitigLengths.
                        terminate called after throwing an instance of 'std::runtime_error'
                        what(): kmer TATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATGCA already present in map
                        Allocating 215326848 bytes for unitig2OverlapIndex.
                        Num pairs with both reads in same unitig: 0
                        Num pairs uniquely joinable: 0
                        .....
                        Output file "work1/readPositionsInSuperReads" is of size 0, must be at least of size 1. Bye!
                        mv work1/readPositionsInSuperReads work1/joinKUnitigs.Failed

                        Can anybody help?

                        I'm also running SPAdes and Cabog, the latter is very very very slow!

                        Thanks guys

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 08:47 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X