Unconfigured Ad

**Brian Bushnell** · 05-21-2014, 08:57 PM

Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.

**susanklein** · 05-21-2014, 09:49 PM

Originally posted by Brian Bushnell View Post

Assemblers output fasta files. I have not used masurca, but Velvet certainly does, as does Soap, Ray, AllPaths, Spades, Minia, etc. Basically, if it doesn't output fasta, it's not an assembler.

So! If you are using something that claims to be an assembler and you can't find the final fasta file, you probably are looking in the wrong place, or did not run the program to completion. Velvet, for example, annoyingly requires you to run 'velveth' and 'velvetg' iteratively before it will yield useful output.

Once you have a fasta file, you can run stats.sh on it to get a rough estimation of continuity, and map the reads to the assembly to get an indication of correctness.

ok, I see. although some assemblers also do the mapping for you! Newbler outputs .ace, Mira .caf etc. So, I need to map the reads to the contig.fasta (or whatever) files using bwa or similar.. then I can get the coverage and retrieve the reads that went into each contig? Although it seems to me that the assembler should do that too since it used those reads (or kmers from them) to make the assembly?

**susanklein** · 05-21-2014, 09:55 PM

erm.. thinking about it more.. there must be some stats output somewhere from masurca? it inputs raw reads, but I can't judge how good the assembly was if I don't know how many it used / rejected. I have asked the developer, but no reply yet on that. The documentation doe snot mention stats at all. Perhaps because masurca uses superreads, its just not possible.

anyone?

thanks,

s.

**susanklein** · 05-21-2014, 10:03 PM

hmm.. thinking about it more, perhaps you have misunderstood me.. I know assemblers output a fasta of contigs or scaffolds, but they do normally output some sort of mapping of those reads that made the assembly. Perhaps you weren't aware. If you map the reads with another program to the contigs.. they might bear no relation to the coverage or position they had in the assembly...

s.

**Brian Bushnell** · 05-21-2014, 10:35 PM

There is not usually a strict relationship recorded between the reads and the assembly for De Bruijn graph assemblers, because the reads are not directly used. Some of them can output which contigs were used where, at the expense of vastly more memory during assembly. OLC assemblers should always be able to tell you exactly which reads were used where, for free.

JGI does hundreds of organism assemblies per year, and we use mapping to determine how the reads relate to the assembly. Even if the assembler gives you an answer, it's not necessarily correct. Not that mapping is strictly correct either, but it will have different biases than the assembler, and thus be more trustworthy than accepting the opinion of an assembler on its own assembly.

BBMap has a special mode for evaluating assemblies, enabled with the flag 'kfilter=K' where K is an integer. This will prohibit mapping of reads to any location where there are fewer than K consecutive matching bases. Therefore, if you do a DBG assembly with kmer length K, any read that maps to some location was actually used to create that part of the contig. So, I consider it more accurate (with that flag) for alignment to De Bruijn graph assemblies, as compared to other aligners. It WILL bear a strict relation to the coverage of that position in the assembly because the read actually contained that kmer.

BBMap also displays a lot of useful statistics such as the mapping rate, perfect match rate, substitution rate, N rate, ambiguous mapping rate, pairing rate, and indel rate that are useful in evaluating comparative assemblies to see which is best represented by the reads.

**Brian Bushnell** · 05-21-2014, 10:45 PM

Furthermore, tools like Quast and ALE are both useful for evaluating the relative quality of different assemblies.

**susanklein** · 05-21-2014, 11:06 PM

Thanks Brian, I understand now. I'll try those.

s

**eyeziko** · 05-24-2014, 08:27 AM

MaSuRCA assembler error help

This is an error I got from MaSuRCA assembler. I am not sure what next. My reads are hiseq = 2X100 bp and Miseq= 2X250bp. It says --default-read-length 92 which should not be I guess.

Anyhelp will be highly appreciated. Thank you.
eyeziko

OUTPUT
[Wed May 21 11:06:42 CDT 2014] Processing pe library reads
Average PE read length 92
choosing kmer size of 31 for the graph
MIN_Q_CHAR: 33
[Wed May 21 11:18:31 CDT 2014] Creating mer database for Quorum.
[Wed May 21 11:37:22 CDT 2014] Error correct PE.
[Wed May 21 13:02:26 CDT 2014] Estimating genome size.
Estimated genome size: 1372852159
[Wed May 21 13:25:57 CDT 2014] Creating k-unitigs with k=31
[Wed May 21 14:44:12 CDT 2014] Computing super reads from PE
Linking PE reads 65185532
[Wed May 21 19:30:14 CDT 2014] Celera Assembler
ovlMerThreshold=75
Overlap/unitig success
/ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
/ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
totalSequenceLength = 1161744854
counts at end of consensus sequence: 'A': 360153758, 'C': 220704505, 'G': 220635409, 'T': 360251182, invalid: 0
sh: line 1: 32520 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
Overlap/unitig success
Unitig consensus success
/ihome/MaSuRCA-2.2.1/bin/getSuperReadPlacements.perl -dir . > tplacements.txt
/ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
totalSequenceLength = 1186002253
counts at end of consensus sequence: 'A': 368076306, 'C': 224907414, 'G': 224844721, 'T': 368173812, invalid: 0
sh: line 1: 23704 Segmentation fault (core dumped) /ihome/MaSuRCA-2.2.1/bin/getATBiasInCoverageForIllumina_v2 --sequence-file localUnitigSequenceFile.fasta --default-read-length 92 --placement-file tplacements.txt --interval-len 200 --min-from-end 200 > coverageCountsByGC.intermFile.txt
cat coverageCountsByGC.intermFile.txt | /ihome/MaSuRCA-2.2.1/bin/getMeanAndStdevByGCCount.perl --interval-len 200 > adjustmentFactorsForGCPct.txt
CA failed, check output under CA/ and runCA3.out

**susanklein** · 05-24-2014, 09:45 AM

I'm no expert, but it says 'segmentation fault' so probably out of memory...

s.

**dandence** · 09-12-2014, 11:09 AM

Hi Eyeziko, did you ever resolve your segmentation fault error? I'm encountering the same problem with assembling an AT-rich genome with MaSurCA.

~D

**francicco** · 10-29-2014, 12:37 AM

Hi guys,

I'm also trying to use MaSuRCA to assembly a Drosophila species, since SOAPdenovo and IDBA-UD fail giving me a N50 smaller than 1k. It looks like MaSuRCA some problem.

These are my errors:

Computing super reads from PE
Super reads failed, check super1.err and files in ./work1/

Allocating 309311856 bytes for startOverlapByUnitig.
Allocating 309311856 bytes for startOverlapIndexByUnitig2.
Allocating 154655928 bytes for unitigLengths.
terminate called after throwing an instance of 'std::runtime_error'
what(): kmer TATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATACATGCA already present in map
Allocating 215326848 bytes for unitig2OverlapIndex.
Num pairs with both reads in same unitig: 0
Num pairs uniquely joinable: 0
.....
Output file "work1/readPositionsInSuperReads" is of size 0, must be at least of size 1. Bye!
mv work1/readPositionsInSuperReads work1/joinKUnitigs.Failed

Can anybody help?

I'm also running SPAdes and Cabog, the latter is very very very slow!

Thanks guys

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 109 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

assembler outputs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News