Seqanswers Leaderboard Ad

**nickloman** · 02-23-2010, 06:11 AM

My experience with assembling bacterial genomes from 454 data is that Newbler with default parameters will do better than any other assembly engine, when dealing purely with 454 reads.

I don't know why this is, but I presume there are some built-in heuristics in Newbler that suit the error model for this technology (which is very different to Solexa or Sanger-ABI).

However, if you really want to try a different assembler then I would suggest MIRA and CLC Genomics Workbench are possible options.

How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?

**maubp** · 02-23-2010, 06:18 AM

I'd second the suggestion to try MIRA 3 on the assembly. The manual takes you through preparing the SFF file using sff_extract etc. If you start using the tool seriously, do sign up to their mailing list.

**pmiguel** · 02-23-2010, 08:59 AM

Originally posted by nickloman View Post

How many contigs are you getting right now and is this consistent with the repeat profile (i.e. number of repeat regions >= read length) predicted from the genome in question?

gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

--
Phillip

**nickloman** · 02-23-2010, 09:16 AM

Originally posted by pmiguel View Post

gsAssembler (aka "Newbler") gives me 52 contigs in a single scaffold, with lots of extraneous contigs--most or all likely deriving from contaminating eukaryotic DNA.

I looks to me like there are very few repeats in this genome. The read lengths are standard Titanium read lengths (~400 mean) but in cases where the read contains the F/R mate split, each right/left reads will be shorter, of course. The pair-ends look to be 2-4 kb apart--consistent with 3kb paired-ends.

--
Phillip

Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.

**pmiguel** · 02-23-2010, 09:21 AM

Originally posted by nickloman View Post

Oh right! Sounds like you have a pretty good result then if ended up with a single scaffold. Be cautious, doing another assembly with a different program may be more confounding than confirmatory! Perhaps it would be better, if you want to verify the genome order to make some confirmatory PCRs, perhaps primers to amplify the entire genome in 10kb sections.

Well, I neglected to add that a cursory examination of the optical map bore no similarity that I could perceive to the scaffold restriction map. But that might be for a trivial reason--like the optical map was done with some other restriction enzyme than the one we were told.

It is the right bacterium--everything fits previous sequencing results.

In part I'm just trying to transition or include the tools I've long used for eukaryotic BAC assembly (phred/phrap/consed) into this brave new next-generation world.

Thanks for your advice,
Phillip

**bio-x** · 02-23-2010, 06:09 PM

for bacteria genome, newbler is a good choice;
for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.

**pmiguel** · 02-24-2010, 04:37 AM

Originally posted by bio-x View Post

for bacteria genome, newbler is a good choice;
for " 52 contigs in a single scaffold", you can try to close gaps with the pair-end information.

Yes, that is the plan. But I am a little suspicious of the gaps.

I have now done assemblies with both gsAssembler and phrap and examined the assemblies in consed.

BTW, I had to do a perl -i -pe 's/_left/.f/ if /^>/;s/_right/.r/ if /^>/;' on the fasta and qual files before doing phrap to get consed to show the F/R paired ends in assembly view.

As I mentioned earlier, the phrap assembly produced a larger number of contigs by far than gsAssembler. This was somewhat mitigated when I cut all the quality values for the reads in half. Nevertheless, the assemblies appear qualitatively different.

Phrap contig consensus sequence extends fairly far into questionable areas towards the ends of contigs. This is normal for phrap. The bases at the ends of contigs tend to have very low quality values. Newbler, however, is far more parsimonious with how far it allows the consensus sequence to extend at the ends of contigs.

While I can understand the purpose behind this more stringent consensus sequence generating behavior, I want to turn it off! As it is, none of the contigs show overlap at their ends--so I have no basis to join them. Anyone know a way to get gsAssembler to become less censorious?

--
Phillip

**kmcarr** · 02-24-2010, 06:08 AM

Phillip,

Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together. You will need to (re)run gsAssembler with the ACE output mode set to generate a full consed folder and you will need the latest version of consed to properly recognize a 454 project.

I also noted that consed ignores any scaffolding information from the gsAssembler; it calculates its own scaffolding based on the read pairing information in the contigs. I ran a number of independent assemblies of the viral genome in gsAssembler (I had vastly more raw data than needed for a single assembly). While the size and sequence of the contigs was consistent from one assembly to the next, the scaffolding produced by gsAssembler varied. When the assemblies were loaded into consed it recalculated the scaffolding and produced consistent arrangements for all the assemblies.

**pmiguel** · 02-24-2010, 07:13 AM

Originally posted by kmcarr View Post

Phillip,

Here is something I did once when working on a herpes virus genome. I first performed the assembly with gsAssembler. I then took the ACE output and loaded that into consed. Once in consed you can use the use the reassembly and contig joining tools to try to bring them together.

Yes, that is what I am doing as well. The caveat being that I normally use the sequences matches produced in consed assembly view to drive joins. But there are virtually no sequence matches among the contigs produced by gsAssembler.

But it looks to me as if gsAssembler has deliberately excluded sequence at the ends of contigs that might match the adjacent contigs. Lacking these regions I don't know how to drive joins.

--
Phillip

**nickloman** · 02-24-2010, 08:20 AM

What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

Hope that helps

**westerman** · 02-24-2010, 09:36 AM

Going along with Phillip's questions (since we are working on the same project), I am having problems running the Newbler assembly through amosvalidate. I get to the point (step 600+) where amosvalidate is looking at the singletons. nucmer/mummer then crashes because the singletons file is empty.

Has anyone run a newbler ace file through amosvalidate?

Thanks,

**pmiguel** · 02-25-2010, 08:17 AM

Originally posted by nickloman View Post

What I find very helpful is the 454ContigGraph.txt file which should give you the information you are looking for. However, unhelpfully this is not produced by default by Newbler. Run Newbler with the -g option set on to get this file. This will give you a contig graph which will allow you to make joins between contigs.

However, you must be aware that contigs that are not included in scaffolds are likely to be "repeat consensus" contigs, so they may not reflect a true biological sequence but a consensus of two or more repetitive regions.

You'd want to do confirmatory Sanger sequencing on these contigs to be sure you get the right sequence.

Hope that helps

With v. 2.3 "-g" is deprecated because the ContigGraph.txt file is created by default.

The issue I was describing--where Newbler appeared to be censoring the ends of contigs was more bizarre than I initially thought. Near as I can tell this censorship hides reads spanning contig junctions. That is, the contigs should be joined on the basis of reads spanning the pseudo-gap, but for a mysterious reason described below, a 0 base gap is created in the assembly.

The reason? These breaks apparently denote multiple branch points. Got that? Contiguous reads join a stretch of sequence, but newbler segments the stretch into multiple contigs because, if the reads were not contiguous, there would be multiple possible branch points there? Using a method I describe below I am able to visualize the "suppressed" overlapping sequence in Assembly View of Consed, using "What to Show" "Run cross_match". Very few of them appear to branch at all--there is only one place they match to--the contig adjacent to them!

How to see the suppressed sequence that shows that Newber mistakenly failed to join adjacent contigs?

Check the "reads limited to one Contig" box of the Parameters:Output pane of the GS de novo assembler GUI.

--
Phillip

**sklages** · 02-25-2010, 12:46 PM

Just to give some other recommendation .. as maubp has already mentioned, I'd give MIRA3 a try. It is (usually) doing a good job on this size of genome. It also handles repetitive sequence quite effective.

In the 1-100Mb class we successfully use Celera Assembler (currently v6beta), results can be converted to ACE/CAF ([1], still "beta").

Just my 2p,
Sven

[1]=https://sourceforge.net/projects/asm2ace/

**pmiguel** · 02-26-2010, 10:55 AM

As a postscript. The bacterial genome I was assembling, after examination and merging inside consed dropped from 52 contigs in a scaffold down to a single contig.

Still not sure what Newbler was doing breaking the contigs down like that.

--
Phillip

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

454 paired end bacterial whole genome assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News