Seqanswers Leaderboard Ad

**JKing** · 11-20-2008, 05:59 PM

SeqMan NGen

Update: I visit this site quite a bit to understand the tools available and where this technology is taking us, but I haven't actually posted in some time. Here's an update if you are interested in DNAStar development.

SeqMan Genome Assembler was an in-house name during development of the assembly program. SeqMan NGen is the name going forward, as it really is an engine for providing these assemblies for end users, who need no special computer specs like 64 bit operating systems and lots of RAM for subsequent assembly analysis. Normal computers do the end user job. Assemblies include siRNA targeting, ChIP-Seq, mRNA alignment to genomic templates, etc., so "Genome Assembler" was a limiting name for the program.

The last few posts dealt with strategies for sequence aligners to compensate for under reporting SNPs in areas heavy with mutation. This gets to the heart of the difference between aligners like MAQ and ELAND and actual contig assemblers that produce .ace files or their equivalent. An aligner program throws one read against the template, records where it sticks, and then proceeds to the next read. The big problem is that if there are more than two differences between any read and the reference, the read is thrown out. The output is a big text file.

NGen performs several passes during the assembly process (and quickly). The first pass does something resembling what aligners do, in that it takes care of the easy reads. In subsequent passes the assembly is completely de novo. All reads are incorporated in the context of the existing reads of the experimental strain. There is no limit to the number of differences between the reference sequence and any given read of the experimental strain, as it is a de novo assembly that disregards the template entirely. No reads are thrown out. There could be eight true SNP differences between your strain and your reference strain within a 35 bp span, for instance, and those SNPs will be reported and can be visually confirmed in the alignment view.

The end user can also filter out false SNPs based on quality score, percent of SNPs in reads at each locus, depth of coverage, and known vs. novel SNPs, using the normal SeqMan interface. SNP reporting also includes subsequent silent or non-silent amino acid mutations at specific aa positions at the protein level. The end user actually has a fairly easy job of discerning those SNPs that matter. The strategy for dealing with large indels or transpositions is exquisite, and you will just have to contact us for that, as it is beyond the scope of a board post.

Aligners like MAQ are actually very effective if one uses a reference sequence that is "the answer", but that is not necessarily the case in many projects. We are actually introducing a MAQ-like aligner in a couple of weeks for next-gen RNA-Seq comparative gene expression analysis, and the results feed directly into the tools traditionally used for microarray analysis like scatter plots and heat maps. Of course, RNA-Seq is orders of magnitude more sensitive and accurate than microarray.

For sequence assembly, nothing beats an actual assembly rather than a read-by-read alignment text file. Due to computer limitations, aligners that throw reads one at a time at a reference sequence are a necessary evil right now for higher level eukaryotes, but that will soon change and end users will soon be able to visualize actual assemblies at any position along the genome.

**cgb** · 11-20-2008, 11:46 PM

a few brief comments on this post :

the short read aligners like MaQ and eland dont 'throw reads one at a time" - in fact they do very efficient batch based inexact matching and some fancy maths to determine the best match where there is an ambiguity. running at around, say, 5000 reads per second per CPU (with <1GB RAM/CPU) against a human genome. This was a very significant computational challenge that many people said was impossible, and is now entirely feasible on small computers using these algorithms.

more than two mismatches isnt a problem. if you had more than two on the majority of your reads, then your sequencer isn't working and you should send it back - because that would equate to something like a 8-10% error rate on average. Actual runs have sub 1% error rates (generally) and thus very few of the 25-35 mers have more than 0-2 errors. In fact very few reads have more than 2 mismatches and in the case of MaQ they aren't thrown away. The number of poor matches chucked by Eland on a normal run is in the <1-4% range and often many of these reads arise because of data collection/imaging artifacts (or they are contaminants) i.e they arent from the sample hence chucking high error reads also has some benefits in terms of false +ves.

'Even 8 SNPs in a single 25mer'. Do we know how often that occurs in the human genome ? it must be under 0.001%. There are more errors, and missing sections, in the reference itself to worry about -

If you are saying that a sort read assembly on a big genome is going to give better coverage, better consensus and better mutation detection than a resequencing run - I think that has yet to be shown and the assembly problems of short read sequencing dwarf the minor side-effects of read mapping. Im a tad skeptical that this can be done on a 'smaller' computer and quicker than re-aligning the same coverage level of short read data with the modern algorithms.

For sure - on genomes without a reference you will ideally assemble.

**JKing** · 11-21-2008, 07:25 AM

You're right...

Agreed, I was too simplistic in my critique of MAQ for a next-generation sequencing board. Aligners can exceed 2 mismatches in certain situations, but in general the rule is that >2 mismatches lead to a statistically insignificant match.

For human genotyping purposes, where the reference sequence is essentially the answer give or take a few SNPs, alignment algorithms are a very efficient approach. I didn't mean to come off negatively in any way regarding them.

However, every genome will eventually be sequenced, and there are only references for a small fraction of worldwide species. Heck, there are only references for a small fraction of E. coli strains, and that is the most studied bacteria. There are obvious limitations using short reads in a de novo fashion to tackle these genomes. This approach allows one to use the best available reference, and the end result of the assembly takes one further than an alignment algorithm could achieve.

Plus, the ability to actually visualize the assembly at every locus provides a certain level of confidence.

**dan** · 12-04-2008, 01:44 AM

Put the info in this thread into a wiki?

Hi,

Can we put the info in this thread into a Wiki page to allow better structuring of the data?

Its a bit of a monster thread, and it isn't clear where in the thread important info will come up...

Is there a SEQanswers Wiki?

If not I can suggest:

* Somewhere on http://wiki.bioinformatics.org/Wiki_Main_Page
* http://bioinformatist.org/index.php/Main_Page

Or just Wikipedia (I'm sure there is a suitable location).

Dan.

**Wolfgang Gerlach** · 12-17-2008, 12:34 PM

more software

Hi all,

I have here two programs that might fit into the software list. Maybe somebody can add it to the list ?

-----------------
The SWIFT suit is a software collection for fast index-based sequence comparison. It contains the following programs: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences; SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds.
----------
Link:

BiBiServ2 - SWIFT Suit

http://bibiserv.techfak.uni-bielefeld.de/swift/

best
Wolfgang

**joa_ds** · 12-18-2008, 08:43 AM

Hi, has anyone used Pyrobayes?

It just seems a bit weird that it only needs ssf files to improve quality?

I cant get a hold to the article (no nature license here

). What is the program based on?

**xuer** · 01-14-2009, 09:02 AM

vmatch is also Good!

**francesco.vezzi** · 01-22-2009, 07:13 AM

Hi to everybody,
It is a long time that I'm reading this interesting discussion on new generation sequence technology but until now I have never posted....
I just start the Phd and this morning I was reading an article that compares the performance of short read assemblers (in particular Edena and Velvet), and also in this article I found a reference to another tool: ALLPATHS. It is more or less one year that my interest is mainly focus on de novo assembly with short reads, and the first article that I read about this topic was the article on ALLPATHS. The problem is that I'm not able to find a site where is possible download this tool, everybody say that is the best tool but it seems impossible to find....
Can somebody help me?

**Stegger** · 01-22-2009, 10:46 AM

Originally posted by francesco.vezzi View Post

Can somebody help me?

Hi,
not sure if you have seen the following link but found it through a google search:

404 - Localist Event Calendar Software

http://www.broad.mit.edu/events/recomb2005/posters/posters/5cafa40d906f58654e51973729bf4bfe_ButlerJ-RECOMB2.pdf

but there are two email adresses in that publication you could try and email?
Sorry if you have already tried that

Stegger

**kmcarr** · 01-22-2009, 01:13 PM

Francesco,

There is a source code download available through the supplementary materials page for the publication http://genome.cshlp.org/content/18/5/810/suppl/DC1 however this is an old file an likely out of date.

This site (http://www.broad.mit.edu/crd/wiki/index.php/Main_Page) serves as the main portal for the Broad Institutes software projects but there is no download link for ALLPATHS there.

**francesco.vezzi** · 01-23-2009, 01:38 AM

Thanks to both,
the file present in the supplementary material is too old... i have already send an email to the authors and they said that as soon as possible they would put on line the new version but this is happened some month ago...

The thing that surprise me is that all the articles on de novo ASSEMBLY cite this instrument, but I never saw experimental result about it except in the article of ALLPATHS....

**RudyS** · 01-23-2009, 01:10 PM

likewise there is no download available on the main page

Page not found

http://www.broad.mit.edu/science/programs/genome-biology/computational-rd/computational-research-and-development

If you are unable to find something or have a question about our new website, please email [email protected]. For other inquiries related to the Broad Institute, the necessary contact information can be found here.

though they state that it is available

but they also indicate that allpaths has only been tested with simulated data ... so maybe better to wait til its been tested on something real?

rudyS

**francesco.vezzi** · 01-25-2009, 11:37 PM

Yes for sure is better way it is tested on a real data set, the thing that it seems strange to me is the fact that in the last year I found dozen of articles that cite ALLPATH as an excellent assembler but it seems that nobody has effectively used it....

**rociobm** · 01-26-2009, 04:02 AM

I am trying to realize an assembly of sequences (454 + Sanger) of novo using the program wgs-assembler (Celera Assembler). This program needs a file *.frg to begin the assembly. Can someone obtain like the file *.frg from the files *.fasta and *.qual without using the file *.xml?.

Thank you for everything.

Rocio

**bioinfosm** · 02-14-2009, 09:24 PM

Most conservative aligner

I wish to find the best alignment of my 32bp and 100bp reads (separately, not as single input), and also determine the next best matches. Which algorithm would be most suitable for this purpose?

I understand that it can take longer to run, than some of the other algorithms that run much faster..

Any thoughts?

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News