Seqanswers Leaderboard Ad

**twaddlac** · 03-14-2012, 01:03 PM

Hey Meli,

The first thing would be to trim the primers/adapter/barcodes. I do this by mapping the know sequences (primers/adapter/barcodes) to the reads and then trimming them with a perl script or something like that.

Next would be to get the closest possible reference sequence (if know and/or available) and map your paired reads to them to filter out the good, the bad, and the ugly. If the reference is not known or close enough then it may be worthwhile to skip this step.

After that I generally filter my reads based on quality score. Trimming the actual reads to a shorter size has also produced very good results, so if you're not getting the assemblies you want with the full reads, I STRONGLY recommend to try it out.

Soap is a good program but there are many others and, as is usually the case, you really have to pick an assembler that fits your data. I will recommend Velvet and ABySS for starters. There are also a lot of good papers about how assemblers perform. Here are some of my favorites:

GAGE: A critical evaluation of genome assemblies and assembly algorithms - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/22147368

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just …

Feature-by-Feature – Evaluating De Novo Sequence Assembly

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0031002

The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.

De novo assembly of short sequence reads - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/20724458

A new generation of sequencing technologies is revolutionizing molecular biology. Illumina's Solexa and Applied Biosystems' SOLiD generate gigabases of nucleotide sequence per week. However, a perceived limitation of these ultra-high-throughput technologies is their short read-lengths. De novo assem …

Also, it would be beneficial to install AMOScmp so that you can use its tools to help analyze your assemblies. This technique has a learning curve but it's so fun! Be patient and don't be scared to ask question... there's a lot of data out there.

I hope this helps and good luck!

**Meligethes** · 03-14-2012, 01:07 PM

thanks you very much, I will take a look at all this tomorrow because for the very moment i am a bit upset about all that :/

**rahularjun86** · 03-14-2012, 02:46 PM

Hi,
1). you can use the Sickle tool(https://github.com/najoshi/sickle) for data preprocessing, and then view the data statistics with FastQC or use FastX tools(http://hannonlab.cshl.edu/fastx_toolkit/).
2). You can try velvet assembler(http://www.ebi.ac.uk/~zerbino/velvet/) from k-mer 21 to 65 with increment of 2. and Expected coverage you can use Auto or can calculate using R as explained in the manual and coverage cutoff from 2 to 15. Or try other Assemblers like Soapdenovo or Abyss.
3). Choose the assembly with best N50 and other parameters(Genome size, Largest Contigs, Reads used, Number of contigs).
4). Use Minimus2 or Minimus2_blat(http://sourceforge.net/apps/mediawik...etting_Started) for merging assemblies And Bambus2/SSPACE(http://www.baseclear.com/landingpages/sspacev12/) for scaffolding. SSPACE is very easy to use with very simple input options.
5). Check the Completeness of the genome using CEGMA pipeline(http://korflab.ucdavis.edu/Datasets/cegma/).
6). RepeatMasker(http://www.repeatmasker.org/) or other tools for repeat elements prediction and AUGUSTUS(http://augustus.gobics.de/) or other tools Genescan, GeneId for gene predictions.
7). Finally MUMMER(http://mummer.sourceforge.net/) for comparative analysis.

Best Wishes,
Rahul

**Meligethes** · 03-16-2012, 08:36 AM

OK thanks for all this help !

I asked to have primers and adapters sequences in order to cut them off (I though this was already done when i received fastq files but actually i have so high percentage of sequence duplication (92%!!) that i suppose there are still in the reads).

I have been told to try to find a reference genome close enough to rely on it for assembly.
I am currently on NCBI taxonomy browser but i still can't find anything close to any insect.

The softwares indicated for this kind of assembly are
- Velvet
- Mira
- SOAPdenovo
- Bowtie (?)

I am looking for installing them.

**mjp** · 03-16-2012, 10:35 PM

why don't you have a look at wiki

SEQanswers

http://seqanswers.com/wiki/How-to/de_novo_assembly

**Meligethes** · 03-17-2012, 03:15 AM

Originally posted by mjp View Post

http://seqanswers.com/wiki/How-to/de_novo_assembly

+1, thank you

**Meligethes** · 03-17-2012, 11:24 AM

j'attend d'avoir les séquences des primers et adapters ainsi que les codes d'accès pour le serveur distant (un genre de supercalculateur : UPPMAX, UPNEXT)

en attendant je suis un peu "coincé" quelles autres types d'informations (en dehors des analyses qualité fournies par Fastx Toolkit et FASTQC) puis-je obtenir de mes "simples" fichiers FASTQ ?

Merci encore pour votre aide

**Meligethes** · 03-19-2012, 02:19 AM

Oops I just figured out I wrote in French, sorry, whatever it was not important.

I just cannot understand why all reads are EXACTLY the same length (76).
Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?

**krobison** · 03-19-2012, 04:56 AM

Originally posted by Meligethes View Post

I just cannot understand why all reads are EXACTLY the same length (76).
Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?

Illumina (and SOLiD) technology inherently generate reads of exactly the same length, unless you have trimmed them. The machine reads the data in cycles, and each cycle can acquire one and only one base.

If the two lanes are paired ends, then the identifiers should be the same or very similar (perhaps with /1 /2 or such as difference); look at the first read identifier in each file.

**Meligethes** · 03-19-2012, 05:06 AM

Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

I searched on the internet but it didn't help me about this...

**krobison** · 03-19-2012, 01:39 PM

Originally posted by Meligethes View Post

Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

I searched on the internet but it didn't help me about this...

Optically -- the system uses high-precision imagery & aligns images between the first read & the second read. Indeed, it takes a set of images for each cycle and must align these to call the bases for a single end.

**Meligethes** · 03-19-2012, 01:44 PM

Do you mean that the machine has 2 main cycles :
1 ) only forward cycles in each cluster position
2 ) only reverse cycles in each cluster position

Then "align" images and same points are from the same cluster so the same fragment ?

Sorry I feel bit an idiot about this but I really don't figure out how this works and "because this is paired-end technique or because this is high end optical lasers" is really not sufficent for me

**krobison** · 03-20-2012, 10:36 AM

Yes.

The system runs through all of read 1. Then there is a clever molecular biology scheme which flips things around and then read 2 is generated.

Tech Summary: Illumina's Solexa Sequencing Technology - SEQanswers

http://seqanswers.com/forums/showthread.php?t=21

Bridged amplification & clustering followed by sequencing by synthesis. (Genome Analyzer / HiSeq / MiSeq)

**seb567** · 04-05-2012, 11:38 AM

Originally posted by Meligethes View Post

Hi there.

I am completely new in the world of (de novo) genome assembly and I don't know what to begin with. When I asked help at the department they said "go to seqanswers", so here I am to have some help...

I have been given some sequencing data about an insect (colza pollen beetle) and have to make a genome assembly. This is Illumina data in paired-end format.

There are 3 fastq files :
- lane 5/1 : 11 423 167 reads of length 76
- lane 5/2 : 11 423 167 reads of length 76
- lane 7 : 9 294 857 reads of length 152

An average beetle genome size is said to be about 650Mbp.

Apparently "we" have a server with 192GB RAM where SOAPdenovo is/will be installed.

I have been told to first control the sequences quality so after a few surfing I found "FASTQC" (with a good Youtube tutorial). I don't know what I have to do after... at all.

I am not here to ask you to do the job in my place & I know a will have a lot of reading & research, but i would know what is the main guide-line to follow, what are the things to mind about, the traps to prevent, etc.

Thank you in advance for any kind of help,

M.

(PS: accordingly to the FASTQC tutorial, data quality are quite poor, i can post output on demand)

Hello,

You may want to try Ray, a easy to use distributed assembler.

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

[help] de novo Genome Assembly : beginner

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News