Seqanswers Leaderboard Ad

**Zigster** · 12-16-2009, 07:54 PM

From section 4.2 and 4.3 of the new CLC white paper, it appears that the old CLC assembler made slightly longer contigs (unpaired max CLC69kbp vs VEL60kbp, N50 CLC23kbp vs VEL16kbp) at the expensive of more incorrect ones (CLC: 36 wrong, VEL :1 wrong). The newer one leans too far the other way. Who knows what velvet parameters were used - probably the ones that most closely matched the total CLC assembly size.

403 Forbidden

http://www.clcbio.com/files/whitepapers/white_paper_on_de_novo_assembly_on_the_CLC_NGS_Cell.pdf

I'm not so sure there is a free lunch here.

Marta, what cvCut and expCov parameters did you use in your Velvet assemblies? The cvCut parameter has a huge effect on N50, assembly size, and read usage.

**Marta** · 12-16-2009, 10:22 PM

The experiment CLC did for this white paper does not reflect the actual performance of the CLC assembler. I think the assembler is much better than what the paper claims.

I use CLC Genomics WorkBench on Windows with 32GB RAM. A few days ago I started to test the latest (beta) version of the assembler for Workbench. It performs much better than the older one. My input is 92.5 Million of transcriptome single reads that are up to 85 nt long (IGA, filtered fasta).

About Velvet - my understanding that there is not much sense in changing expCov for transcriptome reads. We work with normalized mRNA libraries, but still the coverage between different transcrips varies a lot. About cvCut you need to contact alex_kozik (he is a member here). He is the one who ran all Velvet assemblies on the same set.

**MarcelS** · 02-05-2010, 08:12 PM

Originally posted by Neil View Post

Hi all,
also, what software would you recommend for this?
hope someone can help
best regards
neil

Hi Neil,
I would recommend our new software Oases see the thread Oases: De novo transcriptome assembly of very short reads or http://www.ebi.ac.uk/~zerbino/oases/.
The software is designed to cope with alternative splicing and repetitive regions that normally break up contigs (for example if genome assemblers are used). Oases can produce full length transcripts if the coverage allows it and does also support/exploit paired-end information. And yes, paired-end information does improve the results. Oases already supports longer reads (e.g. 75 bp) that are produced by the current technologies.

Bests,
Marcel

**blackgore** · 05-17-2010, 04:20 AM

How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?

**Melissa** · 05-17-2010, 06:20 AM

Interesting question, Blackgore! Without a reference/gene model/ESTs, how to evaluate a de novo transcriptome assembly?

Originally posted by Marta View Post

Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

More technical notes on filtering the reads and Velvet parameters used are here:
http://atgc-illumina.googlecode.com/...k_090910_D.pdf

I also found a 15kb contig homologs to Arabidopsis BIG/ubiquitin-protein ligase in my plant transcriptome. I was told that similar result is obtained in P.trichocarpa. Therefore, I think BIG/ubiquitin-protein ligase can serve as an indicator for plant transcriptome assembly. Long genes like BIG/ubiquitin-protein ligase won't be assembled in poorly sequenced transcriptome.

Anyway, both methods (including N50) doesn't say much about the scaffolds quality. There can be scaffolds with lots of Ns due to poorly sequenced insert gaps. Compare two datasets with the same N50 and longest contig but one with lots of Ns, how can you tell the difference?

**KevinLam** · 08-23-2010, 03:01 AM

This might be an interesting read

Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery

Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery - BMC Genomics

http://www.biomedcentral.com/1471-2164/11/180

Background Massively parallel sequencing of cDNA is now an efficient route for generating enormous sequence collections that represent expressed genes. This approach provides a valuable starting point for characterizing functional genetic variation in non-model organisms, especially where whole genome sequencing efforts are currently cost and time prohibitive. The large and complex genomes of pines (Pinus spp.) have hindered the development of genomic resources, despite the ecological and economical importance of the group. While most genomic studies have focused on a single species (P. taeda), genomic level resources for other pines are insufficiently developed to facilitate ecological genomic research. Lodgepole pine (P. contorta) is an ecologically important foundation species of montane forest ecosystems and exhibits substantial adaptive variation across its range in western North America. Here we describe a sequencing study of expressed genes from P. contorta, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies. Results We obtained 586,732 sequencing reads from a 454 GS XLR70 Titanium pyrosequencer (mean length: 306 base pairs). A combination of reference-based and de novo assemblies yielded 63,657 contigs, with 239,793 reads remaining as singletons. Based on sequence similarity with known proteins, these sequences represent approximately 17,000 unique genes, many of which are well covered by contig sequences. This sequence collection also included a surprisingly large number of retrotransposon sequences, suggesting that they are highly transcriptionally active in the tissues we sampled. We located and characterized thousands of simple sequence repeats and single nucleotide polymorphisms as potential molecular markers in our assembled and annotated sequences. High quality PCR primers were designed for a substantial number of the SSR loci, and a large number of these were amplified successfully in initial screening. Conclusions This sequence collection represents a major genomic resource for P. contorta, and the large number of genetic markers characterized should contribute to future research in this and other pines. Our results illustrate the utility of next generation sequencing as a basis for marker development and population genomics in non-model species.

DNASTAR also lists de novo transcriptome assembly

The resource cannot be found.

http://www.dnastar.com/t-sub-nextgen-genome-solutions-de-novo-transcriptome.aspx

I am curious if anyone managed to do de novo transcriptome with the shorter SOLiD reads?

**petang** · 03-05-2011, 05:39 AM

I am working on the transcriptome sequecning (GAII 75bp pair-end) of a nematode without reference genome. We tried CLC V4.5 and SOAPdenono to assemble the reads. My approach is to limit the number of mismatch to zero and overlapping region to 50% for the first round of assembly to obatin a reliable reference contig. Then use the reference contigs from the 1st assemble to re-assemble with the un-assembled reads, and set mis-2 for the second round of assembly. The largest contig we assembled is a 18KBp titin gene.
Depends on the compexity of the intron/exon of the organism and the depth of sequecning, around 30000-70000 transcripts with reads more than 100 should be identified.
The remaining steps is the same as for classical EST annotation start from Blastx uniprot, NCBI nr and interprot.
The last problem for RNAseq sequencing of an organism without a genome is how to identify the splice variants.

**edge** · 03-30-2011, 11:52 PM

Hi petang, I'm also just start my research about RNA-seq.
Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

**Rachel** · 03-31-2011, 12:07 AM

Just a quick questions on this de novo assembly for transcriptomic, say if I am having Illumina 2x100bp RNA-seq reads once it is assembly by de novo assembler, how do we annotate the transcript? but doing blast or blast2go? does it sufficient?

**petang** · 03-31-2011, 02:17 AM

In my experience, the number of contigs assembled from 20 million of 2x100bp reads varies from 30000-50000 contigs, depends on the complexity of the genome. The first question is how many contigs you want to annotate and the purpose of your experiment. If you aimed to gene discovery, the first 10000 highly expressed contigs should be good enough. Or alternatievly, you can choose the long contigs (let say, longer than 1000bp).

If you are doing comparative transcriptomics. Obivously you can choose those differentially expressed contigs.

In either cases, it is impossible to annotate all contigs without the support of a bioinformatics group.

The quickest way to annotate the transcript is Blastx UniProt, then retrive all the information (pfam, GO, KEGG etc) from the hit. However, you will missed all the conserved hypothetical proteins which is only available from NCBI. So, I will start from BLASTx uniport, then use the un-hit contigs for BLASTx NCBI nr.

**petang** · 03-31-2011, 02:19 AM

Originally posted by edge View Post

Hi petang, I'm also just start my research about RNA-seq.
Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

Still no idea on this.
Sorry

**Rachel** · 03-31-2011, 04:30 PM

Hi petang. Thanks for your reply. Appreciate it.

**edge** · 04-06-2011, 07:18 PM

Dear petang,

Is ok...
No worry about it...
I just not sure whether my query about identify alternative splicing variation without using reference genome sequence is working now?
Is it sound logically or not?
It seems like really quite difficult to identify alternative splicing variation without the reference genome sequence

Thanks first for any advice.

**jmartin127** · 04-06-2011, 08:19 PM

We've tried a few different programs for de novo transcriptome assembly, you can see our paper that came out about a year ago here: http://www.biomedcentral.com/1471-2164/11/663.

As Marcel pointed out, Oases seems to do a pretty good job with the latest updates. In our paper we introduce, Rnnotator, which adds some additional pre/post-processing steps to further improve the assembly. We've been able to assemble a plant transcriptome, but we are still evaluating the result.

Originally posted by Neil View Post

Hi all,
We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
also, what software would you recommend for this?
hope someone can help
best regards
neil

**dnusol** · 05-10-2011, 08:21 AM

Hi,

is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

Best,

Dave

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News