De Novo Assembly of a transcriptome

edge replied

03-30-2011, 11:52 PM
Hi petang, I'm also just start my research about RNA-seq.
Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?
Leave a comment:
petang replied

03-05-2011, 05:39 AM
I am working on the transcriptome sequecning (GAII 75bp pair-end) of a nematode without reference genome. We tried CLC V4.5 and SOAPdenono to assemble the reads. My approach is to limit the number of mismatch to zero and overlapping region to 50% for the first round of assembly to obatin a reliable reference contig. Then use the reference contigs from the 1st assemble to re-assemble with the un-assembled reads, and set mis-2 for the second round of assembly. The largest contig we assembled is a 18KBp titin gene.
Depends on the compexity of the intron/exon of the organism and the depth of sequecning, around 30000-70000 transcripts with reads more than 100 should be identified.
The remaining steps is the same as for classical EST annotation start from Blastx uniprot, NCBI nr and interprot.
The last problem for RNAseq sequencing of an organism without a genome is how to identify the splice variants.
Leave a comment:
KevinLam replied

08-23-2010, 03:01 AM
This might be an interesting read

Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery

Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery - BMC Genomics

http://www.biomedcentral.com/1471-2164/11/180

Background Massively parallel sequencing of cDNA is now an efficient route for generating enormous sequence collections that represent expressed genes. This approach provides a valuable starting point for characterizing functional genetic variation in non-model organisms, especially where whole genome sequencing efforts are currently cost and time prohibitive. The large and complex genomes of pines (Pinus spp.) have hindered the development of genomic resources, despite the ecological and economical importance of the group. While most genomic studies have focused on a single species (P. taeda), genomic level resources for other pines are insufficiently developed to facilitate ecological genomic research. Lodgepole pine (P. contorta) is an ecologically important foundation species of montane forest ecosystems and exhibits substantial adaptive variation across its range in western North America. Here we describe a sequencing study of expressed genes from P. contorta, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies. Results We obtained 586,732 sequencing reads from a 454 GS XLR70 Titanium pyrosequencer (mean length: 306 base pairs). A combination of reference-based and de novo assemblies yielded 63,657 contigs, with 239,793 reads remaining as singletons. Based on sequence similarity with known proteins, these sequences represent approximately 17,000 unique genes, many of which are well covered by contig sequences. This sequence collection also included a surprisingly large number of retrotransposon sequences, suggesting that they are highly transcriptionally active in the tissues we sampled. We located and characterized thousands of simple sequence repeats and single nucleotide polymorphisms as potential molecular markers in our assembled and annotated sequences. High quality PCR primers were designed for a substantial number of the SSR loci, and a large number of these were amplified successfully in initial screening. Conclusions This sequence collection represents a major genomic resource for P. contorta, and the large number of genetic markers characterized should contribute to future research in this and other pines. Our results illustrate the utility of next generation sequencing as a basis for marker development and population genomics in non-model species.

DNASTAR also lists de novo transcriptome assembly

The resource cannot be found.

http://www.dnastar.com/t-sub-nextgen-genome-solutions-de-novo-transcriptome.aspx

I am curious if anyone managed to do de novo transcriptome with the shorter SOLiD reads?

Last edited by KevinLam; 08-23-2010, 03:03 AM.
Leave a comment:
Melissa replied

05-17-2010, 06:20 AM
Interesting question, Blackgore! Without a reference/gene model/ESTs, how to evaluate a de novo transcriptome assembly?

Originally posted by Marta View Post

Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

More technical notes on filtering the reads and Velvet parameters used are here:
http://atgc-illumina.googlecode.com/...k_090910_D.pdf

I also found a 15kb contig homologs to Arabidopsis BIG/ubiquitin-protein ligase in my plant transcriptome. I was told that similar result is obtained in P.trichocarpa. Therefore, I think BIG/ubiquitin-protein ligase can serve as an indicator for plant transcriptome assembly. Long genes like BIG/ubiquitin-protein ligase won't be assembled in poorly sequenced transcriptome.

Anyway, both methods (including N50) doesn't say much about the scaffolds quality. There can be scaffolds with lots of Ns due to poorly sequenced insert gaps. Compare two datasets with the same N50 and longest contig but one with lots of Ns, how can you tell the difference?
Leave a comment:
blackgore replied

05-17-2010, 04:20 AM
How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?
Leave a comment:
MarcelS replied

02-05-2010, 08:12 PM
Originally posted by Neil View Post

Hi all,
also, what software would you recommend for this?
hope someone can help
best regards
neil

Hi Neil,
I would recommend our new software Oases see the thread Oases: De novo transcriptome assembly of very short reads or http://www.ebi.ac.uk/~zerbino/oases/.
The software is designed to cope with alternative splicing and repetitive regions that normally break up contigs (for example if genome assemblers are used). Oases can produce full length transcripts if the coverage allows it and does also support/exploit paired-end information. And yes, paired-end information does improve the results. Oases already supports longer reads (e.g. 75 bp) that are produced by the current technologies.

Bests,
Marcel
Leave a comment:
Marta replied

12-16-2009, 10:22 PM
The experiment CLC did for this white paper does not reflect the actual performance of the CLC assembler. I think the assembler is much better than what the paper claims.

I use CLC Genomics WorkBench on Windows with 32GB RAM. A few days ago I started to test the latest (beta) version of the assembler for Workbench. It performs much better than the older one. My input is 92.5 Million of transcriptome single reads that are up to 85 nt long (IGA, filtered fasta).

About Velvet - my understanding that there is not much sense in changing expCov for transcriptome reads. We work with normalized mRNA libraries, but still the coverage between different transcrips varies a lot. About cvCut you need to contact alex_kozik (he is a member here). He is the one who ran all Velvet assemblies on the same set.
Leave a comment:
Zigster replied

12-16-2009, 07:54 PM
From section 4.2 and 4.3 of the new CLC white paper, it appears that the old CLC assembler made slightly longer contigs (unpaired max CLC69kbp vs VEL60kbp, N50 CLC23kbp vs VEL16kbp) at the expensive of more incorrect ones (CLC: 36 wrong, VEL :1 wrong). The newer one leans too far the other way. Who knows what velvet parameters were used - probably the ones that most closely matched the total CLC assembly size.

403 Forbidden

http://www.clcbio.com/files/whitepapers/white_paper_on_de_novo_assembly_on_the_CLC_NGS_Cell.pdf

I'm not so sure there is a free lunch here.

Marta, what cvCut and expCov parameters did you use in your Velvet assemblies? The cvCut parameter has a huge effect on N50, assembly size, and read usage.

Last edited by Zigster; 12-16-2009, 08:16 PM.
Leave a comment:
Marta replied

12-16-2009, 11:02 AM
KevinLam,

The data is unpublished. We are re-assembling the reads using the latest version of CLC assembler and Velvet with adjusted parameters. The number of transcriptome contigs in our latest assemblies went down from ~70K to ~57K. I have a presentation on-line with results from last summer assemblies here:

Google Drive: Sign-in

https://docs.google.com/fileview?id=0B9g4lIAKxQTaMjViZDJkYTMtYmEyYS00MjI1LWFhMjEtNDlhNDllMWYzNjkz&hl=en

Access Google Drive with a Google account (for personal use) or Google Workspace account (for business use).

Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

More technical notes on filtering the reads and Velvet parameters used are here:

Error 404 (Not Found)!!1

http://atgc-illumina.googlecode.com/files/ILLUPA_Overview_AKozik_090910_D.pdf
Leave a comment:
Peter Bjarke Olsen replied

12-16-2009, 08:43 AM
We have done several de Novo transcriptome projects mainly using Illumina technology and the Abyss assembler. In general it works but the problem is getting full length sequences (from start to stop codon). We have recently learned that some labs uses coligation of the transcipts prior to the nebulization. It should increase the number of full length genes. The reason is that the fragmentation is non random at the ends making the ends underrepresented in the library.
Leave a comment:
KevinLam replied

12-15-2009, 11:04 PM
Originally posted by Marta View Post

We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.

Are your results published in a paper already? Would love to read it!
Leave a comment:
krobison replied

12-02-2009, 03:00 PM
While this is not de novo assembly of a novel transcriptome, in some ways it is better because it can be compared against a known transcriptome (which was not used in the assembly as far as I know

http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=19528083

Bioinformatics. 2009 Nov 1;25(21):2872-7. Epub 2009 Jun 15.
De novo transcriptome assembly with ABySS.
Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ.

Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada. [email protected]
MOTIVATION: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION: Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT: [email protected].

PMID: 19528083
Leave a comment:
Marta replied

12-02-2009, 07:36 AM
We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.
Leave a comment:
KevinLam replied

12-02-2009, 12:28 AM
So in short no one has done de novo transcriptome assembly for new organism before?
can we use a closely related species like fish to do that for de novo?

how about taking it further with doing expression profiling on the new organism?
Leave a comment:
Melissa replied

05-12-2009, 09:18 PM
Originally posted by jordi View Post

oh, sorry. I found repetitive elements which are reverses transcriptases, located at 3' UTR of different genes. How can I differenciate the origin of my blast results?

The only way to tell a 3' UTR is the presence of polyA tail at sequence end. Considering our contigs are short, are you sure this is not misassemblies? How long is the repetitive element you found and what's the similarity?

If you are using blast to annotate your contigs, using 3' UTR is not a good idea because that region can varies even within the same species.

I have used CENSOR to find repeats in my ESTs but there's no significant hits. Most hits are around 100bp with 80% similarity (The original genomic repeat is several kb long) and it only exist once in the ESTs. Maybe plants repeat databases are not well-characterized. In the end, I just ignore them.

Found a related thread on repeat at

Masked/Unmasked Reference Genome - SEQanswers

http://seqanswers.com/forums/showthread.php?t=1504

Any topic/question that does not fit into the subcategories below. If you're unsure of where to put something, ask in here!
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News