De Novo Assembly of a transcriptome

dnusol replied

05-18-2011, 01:30 AM
Hi Aurelien

I was told that Trinity would work with non-strand-specific paired-ends as if they were single reads.
I don't about the interpretation of strand-specific or non-strand-specific data in velvet-oases, but they seem to perform better with paired reads anyway.

HTH

Dave
Leave a comment:
lletourn replied

05-17-2011, 11:31 AM
Originally posted by Aurelien Mazurie View Post

- most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally

ABySS is a seperate tool and doesn't use velvet.

Originally posted by Aurelien Mazurie View Post

My first question would be: are paired-ends a big plus, or are they not worth the extra cost?

In my experience, when assembling, it does make a big difference. I mainly use oases, I seem to identify splice site events a lot more accurately with pairs.
Leave a comment:
Aurelien Mazurie replied

05-17-2011, 09:37 AM
Very interesting thread. I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task). However, I am still wondering about the following choices:

- most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally; the only exception appears to be Trinity, which have its own assembly algorithm. It means those tools can make use of both single- and paired-ends reads. However, there is little information about which of those tools actually use pairing information to improve on the results (e.g., to detect splicing variants). My first question would be: are paired-ends a big plus, or are they not worth the extra cost?

- some tools explicitly state they work best with strand-specific data (e.g., Trinity). Others mention using it, but do not tell if strand-specific data is mandatory (e.g., Rnnotator). My second question is: should I prefer strand-specific sequences?

Best,
Aurelien
Leave a comment:
dnusol replied

05-11-2011, 12:14 AM
Thanks Petang,

do you mean netblast from Wisconsin Package? how do I download it? I cannot find the download page within accelrys

Best,

Dave

Last edited by dnusol; 05-11-2011, 12:36 AM.
Leave a comment:
petang replied

05-10-2011, 06:42 PM
Originally posted by dnusol View Post

Hi,

is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

Best,

Dave

First of all, Uniprot database can only provide proteins with conserved sequence/motif/domain. You will miss all the HYPOTHETICAL PROTEINS if you work on an organism without available genome. The best thing on Uniprot is you can have all the reltaed information (pfam, GO, SignalP, TMHMM...) at a single run.

Running local NCBI Blast is a nightmare unless you have a good computing facilitiy. I will suggest you run Netblastx and limit the output to 10 or less. Use the tubular or xml output will be easier for you to parse the results. Your computer should have at least 6-8 GB of memory for netblast.
Leave a comment:
Wallysb01 replied

05-10-2011, 12:03 PM
thanks lletourn, we'd have some ESTs available too, through from my initial searching through them, coverage in the EST library is pretty poor. So, I think we'd basically be in the same situation, using it for validation, but not assembly.
Leave a comment:
lletourn replied

05-10-2011, 11:29 AM
I've used velvet+oases with GAIIx 108PE data. We actually mixed in difference samples of the same specie for the assembly

We got pretty good result when comparing with available ESTs. We didn't put them in the assembly because we weren't sure of how "good" the EST were. It turns out we found 93% of the full length EST in the assembly.

We also used blastx locally on NR to try to identify the genes. This took a long, long time. It was actually the thing that took the most time by far.

Having the mixed samples, we used an in-house software on the oases output to extract how many reads were used per transcript for each sample to get a feel of variation of expression...this is in no way precise given that a read can be in multiple transcripts (isoforms for example) but it gives insight into differences between the samples.
Leave a comment:
Wallysb01 replied

05-10-2011, 09:10 AM
We're still waiting for the reads, but we were planning on using the trinity package from the Broad: http://trinityrnaseq.sourceforge.net/

Basically, we figure if its good enough for the broad, its good enough for us. But we're green to this and trying to assemble a vertebrate transcriptome, so I'm certainly up for suggestions. Can anyone compare runtimes/processing requirements and the like, for ABySS and other programs? Broad suggests 2GB memory per million reads for example. We expect to have roughly 100M reads of paired end 100bp. Do we really need 200GB of memory? We have access to cluster that would make that possible, but it sounds like ABySS maybe runs on less, with that breast cancer paper saying they used 20 nodes with 2GBs each for 194M reads of 36bp?

Also, for assessing quality, I'd guess the best way would to simply compare to the distribution of a related but more fully annotated species. I don't expect that would be easy, however, requiring a large batch-blast-type analysis while understanding sequence divergence and gene duplications/deletions issues. Other than that, I just don't know how telling these kind of k-mer analysis things really are. So you got X# of contigs bigger than 100bps, or max of 10kb, who cares, exactly? Especially when you go looking through your RNA-seq data that was alligned to reference genome and see all kinds of areas coming up out side gene regions, even on well annotated species like mouse. How much of this is just genomic contamination or a kind of "phantom" or random transcription of areas that do nothing? Basically, I just want to know how well you covered the ~20K genes in a vertebrate genome. After you show me that, I can start carrying about micro-RNAs, or your k-mers.

Last edited by Wallysb01; 05-10-2011, 09:29 AM.
Leave a comment:
dnusol replied

05-10-2011, 08:21 AM
Hi,

is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

Best,

Dave
Leave a comment:
jmartin127 replied

04-06-2011, 08:19 PM
We've tried a few different programs for de novo transcriptome assembly, you can see our paper that came out about a year ago here: http://www.biomedcentral.com/1471-2164/11/663.

As Marcel pointed out, Oases seems to do a pretty good job with the latest updates. In our paper we introduce, Rnnotator, which adds some additional pre/post-processing steps to further improve the assembly. We've been able to assemble a plant transcriptome, but we are still evaluating the result.

Originally posted by Neil View Post

Hi all,
We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
also, what software would you recommend for this?
hope someone can help
best regards
neil
Leave a comment:
edge replied

04-06-2011, 07:18 PM
Dear petang,

Is ok...
No worry about it...
I just not sure whether my query about identify alternative splicing variation without using reference genome sequence is working now?
Is it sound logically or not?
It seems like really quite difficult to identify alternative splicing variation without the reference genome sequence
Thanks first for any advice.
Leave a comment:
Rachel replied

03-31-2011, 04:30 PM
Hi petang. Thanks for your reply. Appreciate it.
Leave a comment:
petang replied

03-31-2011, 02:19 AM
Originally posted by edge View Post

Hi petang, I'm also just start my research about RNA-seq.
Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

Still no idea on this.
Sorry
Leave a comment:
petang replied

03-31-2011, 02:17 AM
In my experience, the number of contigs assembled from 20 million of 2x100bp reads varies from 30000-50000 contigs, depends on the complexity of the genome. The first question is how many contigs you want to annotate and the purpose of your experiment. If you aimed to gene discovery, the first 10000 highly expressed contigs should be good enough. Or alternatievly, you can choose the long contigs (let say, longer than 1000bp).

If you are doing comparative transcriptomics. Obivously you can choose those differentially expressed contigs.

In either cases, it is impossible to annotate all contigs without the support of a bioinformatics group.

The quickest way to annotate the transcript is Blastx UniProt, then retrive all the information (pfam, GO, KEGG etc) from the hit. However, you will missed all the conserved hypothetical proteins which is only available from NCBI. So, I will start from BLASTx uniport, then use the un-hit contigs for BLASTx NCBI nr.
Leave a comment:
Rachel replied

03-31-2011, 12:07 AM
Just a quick questions on this de novo assembly for transcriptomic, say if I am having Illumina 2x100bp RNA-seq reads once it is assembly by de novo assembler, how do we annotate the transcript? but doing blast or blast2go? does it sufficient?
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News