De Novo Assembly of a transcriptome

grassgirl replied

06-08-2011, 10:46 AM
Thanks, all, for the replies to my questions and great suggestions!
Leave a comment:
lletourn replied

06-08-2011, 05:22 AM
Originally posted by kmcarr View Post

I would suggest not BLASTing against nr.

I Totally agree. I wanted to say, blast to protein not nucleotide.

Having a smaller db will yield more precise results (score wise) too.
Leave a comment:
kmcarr replied

06-08-2011, 05:11 AM
Originally posted by lletourn View Post

I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.

Originally Posted by grassgirl
I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?

I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.

I would suggest not BLASTing against nr. Everyone is tempted to BLAST against the whole universe but that is not the best idea. BLAST against a reference database matched to your queries and one which isn't highly redundant. I've done lots of de novo plant transcriptome assembly and I typical run two BLAST jobs on the output, against TAIR and the green plant subdivision of RefSeq. You should also tweak the BLAST options appropriately for the experiment you are performing (yes, think of running BLAST as performing an experiment, an in silico Northern). The parameters I typically use for BLASTing transcript contigs against a protein database are:

Code:

# blastall -p blastx -d <db> -i <contigFile.fasta> -U -f 14 -F "m S" -e 1e-10 -b 20 -v 20 -a <#_cpus> Adapted from BLAST by Korf, Yandell & Bedell Sorry that this is the command using the old BLAST toolset. If you are using BLAST+, as NCBI is urging people to do, you'll have to translate this to the new command/options. The BLAST+ package has a perl script which can do the translation for you.

Limiting the size of your database, the number of hits reported and adjusting the word threshold will reduce the time of your BLAST job. It's been a while since I've done much BLASTing but I believe that 6,600 isotigs should take < 12 hours on 8 cpus against RefSeq plants. Against TAIR it will take under an hour.

Also, the memory requirement is independent of the size of your query set. BLAST does not store the query or the results in RAM. The major contributor to RAM consumption is the size of the target database. Here again, sticking to more narrowly targeted DBs will help, but by today's standards of RAM even nr should not be a problem.
Leave a comment:
Jenzo replied

06-08-2011, 12:18 AM
Hi there,
we just did some Velvet/Oases-Assemblys on several non-normalized 60bp-PE-Libs and I would like to inform you about the ressources needed:

Set1 with 101 Mio Reads: up to 38 GB RAM for Velvet, up to 63 GB for oases with k=25
Set2 with 87 Mio Reads: up to 25 GB for Velvet, up to 46 GB for oases, also k=25
Set3 with 100 Mio Reads: up to 37 GB for Velvet and 27 GB for oases, k=25

Runtimes where between 4,5h for Velvet and up to 1h for Oases.
We explored more kmers and the ressources needed were smaller for higher kmers (as expected).
We also had a set of 454 reads, assembled them with mira, which took 13h and not more than 7 GB of RAM. The N50 value here was about 450 bp for 60k contigs.
In addition, all transcripts have a N50 value around 670 bp after clustering from all Sets (with 454 Contigs).

We plan to do the same assemblies with transAbyss and Trinity as well, I can post ressources needed here, if you are interessted.
Leave a comment:
lletourn replied

06-07-2011, 05:43 PM
Originally posted by grassgirl View Post

but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.

True, but because of the lower number of reads you will need more sequencing to get the transcripts that are less expressed.

Originally posted by grassgirl View Post

Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.

That is very true. If you take into account that on average transcripts are ~1kb (your mileage will very by specie), 500bp reads are long enough. And 454 mates (or pairs, however you name them) would be useless since the protocol basically starts at 3kb.

Originally posted by grassgirl View Post

As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.

This is very common when using velvet or other de brujin graph assemblers. Sometimes you don't need cap3 since the assemblers do a good enough job.

In your case, with 454 reads, I would suggest an overlap based one like mira. Mira has always given good result with 454 and sanger type reads for ESTs and transcriptome analysis. It also works well with illumina but is *very* resource demanding.

Originally posted by grassgirl View Post

I have access to a cluster with blastall and would like to blast to nt or nr.

I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.

Originally posted by grassgirl View Post

I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?

I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.
Leave a comment:
grassgirl replied

06-07-2011, 04:51 PM
Originally posted by Aurelien Mazurie View Post

I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task).

I am doing de novo cDNA library assembly on a plant using 454 Junior reads. I am totally new to sequencing, but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.

Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.

As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.

Regarding BLASTING: I have questions regarding the best way to do this with my isotigs. I have access to a cluster with blastall and would like to blast to nt or nr. I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?
Leave a comment:
Celia replied

06-03-2011, 04:23 PM
Wallysby01,

thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.

Our computer has a 8-core processor and 16GB of memory plus 18GB or SWAP.
and as I said before ABySS always finished over night (max 10hours).
Leave a comment:
Wallysb01 replied

06-03-2011, 03:44 PM
Celia,

From my understanding your N50 for transcriptome assembly should be pretty low. You're just not going to have large contigs when the average spliced gene is something like 1500-2000 base pairs. Then, depending on your RNA extraction and purification methods, you might have also captured microRNAs. To a certain extent you probably have some genomic contamination, and certainly a lot of pre-spliced mRNAs. Then, what's your coverage on lowly expressed genes? For larger, but lowly expressed genes, you probably have a bunch of small contigs. Anyway, I wouldn't worry about the N50 for transcriptome assembly so much. I'd much rather see what the size of the 5,000th and 10,000th contig is to get a measure of how "complete" the transcriptome is. And unless you have the whole body of the animal and a variety of ages (including embryonic), I wouldn't expect you to get much more than 10K reasonably well assembled genes, if that.

I can't really help with the other question though, I'm still behind on the actually "doing" of the analysis, but do you mind sharing what kind of specs the computer you where running ABYSS on had for processors/RAM?
Leave a comment:
Celia replied

06-03-2011, 07:23 AM
Hi. We are running test on different de novo assemblers..like ABYSS and even CLC...which both are fairly quick (maximum 8 hours)..and now we are running Trinity,which is taking days already...
has anyone done a comparison on the different programs and can comment on (first of all) how long Trinity actually takes and also what seems to be the best program to use?? (I cannot run velvet as we dont have enough memory)

We have about 48mill 109bp Illumina reads.

Also, I am having issues with the N50 ...as it seems to be one quality measure for assessing the de novo assemblies, however mine are really low (my read were trimmed and all is above at least 30 quality score)...what N50 values are good for a de novo??
Leave a comment:
Wallysb01 replied

06-01-2011, 10:05 AM
ikim,

Do you mind sharing what kind computational power your velvet/oases and trinity assemblies are taking?

We're gearing up to start some jobs on a shared campus computer, but they are being kinda fussy about letting us run jobs that might take many 10's or even 100's of GB of RAM and take over a couple of days. Right now they are basically saying we can only have 64 GBs of RAM for a few days. Or 8 GBs of RAM for 14 days. (Makes me wonder why we even have this super computer and why they brag about having TBs of RAM?)

All ranting aside, do you have any idea if 64 GBs for a 3-4 days would be enough? Or if that's all we can get out of them, what programs might be able to work with those constraints?

Last edited by Wallysb01; 06-02-2011, 10:21 AM.
Leave a comment:
ikim replied

06-01-2011, 09:20 AM
From their advanced Guide online
"FPKM_all: expression value for this transcript computed based on all fragment pairs corresponding to this path.
FPKM_rel: expression value accounting for fragments that map to multiple reported paths (fragment count is equally divided among paths, yes not optimal... we're working on more advanced methods ala cufflinks to better estimate expression values.)"
Guess its still better to use mapper/cufflinks for now.
Leave a comment:
lletourn replied

05-31-2011, 06:07 AM
I can't comment on the trinity vs bwa, but using bwa on oases assemblies has always been problematic for me since there are hairpins in the assembly and the isoforms in the transcripts.fa need to be filtered out first.

I generate the FPKM values using the read tracking option from velvet/oases.

With the contig-ordering and LastGraph file you can get the reads per transcript. It's not perfect because if oases decides to cut a contig in the final transcript you can't know which reads not to count. I have rarely seen this though.

I'll try trinity to compare.
Leave a comment:
ikim replied

05-30-2011, 03:39 PM
We have been using Velvet/Oases for denovo transcriptome assembly of several large eukaryotes. I'm running Trinity tests at the moment and it seems to need similar computational resources as our current pipeline (we run multiple Velvet asm in parallel). I'm hopeful the FPKM values generated will greatly reduce mapping and expression efforts in terms of time and resources. Any one understand whether the Trinity FPKM calculations will be more or less accurate than say from a BWA mapping?
Leave a comment:
Wallysb01 replied

05-18-2011, 04:50 PM
Aurelien

Originally posted by Aurelien Mazurie View Post

This is something I am wondering: is there any way to come up with an RPKM-like measure of expression level when doing de novo transcriptome assembly? Counting the number of reads per contig (cDNA) appears to be a crude way, but that's the only one I can think of. Any better suggestion? Normalizing by the library's length (number of reads), maybe?

Looks like yes. I'm still getting aquatinted with the various programs, but the Trinity package outputs FPKM values for each assembled contig in the fasta header, and sorts them by it in the output apparently.

I know less about the other programs given that it appears they all require a little more knowledge than I currently have. Trinity seems pretty much plug and chug so long as you have plenty of computing power.
Leave a comment:
Aurelien Mazurie replied

05-18-2011, 03:58 PM
Originally posted by lletourn View Post

Having the mixed samples, we used an in-house software on the oases output to extract how many reads were used per transcript for each sample to get a feel of variation of expression...this is in no way precise given that a read can be in multiple transcripts (isoforms for example) but it gives insight into differences between the samples.

This is something I am wondering: is there any way to come up with an RPKM-like measure of expression level when doing de novo transcriptome assembly? Counting the number of reads per contig (cDNA) appears to be a crude way, but that's the only one I can think of. Any better suggestion? Normalizing by the library's length (number of reads), maybe?

Aurelien
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News