De Novo Assembly of a transcriptome

berath replied

08-08-2011, 06:02 PM
I would like to take the community's opinions on the differential expression analysis of a de novo assembled transcriptome.

We are studying a non-model organism with no genome sequence information; we have 99-bp SE Illumina reads and testing the differential expression for two experimental conditions with two biological replicates each.

For the de novo transcriptome assembly, we have utilized all four lanes and used velvet-oases (multi-k) and trinity packages. Both assembly metrics and biological annotation suggested that the velvet-oases produced a (slightly) better assembly.

For the DE analysis, is it a better approach to use an alignment software to map quality checked sequencing reads (from individually tested condition) to the annotated contigs constructed by the combined assembly (from all conditions) and calculate RPKM values

or

construct two separate de novo assemblies for each experimental condition, extract the number of reads & fragments of an annotated contig and compare it to those of the same gene coding contig from the other assembly?

The second approach seems to be integrated in the Trinity package (as FPKM values for each contig); however, as noted in this thread earlier the authors agree that the values are approximate. I assume read_tracking and amos file option from velvet would allow to extract similar info.

Any thoughts?
Thanks..
Leave a comment:
Bueller_007 replied

07-01-2011, 09:37 AM
Why aren't people using STM for combining runs from multiple k values?

Optimization of de novo transcriptome assembly from next-generation sequencing data

http://genome.cshlp.org/content/20/10/1432.full

An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms
Leave a comment:
boetsie replied

06-24-2011, 04:31 AM
I don't think it is a good idea to use SSPACE for merging assemblies. Of course contigs can be combined if pairs can be found, however it will not merge full assemblies. You will still end up with the initial size of the total assembly of different k-mers.

Best way to go is using a tool that merges assemblies like Zorro or GAM. Have a look at this thread for a list of these tools;

merging scaffolds from several SOAPdenovo assemblies into a single consensus assembly - SEQanswers

http://seqanswers.com/forums/showthread.php?t=10834&highlight=zorro

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Boetsie

Originally posted by dnusol View Post

Hi, just some more info on memory use

velvetg k-mer 31 with 127M reads peaked at 250Gb RAM for 18 cores, took half an hour to run, and produced about 320Gb of output data.

Regarding merging output from different kmers, how about Minimus2 or SSPACE?

HTH,

D
Leave a comment:
dnusol replied

06-24-2011, 03:38 AM
Hi, just some more info on memory use

velvetg k-mer 31 with 127M reads peaked at 250Gb RAM for 18 cores, took half an hour to run, and produced about 320Gb of output data.

Regarding merging output from different kmers, how about Minimus2 or SSPACE?

HTH,

D
Leave a comment:
Jenzo replied

06-23-2011, 10:49 PM
Originally posted by ikim View Post

For ppl running mult-kmers of Velvet, any suggestions on how to combine the assemblies? I used to use vmatch but it seems that their 'nonredundant' setting clusters together much more than just nonredundant data.

Dear ikim,
I also try to combine different assemblies and was not satisfied with results of vmatch and cd-hit-est. For me, assembling all the contigs with cap3 or tigr works much better than clustering with VMatch or cd-hit-est.
To get a idea, how redundant my final dataset is, I think I will blast it against itself..
If you got a good solution for efficient clustering to gain a nonredundant set of contigs, please let me know :-)
Best wishes!
Leave a comment:
ikim replied

06-23-2011, 02:21 PM
Originally posted by dnusol View Post

Hi Apexy, thanks for your input,

I thought small kmers would work worse for long reads (105bp) that is why I chose in the 31-45 range.
Since my last post I got some more news: velvetg peaked at 56Gb RAM for kmer 31 and about 40M reads (keep in mind read_trkg was on, as suggested in the manual, which seems to be memory-hungry).

Best,

David

My exp is likewise; very small kmer settings for longer reads are far from optimal and take a great deal of resources. My runs are generally 31-61 mer. Memory usage between 8 - 28 GB for our typical ~60 M, 90bp paired end reads, runs for 5-6 hours.
A single equivalent run of Trinity seems to top at 68 GB (4 days to run using 5 processors, 3 days using 8), we set butterfly memory allocation to 10GB so when run with -CPU 8, max mem would have been 80GB though it never got using that much. Our latest 150M mixed library run at CPU 10 took 5 days).
Initial annotations suggest a single Trinity run yields better results than one Velvet/Oases run (n50 size, assembly size, refseq matches, cds numbers).
I'm liking how Trinity being three programs allows better handling of recovery runs.
For ppl running mult-kmers of Velvet, any suggestions on how to combine the assemblies? I used to use vmatch but it seems that their 'nonredundant' setting clusters together much more than just nonredundant data.
Leave a comment:
dnusol replied

06-20-2011, 10:52 PM
hi Mbandi,

I am setting the kmer length using the automatic option for multiple kmers on velveth, first run velveth and then just tried the first kmer length on velvetg to assess memory usage. So I still have to run velvetg on the three other kmers specified. I am not intending to run everything simultaneously but I do plan to try velvetg on my full set of reads (127M) to test memory needs for future.

I already preprocessed my set and then selected a random subset to reduce size, but I don't think going down below 30% of my full set is a good idea.

There is a thread on Oases user-list regarding memory usage that may be of interest to someone.

EBI-EMBL Mailman list

http://listserver.ebi.ac.uk/pipermail/oases-users/2011-June/000190.html

Best,

David
Leave a comment:
Apexy replied

06-20-2011, 08:36 AM
Hi David,

Just to add to my previous post, -read_trkg & -amos_file yes was on at oases. My reads were of varying lengths min=30 and max=60. I do not disagree on the memory usage you require but I was just amazed compared to my little experience. Well I will run mine this time on 31M reads and see if it crashed. Are you running one k at-a-time? Does velveth precede velvetg immediately(in a script) or separately? From what I gather, unprocessed reads increase the complexity of the de Bruijn graph with more memory imprint. You can also make a rendez vous on the velveth & oases mailing list and benefit from more experience hands.

HTH,

Mbandi
Leave a comment:
dnusol replied

06-20-2011, 08:11 AM
Hi Apexy, thanks for your input,

I thought small kmers would work worse for long reads (105bp) that is why I chose in the 31-45 range.
Since my last post I got some more news: velvetg peaked at 56Gb RAM for kmer 31 and about 40M reads (keep in mind read_trkg was on, as suggested in the manual, which seems to be memory-hungry).

Best,

David
Leave a comment:
Apexy replied

06-20-2011, 07:45 AM
Hi dnusol,
I'm not experience with assembly, but I started running velveth->velvetg->oases (k iterations pipeline) with 10601688 reads (paired and single). The memory constrain was profound and it always crashed. I was advice to abstain from very low k values. I do my iterations on 19 <= k <=29 with only 5G of memory allocated to the whole process (although not all is used when I look at the log file on the job id) and it take 31.55 mins. I use a 31G, 16 processor machine which I share with others. With 40M reads of yours, it is obvious you would need more memory. However I advice you to start with k=19.
Cheers
Leave a comment:
dnusol replied

06-20-2011, 07:13 AM
Hi, here my two cents: the idea I follow is to use Trinity and then Velvet/Oases on different kmers for de novo transcriptome. I will run both and then assemble the results to create a consensus transcriptome. At the moment, I have run Trinity using 127M 105bp reads (mixed paired-single but used as single-end as Trinity seems to use info only for mate-pairs not paired-reads) on my 24Gb RAM 8 processors box and had no problem on default parameters (I think it took two days or so).

I am now trying to run velvet on a subset of those (40M mixed single-paired) and am running out of memory, so I am trying with a larger computer. I guess I will also run into problems when Oases comes.

Best.
Leave a comment:
lletourn replied

06-17-2011, 03:13 AM
For those interested this just came out:
Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance.

Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance - BMC Genomics

http://www.biomedcentral.com/1471-2164/12/317/abstract

Background Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. Results A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. Conclusion Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig quality. These results highlight the ongoing need for improvements in assembly methodology.
Leave a comment:
Wallysb01 replied

06-16-2011, 10:48 PM
Some data for those trying to figure out which programs to run for transcriptome data:

I tried to run Trinity on 1 lane from the HiSeq, ~100M 105 bp paired end reads, on a machine with 64 GBs of RAM and 4 Xeon processors (though the processor is not the problem), and it crashed after creating all the kmers in the de Bruijn graph and then trying to create contigs.

I'll be moving on to ABySS, as it seems to be much more memory efficient, and despite having access to one of the world's largest super computers, I can't get more than 64 GBs of RAM (makes me wonder what's so super about it).
Leave a comment:
Apexy replied

06-13-2011, 12:39 PM
evaluating transcriptome assembly from k mer iterations

Originally posted by blackgore View Post

How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?

Hi,

A comparative approach was suggested by a user on the oases mailing list.

EBI-EMBL Mailman list

http://listserver.ebi.ac.uk/pipermail/oases-users/2010-February/000008.html

HTH

Mbandi
Leave a comment:
panos_ed replied

06-13-2011, 08:28 AM
Originally posted by Celia View Post

Wallysby01,

thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.

Celia,

I don't know if Trinity is still running but if it is taking too long at the Butterfly step, then you might find interesting this note that I found in the Trinity FAQ.

They, however, say this shouldn't be an issue after version 2011-05-19...
Leave a comment:

Previous 1 2 3 4 5 6 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News