De Novo Assembly of a transcriptome - SEQanswers

You are currently viewing the SEQanswers forums as a guest, which limits your access. Click here to register now, and join the discussion

X

Wallysb01

Senior Member

Join Date: Feb 2011

Posts: 286
- Share
- Tweet
#46

06-03-2011, 03:44 PM

Celia,

From my understanding your N50 for transcriptome assembly should be pretty low. You're just not going to have large contigs when the average spliced gene is something like 1500-2000 base pairs. Then, depending on your RNA extraction and purification methods, you might have also captured microRNAs. To a certain extent you probably have some genomic contamination, and certainly a lot of pre-spliced mRNAs. Then, what's your coverage on lowly expressed genes? For larger, but lowly expressed genes, you probably have a bunch of small contigs. Anyway, I wouldn't worry about the N50 for transcriptome assembly so much. I'd much rather see what the size of the 5,000th and 10,000th contig is to get a measure of how "complete" the transcriptome is. And unless you have the whole body of the animal and a variety of ages (including embryonic), I wouldn't expect you to get much more than 10K reasonably well assembled genes, if that.

I can't really help with the other question though, I'm still behind on the actually "doing" of the analysis, but do you mind sharing what kind of specs the computer you where running ABYSS on had for processors/RAM?
Comment
Celia

Junior Member

Join Date: Jun 2010

Posts: 3
- Share
- Tweet
#47

06-03-2011, 04:23 PM

Wallysby01,

thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.

Our computer has a 8-core processor and 16GB of memory plus 18GB or SWAP.
and as I said before ABySS always finished over night (max 10hours).
Comment
grassgirl

Member

Join Date: Mar 2011

Posts: 51
- Share
- Tweet
#48

06-07-2011, 04:51 PM

Originally posted by Aurelien Mazurie View Post

I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task).

I am doing de novo cDNA library assembly on a plant using 454 Junior reads. I am totally new to sequencing, but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.

Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.

As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.

Regarding BLASTING: I have questions regarding the best way to do this with my isotigs. I have access to a cluster with blastall and would like to blast to nt or nr. I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?
Comment
lletourn

Member

Join Date: Oct 2009

Posts: 63
- Share
- Tweet
#49

06-07-2011, 05:43 PM

Originally posted by grassgirl View Post

but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.

True, but because of the lower number of reads you will need more sequencing to get the transcripts that are less expressed.

Originally posted by grassgirl View Post

Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.

That is very true. If you take into account that on average transcripts are ~1kb (your mileage will very by specie), 500bp reads are long enough. And 454 mates (or pairs, however you name them) would be useless since the protocol basically starts at 3kb.

Originally posted by grassgirl View Post

As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.

This is very common when using velvet or other de brujin graph assemblers. Sometimes you don't need cap3 since the assemblers do a good enough job.

In your case, with 454 reads, I would suggest an overlap based one like mira. Mira has always given good result with 454 and sanger type reads for ESTs and transcriptome analysis. It also works well with illumina but is *very* resource demanding.

Originally posted by grassgirl View Post

I have access to a cluster with blastall and would like to blast to nt or nr.

I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.

Originally posted by grassgirl View Post

I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?

I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.
Comment
Jenzo

Member

Join Date: Mar 2011

Posts: 31
- Share
- Tweet
#50

06-08-2011, 12:18 AM

Hi there,
we just did some Velvet/Oases-Assemblys on several non-normalized 60bp-PE-Libs and I would like to inform you about the ressources needed:

Set1 with 101 Mio Reads: up to 38 GB RAM for Velvet, up to 63 GB for oases with k=25
Set2 with 87 Mio Reads: up to 25 GB for Velvet, up to 46 GB for oases, also k=25
Set3 with 100 Mio Reads: up to 37 GB for Velvet and 27 GB for oases, k=25

Runtimes where between 4,5h for Velvet and up to 1h for Oases.
We explored more kmers and the ressources needed were smaller for higher kmers (as expected).
We also had a set of 454 reads, assembled them with mira, which took 13h and not more than 7 GB of RAM. The N50 value here was about 450 bp for 60k contigs.
In addition, all transcripts have a N50 value around 670 bp after clustering from all Sets (with 454 Contigs).

We plan to do the same assemblies with transAbyss and Trinity as well, I can post ressources needed here, if you are interessted.
Comment
kmcarr

Senior Member

Join Date: May 2008

Posts: 1180
- Share
- Tweet
#51

06-08-2011, 05:11 AM

Originally posted by lletourn View Post

I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.

Originally Posted by grassgirl
I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?

I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.

I would suggest not BLASTing against nr. Everyone is tempted to BLAST against the whole universe but that is not the best idea. BLAST against a reference database matched to your queries and one which isn't highly redundant. I've done lots of de novo plant transcriptome assembly and I typical run two BLAST jobs on the output, against TAIR and the green plant subdivision of RefSeq. You should also tweak the BLAST options appropriately for the experiment you are performing (yes, think of running BLAST as performing an experiment, an in silico Northern). The parameters I typically use for BLASTing transcript contigs against a protein database are:

Code:

# blastall -p blastx -d <db> -i <contigFile.fasta> -U -f 14 -F "m S" -e 1e-10 -b 20 -v 20 -a <#_cpus> Adapted from BLAST by Korf, Yandell & Bedell Sorry that this is the command using the old BLAST toolset. If you are using BLAST+, as NCBI is urging people to do, you'll have to translate this to the new command/options. The BLAST+ package has a perl script which can do the translation for you.

Limiting the size of your database, the number of hits reported and adjusting the word threshold will reduce the time of your BLAST job. It's been a while since I've done much BLASTing but I believe that 6,600 isotigs should take < 12 hours on 8 cpus against RefSeq plants. Against TAIR it will take under an hour.

Also, the memory requirement is independent of the size of your query set. BLAST does not store the query or the results in RAM. The major contributor to RAM consumption is the size of the target database. Here again, sticking to more narrowly targeted DBs will help, but by today's standards of RAM even nr should not be a problem.
Comment
lletourn

Member

Join Date: Oct 2009

Posts: 63
- Share
- Tweet
#52

06-08-2011, 05:22 AM

Originally posted by kmcarr View Post

I would suggest not BLASTing against nr.

I Totally agree. I wanted to say, blast to protein not nucleotide.

Having a smaller db will yield more precise results (score wise) too.
Comment
grassgirl

Member

Join Date: Mar 2011

Posts: 51
- Share
- Tweet
#53

06-08-2011, 10:46 AM

Thanks, all, for the replies to my questions and great suggestions!
Comment
panos_ed

Member

Join Date: May 2010

Posts: 11
- Share
- Tweet
#54

06-13-2011, 08:28 AM

Originally posted by Celia View Post

Wallysby01,

thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.

Celia,

I don't know if Trinity is still running but if it is taking too long at the Butterfly step, then you might find interesting this note that I found in the Trinity FAQ.

They, however, say this shouldn't be an issue after version 2011-05-19...
Comment
Apexy

Member

Join Date: Apr 2011

Posts: 62
- Share
- Tweet
#55

06-13-2011, 12:39 PM

evaluating transcriptome assembly from k mer iterations

Originally posted by blackgore View Post

How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?

Hi,

A comparative approach was suggested by a user on the oases mailing list.

EBI-EMBL Mailman list

http://listserver.ebi.ac.uk/pipermail/oases-users/2010-February/000008.html

HTH

Mbandi
Comment
Wallysb01

Senior Member

Join Date: Feb 2011

Posts: 286
- Share
- Tweet
#56

06-16-2011, 10:48 PM

Some data for those trying to figure out which programs to run for transcriptome data:

I tried to run Trinity on 1 lane from the HiSeq, ~100M 105 bp paired end reads, on a machine with 64 GBs of RAM and 4 Xeon processors (though the processor is not the problem), and it crashed after creating all the kmers in the de Bruijn graph and then trying to create contigs.

I'll be moving on to ABySS, as it seems to be much more memory efficient, and despite having access to one of the world's largest super computers, I can't get more than 64 GBs of RAM (makes me wonder what's so super about it).
Comment
lletourn

Member

Join Date: Oct 2009

Posts: 63
- Share
- Tweet
#57

06-17-2011, 03:13 AM

For those interested this just came out:
Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance.

Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance - BMC Genomics

http://www.biomedcentral.com/1471-2164/12/317/abstract

Background Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. Results A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. Conclusion Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig quality. These results highlight the ongoing need for improvements in assembly methodology.
Comment
dnusol

Senior Member

Join Date: Jul 2009

Posts: 133
- Share
- Tweet
#58

06-20-2011, 07:13 AM

Hi, here my two cents: the idea I follow is to use Trinity and then Velvet/Oases on different kmers for de novo transcriptome. I will run both and then assemble the results to create a consensus transcriptome. At the moment, I have run Trinity using 127M 105bp reads (mixed paired-single but used as single-end as Trinity seems to use info only for mate-pairs not paired-reads) on my 24Gb RAM 8 processors box and had no problem on default parameters (I think it took two days or so).

I am now trying to run velvet on a subset of those (40M mixed single-paired) and am running out of memory, so I am trying with a larger computer. I guess I will also run into problems when Oases comes.

Best.
Comment
Apexy

Member

Join Date: Apr 2011

Posts: 62
- Share
- Tweet
#59

06-20-2011, 07:45 AM

Hi dnusol,
I'm not experience with assembly, but I started running velveth->velvetg->oases (k iterations pipeline) with 10601688 reads (paired and single). The memory constrain was profound and it always crashed. I was advice to abstain from very low k values. I do my iterations on 19 <= k <=29 with only 5G of memory allocated to the whole process (although not all is used when I look at the log file on the job id) and it take 31.55 mins. I use a 31G, 16 processor machine which I share with others. With 40M reads of yours, it is obvious you would need more memory. However I advice you to start with k=19.
Cheers
Comment
dnusol

Senior Member

Join Date: Jul 2009

Posts: 133
- Share
- Tweet
#60

06-20-2011, 08:11 AM

Hi Apexy, thanks for your input,

I thought small kmers would work worse for long reads (105bp) that is why I chose in the 31-45 range.
Since my last post I got some more news: velvetg peaked at 56Gb RAM for kmer 31 and about 40M reads (keep in mind read_trkg was on, as suggested in the manual, which seems to be memory-hungry).

Best,

David
Comment

Previous 1 2 3 4 5 6 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

	Topics		Statistics	Last Post
	Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM		0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
	Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM		0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
	Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM		0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
	Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM		0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Working...

X