Announcement

Collapse
No announcement yet.

De Novo Assembly of a transcriptome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dnusol
    replied
    Hi Aurelien

    I was told that Trinity would work with non-strand-specific paired-ends as if they were single reads.
    I don't about the interpretation of strand-specific or non-strand-specific data in velvet-oases, but they seem to perform better with paired reads anyway.

    HTH

    Dave

    Leave a comment:


  • lletourn
    replied
    Originally posted by Aurelien Mazurie View Post
    - most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally
    ABySS is a seperate tool and doesn't use velvet.

    Originally posted by Aurelien Mazurie View Post
    My first question would be: are paired-ends a big plus, or are they not worth the extra cost?
    In my experience, when assembling, it does make a big difference. I mainly use oases, I seem to identify splice site events a lot more accurately with pairs.

    Leave a comment:


  • Aurelien Mazurie
    replied
    Very interesting thread. I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task). However, I am still wondering about the following choices:

    - most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally; the only exception appears to be Trinity, which have its own assembly algorithm. It means those tools can make use of both single- and paired-ends reads. However, there is little information about which of those tools actually use pairing information to improve on the results (e.g., to detect splicing variants). My first question would be: are paired-ends a big plus, or are they not worth the extra cost?

    - some tools explicitly state they work best with strand-specific data (e.g., Trinity). Others mention using it, but do not tell if strand-specific data is mandatory (e.g., Rnnotator). My second question is: should I prefer strand-specific sequences?

    Best,
    Aurelien

    Leave a comment:


  • dnusol
    replied
    Thanks Petang,

    do you mean netblast from Wisconsin Package? how do I download it? I cannot find the download page within accelrys

    Best,

    Dave
    Last edited by dnusol; 05-11-2011, 12:36 AM.

    Leave a comment:


  • petang
    replied
    Originally posted by dnusol View Post
    Hi,

    is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

    Best,

    Dave
    First of all, Uniprot database can only provide proteins with conserved sequence/motif/domain. You will miss all the HYPOTHETICAL PROTEINS if you work on an organism without available genome. The best thing on Uniprot is you can have all the reltaed information (pfam, GO, SignalP, TMHMM...) at a single run.

    Running local NCBI Blast is a nightmare unless you have a good computing facilitiy. I will suggest you run Netblastx and limit the output to 10 or less. Use the tubular or xml output will be easier for you to parse the results. Your computer should have at least 6-8 GB of memory for netblast.

    Leave a comment:


  • Wallysb01
    replied
    thanks lletourn, we'd have some ESTs available too, through from my initial searching through them, coverage in the EST library is pretty poor. So, I think we'd basically be in the same situation, using it for validation, but not assembly.

    Leave a comment:


  • lletourn
    replied
    I've used velvet+oases with GAIIx 108PE data. We actually mixed in difference samples of the same specie for the assembly

    We got pretty good result when comparing with available ESTs. We didn't put them in the assembly because we weren't sure of how "good" the EST were. It turns out we found 93% of the full length EST in the assembly.

    We also used blastx locally on NR to try to identify the genes. This took a long, long time. It was actually the thing that took the most time by far.

    Having the mixed samples, we used an in-house software on the oases output to extract how many reads were used per transcript for each sample to get a feel of variation of expression...this is in no way precise given that a read can be in multiple transcripts (isoforms for example) but it gives insight into differences between the samples.

    Leave a comment:


  • Wallysb01
    replied
    We're still waiting for the reads, but we were planning on using the trinity package from the Broad: http://trinityrnaseq.sourceforge.net/

    Basically, we figure if its good enough for the broad, its good enough for us. But we're green to this and trying to assemble a vertebrate transcriptome, so I'm certainly up for suggestions. Can anyone compare runtimes/processing requirements and the like, for ABySS and other programs? Broad suggests 2GB memory per million reads for example. We expect to have roughly 100M reads of paired end 100bp. Do we really need 200GB of memory? We have access to cluster that would make that possible, but it sounds like ABySS maybe runs on less, with that breast cancer paper saying they used 20 nodes with 2GBs each for 194M reads of 36bp?

    Also, for assessing quality, I'd guess the best way would to simply compare to the distribution of a related but more fully annotated species. I don't expect that would be easy, however, requiring a large batch-blast-type analysis while understanding sequence divergence and gene duplications/deletions issues. Other than that, I just don't know how telling these kind of k-mer analysis things really are. So you got X# of contigs bigger than 100bps, or max of 10kb, who cares, exactly? Especially when you go looking through your RNA-seq data that was alligned to reference genome and see all kinds of areas coming up out side gene regions, even on well annotated species like mouse. How much of this is just genomic contamination or a kind of "phantom" or random transcription of areas that do nothing? Basically, I just want to know how well you covered the ~20K genes in a vertebrate genome. After you show me that, I can start carrying about micro-RNAs, or your k-mers.
    Last edited by Wallysb01; 05-10-2011, 09:29 AM.

    Leave a comment:


  • dnusol
    replied
    Hi,

    is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

    Best,

    Dave

    Leave a comment:


  • jmartin127
    replied
    We've tried a few different programs for de novo transcriptome assembly, you can see our paper that came out about a year ago here: http://www.biomedcentral.com/1471-2164/11/663.

    As Marcel pointed out, Oases seems to do a pretty good job with the latest updates. In our paper we introduce, Rnnotator, which adds some additional pre/post-processing steps to further improve the assembly. We've been able to assemble a plant transcriptome, but we are still evaluating the result.

    Originally posted by Neil View Post
    Hi all,
    We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
    also, what software would you recommend for this?
    hope someone can help
    best regards
    neil

    Leave a comment:


  • edge
    replied
    Dear petang,

    Is ok...
    No worry about it...
    I just not sure whether my query about identify alternative splicing variation without using reference genome sequence is working now?
    Is it sound logically or not?
    It seems like really quite difficult to identify alternative splicing variation without the reference genome sequence
    Thanks first for any advice.

    Leave a comment:


  • Rachel
    replied
    Hi petang. Thanks for your reply. Appreciate it.

    Leave a comment:


  • petang
    replied
    Originally posted by edge View Post
    Hi petang, I'm also just start my research about RNA-seq.
    Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

    Still no idea on this.
    Sorry

    Leave a comment:


  • petang
    replied
    In my experience, the number of contigs assembled from 20 million of 2x100bp reads varies from 30000-50000 contigs, depends on the complexity of the genome. The first question is how many contigs you want to annotate and the purpose of your experiment. If you aimed to gene discovery, the first 10000 highly expressed contigs should be good enough. Or alternatievly, you can choose the long contigs (let say, longer than 1000bp).

    If you are doing comparative transcriptomics. Obivously you can choose those differentially expressed contigs.

    In either cases, it is impossible to annotate all contigs without the support of a bioinformatics group.

    The quickest way to annotate the transcript is Blastx UniProt, then retrive all the information (pfam, GO, KEGG etc) from the hit. However, you will missed all the conserved hypothetical proteins which is only available from NCBI. So, I will start from BLASTx uniport, then use the un-hit contigs for BLASTx NCBI nr.

    Leave a comment:


  • Rachel
    replied
    Just a quick questions on this de novo assembly for transcriptomic, say if I am having Illumina 2x100bp RNA-seq reads once it is assembly by de novo assembler, how do we annotate the transcript? but doing blast or blast2go? does it sufficient?

    Leave a comment:

Working...
X