Announcement

Collapse
No announcement yet.

De Novo Assembly of a transcriptome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    We're still waiting for the reads, but we were planning on using the trinity package from the Broad: http://trinityrnaseq.sourceforge.net/

    Basically, we figure if its good enough for the broad, its good enough for us. But we're green to this and trying to assemble a vertebrate transcriptome, so I'm certainly up for suggestions. Can anyone compare runtimes/processing requirements and the like, for ABySS and other programs? Broad suggests 2GB memory per million reads for example. We expect to have roughly 100M reads of paired end 100bp. Do we really need 200GB of memory? We have access to cluster that would make that possible, but it sounds like ABySS maybe runs on less, with that breast cancer paper saying they used 20 nodes with 2GBs each for 194M reads of 36bp?

    Also, for assessing quality, I'd guess the best way would to simply compare to the distribution of a related but more fully annotated species. I don't expect that would be easy, however, requiring a large batch-blast-type analysis while understanding sequence divergence and gene duplications/deletions issues. Other than that, I just don't know how telling these kind of k-mer analysis things really are. So you got X# of contigs bigger than 100bps, or max of 10kb, who cares, exactly? Especially when you go looking through your RNA-seq data that was alligned to reference genome and see all kinds of areas coming up out side gene regions, even on well annotated species like mouse. How much of this is just genomic contamination or a kind of "phantom" or random transcription of areas that do nothing? Basically, I just want to know how well you covered the ~20K genes in a vertebrate genome. After you show me that, I can start carrying about micro-RNAs, or your k-mers.
    Last edited by Wallysb01; 05-10-2011, 09:29 AM.

    Comment


    • #32
      I've used velvet+oases with GAIIx 108PE data. We actually mixed in difference samples of the same specie for the assembly

      We got pretty good result when comparing with available ESTs. We didn't put them in the assembly because we weren't sure of how "good" the EST were. It turns out we found 93% of the full length EST in the assembly.

      We also used blastx locally on NR to try to identify the genes. This took a long, long time. It was actually the thing that took the most time by far.

      Having the mixed samples, we used an in-house software on the oases output to extract how many reads were used per transcript for each sample to get a feel of variation of expression...this is in no way precise given that a read can be in multiple transcripts (isoforms for example) but it gives insight into differences between the samples.

      Comment


      • #33
        thanks lletourn, we'd have some ESTs available too, through from my initial searching through them, coverage in the EST library is pretty poor. So, I think we'd basically be in the same situation, using it for validation, but not assembly.

        Comment


        • #34
          Originally posted by dnusol View Post
          Hi,

          is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

          Best,

          Dave
          First of all, Uniprot database can only provide proteins with conserved sequence/motif/domain. You will miss all the HYPOTHETICAL PROTEINS if you work on an organism without available genome. The best thing on Uniprot is you can have all the reltaed information (pfam, GO, SignalP, TMHMM...) at a single run.

          Running local NCBI Blast is a nightmare unless you have a good computing facilitiy. I will suggest you run Netblastx and limit the output to 10 or less. Use the tubular or xml output will be easier for you to parse the results. Your computer should have at least 6-8 GB of memory for netblast.

          Comment


          • #35
            Thanks Petang,

            do you mean netblast from Wisconsin Package? how do I download it? I cannot find the download page within accelrys

            Best,

            Dave
            Last edited by dnusol; 05-11-2011, 12:36 AM.

            Comment


            • #36
              Very interesting thread. I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task). However, I am still wondering about the following choices:

              - most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally; the only exception appears to be Trinity, which have its own assembly algorithm. It means those tools can make use of both single- and paired-ends reads. However, there is little information about which of those tools actually use pairing information to improve on the results (e.g., to detect splicing variants). My first question would be: are paired-ends a big plus, or are they not worth the extra cost?

              - some tools explicitly state they work best with strand-specific data (e.g., Trinity). Others mention using it, but do not tell if strand-specific data is mandatory (e.g., Rnnotator). My second question is: should I prefer strand-specific sequences?

              Best,
              Aurelien

              Comment


              • #37
                Originally posted by Aurelien Mazurie View Post
                - most tools that are mentioned for the transcriptome assembly (Rnnotator, Oases, ABySS, Multiple-k) uses Velvet internally
                ABySS is a seperate tool and doesn't use velvet.

                Originally posted by Aurelien Mazurie View Post
                My first question would be: are paired-ends a big plus, or are they not worth the extra cost?
                In my experience, when assembling, it does make a big difference. I mainly use oases, I seem to identify splice site events a lot more accurately with pairs.

                Comment


                • #38
                  Hi Aurelien

                  I was told that Trinity would work with non-strand-specific paired-ends as if they were single reads.
                  I don't about the interpretation of strand-specific or non-strand-specific data in velvet-oases, but they seem to perform better with paired reads anyway.

                  HTH

                  Dave

                  Comment


                  • #39
                    Originally posted by lletourn View Post
                    Having the mixed samples, we used an in-house software on the oases output to extract how many reads were used per transcript for each sample to get a feel of variation of expression...this is in no way precise given that a read can be in multiple transcripts (isoforms for example) but it gives insight into differences between the samples.
                    This is something I am wondering: is there any way to come up with an RPKM-like measure of expression level when doing de novo transcriptome assembly? Counting the number of reads per contig (cDNA) appears to be a crude way, but that's the only one I can think of. Any better suggestion? Normalizing by the library's length (number of reads), maybe?

                    Aurelien

                    Comment


                    • #40
                      Aurelien

                      Originally posted by Aurelien Mazurie View Post
                      This is something I am wondering: is there any way to come up with an RPKM-like measure of expression level when doing de novo transcriptome assembly? Counting the number of reads per contig (cDNA) appears to be a crude way, but that's the only one I can think of. Any better suggestion? Normalizing by the library's length (number of reads), maybe?
                      Looks like yes. I'm still getting aquatinted with the various programs, but the Trinity package outputs FPKM values for each assembled contig in the fasta header, and sorts them by it in the output apparently.

                      I know less about the other programs given that it appears they all require a little more knowledge than I currently have. Trinity seems pretty much plug and chug so long as you have plenty of computing power.

                      Comment


                      • #41
                        We have been using Velvet/Oases for denovo transcriptome assembly of several large eukaryotes. I'm running Trinity tests at the moment and it seems to need similar computational resources as our current pipeline (we run multiple Velvet asm in parallel). I'm hopeful the FPKM values generated will greatly reduce mapping and expression efforts in terms of time and resources. Any one understand whether the Trinity FPKM calculations will be more or less accurate than say from a BWA mapping?

                        Comment


                        • #42
                          I can't comment on the trinity vs bwa, but using bwa on oases assemblies has always been problematic for me since there are hairpins in the assembly and the isoforms in the transcripts.fa need to be filtered out first.

                          I generate the FPKM values using the read tracking option from velvet/oases.

                          With the contig-ordering and LastGraph file you can get the reads per transcript. It's not perfect because if oases decides to cut a contig in the final transcript you can't know which reads not to count. I have rarely seen this though.

                          I'll try trinity to compare.

                          Comment


                          • #43
                            From their advanced Guide online
                            "FPKM_all: expression value for this transcript computed based on all fragment pairs corresponding to this path.
                            FPKM_rel: expression value accounting for fragments that map to multiple reported paths (fragment count is equally divided among paths, yes not optimal... we're working on more advanced methods ala cufflinks to better estimate expression values.)"
                            Guess its still better to use mapper/cufflinks for now.

                            Comment


                            • #44
                              ikim,

                              Do you mind sharing what kind computational power your velvet/oases and trinity assemblies are taking?

                              We're gearing up to start some jobs on a shared campus computer, but they are being kinda fussy about letting us run jobs that might take many 10's or even 100's of GB of RAM and take over a couple of days. Right now they are basically saying we can only have 64 GBs of RAM for a few days. Or 8 GBs of RAM for 14 days. (Makes me wonder why we even have this super computer and why they brag about having TBs of RAM?)

                              All ranting aside, do you have any idea if 64 GBs for a 3-4 days would be enough? Or if that's all we can get out of them, what programs might be able to work with those constraints?
                              Last edited by Wallysb01; 06-02-2011, 10:21 AM.

                              Comment


                              • #45
                                Hi. We are running test on different de novo assemblers..like ABYSS and even CLC...which both are fairly quick (maximum 8 hours)..and now we are running Trinity,which is taking days already...
                                has anyone done a comparison on the different programs and can comment on (first of all) how long Trinity actually takes and also what seems to be the best program to use?? (I cannot run velvet as we dont have enough memory)

                                We have about 48mill 109bp Illumina reads.


                                Also, I am having issues with the N50 ...as it seems to be one quality measure for assessing the de novo assemblies, however mine are really low (my read were trimmed and all is above at least 30 quality score)...what N50 values are good for a de novo??

                                Comment

                                Working...
                                X