Announcement

Collapse
No announcement yet.

De Novo Assembly of a transcriptome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Hi David,

    Just to add to my previous post, -read_trkg & -amos_file yes was on at oases. My reads were of varying lengths min=30 and max=60. I do not disagree on the memory usage you require but I was just amazed compared to my little experience. Well I will run mine this time on 31M reads and see if it crashed. Are you running one k at-a-time? Does velveth precede velvetg immediately(in a script) or separately? From what I gather, unprocessed reads increase the complexity of the de Bruijn graph with more memory imprint. You can also make a rendez vous on the velveth & oases mailing list and benefit from more experience hands.

    HTH,

    Mbandi

    Comment


    • #62
      hi Mbandi,

      I am setting the kmer length using the automatic option for multiple kmers on velveth, first run velveth and then just tried the first kmer length on velvetg to assess memory usage. So I still have to run velvetg on the three other kmers specified. I am not intending to run everything simultaneously but I do plan to try velvetg on my full set of reads (127M) to test memory needs for future.

      I already preprocessed my set and then selected a random subset to reduce size, but I don't think going down below 30% of my full set is a good idea.

      There is a thread on Oases user-list regarding memory usage that may be of interest to someone.

      http://listserver.ebi.ac.uk/pipermai...ne/000190.html


      Best,

      David

      Comment


      • #63
        Originally posted by dnusol View Post
        Hi Apexy, thanks for your input,

        I thought small kmers would work worse for long reads (105bp) that is why I chose in the 31-45 range.
        Since my last post I got some more news: velvetg peaked at 56Gb RAM for kmer 31 and about 40M reads (keep in mind read_trkg was on, as suggested in the manual, which seems to be memory-hungry).

        Best,

        David
        My exp is likewise; very small kmer settings for longer reads are far from optimal and take a great deal of resources. My runs are generally 31-61 mer. Memory usage between 8 - 28 GB for our typical ~60 M, 90bp paired end reads, runs for 5-6 hours.
        A single equivalent run of Trinity seems to top at 68 GB (4 days to run using 5 processors, 3 days using 8), we set butterfly memory allocation to 10GB so when run with -CPU 8, max mem would have been 80GB though it never got using that much. Our latest 150M mixed library run at CPU 10 took 5 days).
        Initial annotations suggest a single Trinity run yields better results than one Velvet/Oases run (n50 size, assembly size, refseq matches, cds numbers).
        I'm liking how Trinity being three programs allows better handling of recovery runs.
        For ppl running mult-kmers of Velvet, any suggestions on how to combine the assemblies? I used to use vmatch but it seems that their 'nonredundant' setting clusters together much more than just nonredundant data.

        Comment


        • #64
          Originally posted by ikim View Post
          For ppl running mult-kmers of Velvet, any suggestions on how to combine the assemblies? I used to use vmatch but it seems that their 'nonredundant' setting clusters together much more than just nonredundant data.
          Dear ikim,
          I also try to combine different assemblies and was not satisfied with results of vmatch and cd-hit-est. For me, assembling all the contigs with cap3 or tigr works much better than clustering with VMatch or cd-hit-est.
          To get a idea, how redundant my final dataset is, I think I will blast it against itself..
          If you got a good solution for efficient clustering to gain a nonredundant set of contigs, please let me know :-)
          Best wishes!

          Comment


          • #65
            Hi, just some more info on memory use

            velvetg k-mer 31 with 127M reads peaked at 250Gb RAM for 18 cores, took half an hour to run, and produced about 320Gb of output data.

            Regarding merging output from different kmers, how about Minimus2 or SSPACE?

            HTH,

            D

            Comment


            • #66
              I don't think it is a good idea to use SSPACE for merging assemblies. Of course contigs can be combined if pairs can be found, however it will not merge full assemblies. You will still end up with the initial size of the total assembly of different k-mers.

              Best way to go is using a tool that merges assemblies like Zorro or GAM. Have a look at this thread for a list of these tools;

              http://seqanswers.com/forums/showthr...ighlight=zorro

              Boetsie

              Originally posted by dnusol View Post
              Hi, just some more info on memory use

              velvetg k-mer 31 with 127M reads peaked at 250Gb RAM for 18 cores, took half an hour to run, and produced about 320Gb of output data.

              Regarding merging output from different kmers, how about Minimus2 or SSPACE?

              HTH,

              D

              Comment


              • #67
                Why aren't people using STM for combining runs from multiple k values?
                http://genome.cshlp.org/content/20/10/1432.full

                Comment


                • #68
                  I would like to take the community's opinions on the differential expression analysis of a de novo assembled transcriptome.

                  We are studying a non-model organism with no genome sequence information; we have 99-bp SE Illumina reads and testing the differential expression for two experimental conditions with two biological replicates each.

                  For the de novo transcriptome assembly, we have utilized all four lanes and used velvet-oases (multi-k) and trinity packages. Both assembly metrics and biological annotation suggested that the velvet-oases produced a (slightly) better assembly.

                  For the DE analysis, is it a better approach to use an alignment software to map quality checked sequencing reads (from individually tested condition) to the annotated contigs constructed by the combined assembly (from all conditions) and calculate RPKM values

                  or

                  construct two separate de novo assemblies for each experimental condition, extract the number of reads & fragments of an annotated contig and compare it to those of the same gene coding contig from the other assembly?

                  The second approach seems to be integrated in the Trinity package (as FPKM values for each contig); however, as noted in this thread earlier the authors agree that the values are approximate. I assume read_tracking and amos file option from velvet would allow to extract similar info.

                  Any thoughts?
                  Thanks..

                  Comment


                  • #69
                    I'm also really interested in the opinions of others on this topic.

                    I've used the make 1 assembly (with velvet-oases) annotate and compare RPKMs from the assembly it seems to work well for "obvious" differences in genes (the ones that are way more abundant in one than the other).

                    What I was going to try is to fit the rpkms in edgR or DeSeq and check the results. I'm not too sure what too expect though.

                    BTW, I had to write my own script to extract read counts as you mentioned. with read_trkg on+LastGraph and the contig-ordering file it's not too hard to count reads per transcript.

                    What did you use to merge assemblier? Passed them back into velvet, CAP3, Zorro, GAM, other?

                    Comment


                    • #70
                      Dear berath & Illetourn,
                      we also mapped back reads of four conditions to the set of contigs from all lanes and now will count the rpkm value. Since we found no tool which is able to do DE of de novo transcriptomes, I think we have to write a script on our own.. Or are any tools available now, for this special purpose?

                      Again, to merge assemblies, we used TGICL/CAP3-package. Retrospectively, I'm not sure, if it does a good job, because sometimes there are annotations found on the (+)-strand and the same one on another transcript but (-)-strand.. After reverse complementing a sample transcript (-), and doing an alignment against the (+)-transcript, it shows, that sequences are similar, but with few small gaps between 30 and 60 bp.. Could these be real isoforms?

                      Comment


                      • #71
                        Hi lletourn,

                        In #69, for DESeq and edgeR, you should use the raw counts, not RPKM.

                        Comment


                        • #72
                          Yeah I know, I computed RPKM separately. I mis-phrased what I meant.
                          I should go back and edit.
                          Thanks.

                          Comment


                          • #73
                            So far, we have tried vmatch and CD-EST-HIT (performed better) to merge our assemblies. CAP3 is the next one to try.

                            @Iletourn,

                            Would you mind sharing your script to extract read counts from the velvet-oases assemblies?

                            I guess there is no way of choosing a DE approach without comparing the results. One way to do it is probably try it on a set of contigs representing a sample set of genes and compare the fold changes from the two approaches.

                            What I am wondering is if the de novo assembly using the reads from one experimental condition would be comprehensive enough as the one constructed from all the reads at hand. The second approach I mentioned is certainly shorter and maybe less prone to errors than mapping the reads back to combined assembly.

                            Comment


                            • #74
                              I wanted to add parameters to allow for -long or not. Right now the script assumes -long is passed (in 4 of our experimets we added 454 RNA-Seq reads to the mix).

                              I'll fix this and post the code...somewhere :-)

                              Comment


                              • #75
                                Originally posted by boetsie View Post
                                I don't think it is a good idea to use SSPACE for merging assemblies. Of course contigs can be combined if pairs can be found, however it will not merge full assemblies. You will still end up with the initial size of the total assembly of different k-mers.

                                Best way to go is using a tool that merges assemblies like Zorro or GAM. Have a look at this thread for a list of these tools;

                                http://seqanswers.com/forums/showthr...ighlight=zorro

                                Boetsie

                                just a quick update on GAM, at the moment it is Sanger based only but we are modifying it to accept NGS (bam) files as well!

                                http://services.appliedgenomics.org/software/gam/

                                Comment

                                Working...
                                X