Header Leaderboard Ad

Collapse

De Novo Assembly of a transcriptome

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Celia,

    From my understanding your N50 for transcriptome assembly should be pretty low. You're just not going to have large contigs when the average spliced gene is something like 1500-2000 base pairs. Then, depending on your RNA extraction and purification methods, you might have also captured microRNAs. To a certain extent you probably have some genomic contamination, and certainly a lot of pre-spliced mRNAs. Then, what's your coverage on lowly expressed genes? For larger, but lowly expressed genes, you probably have a bunch of small contigs. Anyway, I wouldn't worry about the N50 for transcriptome assembly so much. I'd much rather see what the size of the 5,000th and 10,000th contig is to get a measure of how "complete" the transcriptome is. And unless you have the whole body of the animal and a variety of ages (including embryonic), I wouldn't expect you to get much more than 10K reasonably well assembled genes, if that.

    I can't really help with the other question though, I'm still behind on the actually "doing" of the analysis, but do you mind sharing what kind of specs the computer you where running ABYSS on had for processors/RAM?

    Comment


    • #47
      Wallysby01,

      thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.

      Our computer has a 8-core processor and 16GB of memory plus 18GB or SWAP.
      and as I said before ABySS always finished over night (max 10hours).

      Comment


      • #48
        Originally posted by Aurelien Mazurie View Post
        I am collecting information about the best strategy to perform de novo transcriptome assembly for a plant for which we have no reference genome. From what I read here it seems that most people are going for Illumina rather than 454 reads (which answers my first question, about which NGS technology should be used for this task).
        I am doing de novo cDNA library assembly on a plant using 454 Junior reads. I am totally new to sequencing, but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.

        Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.

        As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.

        Regarding BLASTING: I have questions regarding the best way to do this with my isotigs. I have access to a cluster with blastall and would like to blast to nt or nr. I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?

        Comment


        • #49
          Originally posted by grassgirl View Post
          but was told by a researcher with much experience that the 454 de novo data would assemble better than Illumina because of the long reads.
          True, but because of the lower number of reads you will need more sequencing to get the transcripts that are less expressed.

          Originally posted by grassgirl View Post
          Also I was told by Roche that there is no protocol for paired end cDNA libraries because the reads are so long and that it isn't a necessity.
          That is very true. If you take into account that on average transcripts are ~1kb (your mileage will very by specie), 500bp reads are long enough. And 454 mates (or pairs, however you name them) would be useless since the protocol basically starts at 3kb.

          Originally posted by grassgirl View Post
          As for assembly, a fellow researcher runs GS de novo Assembler followed by cap3.
          This is very common when using velvet or other de brujin graph assemblers. Sometimes you don't need cap3 since the assemblers do a good enough job.

          In your case, with 454 reads, I would suggest an overlap based one like mira. Mira has always given good result with 454 and sanger type reads for ESTs and transcriptome analysis. It also works well with illumina but is *very* resource demanding.

          Originally posted by grassgirl View Post
          I have access to a cluster with blastall and would like to blast to nt or nr.
          I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.

          Originally posted by grassgirl View Post
          I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?
          I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

          I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.

          Comment


          • #50
            Hi there,
            we just did some Velvet/Oases-Assemblys on several non-normalized 60bp-PE-Libs and I would like to inform you about the ressources needed:

            Set1 with 101 Mio Reads: up to 38 GB RAM for Velvet, up to 63 GB for oases with k=25
            Set2 with 87 Mio Reads: up to 25 GB for Velvet, up to 46 GB for oases, also k=25
            Set3 with 100 Mio Reads: up to 37 GB for Velvet and 27 GB for oases, k=25

            Runtimes where between 4,5h for Velvet and up to 1h for Oases.
            We explored more kmers and the ressources needed were smaller for higher kmers (as expected).
            We also had a set of 454 reads, assembled them with mira, which took 13h and not more than 7 GB of RAM. The N50 value here was about 450 bp for 60k contigs.
            In addition, all transcripts have a N50 value around 670 bp after clustering from all Sets (with 454 Contigs).

            We plan to do the same assemblies with transAbyss and Trinity as well, I can post ressources needed here, if you are interessted.

            Comment


            • #51
              Originally posted by lletourn View Post
              I highly suggest nr, with blastx with your transcripts. Conserved proteins are easier to find this way.
              Originally Posted by grassgirl
              I have about 6600 isotigs and I'm not sure how to find out how much memory it would take (and that I would request on the cluster) to blast them all to nr. I have heard that I can test a subset (10, 100, 1000), but don't know how to go about doing this. Any suggestions? Should I split my isotig file up before blasting?
              I don't know what blast setup you have (mpiblast or some home made solution), but if your blast clone doesn't split the query for you then yes I would break it up. I usually break it in the amount of nodes I'm allowed, or that I can use.

              I've done this with 33,000 assembled transcripts on a 192 node cluster. It took a few days (2-4 I don't remember) to get all the xml results. Basically I broke down the 33k transcripts in 192 parts. And ran 1 per node.
              I would suggest not BLASTing against nr. Everyone is tempted to BLAST against the whole universe but that is not the best idea. BLAST against a reference database matched to your queries and one which isn't highly redundant. I've done lots of de novo plant transcriptome assembly and I typical run two BLAST jobs on the output, against TAIR and the green plant subdivision of RefSeq. You should also tweak the BLAST options appropriately for the experiment you are performing (yes, think of running BLAST as performing an experiment, an in silico Northern). The parameters I typically use for BLASTing transcript contigs against a protein database are:

              Code:
              # blastall -p blastx -d <db> -i <contigFile.fasta> -U -f 14 -F "m S" -e 1e-10 -b 20 -v 20 -a <#_cpus>
              
              Adapted from BLAST by Korf, Yandell & Bedell
              
              Sorry that this is the command using the old BLAST toolset.
              If you are using BLAST+, as NCBI is urging people to do, you'll have to translate this to the new command/options.
              The BLAST+ package has a perl script which can do the translation for you.
              Limiting the size of your database, the number of hits reported and adjusting the word threshold will reduce the time of your BLAST job. It's been a while since I've done much BLASTing but I believe that 6,600 isotigs should take < 12 hours on 8 cpus against RefSeq plants. Against TAIR it will take under an hour.

              Also, the memory requirement is independent of the size of your query set. BLAST does not store the query or the results in RAM. The major contributor to RAM consumption is the size of the target database. Here again, sticking to more narrowly targeted DBs will help, but by today's standards of RAM even nr should not be a problem.

              Comment


              • #52
                Originally posted by kmcarr View Post
                I would suggest not BLASTing against nr.
                I Totally agree. I wanted to say, blast to protein not nucleotide.

                Having a smaller db will yield more precise results (score wise) too.

                Comment


                • #53
                  Thanks, all, for the replies to my questions and great suggestions!

                  Comment


                  • #54
                    Originally posted by Celia View Post
                    Wallysby01,

                    thanks for answering...as soon as trinity stops running I will have a look at what you said about the 5000 and 10000th contig.
                    Celia,

                    I don't know if Trinity is still running but if it is taking too long at the Butterfly step, then you might find interesting this note that I found in the Trinity FAQ.

                    They, however, say this shouldn't be an issue after version 2011-05-19...

                    Comment


                    • #55
                      evaluating transcriptome assembly from k mer iterations

                      Originally posted by blackgore View Post
                      How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?
                      Hi,

                      A comparative approach was suggested by a user on the oases mailing list.
                      http://listserver.ebi.ac.uk/pipermai...ry/000008.html

                      HTH

                      Mbandi

                      Comment


                      • #56
                        Some data for those trying to figure out which programs to run for transcriptome data:

                        I tried to run Trinity on 1 lane from the HiSeq, ~100M 105 bp paired end reads, on a machine with 64 GBs of RAM and 4 Xeon processors (though the processor is not the problem), and it crashed after creating all the kmers in the de Bruijn graph and then trying to create contigs.

                        I'll be moving on to ABySS, as it seems to be much more memory efficient, and despite having access to one of the world's largest super computers, I can't get more than 64 GBs of RAM (makes me wonder what's so super about it).

                        Comment


                        • #57
                          For those interested this just came out:
                          Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance.
                          http://www.biomedcentral.com/1471-2164/12/317/abstract

                          Comment


                          • #58
                            Hi, here my two cents: the idea I follow is to use Trinity and then Velvet/Oases on different kmers for de novo transcriptome. I will run both and then assemble the results to create a consensus transcriptome. At the moment, I have run Trinity using 127M 105bp reads (mixed paired-single but used as single-end as Trinity seems to use info only for mate-pairs not paired-reads) on my 24Gb RAM 8 processors box and had no problem on default parameters (I think it took two days or so).

                            I am now trying to run velvet on a subset of those (40M mixed single-paired) and am running out of memory, so I am trying with a larger computer. I guess I will also run into problems when Oases comes.

                            Best.

                            Comment


                            • #59
                              Hi dnusol,
                              I'm not experience with assembly, but I started running velveth->velvetg->oases (k iterations pipeline) with 10601688 reads (paired and single). The memory constrain was profound and it always crashed. I was advice to abstain from very low k values. I do my iterations on 19 <= k <=29 with only 5G of memory allocated to the whole process (although not all is used when I look at the log file on the job id) and it take 31.55 mins. I use a 31G, 16 processor machine which I share with others. With 40M reads of yours, it is obvious you would need more memory. However I advice you to start with k=19.
                              Cheers

                              Comment


                              • #60
                                Hi Apexy, thanks for your input,

                                I thought small kmers would work worse for long reads (105bp) that is why I chose in the 31-45 range.
                                Since my last post I got some more news: velvetg peaked at 56Gb RAM for kmer 31 and about 40M reads (keep in mind read_trkg was on, as suggested in the manual, which seems to be memory-hungry).

                                Best,

                                David

                                Comment

                                Working...
                                X