Announcement

Collapse
No announcement yet.

De Novo Assembly of a transcriptome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    From section 4.2 and 4.3 of the new CLC white paper, it appears that the old CLC assembler made slightly longer contigs (unpaired max CLC69kbp vs VEL60kbp, N50 CLC23kbp vs VEL16kbp) at the expensive of more incorrect ones (CLC: 36 wrong, VEL :1 wrong). The newer one leans too far the other way. Who knows what velvet parameters were used - probably the ones that most closely matched the total CLC assembly size.
    http://www.clcbio.com/files/whitepap...C_NGS_Cell.pdf

    I'm not so sure there is a free lunch here.

    Marta, what cvCut and expCov parameters did you use in your Velvet assemblies? The cvCut parameter has a huge effect on N50, assembly size, and read usage.
    Last edited by Zigster; 12-16-2009, 08:16 PM.
    --
    Jeremy Leipzig
    Bioinformatics Programmer
    --
    My blog
    Twitter

    Comment


    • #17
      The experiment CLC did for this white paper does not reflect the actual performance of the CLC assembler. I think the assembler is much better than what the paper claims.

      I use CLC Genomics WorkBench on Windows with 32GB RAM. A few days ago I started to test the latest (beta) version of the assembler for Workbench. It performs much better than the older one. My input is 92.5 Million of transcriptome single reads that are up to 85 nt long (IGA, filtered fasta).

      About Velvet - my understanding that there is not much sense in changing expCov for transcriptome reads. We work with normalized mRNA libraries, but still the coverage between different transcrips varies a lot. About cvCut you need to contact alex_kozik (he is a member here). He is the one who ran all Velvet assemblies on the same set.

      Comment


      • #18
        Originally posted by Neil View Post
        Hi all,
        also, what software would you recommend for this?
        hope someone can help
        best regards
        neil
        Hi Neil,
        I would recommend our new software Oases see the thread Oases: De novo transcriptome assembly of very short reads or http://www.ebi.ac.uk/~zerbino/oases/.
        The software is designed to cope with alternative splicing and repetitive regions that normally break up contigs (for example if genome assemblers are used). Oases can produce full length transcripts if the coverage allows it and does also support/exploit paired-end information. And yes, paired-end information does improve the results. Oases already supports longer reads (e.g. 75 bp) that are produced by the current technologies.

        Bests,
        Marcel

        Comment


        • #19
          How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?

          Comment


          • #20
            Interesting question, Blackgore! Without a reference/gene model/ESTs, how to evaluate a de novo transcriptome assembly?

            Originally posted by Marta View Post
            Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

            More technical notes on filtering the reads and Velvet parameters used are here:
            http://atgc-illumina.googlecode.com/...k_090910_D.pdf
            I also found a 15kb contig homologs to Arabidopsis BIG/ubiquitin-protein ligase in my plant transcriptome. I was told that similar result is obtained in P.trichocarpa. Therefore, I think BIG/ubiquitin-protein ligase can serve as an indicator for plant transcriptome assembly. Long genes like BIG/ubiquitin-protein ligase won't be assembled in poorly sequenced transcriptome.

            Anyway, both methods (including N50) doesn't say much about the scaffolds quality. There can be scaffolds with lots of Ns due to poorly sequenced insert gaps. Compare two datasets with the same N50 and longest contig but one with lots of Ns, how can you tell the difference?

            Comment


            • #21
              This might be an interesting read

              Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery
              http://www.biomedcentral.com/1471-2164/11/180

              DNASTAR also lists de novo transcriptome assembly
              http://www.dnastar.com/t-sub-nextgen...scriptome.aspx

              I am curious if anyone managed to do de novo transcriptome with the shorter SOLiD reads?
              Last edited by KevinLam; 08-23-2010, 03:03 AM.
              http://kevin-gattaca.blogspot.com/

              Comment


              • #22
                I am working on the transcriptome sequecning (GAII 75bp pair-end) of a nematode without reference genome. We tried CLC V4.5 and SOAPdenono to assemble the reads. My approach is to limit the number of mismatch to zero and overlapping region to 50% for the first round of assembly to obatin a reliable reference contig. Then use the reference contigs from the 1st assemble to re-assemble with the un-assembled reads, and set mis-2 for the second round of assembly. The largest contig we assembled is a 18KBp titin gene.
                Depends on the compexity of the intron/exon of the organism and the depth of sequecning, around 30000-70000 transcripts with reads more than 100 should be identified.
                The remaining steps is the same as for classical EST annotation start from Blastx uniprot, NCBI nr and interprot.
                The last problem for RNAseq sequencing of an organism without a genome is how to identify the splice variants.

                Comment


                • #23
                  Hi petang, I'm also just start my research about RNA-seq.
                  Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

                  Comment


                  • #24
                    Just a quick questions on this de novo assembly for transcriptomic, say if I am having Illumina 2x100bp RNA-seq reads once it is assembly by de novo assembler, how do we annotate the transcript? but doing blast or blast2go? does it sufficient?

                    Comment


                    • #25
                      In my experience, the number of contigs assembled from 20 million of 2x100bp reads varies from 30000-50000 contigs, depends on the complexity of the genome. The first question is how many contigs you want to annotate and the purpose of your experiment. If you aimed to gene discovery, the first 10000 highly expressed contigs should be good enough. Or alternatievly, you can choose the long contigs (let say, longer than 1000bp).

                      If you are doing comparative transcriptomics. Obivously you can choose those differentially expressed contigs.

                      In either cases, it is impossible to annotate all contigs without the support of a bioinformatics group.

                      The quickest way to annotate the transcript is Blastx UniProt, then retrive all the information (pfam, GO, KEGG etc) from the hit. However, you will missed all the conserved hypothetical proteins which is only available from NCBI. So, I will start from BLASTx uniport, then use the un-hit contigs for BLASTx NCBI nr.

                      Comment


                      • #26
                        Originally posted by edge View Post
                        Hi petang, I'm also just start my research about RNA-seq.
                        Do you have any idea to identify alternative splice variants without reference genome of RNA-seq data right now?

                        Still no idea on this.
                        Sorry

                        Comment


                        • #27
                          Hi petang. Thanks for your reply. Appreciate it.

                          Comment


                          • #28
                            Dear petang,

                            Is ok...
                            No worry about it...
                            I just not sure whether my query about identify alternative splicing variation without using reference genome sequence is working now?
                            Is it sound logically or not?
                            It seems like really quite difficult to identify alternative splicing variation without the reference genome sequence
                            Thanks first for any advice.

                            Comment


                            • #29
                              We've tried a few different programs for de novo transcriptome assembly, you can see our paper that came out about a year ago here: http://www.biomedcentral.com/1471-2164/11/663.

                              As Marcel pointed out, Oases seems to do a pretty good job with the latest updates. In our paper we introduce, Rnnotator, which adds some additional pre/post-processing steps to further improve the assembly. We've been able to assemble a plant transcriptome, but we are still evaluating the result.

                              Originally posted by Neil View Post
                              Hi all,
                              We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
                              also, what software would you recommend for this?
                              hope someone can help
                              best regards
                              neil

                              Comment


                              • #30
                                Hi,

                                is web-based blastx able to digest a full contig output from velvet or oasis, or is it better to download both blast and uniprot database and work locally?

                                Best,

                                Dave

                                Comment

                                Working...
                                X