Announcement

Collapse
No announcement yet.

De Novo Assembly of a transcriptome

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De Novo Assembly of a transcriptome

    Hi all,
    We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
    Is there an example database of mRNA-seq short-pair reads that i can download to simulate assembly?
    also, what software would you recommend for this?
    hope someone can help
    best regards
    neil

  • #2
    Check for ESTs may help you in assembly
    de novo assembly of transcriptome.... what about misassemblies...

    Comment


    • #3
      Though I haven't finished the project (reads aren't all in yet), I'm doing something similar right now: no reference transcriptome, but looking for SNPs in cDNA reads of two subspecies. The first was sequenced with single-ended reads, and resulted in pretty short contigs, and only roughly 1/10 of the trancriptome total was assembled. I'm recommending paired-ends for the second sample, so I may have a quantitative answer for you in a couple of weeks.

      The transcriptome may have more unique, assemblable sequence than the genome .. but homologous domains will be a problem, and paired-ends would definitely help there. That's why I'd guess that a small insert library should help quite a bit.

      I'd recommend velvet - seems to still be the best option out there for Illumina reads. Not sure on simulation ...

      Comment


      • #4
        A year ago, de novo transcriptome sequencing solely based on Illumina GAII is a bad idea. With 72bp PE reads and higher coverage, nothing is impossible now.

        Like what Rao suggested, EST data will be helpful for the assembly. But, the fact is most organisms of interest don’t have comprehensive EST information. No available reference genome/ transcriptome (not even from a related species). You don’t know the exact size of the transcriptome, repeats, paralogous genes and isoforms problem. It’s tricky to tell even if your assembly went wrong. Like I said, it depends on the purpose of sequencing. Things is a lot easier if the goal is to discover SNPs. If the results are not satisfying, try other alternatives like sequencing using longer reads.

        Comment


        • #5
          Hi all!
          I'm doing the annotation of a transcriptome of a non reference organism, something similar like you. My assembly was made with GS de novo assembler, but I had short contigs...
          I'm trying the assembly with Mosaik but prior I have another problem: what about transposable elements? Have you tried to use windowmasker? Or RepeatMasker? For an organism without a database for these repetitives elements, which program do you think is better?
          Thanks!

          Comment


          • #6
            Originally posted by jordi View Post
            Hi all!
            I'm doing the annotation of a transcriptome of a non reference organism, something similar like you. My assembly was made with GS de novo assembler, but I had short contigs...
            I'm trying the assembly with Mosaik but prior I have another problem: what about transposable elements? Have you tried to use windowmasker? Or RepeatMasker? For an organism without a database for these repetitives elements, which program do you think is better?
            Thanks!
            Why would you worry about transposable/repetitive elements in the transcriptome? The common repeats found in transcriptome are SSR and low complexity region. I'm not refering to the repeats that are several kb long (like in the genome). But if these repeats are transcribed, then yes, you will find them in the transcriptome.

            Comment


            • #7
              Because if you haven't a large coverage and the same repetitive elements could appears in different genes, how do I know which protein has been translated? So, I would mask these elements.
              The low coverage has been my problem with Standard GS de novo assembler. Length contigs aprox 200 bp and a coverage from 4X to 6X.
              Thanks!

              Comment


              • #8
                oh, sorry. I found repetitive elements which are reverses transcriptases, located at 3' UTR of different genes. How can I differenciate the origin of my blast results?

                Comment


                • #9
                  Originally posted by jordi View Post
                  oh, sorry. I found repetitive elements which are reverses transcriptases, located at 3' UTR of different genes. How can I differenciate the origin of my blast results?
                  The only way to tell a 3' UTR is the presence of polyA tail at sequence end. Considering our contigs are short, are you sure this is not misassemblies? How long is the repetitive element you found and what's the similarity?

                  If you are using blast to annotate your contigs, using 3' UTR is not a good idea because that region can varies even within the same species.

                  I have used CENSOR to find repeats in my ESTs but there's no significant hits. Most hits are around 100bp with 80% similarity (The original genomic repeat is several kb long) and it only exist once in the ESTs. Maybe plants repeat databases are not well-characterized. In the end, I just ignore them.

                  Found a related thread on repeat at
                  http://seqanswers.com/forums/showthread.php?t=1504

                  Comment


                  • #10
                    So in short no one has done de novo transcriptome assembly for new organism before?
                    can we use a closely related species like fish to do that for de novo?

                    how about taking it further with doing expression profiling on the new organism?
                    http://kevin-gattaca.blogspot.com/

                    Comment


                    • #11
                      We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.

                      Comment


                      • #12
                        While this is not de novo assembly of a novel transcriptome, in some ways it is better because it can be compared against a known transcriptome (which was not used in the assembly as far as I know

                        http://bioinformatics.oxfordjournals...&pmid=19528083
                        Bioinformatics. 2009 Nov 1;25(21):2872-7. Epub 2009 Jun 15.
                        De novo transcriptome assembly with ABySS.
                        Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ.

                        Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada. [email protected]
                        MOTIVATION: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION: Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT: [email protected].

                        PMID: 19528083

                        Comment


                        • #13
                          Originally posted by Marta View Post
                          We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.
                          Are your results published in a paper already? Would love to read it!
                          http://kevin-gattaca.blogspot.com/

                          Comment


                          • #14
                            We have done several de Novo transcriptome projects mainly using Illumina technology and the Abyss assembler. In general it works but the problem is getting full length sequences (from start to stop codon). We have recently learned that some labs uses coligation of the transcipts prior to the nebulization. It should increase the number of full length genes. The reason is that the fragmentation is non random at the ends making the ends underrepresented in the library.

                            Comment


                            • #15
                              KevinLam,

                              The data is unpublished. We are re-assembling the reads using the latest version of CLC assembler and Velvet with adjusted parameters. The number of transcriptome contigs in our latest assemblies went down from ~70K to ~57K. I have a presentation on-line with results from last summer assemblies here:
                              https://docs.google.com/fileview?id=...MWYzNjkz&hl=en

                              Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

                              More technical notes on filtering the reads and Velvet parameters used are here:
                              http://atgc-illumina.googlecode.com/...k_090910_D.pdf

                              Comment

                              Working...
                              X