Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create longer contigs from transcriptome assembly

    Hi everyone,

    I am doing transcriptome assembly on short Illumina reads (both single end and paired end with avg length of 100 bp).
    I use different transcriptome assemblers, like MIRA, Velvet Oases, SOAPdenovoTrans.
    The avg contig length I got varies from 400 to 1000 bp.
    I tried different parameters, like different kmer and different min contig size, but the result does not improve a lot.
    I am interested in making much longer contigs (few thousands), so I was wondering is there any way how can I improve and increase this current contig length.
    Also, I would like to know why this length varies a lot between different transcriptome assemblers (there is a huge difference between length 400 and length 1000).
    I know that Velvet and SOAPdenovo are based on de Bruijn graphs, while MIRA is OLC based.

    I would appreciate a lot if someone can share some similar experience with me.
    Thank you very much,
    Best Regards,
    Natasha

  • #2
    So ... what average contig length do you expect? And why? While the transcriptome projects that come through my hands do generate some transcripts in the 'few thousands' most of them are much shorter.

    The average length *may* vary because of the number of short transcripts being kept between the various programs. If one program keeps all transcripts while another throws away transcripts less than 200 bases then your average will vary even if the longest transcripts do not. Really you can not say much about average lengths unless you also know the shortest/longest lengths and the distribution.

    Comment


    • #3
      Also, if you are going to do denovo transcriptome assembly then you really owe it to yourself to try out Trinity instead of using non-transcriptome assemblers.

      Comment


      • #4
        The length of your assemblies will be greatly impacted on expression level of the genes you're assembling. Even in very deeply sequenced samples, and with replicates, its going to be very hard to assemble very many genes from TSS to polyA, or even start to stop codon. Simple statistics like N50, or mean length, just don't mean much for transcriptomes.

        You need to do some sort of orthology assignment to get an idea of how complete your assembly is, or how one assembly compares to another. If you're going for simple statistics about your assembly. I'd much rather just look at number of transcripts >1kbp than N50/average length, because its usually in the 500-1000bp range that you start getting meaningful information for downstream analysis.

        And to try to answer your questions about how to improve assembly length, I would just say try trans-ABySS (which uses multiple k-mer approach and might be the best assembler in terms of completeness) and Trinity (does a nice job with length and ease of downstream analysis). From your use of Velvet Oases, it sounds like you're doing this on microbes, but you may still find success with those two.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X