Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • edge
    Senior Member
    • Sep 2009
    • 199

    Minimum short read required for transcriptome assembly

    I have Illumina short read, 2X50bp right now, around 14Gb data.
    I just curious whether got any parameter or formula able to calculate the minimum short read required to assemble a transcript sequence by transcriptome assembler program in order to obtain comprehensive transcript?
    eg. must have at least 1Mb Illumina short read in order to assemble it.

    Do we need consider coverage and depth of data when determine or calculate the minimum short read required for transcriptome assembly as well?

    Many thanks for advice.
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    Ah, I should have noted that you are a "Senior Member" and thus undoubtedly already know more about sequencing than many of us. My response below was more aimed towards the many new people we get on SeqAnswers thus it may not be applicable to you. Wish I did have more than a rough guide on an actual formula to use.

    -------------------

    Originally posted by edge View Post
    Do we need consider coverage and depth of data...
    Yes you do. In particular for a non-normalized transcriptome or non-rRNA-depleted sample then you need to be concerned with picking up low expression genes.

    You do not give enough information for us to make an intelligent decision for your particular case (e.g., we would need information on the organism you are sequencing, the complexity of the genes for the organism, if your sequence sample is normalized or not, etc.) However we can play around with some very rough numbers.

    Let us assume that your sample is completely normalized. In other words each transcript (gene) is present once and only once in your sample. Assume a complex eukaryotic organism. Then our numbers could look like:

    100,000 genes at 1000 bases each ... equals a sequence space of 100 Mbase

    Desire 30x sequencing coverage ... means we need 3 GB of sequence.

    Your 14 GB will do quite nicely.

    On the other hand let us assume that you do not have a normalized sample. Then some genes will be present thousands of times. Others only once. I am sure that there is some graph out there that describes this behavior and provides a multiplication factor but I'll make a wild guess that this increase the sequence space by at least 10. Thus you would need 30 GB of sequence.

    The numbers above are very, very rough so do not base your research off of them. The numbers are more meant as a way to say "... it depends ..."
    Last edited by westerman; 09-21-2011, 10:34 AM. Reason: Realized that 'edge' is not a newbie.

    Comment

    • tbanks
      Member
      • Mar 2010
      • 11

      #3
      The following publication shows a number of simulations on transcriptome assembly and the effects of coverage and sequencing technology. It`s a bit dated now but should help you out. I believe they also have some online software so you can do your own rough simulation.

      Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho LP, Hu Y, Carlson JE, Ma H, Schuster SC, Soltis DE, Soltis PS, Altman N, dePamphilis CW. Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009 Aug 1;10:347.

      Comment

      • edge
        Senior Member
        • Sep 2009
        • 199

        #4
        many thanks, westerman.

        I have a RNA-seq human lung sample, 2X100bp, pair-end read with total 14GB file size right now.
        I plan to map my RNA-seq data against transcriptome database that downloaded from NCBI.
        After then, I plan to cluster all the short read depend on their mapped transcript group.
        My problem facing is to determine how many minimum pair-end read is best to be a cut-off for assembly purpose.
        From the mapping result, some of the transcript group only mapped by thousand read pair.

        Thanks for any advice.

        Comment

        • mruizm
          Member
          • Apr 2013
          • 22

          #5
          Minimum deep of coverage in transcriptome assembly

          Hi everyone, i have 4,46 Gigas of information on various sequencing of transcripts in various tissues of Illumina Miseq paired-end reads. I had assembly all these reads and i found that the mean deep of coverage is of 27,9X (Deep of coverage = efficiency of sequencing / efficiency of assembly)
          My question here is, what is de minimun of the deep of coverage for obtain robust information of the assembled transcriptome in a de novo transcriptome analysis?

          Thanks!
          Best regards!

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM
          • SEQadmin2
            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
            by SEQadmin2

            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
            05-06-2026, 09:04 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Today, 08:59 AM
          0 responses
          9 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          21 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          17 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-28-2026, 11:40 AM
          0 responses
          30 views
          0 reactions
          Last Post SEQadmin2  
          Working...