Header Leaderboard Ad

Collapse

How to estimating the genome size

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by aaronrjex View Post
    Hi yanij

    I don't know how useful this will be to you give the time since your post, but just in case....

    The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

    Or as a simple equation: depth of coverage = total data / genome length.

    If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

    Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

    To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

    You can relate this value to the actual coverage of your genome using the formula M = N * (L – K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

    L -k +1 gives you the number of kmers created per read.

    So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

    Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

    N = M/((L-k+1)/L) and calculate N.

    Once you have that, divide your total sequence data by N and you have your genome estimate.

    Hope that helps.
    This is a very helpful description of BGI's methods but I have one question. How do you determine M, what you call the mean k-mer coverage, from the k-mer frequency histogram produced by soapdenovo's pregraph program (the file with the *.kmerFreq extension), or with any other method? The first post in this thread quotes the methods from the Panda genome paper as saying M is the peak k-mer coverage (I guess on a normal distribution), and that is what is challenging to calculate.

    Comment


    • #17
      Hello folks,

      I am trying to understand the genome size estimation method, and here is the part that is not clear to me.

      BGI is estimating coverage depth by peak of frequency histogram, but in highly repetitive genomes the peak shifts down. E.g. check the second chart here, where we generated simulated reads from sea urchin genome at 50X coverage, but the histogram peak came at ~40 due to repeats.



      If that is true, how does BGI estimate genome size so precisely? In later section of bat paper, they claimed that the assembly size is close to the 'estimated size', which is puzzling given the impreciseness in genome size estimation.

      Maybe I am missing something. Please help !!

      ---------------

      Edit. I am rerunning the above simulation to make sure everything is done correctly. Results will be reported here.
      Last edited by samanta; 04-10-2013, 04:59 PM.
      http://homolog.us

      Comment


      • #18
        Lizhenyu from BGI explained where I made the error in thinking. When a read has 100 nucleotides and is split into 21-mers, the read will produce only 80 k-mers, not 100. Here is his full response, which agrees with aaronrjex's post above.



        "Hi,

        I think you mixed the concepts of base coverage depth and kmer coverage depth.
        When you refered 50X genome coverage, it meant base coverage which is obtained by

        total_base_num/genome_size=(read_num*read_length)/genome_size.

        Similarly, the kmer coverage depth, the peak value in kmer frequency curve, is calculated by

        total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size.

        So the relationship between base coverage depth and kmer coverage depth is:

        kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length.

        In your case, kmer_coverage_depth = 50 * (100 - 21 + 1)/100 = 40, which is exactly
        the peak value in you plot.

        best,"
        http://homolog.us

        Comment


        • #19
          Ok, thanks for the nice explanation
          Last edited by mflderks; 09-27-2013, 12:01 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            How RNA-Seq is Transforming Cancer Studies
            by seqadmin



            Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
            09-07-2023, 11:15 PM
          • seqadmin
            Methods for Investigating the Transcriptome
            by seqadmin




            Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

            Whole Transcriptome RNA-seq
            Whole transcriptome sequencing...
            08-31-2023, 11:07 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 07:42 AM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-22-2023, 09:05 AM
          0 responses
          23 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-21-2023, 06:18 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-20-2023, 09:17 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Working...
          X