Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating heterozygosity from kmer frequency distribution

    Is there a program that can estimate the heterozygosity of a sample using the kmer frequency distribution of the raw reads? I have whole genome, Illumina data (100bp PE reads, from 300bp fragments). The kmer frequency plot has a clear bimodal distribution, so I can get a rough estimate by eyeballing the areas under the curves for the two peaks. I am hoping to find a more robust method and more automated since I have over 100 samples.

  • #2
    push

    Actually I have no responds neither, I am afraid.
    I am just asking myself the same question and wondered whether you were able to solve that question ?

    Comment


    • #3
      Perhaps you want to look into Ka/Ks estimation.

      Comment


      • #4
        I just came across this paper on arxiv "Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects"
        Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++, and freely accessible at Github URL (https://github.com/fanagislab/GCE) or BGI ftp ( ftp://ftp.genomics.org.cn/pub/gce).

        I have not tried their tool though! It is available at ftp://ftp.genomics.org.cn/pub/gce/
        Best,
        ~wormSeeq.

        Comment


        • #5
          Hmmm, I wrote a program that does this. Well, two, actually. Their usage is about the same.

          khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt
          or
          kmercountexact.sh in=reads.fq khist=khist.txt peaks=peaks.txt

          The first uses approximate counts, while the second uses exact counts (and thus potentially more memory). The peaks file header contains estimates of genome size and heterozygousity. You can also add the flag "ploidy=2" for diploid organisms, so that it won't need to autodetect the ploidy (and thus potentially make a mistake).

          These are both distributed with BBTools.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin







            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has...
            12-02-2024, 01:49 PM
          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-02-2024, 09:29 AM
          0 responses
          141 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-02-2024, 09:06 AM
          0 responses
          50 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-02-2024, 08:03 AM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-22-2024, 07:36 AM
          0 responses
          70 views
          0 likes
          Last Post seqadmin  
          Working...
          X