Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation

    Hi,
    I have the de novo assembly was done by clc and also have the reference assembly ie.mapping all the read files to de novo assembly. Now, I want to estimate the genome size from these information.

    Can anybody help me how could I do that?

  • #2
    The sum of the size of your contigs/scaffolds from the assembly should be your genome size.

    Comment


    • #3
      Originally posted by lletourn View Post
      The sum of the size of your contigs/scaffolds from the assembly should be your genome size.
      The 'true' target genome size will likely be different from the sum of the assembly. It can be smaller due to collapsed repeats/near repeats or low coverage areas being missing.

      It can also come with 'extras' such as contaminants (stuff in the assembly, not in the target genome), possibly "over-expansion" of heterozygous regions into separate contigs.

      Comment


      • #4
        For sure, but it won't be 3,4,5 times bigger. It's a pretty close approximation.

        another way of computing genome size without an assembly is to count kmer coverage. An example of this can be found here:


        The problem with this approach is if you have plasmids, chloroplast, (high coverage small genomes). These will skew the graph.

        Comment


        • #5
          Originally posted by lletourn View Post
          For sure, but it won't be 3,4,5 times bigger. It's a pretty close approximation.

          another way of computing genome size without an assembly is to count kmer coverage. An example of this can be found here:


          The problem with this approach is if you have plasmids, chloroplast, (high coverage small genomes). These will skew the graph.
          Basically, you want to find the inflection point.

          Then, find the minimum left of this inflection point and discard all k-mers on the left of this minimum.

          This usually works well.

          Finally, you sum all the counts with a coverage value greater or equal to the minimum.

          This will be your estimated genome k-mers multiplied by 2. Divide this number by 2 and there you go.

          Example: http://postimage.org/image/1p5t8wmsk/

          As highlighted in your link, you want to discard the erroneous k-mers by fitting a distribution with a known equation.


          The Ray assembler generates such a coverage distribution.

          see http://denovoassembler.sf.net

          -seb

          Comment


          • #6
            Thanks seb, but I want to know how to find the exact K-mer of my genome. Should it be the inflection point you said or the pick of the histogram?

            Then why "This will be your estimated genome k-mers multiplied by 2. Divide this number by 2 and there you go" . I am a little bit confused here.

            Why I can't divide the sum of all coverage contribution to this K-mer (if it would really be the inflection point)?

            Thanks again
            Moinul

            Comment


            • #7
              Originally posted by moinul View Post
              Thanks seb, but I want to know how to find the exact K-mer of my genome. Should it be the inflection point you said or the pick of the histogram?

              Then why "This will be your estimated genome k-mers multiplied by 2. Divide this number by 2 and there you go" . I am a little bit confused here.

              Why I can't divide the sum of all coverage contribution to this K-mer (if it would really be the inflection point)?

              Thanks again
              Moinul
              The peak occurs at the coverage depth where you see the inflection point, by definition. An inflection point is where the derivative is equal to 0.

              However, if the peak is 55, then we can say that on average, a unique region of the genome has a k-mer coverage of 55. But any unique region can also have a k-mer coverage of 54.


              In the k-mer coverage depth distribution, erroneous k-mers and genome k-mers are present. If you take the sum of the number of k-mers at each coverage depth, you will obtain a number that includes erroneous k-mers.

              Using the minimum before the inflection point or the gamma distribution are ways to eliminate the erroneous k-mers although both methods are not perfect.

              You DNA reads origin from either the forward or the reverse strand in the genome. Since be can't know for sure which, k-mer counters consider both strands.

              So, the count will include both strands. Therefore you must divide by 2.
              Last edited by seb567; 06-01-2011, 06:17 AM. Reason: typo

              Comment


              • #8
                Jellyfish-kmer, genome size estimation

                Dear all,

                I am running jelly fish (jellyfish-2.1.1) for first time to estimate genome size. Although I followed manual, i am bit confused to estimate genome size. Below are my steps for kmer 27. Did I get correct genome size estimation. If I want to try different kmers to get best kmer & genome size how I do plotting? If any body have script to plot for different kmers and find best kmer and genome size, please share with me.


                Quote:
                jellyfish count -m 27 -s 100M -t 10 -C sample.filtered.fastq

                jellyfish histo -f mer_counts.jf > histogram.txt

                jellyfish stats -v -o stats.txt mer_counts.jf
                less stats.txt
                Unique: 659211049
                Distinct: 2297173537
                Total: 31359408599
                Max_count: 16054234
                (END)


                less histogram.txt (first 10 rows)
                0 0
                1 659211049
                2 94535838
                3 109738065
                4 125218564
                5 126564348
                6 117188987
                7 103591231
                8 90823407
                9 80950377
                10 74112334

                Genome size estimation= totalnumber of distant kmers - distinct error kmers
                Genome size estimation=31359408599 - 2297173537 = 31130235062

                Comment


                • #9
                  You should probably drop k lower, but its hard to tell only seeing coverage out to 10. If that’s your “real” peak at coverage=5, k is much too big for your sequencing depth. Of course, I’ve seen double dips in the peaks before for different reasons (meaning you might have a second local maximum at a higher coverage), such as heterozygousity or contaminates, but dropping k should be the first step.

                  Comment


                  • #10
                    @wallysbo1. thanks for suggestion.

                    My data are sequenced through hiseq2000 (whole genome shortgun approach) illumina paired-end reads interleaved using velvet shuffleseq.pl script after filtration through trimmomatic. Checked with fastqc, it passed all the test with warning in 'per sequence content' and 'sequence duplication level'.

                    I have tried with lower kmer with 17, included histor_17.txt I have tried with 15, 17, 21, 25,27,30 I see peak at 5. Any suggestion?. Is the reads are filtered properly? how to estimate genome size?.

                    histor_17.txt- pl.see attachment at link below
                    http://www.fileswap.com/dl/rSoQKp5sLi/

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 07-16-2024, 05:49 AM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    33 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-10-2024, 07:30 AM
                    0 responses
                    40 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-03-2024, 09:45 AM
                    0 responses
                    205 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X