Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coverage & insert size estimation

    I have illumina paired-end reads (1.fastq and 2.fastq) of genomic reads sequenced through hiseq2000. Since my genomic reads (around 100 Gb in size each reads) and also have constraint in computing power (shortage of memory and space- working in Dell workstation). I need to know coverage and insert size for my genome without doing denovo assembly and mapping the reads to it. I know some tools which calculate insert size and coverage from sam/bam file like qualimap, qatools.

    1. Is there any tool which estimate in approx the coverage and insert size with out doing asembling and mapping?

    2. If no tools available, can I extract 10% of random reads, to do denovo assemble and map the reads to find coverage and insert size?

  • #2
    Depending on the length and insert size of the reads, you can get an insert size histogram via overlap, which is fast and does not require assembly or mapping. You can do that like this:

    bbmerge.sh in1=1.fastq in2=2.fastq ihist=ihist.txt reads=2000000

    ...which will just process the first two million reads. However, if the insert size is long enough that they don't overlap, it won't work and you need to assemble and map. Whether or not you can assemble only 10% of the reads depends on how much coverage you have. Do you know what kind of organism it is, or is it a metagenome?

    You can estimate coverage via kmer-counting, like this:

    khist.sh in1=1.fastq in2=2.fastq hist=hist.txt


    Then you look at the histogram and find the first major peak, which tells you the approximate coverage. You could also speed it up by limiting it to some fraction of the total reads and then scaling the result by a factor.

    Both of these are in the BBTools package. Note that these command lines are for Linux. If your computer uses Windows, the commands would be slightly different.

    Comment


    • #3
      @Brian Bushnell- Thanks for your suggestion.

      I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
      Another thing, can tool to estimate heterozygosity rate from mapped reads?

      Comment


      • #4
        I would rather do this with the whole dataset. If you have enough coverage (approx. >30-fold) the k-mer graph should not only be able to give you a hint about the genome size and coverage but also heterozygosity.

        I haven't worked with the BBTools package yet but with Jellyfish and SOAPec. There is also a tool available for the estimation of these characteristics (see the attached paper). The Figures in there might also be helpful for the understanding of the k-mer graph:

        Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

        Comment


        • #5
          Originally posted by bioman1 View Post
          @Brian Bushnell- Thanks for your suggestion.

          I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
          Another thing, can tool to estimate heterozygosity rate from mapped reads?
          First, try merging the reads by overlap; you will know in under a minute whether the reads overlap or not (based on the percentage merged). If they do, then the insert size question is solved.

          The kmer histogram can give you an estimate of the genome size, repetitiveness, AND the heterozygosity. There's really no way to tell whether 10% is enough for assembly without a genome size estimate. If you have 200Gbp, that would give 30x coverage for a ~700Mbp organism, which is very small for a tree (even ignoring the ploidy).

          By the way, you can also do normalization and subsampling with BBTools, either of which will reduce the read count. For example, you could normalize to approximately 30x coverage like this:

          bbnorm.sh in1=1.fastq in2=2.fastq hist=hist.txt out=normalized.fq target=30

          ...which will automatically determine how many reads you need to get a uniform 30x coverage. It's slower than sampling, but not too bad. The output from that command would be interleaved.
          Last edited by Brian Bushnell; 06-11-2014, 09:22 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Developments in Metagenomics
            by seqadmin





            Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
            09-23-2024, 06:35 AM
          • seqadmin
            Understanding Genetic Influence on Infectious Disease
            by seqadmin




            During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

            Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
            09-09-2024, 10:59 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 04:51 AM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-01-2024, 07:10 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-30-2024, 08:33 AM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-26-2024, 12:57 PM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Working...
          X