Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coverage & insert size estimation

    I have illumina paired-end reads (1.fastq and 2.fastq) of genomic reads sequenced through hiseq2000. Since my genomic reads (around 100 Gb in size each reads) and also have constraint in computing power (shortage of memory and space- working in Dell workstation). I need to know coverage and insert size for my genome without doing denovo assembly and mapping the reads to it. I know some tools which calculate insert size and coverage from sam/bam file like qualimap, qatools.

    1. Is there any tool which estimate in approx the coverage and insert size with out doing asembling and mapping?

    2. If no tools available, can I extract 10% of random reads, to do denovo assemble and map the reads to find coverage and insert size?

  • #2
    Depending on the length and insert size of the reads, you can get an insert size histogram via overlap, which is fast and does not require assembly or mapping. You can do that like this:

    bbmerge.sh in1=1.fastq in2=2.fastq ihist=ihist.txt reads=2000000

    ...which will just process the first two million reads. However, if the insert size is long enough that they don't overlap, it won't work and you need to assemble and map. Whether or not you can assemble only 10% of the reads depends on how much coverage you have. Do you know what kind of organism it is, or is it a metagenome?

    You can estimate coverage via kmer-counting, like this:

    khist.sh in1=1.fastq in2=2.fastq hist=hist.txt


    Then you look at the histogram and find the first major peak, which tells you the approximate coverage. You could also speed it up by limiting it to some fraction of the total reads and then scaling the result by a factor.

    Both of these are in the BBTools package. Note that these command lines are for Linux. If your computer uses Windows, the commands would be slightly different.

    Comment


    • #3
      @Brian Bushnell- Thanks for your suggestion.

      I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
      Another thing, can tool to estimate heterozygosity rate from mapped reads?

      Comment


      • #4
        I would rather do this with the whole dataset. If you have enough coverage (approx. >30-fold) the k-mer graph should not only be able to give you a hint about the genome size and coverage but also heterozygosity.

        I haven't worked with the BBTools package yet but with Jellyfish and SOAPec. There is also a tool available for the estimation of these characteristics (see the attached paper). The Figures in there might also be helpful for the understanding of the k-mer graph:

        Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

        Comment


        • #5
          Originally posted by bioman1 View Post
          @Brian Bushnell- Thanks for your suggestion.

          I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
          Another thing, can tool to estimate heterozygosity rate from mapped reads?
          First, try merging the reads by overlap; you will know in under a minute whether the reads overlap or not (based on the percentage merged). If they do, then the insert size question is solved.

          The kmer histogram can give you an estimate of the genome size, repetitiveness, AND the heterozygosity. There's really no way to tell whether 10% is enough for assembly without a genome size estimate. If you have 200Gbp, that would give 30x coverage for a ~700Mbp organism, which is very small for a tree (even ignoring the ploidy).

          By the way, you can also do normalization and subsampling with BBTools, either of which will reduce the read count. For example, you could normalize to approximately 30x coverage like this:

          bbnorm.sh in1=1.fastq in2=2.fastq hist=hist.txt out=normalized.fq target=30

          ...which will automatically determine how many reads you need to get a uniform 30x coverage. It's slower than sampling, but not too bad. The output from that command would be interleaved.
          Last edited by Brian Bushnell; 06-11-2014, 09:22 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin



            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM
          • seqadmin
            Multiomics Techniques Advancing Disease Research
            by seqadmin


            New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

            A major leap in the field has
            ...
            02-08-2024, 06:33 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 02-28-2024, 06:12 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 02-23-2024, 04:11 PM
          0 responses
          74 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 02-21-2024, 08:52 AM
          0 responses
          85 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 02-20-2024, 08:57 AM
          0 responses
          69 views
          0 likes
          Last Post seqadmin  
          Working...
          X