Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coverage calculation

    Hey

    I am trying to sequence the exome and the capture kit is 100MB

    The sequencing core promised 120 million reads per lane and we are using paired end 100bp reads and our fragment size is 250 basepairs.

    My calculation was I will get 120 million reads * 200= 240 million bases read

    so coverage= 240 million bases/100MB= 240x coverage (average)

    But some people say I will get a coverage of only 120x. What could be the reason? Or is the coverage actually 240x?

  • #2
    Why are you multiplying 120 million reads by 200, if each read is 100 bases long? A read is one end, a cluster has two reads.

    It's 120x by those calcuations, but obviously not every read will fall on target, so it will be lower than that.

    Comment


    • #3
      It can read 120 million fragments and each fragment will be read twice with 100pb length. So i thought I will get twice of it.

      Comment


      • #4
        I think you are conflating fragments and clusters and reads.

        One read is just one read. One fragment generates one cluster on the Illumina flow cell, and two reads come from that one cluster.

        If you were told 120 million reads, like you write in your first post, then you don't double that again. If you were told 120 million clusters, that 240 million reads at 100 bp each.

        Comment


        • #5
          It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).

          Comment


          • #6
            Originally posted by simonandrews View Post
            It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).
            "coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.

            You seem to be referring to what I would call "% of genome covered".

            --
            Phillip

            Comment


            • #7
              Originally posted by pmiguel View Post
              "coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.
              I suppose this comes down to where you think your errors will occur. Resequencing the same fragment multiple times will help to correct sequencing errors, but won't help if the fragment picked up a PCR error during library preparation.

              I guess I tend to think in terms of epigenetics where there isn't a single fixed epigenome to measure, so the distinction between two reads from the same fragment and two reads from different fragments actually matters. If you're only concerned with sequencing errors then I guess you count overlapping reads equally.

              Comment


              • #8
                A quick and dirty estimation of final coverage in a sequence capture experiment using a hybridization based method is to assume about 50% efficiency.

                Looking at the summary data over a few dozen different custom captures and a few thousand exome captures from Agilent and Nimblegen, a reasonable estimation of depth of coverage from total sequence data is to assume about a 50% efficiency in the entire process.

                For example, if your capture region is 100Mb and your total sequence yield is 5Gb, your coverage would be 50x if every sequence read aligned within the capture region and everything was 100% efficient and evenly distributed. In reality, you will see median coverages in the 25x range once all of the inefficiencies are accounted for.

                If you want to calculate the amount of sequence needed for a particular scenario, say to cover at least 80% of the capture region to at least 20x, the relationship is not linear but more exponential and can be approximated by:

                To have at least 70% of the capture region covered at 'Y' coverage, multiply 'Y' by 2 to estimate the median coverage needed.
                To have at least 80% of the capture region covered at 'Y' coverage, multiply 'Y' by 4 to estimate the median coverage needed.
                To have at least 90% of the capture region covered at 'Y' coverage, multiply 'Y' by 7 to estimate the median coverage needed.

                All of the above are based on human exome capture. YRMV.

                A number of factors influence the final numbers including sequencing read length, insert size, specificity of the capture reagent/region, etc. The 50% is a very good estimation for mammalian species. Really don't know how well it would apply to other organisms, but suspect it would be close.


                Similar to Simon, we have found mostly minor issues introduced in variant calling when the same physical fragment is sequenced twice, resulting in over-statement of variant quality scores. The effects of sequencing the same fragment on data produced for sequencing census methods (ChIPseq, RNAseq, Methylseq) is substantially more pronounced in that you double count short fragments and introduce an insert length dependent bias in the data.

                If the paired reads overlap following duplicate removal, we trim them back at the BAM stage to allow the reads to meet end to end. During the trim, the exact proportion of overlapping bases can be tracked to provide a summary report of the total bases removed.
                HudsonAlpha Institute for Biotechnology
                http://www.hudsonalpha.org/gsl

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Quality Control Essentials for Next-Generation Sequencing Workflows
                  by seqadmin




                  Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

                  Nucleic Acid Quality Control
                  Preparing for NGS starts with isolating the...
                  02-10-2025, 01:58 PM
                • seqadmin
                  An Introduction to the Technologies Transforming Precision Medicine
                  by seqadmin


                  In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
                  01-27-2025, 07:46 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 02-07-2025, 09:30 AM
                0 responses
                65 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-05-2025, 10:34 AM
                0 responses
                101 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-03-2025, 09:07 AM
                0 responses
                81 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 01-31-2025, 08:31 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Working...
                X