Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shimbalama
    bioinformatics-help.com
    • Jul 2014
    • 9

    Comparing read depths per gene/exon between samples

    HI all, this is my first post so apologies if its inappropriate - its a pretty simple question but I need a little help. I promise I've googled far and wide to try and figure it out myself.

    I use short reads from a Miseq to make clinical variant calls using GATK. I use various panels (trusight cancer etc). Some exons/genes always have low coverage (due to GC content etc) and others just fail in one sample, which is often clinically relevant.

    I would like to compare the mean coverage of each exon/gene in each sample to the same from a 'gold standard' derived of what my lab scientist tell me is a 'good run'. Currently I am doing a ttest with the mean of the gold compared to the read depths at each base in the exon/gene that I am doing variant calling on. Basically, I only want to know if the mean read depth is low if it is significantly different to the mean of the gold.

    It made sense at first because I am comparing 2 means. Is that right? It seems wrong because I'm really only comparing two samples. So I though I should do a Z test...

    Has anyone done anything similar? How did you implement it?
    LM
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    For human data, I suggest calling mutations against the standard human genome, then comparing them against known databases, such as the human 1000 genomes project or other databases.

    There are gold standards, but gold is a relative and dynamic term in any advancing industry. Particularly, exon-capture is not at all replicable between different platforms.

    Comment

    • bt27uk
      Junior Member
      • Aug 2011
      • 7

      #3
      I think you are asking about coverage, where the first reply seemed to be talking about something a bit different.

      I wonder if it's not so much a statistical comparison you are after here, but rather a cutoff level. In this case, the challenge becomes what regions to measure and how to set the cutoffs for those regions.

      From your past experience, does the mean depth tell you what you need to know? If you are working with panels, then perhaps it would be relevant to choose a few regions where you know the coverage range you would consider normal or good and check whether the coverage from a given run is at that level?

      How to set cutoffs, which would act as the warnings that a sample may not be of the quality you need, could involve, for example, basic exploratory data analysis, such as tables and plots of the coverage of your gold sample and looking at the distribution of coverage over the mapping, (or over the regions you work with). From this, determine values that would be meaningful to check for in your samples. I would likely test the test you come up with by running against other samples you know were considered good or bad in the past, to see if your tests would have flagged up the samples you hope it will.

      Having said all that, my suspicion is that this question may be a solved problem and that others in the forum will have more mature ideas about processes and tools to use for this purpose.

      Guess we'll find out, right? :-)

      Comment

      • shimbalama
        bioinformatics-help.com
        • Jul 2014
        • 9

        #4
        Thanks Brian.

        I do all that. What I am trying to do is QC on the negative var calls. So every base in every gene of interest (GOI).

        What I am interested in is the mean read depth of every GOI that comes off my machine and whether it is significantly different to the mean depth I have defined as 'gold'. So the question is about statistical analysis only.
        LM

        Comment

        • shimbalama
          bioinformatics-help.com
          • Jul 2014
          • 9

          #5
          Thanks bt27uk,

          Much more on point.

          I have implemented an approach similar to what you suggest, ie. if sample mean < 20x but gold isn't we want to know. My boss wants a P value though.

          Cheers,
          Liam
          LM

          Comment

          • bt27uk
            Junior Member
            • Aug 2011
            • 7

            #6
            If your supervisor wants a p-value, then I have likely missed the point.

            I originally assumed the aim was to ask a question like "does this sample have adequate coverage for my purposes?”. For the purpose of noting samples that might not have adequate coverage for downstream analysis, I think a set of coverage cutoffs for the various genes of interest, based on some lower limit you determine based on your knowledge of a “good” sample, would be a reasonable way forward.

            To me, a p-value suggests questions more long the line of "does this sample have (any, some, all?) genes that have coverage that fall outside a range that constitutes the population of what are considered good samples?" That is a rather more complex question to approach.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 08:59 AM
            0 responses
            8 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            15 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Working...