Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Jane M
    Senior Member
    • Aug 2011
    • 239

    Correlation between sequencing depth and false positives

    Hello,

    I am looking for documentation about the correlation between sequencing depth and false discovery rate.
    I mean, if there is a SNPs at a position and if the coverage at this position is low, is the probability to detect this snp lower than if the coverage was high ?
    Or if a differentially expressed gene has a low coverage on the average, is the probability to detect the gene as differentially expressed lower than if the coverage was high?

    Do you know if there are studies or papers about this point?
    Thanks in advance,
    Jane
  • A Oshlack
    Member
    • Jun 2010
    • 17

    #2

    Comment

    • Jane M
      Senior Member
      • Aug 2011
      • 239

      #3
      Thanks for the paper.
      I am in particular interested in the effect of sequencing depth on SNPs and indels detection. Are there papers on this topic?

      I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?

      Thanks,
      Jane

      Comment

      • gringer
        David Eccles (gringer)
        • May 2011
        • 845

        #4
        Originally posted by Jane M View Post
        I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?
        All other things equal, the probability of detecting a SNP at high-coverage will be higher than the probability of detecting a SNP at low coverage. The error rate at a particular location will decrease with repeated sampling, increasing the reliability of measurement.

        This is not a particularly meaningful statement. Of course there will be some point where an increase in quality won't significantly increase the reliability of measurement (e.g. phred score of 40 or so, considering repeated sampling). However, in almost all cases the actual SNP frequency will play a greater role in detection, and the difference in detection probability will be insignificant for high-frequency SNPs and for very low-frequency SNPs.

        If the chance of a polymorphism is near 50%, then you'd need a coverage of less than 6 or so over a region (my ball-park guess) to miss repeated observations of both variants of a dimorphic SNP. Conversely, for a SNP (depending on the definition of SNP) with frequency less than 1%, you'd have to be quite lucky to get any sample that has the variant of interest.

        Comment

        • gringer
          David Eccles (gringer)
          • May 2011
          • 845

          #5
          Or if a differentially expressed gene has a low coverage on the average, is the probability to detect the gene as differentially expressed lower than if the coverage was high?
          This is quite a different question from the SNP question, because there are two dimensions of measurement that influence the probability that differential expression is significant even when just considering the read counts at a single base-pair location (number of raw reads, and fold-change difference). A low number of raw reads increases the measurement error, increasing the fold-change difference that would need to be observed for a differential expression to be considered significant (note: raw read counts, not normalised read counts).

          Again, with all other things equal, a high coverage will increase the reliability of the result, but this time it has a much greater role to play in determining whether the expression difference is significant.

          Unfortunately, there are plenty of other confounding factors, such that differential expression analysis by NGS can really only be used for fishing / hypothesis generation. Off the top of my head, there's multiply-mapped reads, multiple isoforms / splice variants, incomplete coverage of the gene / transcript, PCR duplicates, and incorrect gene annotation. Some of these situations can be identified by looking at coverage plots at a transcript level, but that requires too much effort and human intervention to work at a genome-wide scale.

          If you really want to doubt the reliability of your results, look at the coefficient of variation for coverage in all transcripts (SD of coverage divided by mean coverage). The last time I looked at that, I think about 70~125% described a "good" coverage, and most transcripts were over something like 300%. I'd be interested to know other people's experience regarding this matter.

          Comment

          • Jane M
            Senior Member
            • Aug 2011
            • 239

            #6
            Thanks a lot for your answer gringer!

            I must admit that currently, I'm particularly interested in the detection of SNPs. So I would like to have an idea about the reliability of my results when having low coverage.
            Because I detect variant in these 2 extreme cases :
            -3 reads for the reference and 3 reads for the variant
            -100 reads for the reference and 100 reads for the variant

            Has someone estimated the reliability of results depending on sequencing depth? Gringer, can you suggest me publications about it?

            Jane

            Comment

            • gringer
              David Eccles (gringer)
              • May 2011
              • 845

              #7
              Because I detect variant in these 2 extreme cases :
              -3 reads for the reference and 3 reads for the variant
              -100 reads for the reference and 100 reads for the variant
              That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

              SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
              SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]

              With sanger sequencing, two observations of a variant (in a population) are typically enough to consider the variant as being present, bearing in mind that a typical definition of a SNP is for a frequency greater than 1% (or possibly 5%). I expect it would be similar for NGS. I think the SNP microarrays use a few replicate sequences per variant (e.g. see here), just to be safe.

              Edit:

              can you suggest me publications about it?
              I'm not aware of any NGS publications relating to SNP discovery (because I haven't looked), but for "classical" SNP detection I guess you could look at the Wikipedia references:
              Last edited by gringer; 02-28-2012, 03:36 AM. Reason: added wikipedia link, affy reference

              Comment

              • Jane M
                Senior Member
                • Aug 2011
                • 239

                #8
                Originally posted by gringer View Post
                That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

                SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
                SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]
                The examples that I gave are not especially something that I've got, maybe I have it, I have hundreds of variants...

                My questions are related to the examples that I gave and the ones that you gave. It's easier to start with my cases.
                From what you said, I understand that I can trust equally my two cases.
                It was my question, I though I could be more confident with (100 reads for the reference and 100 reads for the variant) than with (3 reads for the reference and 3 reads for the variant) all other things equal because it is more likely to have 3 than 100 errors.

                Then, for the cases you mentioned, it's more complicated. But, it's the same idea. We calculate a proportion of variant and this proportion is probably more reliable if it has been estimated from a big sample, all other things equal.

                I'm studying the mutations occurring in cells of patients suffering from leukaemia. I am looking for somatic mutations which take place at homozygous position as a first study.
                I'm using tools like VarScan 2 and JointSNVMix for detection.
                I know that my samples have a purity of 1 (or very close to 1) but I shouldn't expect 0, 50 or 100% of variant because all my cells won't be mutated...

                So to filter my (big) list of variants, I use quality criterion and that is why I'm looking for publications about it.

                Comment

                • rlopez
                  Junior Member
                  • Nov 2010
                  • 1

                  #9
                  > I'm using tools like VarScan 2 and JointSNVMix for detection.

                  Hello Jane M,

                  This might not be the right post but I was wondering if you would you like to share your experience with VarScan2, JointSNVMix? and Strelka? and others you might have tried it i.e. somatic sniper, muTect, etc...

                  Many thanks,

                  Rene L

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  22 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  27 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  38 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  61 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...