Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate nucleotide diversity from SNP data from multiple samples

    Hi all,
    I have a collection of SNP data from a tetraploid population, and I would like to calculate the nucleotide diversity. The generel forumula is:


    where xi and xj are the respective frequencies of the ith and jth sequences, (PI)ij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
    This is fairly simple if we have sequence data (and not SNP data).
    If we have these 3 sequences in a (very small) population, and we have sampled them each 1 time, we need to compare 1vs2, 1vs3, and 2vs3

    >1
    ATGCGTTTTT
    >2
    ATGGGTTTTT
    >3
    ATGCGTTTTA

    We now get:
    (PI)12 = 1/10 (they differ at pos. 4)
    (PI)13 = 1/10 (they differ at pos. 10)
    (PI)23 = 2/10 (they differ at pos. 4 and 10)
    n = 3
    The respective frequencies are always 1/3 in this case this means that:

    PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
    =>
    PI = 2*(1/3^2)*(1/10+1/10+2/10)
    =>
    PI = (2/9)*(2/5)
    =>
    PI = 4/45 = 0.0888

    I have a (VERY small popoluation of 2), and I have measured 2 SNPs:

    Individual 1: a G/C SNP at position 3.
    In
    dividual 2: a T/A SNP at position 10.
    Lets say the reference sequence is:
    >ref
    ATGCGTTTTT

    Q1: Now I want to calculate the nucleotide diversity, would it then be correct to state that I have actually 4 sequences (4 alleles), meaning that I can calculate PI as follows:

    Based on the SNP information:
    >1 Individual 1 - allele 1
    ATGCGTTTTT
    >2
    Individual 1 - allele 2
    ATCCGTTTTT
    >
    3 Individual 2 - allele 1
    ATGCGTTTTT
    >4 Individual 2 - allele 2
    ATGCGTTTTA
    We now get:
    (PI)12 = 1/10 (they differ at pos. 3)
    (PI)13 = 0 (they do not differ)
    (PI)14 = 1/10 (they differ at pos. 10)
    (PI)23 = 1/10 (they differ at pos. 3)
    (PI)24 =
    2/10 (they differ at pos. 3 and 10)
    (PI)34 =
    1/10 (they differ at pos. 10)
    n = 4
    The respective frequencies are always 1/4, and can be moved out of the parenthesis and squared (see above). in this case PI is:

    PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
    =>
    PI = 2*(1/4^2)*(1/10+0+1/10+1/10+2/10+1/10)
    =>
    PI = (2/16)*(5/10)
    =>
    PI = 1/16

    The question is whether this splitting of SNP information into alleles, and hence sequences is correct?

    Now, as I stated I have a tetraploid population, and I therefore plan to do the same, but expand each SNP into 4 alleles:

    I have a (VERY small population of 2), and I have measured 2 SNPs:

    Individual 1: a G/G/C/C SNP at position 3.
    In
    dividual 2: a T/T/T/A SNP at position 10.
    Lets say the reference sequence is still:
    >ref
    ATGCGTTTTT

    Based on the SNP information:
    >1 Individual 1 - allele 1
    ATGCGTTTTT
    >2
    Individual 1 - allele 2
    ATGCGTTTTT
    >
    3 Individual 1 - allele 3
    ATCCGTTTTT
    >4 Individual 1 - allele 4
    ATCCGTTTTT
    >5 Individual 2 - allele 1
    ATGCGTTTTT
    >6
    Individual 2 - allele 2
    ATGCGTTTTT
    >7
    Individual 2 - allele 3
    ATGCGTTTTT
    >8 Individual 2 - allele 4
    ATGCGTTTTA
    n = 8
    (PI)12 = 0 (they do not differ)
    (PI)13 =
    1/10 (they differ at pos. 3)
    (PI)14 =
    1/10 (they differ at pos. 3)
    (PI)15 = 0 (they do not differ)
    (PI)16 = 0 (they do not differ)
    (PI)17 = 0 (they do not differ)
    (PI)18 = 1/10 (they differ at pos. 10)

    (PI)23 =
    1/10 (they differ at pos. 3)
    (PI)24 = 1/10 (they differ at pos. 3)
    (PI)25 = 0 (they do not differ)
    (PI)26 = 0 (they do not differ)
    (PI)27 = 0 (they do not differ)
    (PI)28 = 1/10 (they differ at pos. 10)

    (PI)34 =
    0 (they do not differ)
    (PI)35 =
    1/10 (they differ at pos. 3)
    (PI)36 =
    1/10 (they differ at pos. 3)
    (PI)37 =
    1/10 (they differ at pos. 3)
    (PI)38 = 2/10 (they differ at pos. 3 and 10)

    (PI)45 = 1/10 (they differ at pos. 3)
    (PI)46 =
    1/10 (they differ at pos. 3)
    (PI)47 =
    1/10 (they differ at pos. 3)
    (PI)48 = 2/10 (they differ at pos. 3 and 10)

    (PI)56 =
    0 (they do not differ)
    (PI)57 = 0 (they do not differ)
    (PI)58 = 1/10 (they differ at pos. 10)

    (PI)67 = 0 (they do not differ)
    (PI)68 = 1/10 (they differ at pos. 10)


    (PI)78 = 1/10 (they differ at pos. 10)

    Now, the respective frequencies are always 1/8, and can be moved out of the parenthesis and squared (see above). in this case PI is:

    SUM( PI(XJ) ) = 19/10

    PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
    =>
    PI = 2*(1/8^2)*(19/10)
    =>
    PI = (1/32)*(19/10)
    =>
    PI = 19/320 = 0.059375

    The question is (again) whether this splitting of SNP information into alleles, and hence sequences is correct?

  • #2
    I think there is a mistake in your formula for pi, you are missing a factor n/n-1, which is negligible when n is large and often omitted, but this is not your case. The nucleotide diversity in your 1st example should be equal to 2/15 (not 4/45, total 4 differences out of 30 comparisons=4/30=2/15).

    Comment


    • #3
      I am doing a very similar analysis. However, my SNP data is based off RAD data and the number of individuals genotyped at any given base may change (i.e., the number of "chromosomes" considered per each site). Therefore, using the n/(n-1) factor can't work for my data; instead, I am going through each base and calculating pairwise differences, then averaging over my arbitrary window size.

      For example, if I have the following five sites, the first two may only be genotyped for four 'chromosomes' and the last three for six.

      site: 1 2 3 4 5
      chr1 A A T G C
      chr2 A A T G C
      chr3 A A T C C
      chr4 A T T C C
      chr5 [ ][ ]T C G
      chr6 [ ][ ]A C G

      site 1: 0 pairwise differences (4 chr sampled)
      site 2: 3/6 differences (4 chr sampled)
      site 3: 5/15 (6 chr sampled)
      site 4: 8/15 (6 sampled)
      site 5: 8/15 (6 sampled)

      pi = 0.38 for this five-base window.

      Another thing that I figured out while working with pi and RAD data is that you have to be really careful if you're using a vcf file! vcfs obviously (usually) only contain variants; however, pi must consider the number of GENOTYPED bases, INCLUDING invariants! Other tools such as vcftools only use the pi positional data and assume all bases not included in the vcf are invariants. This will give you a biased estimate.

      As far as your specific questions, I think that yes, you are on the right track. I don't think that pi cares if your data is diploid or tetraploid; it simply samples the number of pairwise differences among chromosomes considered, correct?

      The only real difference between my example here and yours above are that you are considering differences among stretches of bases and I am walking along the chromosome. I think each method should give you the same answer as long as you have the same 'n' along the stretch of bases (and, importantly, you use the n/(n-1) factor as mentioned by bondsergy). I don't, therefore the base-by-base works best for me.

      You might also find this blog post useful:
      http://binhe.org/2011/12/29/calculat...mation-method/

      I'm not sure this will help you, but I'd like to get it out on the forums in case it helps someone.

      Comment


      • #4
        I thought I'd offer an improvement.
        sum( 2j(n-j) / n(n-1) ) only works for bi-allelic loci, but generalizing is easy.

        If you take a list of allele counts (Ji values) INCLUDING THE MAJOR ALLELE,
        sum( sum_i( Ji (n-Ji) ) / (n(n-1)) ) ) / L
        where L is the number of sites/positions compared.
        (BTW: sum_i is read as "sum over i", and Ji should really be "J sub i")

        I haven't rigorously done the proof, but it intuitively makes sense and agrees with the value you get by exhaustively looking at all the pair values. So I'm pretty certain it is correct.

        Ex:
        3 A, 2 C, 2 G
        = (12+10+10)/42 = 16/21

        Some python testing code:
        Code:
        #!/usr/bin/python
        from collections import Counter
        import itertools
        
        def calcPi(seq_list, transposed=True):
            """
            seq_list is an iteratable of iterables containing alleles by [position][seq_number]
            if transposed=False,
                then seq_list is by [seq_number][position]
                this format is slightly less efficient
            """
            if( transposed == False ):
                seq_list = itertools.izip(*seq_list)
            num_sites = 0
            sum_pi = 0
            for allele_list in seq_list:
                counts = Counter(allele_list)
                n = sum(counts.values())
                num_sites += 1
                sum_pi += sum(( j*(n-j) for j in counts.values() )) / (n*(n-1.0))
            return sum_pi / float(num_sites) # sequence pi is just mean of per-position pi values
        
        # Test it!
        import random
        import time
        
        LEN = 1000
        NUM_SEQ = 10
        NUM_ITERATIONS = 10
        NUCLEOTIDES = ['A','T','C','G']
        NUCLEOTIDES.extend(['A']*8) # over-represent 'A' to make sequences more similar
        #random.seed(0)
        CMP_EPSILON = 1e-10 # how close do results have to be to call them equal
        
        tics = 0
        exhaustive_tics = 0
        
        for _ in range(NUM_ITERATIONS):
            # generate NUM_SEQ random sequences of length LEN (trasposed so it is by [position][seq_number]
            At = [ [ random.choice(NUCLEOTIDES) for _ in range(NUM_SEQ) ] for _ in range(LEN) ]
        
            tic = time.clock()
            seq_pi = calcPi(At, transposed=True)
            tics += time.clock()-tic
        
            # exhaustive (all-pairs) per-sequence method
            A = map(list, itertools.izip(*At)) # transpose At so it is [seq_num][pos]
            tic = time.clock()
            pair_pi = []
            for i in range(len(A)):
                for j in range(i):
                    t = 0
                    d = 0
                    for k in range(len(A[i])):
                        t += 1
                        if( A[i][k] != A[j][k] ):
                            d += 1
                    pair_pi.append( 2.0*((1.0/float(len(A)))**2)*(d/float(t)) )
                    #print i,j, A[i],A[j], d,t, pair_pi[-1]
            seq_pi_exhaustive = (len(A)/(len(A)-1.0))*sum(pair_pi)
            exhaustive_tics += time.clock()-tic
        
            # output
            if( abs(seq_pi-seq_pi_exhaustive) > CMP_EPSILON ):
                print "MISMATCH", seq_pi,'!=', seq_pi_exhaustive
        
        print NUM_ITERATIONS, "cycles"
        print "calcuation took",tics
        print "exhaustive took",exhaustive_tics

        Comment


        • #5
          Adenin base insertion

          Hi All,

          Iam Irma, new comer from Bogor Agriculture University, Indonesia. I have 180 DNA sequence samples from duck population. When i analyze the data, why insertion of Adenin base always uniform in all of DNA samples? Another case is, when there are C to A mutation, in another place we found A deletion. Its like an equilibrium to balancq A composition. Any one who can explain this phenomenon? Thank you All

          Comment


          • #6
            unbiased nucleotide diversity estimation from SNP data

            Thanks for all your useful discussion about the topic!
            The only doubt I have regarding your formulas is a small discrepancy in the denominator of the base pair summation method.
            Both Travis Collier and Bin He propose the following formula: 2j(n-j)/(n(n-1)) here and in the blog entry mentioned above (http://binhe.org/2011/12/29/calculat...mation-method/).
            However, Begun et al., 2007 uses a formula that can be transformed to 2j(n-j)/(n(n-j)) (if I didn't make any mistake in the process...). I don't understand where this discrepancy comes from and which formula is the correct one...
            Another point that is not clear for me is if j reffers to (i)the count of the minor allele, (ii) the count of the alternative allele (either major or minor) or (iii) the count of the derived allele (if ancestral state is available).

            I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you?
            The purpose of my analysis is to accurately estimate nucleotide diversity from a cohort of 43 genomes (in fact, two populations of 16 and 27 genomes).

            Thank you very much!

            Begun's paper: (http://www.plosbiology.org/article/i...bio.0050310#s3)

            Comment


            • #7
              Hi Dmelnet,

              I think there is a typo in the Begun paper. If you compare the first part of the formular with the second, I think it should be 2j(n-j)/(n(n-1)).

              I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you?
              This is something I'm also interested in!

              Comment


              • #8
                typo found

                Thank you dschika!

                I wanted to use vcf tools for calculating pi but I have realized that, not only treates missing positions in the vcf file as invariant sites but it also treats missing genotypes as invariant sites (hom ref)! I think the correct way would be using the total number of genotyped alleles per position instead of just the total number of alleles per position (genotyped or missing data).
                Do you know if this is already implemented in any other tool?

                I will have a look at this: "PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R" (http://mbe.oxfordjournals.org/conten.../molbev.msu136) and I will let you know. If you have experience with it, share it also!

                Comment


                • #9
                  I also use vcftools a lot, but, because of the reasons you mentioned, it is not a good solution for pi calculation in this case.
                  Unfortunately, I have no experience with PopGenome.
                  I also searched a bit, but could not find an existing tool for this task.

                  I implemented now the Begun method (with c(c-1)) in the denominator of the nominator of the pi formula) in a python script using a vcf file and a 'coverage' file generated with bedtools for each individual. Those files report the coverage of each position to a given reference. From the bedtools files I can calculate how many bases in total were covered by c individuals and from the vcf file I get the variants with coverage c and the number of j (equals the number of 1's) in it.

                  With this approach, I do not use missing data or uncovered positions.

                  Comment


                  • #10
                    Me too!

                    I cannot belive it! I have done exactly the same in perl!!! Well... a slightly different implementation calculating a pi per site that is stored in the INFO field of the vcf. Then, it can be used to calculate a pi per window (summation of pi per site and then division by the number of positions analysed).
                    I was afraid I was wasting my time if this was already implemented somewhere else... but I have specific filtering criteria for the vcf calls and I also want to calculate other statistics such as Fst, watterson's theta, and all that stuff... and I have to finish all this in a couple of weeks!

                    Comment


                    • #11

                      Sounds good!

                      What exactly is your divisor in
                      (summation of pi per site and then division by the number of positions analysed)
                      ?

                      The number of variant bases or a window size of e.g. 1000 bp?
                      If you divide by a certain window size: Do you adjust somewhere for different coverages in the non-variant sites?

                      This was one of my problems, as I had different coverages over my sample and could not assume that all non-variant sites had same coverage. I mean, if I have a position that is covered by 5 of 20 individuals, I cannot assume that the remaining 15 individuals all show the reference nucleotide at that position. Therefore, I used the Begun method which takes coverages into account.

                      Comment


                      • #12
                        my calculation of pi per site is the following:

                        sub PixSite(){
                        my $c=$_[0];
                        my $j=$_[1];
                        my $pi=0;
                        $pi=(2*$j*($c-$j))/($c*($c-1));
                        return $pi;
                        }

                        where c is the number of covered positions (e.g 15 out of 20 individuals would give 30 alleles at this site, it corresponds to AN field in GATK INFO column of your vcf file) and j is the number of alternate alleles (AC field in the GATK INFO column of your vcf file, e.g if 5 individuals are homozygous for the alternate allele, j would be 10).

                        This comes from a small rearrangement of Begun's formula and from the definition itself.

                        Once I have a pi per site adjusted for different coverages, I can calculate average pi per window.

                        For doing this, I count the number of analysed positions in the window of 1000bp that pass certain filtering criteria taking into account both variant and invariant sites (e.g 990 out of 1000bp have good coverage in 5 or more genomes, and are not part of repetitive elements, and and and...). Then I divide the summatory of pi per site by the number of analysed positions in the window (990).
                        It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1).

                        I hope it helps!

                        By the way, are you working on drosophila?

                        Comment


                        • #13
                          It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1).
                          Yes, pi is zero in those regions. But the coverages of invariant sites are important (as far as I understood it), as the coverage of the invariant sites influences the total amount of possible pairwise comparisions. Looking again at the denominator of the Begun method, they explain that:
                          n_c is the number of aligned base pairs in the genomic region
                          which I interpreted to be coverage depended. Maybe I miss something here?

                          By the way, are you working on drosophila?
                          I work with plants, mainly a conifer species and crops.

                          Have you published your complete script already somewhere? Or do you plan to do so?

                          Comment


                          • #14
                            If I interpreted correctly the formula, you can expand the summatories and reorganize them such that you get what I posted above, where the coverage in invariant sites does not influence the final result. I did the calculations by hand for a few examples and it worked...
                            When I finish editing the code I can upload it somewhere so that other people can use it (at their own risk). The paper still needs a lot of work to get published but hopefully it will be available in a mid-term future ;-)

                            I will leave the lab where I am currently doing research on December and I will defend my thesis on July 2015, so I will try to have something more or less finished by then...

                            Comment


                            • #15
                              I have used the formula suggested by Travis Collier to obtain Pi values on a per-site basis from a vcf in Python. I now want to filter the sites and calculate mean and median pi values in windows, similar to what Dmelnet did. I was wondering if anybody has any suggestions about how to efficiently obtain these windowed results. I am new to programming and bioinformatics, so your help is greatly appreciated.

                              Also, I don't know if anyone here is analyzing multiple populations, or would like to calculate Dxy, but I think this formula works using the variables stored for the pi calculations and will give you per-site Dxy between two populations.
                              ((n1*n2)-sum_i(J1_i*J2_i))/(n1*n2)
                              for population 1 and population 2 where J_i is the allele count (# of 0,1,2, or 3's) at a given site in the vcf for either population.
                              Code:
                                 if n_seq_POP_1 > 0 and n_seq_POP_2 >0 :
                              	POP_1_POP_2_dxy=((n_seq_POP_1 * n_seq_POP_2)- ((j0_POP_1 * j0_POP_2)+(j1_POP_1 * j1_POP_2)+(j2_POP_1 * j2_POP_2)+(j3_POP_1*j3_POP_2)))/(n_seq_POP_1*n_seq_POP_2)
                              Thank you,
                              Jimmy

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X