Announcement

Collapse
No announcement yet.

How to Estimate the Average Coverage per RAD Locus

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Estimate the Average Coverage per RAD Locus

    Hi all,

    I am trying to estimate the sequencing effort needed for a valid SNP call using RADtags on a non-model organism.

    To have an idea of the average cover per locus I used the following equation:

    [No. Illumina Reads x read length] / [No. Individuals x Expected RAD sites across the genome x 2 RADtags per RAD site x average length of each RAD tag]


    For example, let's say we have a non-model species with a genome size of 350 Mb. We digest the DNA with and an 8-cutter RE such as SbfI. Assuming equal proportion of each base across (A=T=G=C=0.25) the genome we'll expect:

    (0.25^8) x 350,000,000 = 5,340 RAD sites


    After the size selection step the average RAD tag size will be 500 bp. If we use one Lane of an Illumina GAIIx (35 million reads of 75bp) to sequence 100 pooled individuals then the coverage per RAD locus would be:

    (35,000,000 x 75) / (500 x 2 x 5340 x 100) = 4.9 X

    As for de novo species is recommended at least 60 X (Davey et al. 2011 Nature Genetics) I think it will be necessary to use 8 individuals per lane.

    Are my calculations right? what do you think about the final decision?

  • #2
    Hello fcr,
    Your calculation of the RAD sites is correct, however instead of using the size of the RAD tags, since you won't be sequencing the tags, only the ends, this is how the math would look on coverage:

    5,340 RAD fragments
    10,680 RAD Sites (cut ends of fragments)


    If you wanted 60x coverage of these, you would need a minimum of:

    10,680 x 60 = 640,800 sequences needed for each sample
    (At 2x75 bp reads, this would be 48 Mb of sequence needed per sample)

    At this level of sequencing you would be able to multiplex approximately:

    35,000,000 / 640,800 = 54 samples per lane

    You can calculate all of the the math in sequence space. You do not need to go into base sequence to make these calculations, which is what you were doing above by including the length of the fragment. The sequence read length does not make a difference on how many RAD ends you will sample (sample = sequence), this is determined by the output of the machine. The read length will define how far you read into the fragments for SNP detection. Also, I would definitely recommend sequencing these as paired-ends.

    Please let me know if you have any other questions.

    jon

    Comment


    • #3
      Thanks for your answer Jon,

      I have two more questions ...

      First-To make sure I got it right. I calculated the coverage after sequencing in the next example using Holenhole's data for Stickleback. In their case 8 individuals were barcoded and sequenced in one lane. Holenlohe et al. provided a supplementary table, where for a single run they provide the following information:

      Individuals= 16
      Total # reads= 8895289
      Barcoded reads = 8269024
      Aligned reads= 6497736
      RAD tags= 41590
      Sequence length= 26
      Nucleotides=1094434

      Then following your explanation above I would say that the coverage per RAD site and indivual would be calculated as Aligned Reads/RAD tags x Individuals, thus resulting in 6497736/ (41590 x 16) = 9.76 X coverage

      Second: I think your explanation applies to single-end reads. What happen if you use pair-end reads? (barcoding and pooling individuals on a lane)


      Cheers,
      Fernando

      Comment


      • #4
        fcr,
        This is correct. One thing that can be confusing to some is the difference between physical coverage and sequence coverage. Physical coverage is what I used for the math above, and in general, can be used to figure the coverage on the ends. Sequence coverage is mainly used when DNA (or RNA) is randomly fragmented and sequenced to a certain depth of coverage against a reference. Sequence coverage is useful to consider when sequence reads are generated from fragment ends that are assumed to be randomly distributed across a genome. Therefore, one can figure the average sequence coverage by taking the number of aligned bases divided by the total number of reference bases. RAD-seq sequences ends that are generated in a very non-random way and thus can be modeled by the physical coverage, since all of the reads should have the same start site from the P1 adapter. After writing this I realized that you gain nothing from sequencing these as paired-ends since the 2 read will only be randomly distributed through the fragment, and not very useful for comparative SNPs.

        Hope this helps.

        jon

        Comment


        • #5
          Hi Jon,

          Actually, I thought that the main purpose of using pair-end seq. with RADs was helping to remove PCR duplicates. As fragments of equal size with identical sequence at both ends are unlikely to occur with random shearing.


          Thanks a lot!
          Fernando

          Comment


          • #6
            Dear Fernando,
            Good point! I was only considering the utility of comparative SNPs, which of course would need to have deduplication. Great point. Paired-end it is.

            jon

            Comment


            • #7
              Hi Jon,

              Using pair-end reads in Illumina platform will also double the coverage (~70 million pair end-reads instead of 35 million single-reads for a GAIIx Lane). Do you know if using pair-end reads is far more expensive?

              I was discussing with my colleagues about using RADs through a technical service. However, we think is worthy to perform a pilot experiment in order to fine-tune the sequencing effort. Do you think to digest DNA for one individual with SbfI and estimate the number of RAD fragments obtained after size selection? I guess it can't be be done without sequencing because is necessary to identify he sequences containing the adapters.

              I was checking your website but it seems that your company don't do RADs. Do you know any company offering this service? If they have an European headquarter will be better for us...

              Cheers,
              Fernando

              Comment


              • #8
                fcr,
                sorry for my silence, I was traveling.

                Using pair-end reads in Illumina platform will also double the coverage (~70 million pair end-reads instead of 35 million single-reads for a GAIIx Lane).

                It will double the coverage, however it does not give you any additional RAD SNP power. The P1 adapter end is the one with the RAD flanking region and with single end sequencing you would generate ~30 million reads from those ends, however when you add paired-end reads, you then generate ~60 million reads, split across both ends of the fragment (P1 and P2 adapter), which still gives you ~30 million reads at the P1 end.

                I was discussing with my colleagues about using RADs through a technical service. However, we think is worthy to perform a pilot experiment in order to fine-tune the sequencing effort. Do you think to digest DNA for one individual with SbfI and estimate the number of RAD fragments obtained after size selection? I guess it can't be be done without sequencing because is necessary to identify he sequences containing the adapters.

                My initial thought would be to go ahead and use 3-4 RE to cut the DNA, if DNA quantity is not a concern. That part will take the longest time anyway, so you might as well run several in parallel. You can then barcode the libraries and sequence them in one lane and gain more than n=1.

                I was checking your website but it seems that your company don't do RADs. Do you know any company offering this service? If they have an European headquarter will be better for us..

                Floragenex in the US has IP wrapped around the original protocol, thus they are one of the few suppliers. We are in the process of developing a novel and more cost effective protocol, however it is not prime-time yet.

                Please let me know if you have additional questions. I am glad to help.

                Jon
                Last edited by Cofactor Genomics; 02-08-2012, 12:25 AM.

                Comment


                • #9
                  Hi Jon,

                  Thanks a lot for your answer. No worries for the delay

                  In mean time, we already have contacted Floragenex and we contemplate GBS instead of RADs. But I guess we'll obtain lower number of fragments than with RAD for same enzyme...

                  I'll definitely come back to you for advice.

                  Cheers,
                  FCR

                  Comment


                  • #10
                    Dear fcr,
                    With the current cost per base of the high output machines such as the HiSeq2000, it may make sense to perform GBS as opposed to RAD depending on your goals as RAD was originally developed for arrays and then ported over to sequencing. RRS (reduced representation sequencing) can be an excellent strategy if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines. Please feel free to contact me off the board if you would like some suggestion as to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites. Best of luck!
                    jon

                    Comment


                    • #11
                      Dear fcr,
                      One question that I did not address above is the cost of paired-end versus single-ends. Of course, it depends on the platform, however on the ILMN, it is ~1.5x cost of single-ends.

                      jon

                      Comment


                      • #12
                        Hi Jon,

                        Actually with current costs we discussed doing GBS using PstI 48 individuals. It brings associated sequencing on a HiSeq Lane (~100 million single reads). So assuming we'll get 10,000 RAD fragments for our 350Mb genome, the coverage per RAD site will be ~195 X. I don't know in the end too much coverage will complicate things, such as identifying individual locus and calling SNPs...what do you think?

                        On top of that, GBS will produce less fragments than RADs, I guess half of them, doubling the coverage. making it even worse, right?

                        I don't really know what you actually meant with "if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines." could you clarify this?

                        I would really appreciate suggestions to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites.


                        Thanks again,
                        Fernando

                        Comment

                        Working...
                        X