Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Statistical model for RNA-Seq sensitivity estimation

    Dear All,

    I apologize if an existing answer to the following, basic question is somewhere buried in the forum - if yes, then a quick search did not reveal it.

    I'm looking for the right statistical model to compute the required sequencing depth for detecting a rare isoform with a certain probability in RNA-Seq data. Or, in other words, I would like to compute the sensitivity of an RNA-Seq experiment for finding minority isoforms at a given sequencing depth and isoform characteristic.

    The problem is closely related to differential expression analysis but I have serious problems combining the right models (poisson, betabin) at the right positions. Perhaps one of the statistically minded people working on RNA-Seq has an idea. Of course, partial solutions or caveats pointed out are also very welcome.

    Here is a contrived example with rough numbers: Let's assume that I want to look for a rare isoform that only occurs as n (=10) of the N (=100,000) overal mRNA transcripts per cell. How many reads of length r (=100bp) do I need to sequence from my library derived from the total mRNA of k (=1,000,000) cells so that I will sequence at least m (=3) reads from my rare isoform at a probability of P >=p (=0.999)?

    Bonus points: of course, a useful estimate may also depend on how easily I can distinguish the rare isoform from its more abundant brethren originating from the same gene. After all, I may receive reads from my rare isoform with probability P but only ones that are indistinguishable from the other isoforms since the isoforms are identical for most of the sequence. For simplicity, let's assume that all isoforms of the gene are L=(1000bp) long and can be differentiated from each other by one single stretch of length l=(200bp) which encodes an alternatively spliced exon und uniquely tags an isoform.

    I realize this is a complex example, but perhaps it's not without merit. Also, who better to ask it than you guys. Anyways, thanks for any insights!

    Cheers, Sven

    --
    Sven-Eric Schelhorn - http://mpi-inf.mpg.de/~sven
    Max Planck Institute for Informatics, Saarbrücken
    D3 - Computational Biology & Applied Algorithmics

  • #2
    very interesting question..
    any replies please... :-)

    Comment


    • #3
      You are halfway there, and a few extra preparation make things easier. First, how many cDNA fragments do we get from your N transcript molecules? Let's say, we fragment to lf=200 bp pieces, and the average length of the genes is L=1000 bp. Then, we should get roughly N' = N*L/lf = 500,000 cDNA molecules out of that. How many of these tell us about the isoform that you want to detect? If only a single stretch of l=200 bp is useful to ascertain that it is the isoform of interest and not another one, and if your gene has length 1000 bp, then it fragments, on average, into 5 pieces, only one of which is useful, i.e., we get n'=10 useful cDNA molecules which we have to fish out from N'=500,000 molecules. So, taking a random read, the probability that it is from the transcript stretch you are looking for is p = n'/N' = 1/50,000.

      Note that we did not need the number of cells; we just assume that we have enough so that we can ignore the possibility that the few cells we are looking at happen to not contain the rare isoform, or that we lose it during sample prep because there are so few.

      Hence, the answer is simple now: If you get a total of, say NR = 2,000,000 reads from your sequencing run, the probability that none of these contains the stretch of sequence that proves existence of your isoform, is given by (1-p)^NR=4e-18, i.e., it is nearly certain that you find it, using the numbers you suggested. This is because the expected number of reads from the stretch, p*NR, is 40, which is a lot. If n' were smaller, say, 1, this will look differently.

      Finally, if you want to quantify the abundance n/N, the precision of your quantification is roughly 1/sqrt(p*NR), due to Poisson noise. Here 1/sqrt(40)=16%.

      Comment


      • #4
        I think it's time to write up a book of Simon's replies. I'm constantly stunned by the things I still do not understand but greatly appreciate being educated.

        Comment


        • #5
          Thanks Simon for setting this straight for me. And I agree with Jon's suggestion. Your answers usually are both precise and comprehensive, which make you a great asset for this forum.

          Comment


          • #6
            Thanks

            Hi - an outdated thanks for this informative thread.
            I am setting up some course notes and the arithmetic was helpful.
            I think there is a term switch between the OP and Simon, using 'p' for two different factors (OP - desired statistical power; Simon - probability of a random read coming from the target transcript).
            cheers, Doug

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Best Practices for Single-Cell Sequencing Analysis
              by seqadmin



              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
              Today, 07:15 AM
            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:18 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Today, 08:04 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 06-03-2024, 06:55 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-30-2024, 03:16 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Working...
            X