Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by pengchy View Post
    Hi dpryan,

    Sorry, I can't catch your meaning.
    The pvalue of adjusted pvalue will be used to detect differentially expressed genes. If the different p value is caused by the gene length, the order of p value will misguid the following biology experiments.

    Thank you.
    If the biologists (I am one myself) are thrown off by this, then they should find a different profession. This concept is incredibly common in molecular biology or in whatever other sub-field your colleagues might work.

    The p-value isn't being caused by gene length, rather longer genes will tend to have more reads, meaning that there can be more evidence for or against differential expression. Other factors affecting that are within-group variability, between-group variability (i.e., fold change) and general level of expression (depending on exactly how the statistics are done). The p-value, then, is just one factor that should affect their decision of which follow-up experiment(s) to perform.

    Comment


    • #32
      Hi dpryan,

      Thank you for your reply.

      I agree with all of your viewpoints.

      From your explanation, it can be concluded that p-value is not only influenced by gene length, but also other factors. Here, the focus is gene length, if we can reduce this influence at DEG detect step, why not to try?

      Comment


      • #33
        After reading so many posts, I feel length adjusted counts should be a better way of doing DE analysis. Even though the length has equal effects among different samples as manual suggested, however, when we do DE analysis, we do care the top DE list which really involves comparison between genes.

        Since we can not get the exact counts of genes by simply doing RNA seq, steps of length normalization and rounding represents an approximate data.

        Just my two cents.

        Comment


        • #34
          Originally posted by pengchy View Post
          Hi dpryan,

          Thank you for your reply.

          I agree with all of your viewpoints.

          From your explanation, it can be concluded that p-value is not only influenced by gene length, but also other factors. Here, the focus is gene length, if we can reduce this influence at DEG detect step, why not to try?
          The underlying problem here is that the way that RNA-Seq data is collected, with a random sampling of fragmented cDNA. This intrinsically means that if you have 2 genes with a copy number of 100, but one is 10X the length of the other then on average you will have 10X the number of RNA-Seq reads from the longer gene even though they exist at the same expression level.

          This difference in observation then passes through to the statistical analysis where what matters is how accurately the expression of each gene is measured, as well as the level of change in expression. The more observations you have the more accurately you can infer your true expression level and the easier it becomes to detect differential expression at a given fold change. You'll therefore find that it's easier to detect changes in longer genes for the same fold change and the associated p-values will therefore be lower.

          You mentioned the idea of correcting for this observation bias, and I guess you could do this, but the problem would be that you can only do this by making the well observed data worse. There's nothing you can do to make the poorly observed (shorter) genes better. Pretty much all of the statistical approaches use some direct transformation of read counts in their statistical tests since this provides the most direct and relevant measure. You could run your statistics on counts which have been length normalised (RPKM) but all you end up doing by that is mixing together different observation levels with very different levels of noise at the same value in your data, ie a high value RPKM could be a long gene with a large number of observations for which you can be very sure of the value, or a short gene with low numbers of observations where the true expression level is not known with any certainty. Taking this approach won't help you improve your analysis (quite the opposite) and won't make it any fairer - it will just put the biases in a different place.

          I guess the ultimate solution to this will come when we lose the length restriction on sequence read levels so that every transcript is read in its entirety, but I'm not holding my breath for this.

          One thing we have been doing to help make DE analysis fairer is to use the intensity difference analysis approach in SeqMonk to help to order the hits coming out of DE analysis. This doesn't change the set of hits you'd get out of something like DESeq but it can be useful in helping to prioritse which are the most interesting. The basic approach is that we construct a local distribution of differences for genes with similar average expression to the gene being tested. We can then compute z-scores for each DE hit using the local level of noise to provide an improved 'fold change' type meaure which we've found to be useful in ranking hits and selecting the top hits to follow up.

          Comment


          • #35
            I feel the length difference of the most of the gene are not huge, generally within 10 times. How much could it introduce noise?

            Comment


            • #36
              I am stuck in normalization of rna-seq data using DESeq.
              I have used command like "counts( data, normalized=TRUE ) " but an error occurred which says that "Error in .local(object, ...) : unused argument (normalized = TRUE)".
              How shall i get rid of this problem??

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X