Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq with SPIA

    Hello all,

    I've been using SPIA (Signal Pathway Impact analysis), and I think its great. I'm surprised I don't see it up on the forum more often, and I thought I would start this thread, one to discuss it and how to use it, and two, how DESeq (which I'm guessing most of us would be using if we were to use SPIA) works with it.

    Let me start off about DESeq. So when I write my res file that I generate from DESeq, I get entries that are Inf or #Name when one gene doesn't have any transcripts. SPIA doesn't know how to handle these and it crashes. So far, I've just been manually changing it (I guess I could use R) to numbers on the higher end. Does anyone else have a more scientific solution?

    Also my annotation gtf is of gene names for human, hg19. SPIA only uses ENTREZ gene IDs. So I convert them using Clone|Gene ID converter (this gets maybe 70% of them). Does anyone have a gtf file that uses just ENTREZ gene IDs or knows where I can get one?

  • #2
    CAVE: selection bias in RNA-seq

    SPIA and all other gene over-representation analysis methods suffer from the gene length and gene expression height bias of RNA-seq data.

    this means that longer genes and even more higher expressed genes get easier called as statistically significant by nearly all statistical RNA-seq methods. this impairs all methods which use p- or q-value cut-offs.

    see:
    Gene ontology analysis for RNA-seq: accounting for selection bias
    Matthew D. Young, Matthew J. Wakefield, Gordon K. Smyth, Alicia Oshlack
    Genome Biology 2010, 11:R14 (4 February 2010)

    better are gene set enrichment analysis methods, as they don't use p- or q-value cut-offs.

    i also used SPIA with RNA-seq data and set inf values to high numbers (as you) and averaged the expression values of all genes mapped to one ENTREZ ID.

    Comment


    • #3
      I and my collaborators also find SPIA very useful. Maybe this is naive, but doesn't FPKM fix this issue? If you don't like what cufflinks is doing, can't you use something like eXpress to get FPKMs instead of counts (a la HTSeq-count)

      Comment


      • #4
        Originally posted by dietmar13 View Post
        SPIA and all other gene over-representation analysis methods suffer from the gene length and gene expression height bias ... this impairs all methods which use p- or q-value cut-offs....better are gene set enrichment analysis methods, as they don't use p- or q-value cut-offs.
        The fact that over-representation or enrichment analysis methods can be confounded by detection power is not new with RNA-Seq. This has been true for these methods all along. The key point is to use a 'background' set against which you compare that is controlled for that. So don't just use all genes that someone happened find somewhere in a table as background for these types of analyses, but use a matched set of genes that in your experiment would have had roughly equal chance of detection as those that actually did make it to the top of your list.

        Over-representation (i.e. using a fixed cutoff and hypergeometric test or alike) and enrichment analysis (i.e. looking for a trend in the test statistic that is associated with annotation) have fundamentally the same issue. Again, the key issue is choosing the right background set.

        The focus in some of these discussions on gene length is peculiar. Really the total number of counts is the dominant variable here, on which detection power depends.

        Best wishes
        Wolfgang
        Wolfgang Huber
        EMBL

        Comment


        • #5
          Originally posted by Wolfgang Huber View Post
          The fact that over-representation or enrichment analysis methods can be confounded by detection power is not new with RNA-Seq. This has been true for these methods all along. The key point is to use a 'background' set against which you compare that is controlled for that. So don't just use all genes that someone happened find somewhere in a table as background for these types of analyses, but use a matched set of genes that in your experiment would have had roughly equal chance of detection as those that actually did make it to the top of your list.

          Over-representation (i.e. using a fixed cutoff and hypergeometric test or alike) and enrichment analysis (i.e. looking for a trend in the test statistic that is associated with annotation) have fundamentally the same issue. Again, the key issue is choosing the right background set.

          The focus in some of these discussions on gene length is peculiar. Really the total number of counts is the dominant variable here, on which detection power depends.

          Best wishes
          Wolfgang
          I never really understood the focus on gene length, perhaps it is my naivete of the finer aspects of the statistics. I understand the need to adjust for this if comparing expression of gene A to gene B, but not when comparing expression of gene A in condition A to gene A in condition B. The ability to detect DE in smaller genes can be overcome by increasing sequencing depth, so that like you said, "total number of counts is the dominant variable".

          I tried incorporating the methodology of GOseq with very confounding results, but we had also sequenced to depths where we were not seeing significant increases in the detection of lowly expressed genes.

          Comment


          • #6
            What I imagine is that it matters to actually call something differentially expressed because if longer genes give off higher counts, then DESeq requires a lower threshold of fold change to call it differentially expressed, even though it actually had the same level of expression of a shorter gene.



            This is where Simon explains that longer genes can show up as more expressed, but he says this bias cannnot be dealt with by the test for differential expression. He then states it should be taken into account by the gene enrichment test, but I'm not quite understanding how.
            Last edited by billstevens; 09-19-2012, 01:31 PM.

            Comment


            • #7
              Sorry, I think I should have asked that question more explicitly. How can we deal with the bias in counting methods in the gene enrichment?

              Comment


              • #8
                bump?

                Simon? Wolfgang? Bueller?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X