Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I having some fun with it. What it seems like to me is I've got to have very detailed knowledge of the transcriptome within context of a sequencing run. For example we know things like that there are families of genes in different loci who are 50 or 60% similar which to a biologist makes it sound like they are fairly separable. To an aligner with 50bp reads, however, those features could share a lot of data when one or the other is expressed. Since most mappers assign equally good hits randomly that's gonna be messy.

    So you need to know how much data can be shared between which genes at a specific sequencing type and read length.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #17
      sdriscoll: I've been argueing for some time that there is no such thing as "alignment noise", but your post seems to contradict my claim. So I would like to hear more about your simulations.

      These 2% wrongly aligned reads, do they really look like correctly mapped ones? Do they have good mapping quality (MAPQ value in the SAM file)? Are they mapped uniquely? Are they unspliced? If you run the aligner a second time, do they end up at the same wrong position?

      Comment


      • #18
        Originally posted by sadiexiaoyu View Post
        I would like to keep genes with p-value less than 0.05, so according to my result, if I cut around 10%, then no genes with a p-value less than 0.05 (10^-1.3) will be lost
        Of course, you cannot call genes with a raw p value below 0.05 as significant, due to the multiple-testing problem. Rather, you want the adjusted p value to be below some threshold (with 0.05 or 0.1 being commonly chosen values), and an adjusted p value of 0.05 typically (though not always) corresponds to a raw p value which is a good deal smaller. This is why Wolfgang suggested something like 0.003.

        As the relation between raw and adjusted p values depends on your data set, some experimenting with the threshold is often helpful to get optimal power.

        Comment


        • #19
          Originally posted by Simon Anders View Post
          sdriscoll: I've been argueing for some time that there is no such thing as "alignment noise", but your post seems to contradict my claim. So I would like to hear more about your simulations.

          These 2% wrongly aligned reads, do they really look like correctly mapped ones? Do they have good mapping quality (MAPQ value in the SAM file)? Are they mapped uniquely? Are they unspliced? If you run the aligner a second time, do they end up at the same wrong position?
          Ill see what I can put together for you. Naturally some if this will depend on which aligner and how I'm mapping the reads...but I can come up with some answers for you. I'm working on a transcriptome analysis that I suspect will explain a lot of it. I'm positive there are many connections between genes at the 100bp window of resolution that goes beyond the names, ids and even genomic location of the features. Once I have this map I expect to see some of the false gene counts from the misaligned reads fall away.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #20
            Originally posted by Simon Anders View Post
            Of course, you cannot call genes with a raw p value below 0.05 as significant, due to the multiple-testing problem. Rather, you want the adjusted p value to be below some threshold (with 0.05 or 0.1 being commonly chosen values), and an adjusted p value of 0.05 typically (though not always) corresponds to a raw p value which is a good deal smaller. This is why Wolfgang suggested something like 0.003.

            As the relation between raw and adjusted p values depends on your data set, some experimenting with the threshold is often helpful to get optimal power.
            Dear Simon,

            Thank you very much for your suggestion. I was also confused about whether I should use raw p-value or adjusted p value (like in edgeR, it is the FDR). And you suggested that "some experimenting with the threshold is often helpful to get optimal power", so I plan to run the data without filtering to see how is the 0.05 FDR corresponded to the raw p value, then choose this adjusted p value to see how many percent data should be filtered out according to the Fig.1 in the paper. Do you think this will be helpful?

            Best,

            Sadiexiaoyu

            Comment


            • #21
              Yes.

              But you are confused in terminology: An "adjusted p value" is a p value that has been "adjusted" for multiple testing. If the adjustment method is one that is designed to control false discovery rate (FDR), such as the methods by Benjamini and Hochberg or by Storey and Tibschirani, and if the original p values were sound, than the following holds: If one considers all genes with an adjusted p value below some threshold ϑ as "hits" then the proportion of false positives in this list of hits, the so-called false discovery rate, is expected to be at most ϑ.

              Comment


              • #22
                Right...so it's not necessary to do any kind of analysis comparing raw p-values to adjusted p-values. By the rules of statistics when multiple testing correction is necessary you're supposed to ignore the raw p-values and take the adjusted ones as "truth". Then you do as Simon suggested, pick a threshold and understand that means you're results may have that ratio of false positives.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #23
                  I think you have to use a Chinese restaurant process or biological diversity estimate to justify your thresholds, since reads counts aren't independent. IE you are fiddling around with the number of *types* of things that were observed, which is dependent on the number of observations and the relative expression of each of the genes. It could be that your arbitrary threshold throws away much more of the reads in some cases and not others, depending on how the reads were distributed throughout the sample.

                  Also you need to take into account that many of the most interesting and relevant genes will be produced at much lower levels.

                  Comment


                  • #24
                    No need to make things overly complicated.

                    The point of the Bourgon et al. paper is that the following is perfectly fine: Try different thresholds on the count sums (by simply scanning through a gird of values), always adjust the p values of the genes above the count sum with BH and then use the threshold that gives the largest absolute number of genes with an adjusted p value below your chosen FDR. (It may sound that such a post-hoc choosing of the threshold by peaking at the test outcome is "cheating" and breaks FDR control, but this is, somewhat surprisingly, not the case, as Bourgon et al. showed.)

                    Of course, if you are specifically interested in lowly expressed genes then such a way of choosing the filter may be permissible but disadvantagous because your goal is not to optimise power to get many hits but to learn about the small genes. Then, it might be better to choose a lower threshold, just so low that you do not lose any hits at all compared to the no-filtering case.

                    Comment


                    • #25
                      Originally posted by Simon Anders View Post
                      No need to make things overly complicated.

                      The point of the Bourgon et al. paper is that the following is perfectly fine: Try different thresholds on the count sums (by simply scanning through a gird of values), always adjust the p values of the genes above the count sum with BH and then use the threshold that gives the largest absolute number of genes with an adjusted p value below your chosen FDR. (It may sound that such a post-hoc choosing of the threshold by peaking at the test outcome is "cheating" and breaks FDR control, but this is, somewhat surprisingly, not the case, as Bourgon et al. showed.)

                      Of course, if you are specifically interested in lowly expressed genes then such a way of choosing the filter may be permissible but disadvantagous because your goal is not to optimise power to get many hits but to learn about the small genes. Then, it might be better to choose a lower threshold, just so low that you do not lose any hits at all compared to the no-filtering case.
                      I just don't think that is the correct framework to begin with since they aren't doing multiple testing in the first place.

                      Comment


                      • #26
                        Sorry, lost the thread of the discussion now. Whom do you mean by "they"?

                        Comment


                        • #27
                          Bourgon et al.

                          Comment


                          • #28
                            Originally posted by Simon Anders View Post
                            Yes.

                            But you are confused in terminology: An "adjusted p value" is a p value that has been "adjusted" for multiple testing. If the adjustment method is one that is designed to control false discovery rate (FDR), such as the methods by Benjamini and Hochberg or by Storey and Tibschirani, and if the original p values were sound, than the following holds: If one considers all genes with an adjusted p value below some threshold ϑ as "hits" then the proportion of false positives in this list of hits, the so-called false discovery rate, is expected to be at most ϑ.
                            Hi, Simon,

                            Thank you so much for the correction! I also noticed your nice explain in this thread http://seqanswers.com/forums/showthread.php?t=17011

                            Best,

                            Sadiexiaoyu

                            Comment


                            • #29
                              I just don't think that is the correct framework to begin with since they aren't doing multiple testing in the first place.
                              Maybe we are talking about different papers. I'm referring to this one:

                              R Bourgon, R Gentleman, W Huber: Independent filtering increases detection power for high-throughput experiments.
                              PNAS 2010 107(21):9546-51. doi: 10.1073/pnas.0914005107.

                              This paper discusses which kind of filtering is permissible in the sense that it does not invalidate subsequent adjustment for multiple testing.

                              So, yes, of course, they do multiple testing. It's the whole point of the paper.

                              Comment


                              • #30
                                What they were doing was trying to adapt an existing analysis framework to their problem.

                                Anyway it just strikes me as wrong to adjust the significance of the sample by selecting the number of genes to test, when it seems the p-values could be derived from first principles.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                31 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                33 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X