Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Nobody knows what rskr is debating about. But thanks, sdriscoll for chiming in; I didn't expect anybody to be still reading. ;-)

    Just to clarify two points in case somebody reads through this mess:

    - Independent filtering is not restricted to microarrays. The paper derives and proves a general result, and merely illustrates it with microarray data. Our application here does not follow that example but uses the general result. (The paper filters on variance, in this thread we use mean.)

    - It is in fact true that you can use the p values from the DE test to chose, sort of post hoc, the filter threshold, as long as the filter criterion fulfills the permutation-invariance condition described in the paper. Intuition might make one be sceptical here, so read the proof if needed.

    Comment


    • #47
      Instinctively a biologist tends to think there SHOULD be a cutoff because they often want to know "what's really expressed" in their samples. It makes them uneasy that RNA-seq is kind of grey in answering this when it comes to low-count features since those hits could be false-positives or they could be features with VERY low expression which may resolve with additional sequencing depth. It's great that people are working on this question because I think it's very useful due to the extreme amount of information you get out of an RNA-seq run. I remember in 2009 we just ignored features with RPKM values less than 1 which, obviously, doesn't make a lick of sense today.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #48
        Originally posted by sdriscoll View Post
        rskr--- are you suggesting that one should NOT filter out low-count features prior to running a differential expression test? I thought that's what we were discussing in this thread - filtering out low count genes prior to running DE in order to eliminate a lot of extreme fold change values that such low-count data tend to produce. Or did you guys somehow get into a debate over filtering p-values AFTER the DE test?
        I am advocating for one test rather than a filter test and a DE test in this case, because the data are discrete, and arguing that that previous work was a poor adaptation of Microarray theory.

        Comment


        • #49
          Originally posted by sdriscoll View Post
          Instinctively a biologist tends to think there SHOULD be a cutoff because they often want to know "what's really expressed" in their samples. It makes them uneasy that RNA-seq is kind of grey in answering this when it comes to low-count features since those hits could be false-positives or they could be features with VERY low expression which may resolve with additional sequencing depth. It's great that people are working on this question because I think it's very useful due to the extreme amount of information you get out of an RNA-seq run. I remember in 2009 we just ignored features with RPKM values less than 1 which, obviously, doesn't make a lick of sense today.
          This is why I brought up the Chinese Restaurant process and the UNTB, which give ways to generate expectations over the abundances of genes.

          Comment


          • #50
            Originally posted by rskr View Post
            Anyway it just strikes me as wrong to adjust the significance of the sample by selecting the number of genes to test, when it seems the p-values could be derived from first principles.
            This part I get. I've felt a little odd about that as well. Reducing the # of genes being tested by arbitrary filtering has a direct impact on the results of the FDR correction for multiple testing. In essence you mean to point out the absurdity of doing this. Say for example a classic case of multiple testing where I've made 10 comparisons against a single sample but due to the number of comparisons my FDR corrected p-value is not significant so I opt to remove a few comparisons to push my p-value into significance. I get that point...and filtering the gene list with the intention of improving the corrected p-values does seem inherently wrong. Therefore a filtering based on a "what's probably expressed" verses "what's probably noise" seems more sound.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #51
              Originally posted by sdriscoll View Post
              Instinctively a biologist tends to think there SHOULD be a cutoff because they often want to know "what's really expressed" in their samples. It makes them uneasy that RNA-seq is kind of grey in answering this when it comes to low-count features since those hits could be false-positives or they could be features with VERY low expression which may resolve with additional sequencing depth. It's great that people are working on this question because I think it's very useful due to the extreme amount of information you get out of an RNA-seq run. I remember in 2009 we just ignored features with RPKM values less than 1 which, obviously, doesn't make a lick of sense today.
              I think you are confusing two issues here:

              1. For genes with very low counts we have strong Poisson noise and therefore insufficient information to distinguish true differential expression from just sampling noise. Therefore, any correct test for differential expression should not call any of these differentially expressed. The test needs to be able to notice this by itself; if it needed to rely on a filtering step beforehand, the test would be incorrect.

              This is one of the reasons why methods such as DESeq and edgeR insist on getting raw counts rather then RPKM values and the like: the test needs to be able to infer the expected strength of sampling noise from the absolute number of reads. We then model using a negative binomial distribution (or something similar) for inference. Using normal models as one did in microarray times will not work as well.

              Again: The point of independent filtering is not to "help" the test to avoid type-I errors (false positives). If this were the case, rskr would actually be justified with his ridicule (and I am starting to wonder whether he misunderstood the whole discussion along these lines). The test should be and typical methods are perfectly capable of noticing on their own when read count is too low to have enough information for a DE call.

              2. The point of the filtering step is to increase power, i.e. to reduce false negatives (slightly). This is due to a quirk of multiple testing adjustments such as Benjamini-Hochberg. If you apply them to p values from a mix of tested hypotheses with widely varying power, the cases with low power (here: weakly expressed genes) will make the adjustment overly conservative and so "drag down" the cases with good power (here: strong genes). Hence, it may pay to "not present" the weak cases to the BH procedure, i.e., to run it on only the genes with sufficiently strong expression. The question is how to preselect these "weak cases" without violating the assumptions of the adjustment procedure. The paper by Bourgon et al. gives a criterion to answer this question, and according to this criterion, prefiltering on the sum of read counts across samples, taken for each gene, is permissible -- and it does increase the overall number of hits, at least a bit.

              Comment


              • #52
                Originally posted by Simon Anders View Post
                I think you are confusing two issues here:

                1. For genes with very low counts we have strong Poisson noise and therefore insufficient information to distinguish true differential expression from just sampling noise. Therefore, any correct test for differential expression should not call any of these differentially expressed. The test needs to be able to notice this by itself; if it needed to rely on a filtering step beforehand, the test would be incorrect.

                This is one of the reasons why methods such as DESeq and edgeR insist on getting raw counts rather then RPKM values and the like: the test needs to be able to infer the expected strength of sampling noise from the absolute number of reads. We then model using a negative binomial distribution (or something similar) for inference. Using normal models as one did in microarray times will not work as well.

                Again: The point of independent filtering is not to "help" the test to avoid type-I errors (false positives). If this were the case, rskr would actually be justified with his ridicule (and I am starting to wonder whether he misunderstood the whole discussion along these lines). The test should be and typical methods are perfectly capable of noticing on their own when read count is too low to have enough information for a DE call.

                2. The point of the filtering step is to increase power, i.e. to reduce false negatives (slightly). This is due to a quirk of multiple testing adjustments such as Benjamini-Hochberg. If you apply them to p values from a mix of tested hypotheses with widely varying power, the cases with low power (here: weakly expressed genes) will make the adjustment overly conservative and so "drag down" the cases with good power (here: strong genes). Hence, it may pay to "not present" the weak cases to the BH procedure, i.e., to run it on only the genes with sufficiently strong expression. The question is how to preselect these "weak cases" without violating the assumptions of the adjustment procedure. The paper by Bourgon et al. gives a criterion to answer this question, and according to this criterion, prefiltering on the sum of read counts across samples, taken for each gene, is permissible -- and it does increase the overall number of hits, at least a bit.
                this is what you meant by "Intuition might make one be sceptical here", yes?

                Maybe what's going on here is simply confusion of these issues combined with what one's intuition naturally leads one to think about the filtering. Finding a logical threshold to define whats expressed verses what's not expressed is not the focus of these filtering steps for DE. Instead it's specifically about the test itself and some interesting insight into the BH correction. I like the explanation that the filtering is simply seeking to give the BH correction a more theoretically correct set of tests to correct rather than mixing low power and high power. I think I could use that explanation to explain this process to the biologists I work with.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #53
                  Originally posted by Simon Anders View Post
                  sdriscoll: I've been argueing for some time that there is no such thing as "alignment noise", but your post seems to contradict my claim. So I would like to hear more about your simulations.

                  These 2% wrongly aligned reads, do they really look like correctly mapped ones? Do they have good mapping quality (MAPQ value in the SAM file)? Are they mapped uniquely? Are they unspliced? If you run the aligner a second time, do they end up at the same wrong position?
                  Simon I haven't gotten back to this question yet but I've had a few peeks at some misaligned reads from the "subread" aligner and I think what's going on is maybe not so easily avoided. When I benchmark the aligners I've been benchmarking (bowtie1/2, tophat, STAR, bwa aln and mem) I count up the incorrectly mapped reads in a few categories: wrong reference, wrong orientation and finally wrong position. This gives me a few things to look at. Again I haven't done a real thorough investigation into these misaligned reads but I did look at some reads that fell into the "wrong orientation" category from a run with subread. These are reads mapped to the correct chromosome but are for some reason aligned flipped from their proper orientation and potentially also not in the correct genomic region (or probably not). A single read example, which based on what I've been seeing is probably a frequent issue for reads from certain genes, is a read who's correct alignment is across a splice junction with 0 mismatches. the subread-align command doesn't find these kind of alignments but it did align the read (calling it a unique alignment I might add) elsewhere in the chromosome flipped in orientation. The alignment it selected was one where it could align the entire read with 6 mismatches. What seems likely is this is a common mistake - I think aligners are going to select continuous end-to-end alignment over partial or spliced alignments unless I've misunderstood their alignment logic. In this case the read came from ENSMUST00000075388 but the alignment was made to ENSMUST00000177917 which both have different 'gene_id' values from Ensemble and based on subread's alignment score it was a decent quality alignment. So this is one read that should be assigned to one feature but would be counted into another. This, I believe, is what I'm seeing as "aligner noise" but maybe a better name is "genome noise". I'll work through more of my simulated results as I have time and report back.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #54
                    It's reassuring that the alternative, wrong, alignment had 6 mismatches and hence presumably low alignment quality. My hope would be that if one only uses only high quality alignments, one is reasonably safe.

                    You might find, by the way, this paper useful:

                    Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis
                    Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat.
                    Journal of Computational Biology. June 2012, 19(6): 796-813. doi:10.1089/cmb.2012.0022.

                    They compare several genomic aligners by taken read-length pieces from the genome, feeding them to the aligners, and checking whether the aligner find the position it was taken from. Interestingly some aligner make quite some mistakes there while others manage to keep the number of false mappings to zero.

                    So, it seems possible to avoid false mappings -- if the aligner is done well.

                    A caveat is that this paper does not study spliced alignment. It would be very interesting to perform a similar study with splice-aware aligners and simulated reads taken from the transcriptome.

                    Comment


                    • #55
                      that's what I've been thinking, though I haven't seen that paper - thanks for the link. It seems common to benchmark aligners with genomic reads but since the sequencing data I work with is nearly 100% RNA-seq those benchmarks are only partially useful for me. I've been compiling iterations of both position based benchmarking and read-count based benchmarking. The read count based is probably the most directly applicable to my actual projects. Those are where I use simulated transcriptome sourced reads and run them through about a dozen different alignment scenarios and read count options (based on whether it was alignments to genome or alignments to transcriptome) and then simply compare the returned list of features with counts and the count values themselves to the control list and counts. I've been working on this as a way to sort of decide which pipeline seems to be the best for me to use in our lab plus I can re-run it anytime something new comes out. In addition to gene-level counting I'm also running eXpress, RSEM and cufflinks with cufflinks running both alignments from STAR and Tophat (and both of those running with and without a junction database of known junctions). Since my simulated reads are transcript specific, and I can control the depth of sequencing at those transcripts to make sure they are sufficiently expressed, I can benchmark these so-called isoform expression quantifiers in the condition of total randomly expressed isoforms from throughout the transcriptome. So far the results of those isoform counts aren't too pretty but I can report that RSEM stands out by a lot and also that cufflinks performed slightly better with STAR alignments compared to Tophat alignments. Both of those seem to be out-performing eXpress which seems to be plagued with a lot of false-positives. I've recently changed my alignment settings for eXpress and maybe that'll have a positive impact on its results. By "not too pretty" I mean that, with the exception of RSEM, more than 50% of the returned isoforms showing non-zero expression are false positives and still more than 50% with counts higher than 10 reads are still false positives...and they aren't all low count features either. Many of them have expressions quite comparable with those of features that were correctly identified. It's not super inspiring. Their recall/sensitivity rates are pretty good though. Then tend to correctly identify over 90% of the features that my controls say should be present.

                      Endless fun...
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #56
                        This sounds partly worrying. You really should write a paper about your benchmarks.

                        Comment


                        • #57
                          Originally posted by Simon Anders View Post
                          I think you are confusing two issues here:

                          1. For genes with very low counts we have strong Poisson noise and therefore insufficient information to distinguish true differential expression from just sampling noise. Therefore, any correct test for differential expression should not call any of these differentially expressed. The test needs to be able to notice this by itself; if it needed to rely on a filtering step beforehand, the test would be incorrect.

                          This is one of the reasons why methods such as DESeq and edgeR insist on getting raw counts rather then RPKM values and the like: the test needs to be able to infer the expected strength of sampling noise from the absolute number of reads. We then model using a negative binomial distribution (or something similar) for inference. Using normal models as one did in microarray times will not work as well.

                          Again: The point of independent filtering is not to "help" the test to avoid type-I errors (false positives). If this were the case, rskr would actually be justified with his ridicule (and I am starting to wonder whether he misunderstood the whole discussion along these lines). The test should be and typical methods are perfectly capable of noticing on their own when read count is too low to have enough information for a DE call.

                          2. The point of the filtering step is to increase power, i.e. to reduce false negatives (slightly). This is due to a quirk of multiple testing adjustments such as Benjamini-Hochberg. If you apply them to p values from a mix of tested hypotheses with widely varying power, the cases with low power (here: weakly expressed genes) will make the adjustment overly conservative and so "drag down" the cases with good power (here: strong genes). Hence, it may pay to "not present" the weak cases to the BH procedure, i.e., to run it on only the genes with sufficiently strong expression. The question is how to preselect these "weak cases" without violating the assumptions of the adjustment procedure. The paper by Bourgon et al. gives a criterion to answer this question, and according to this criterion, prefiltering on the sum of read counts across samples, taken for each gene, is permissible -- and it does increase the overall number of hits, at least a bit.
                          Yet, you use a paper based on microarrays, using normally distributed simulations to justify the filtering and threshold.

                          Comment


                          • #58
                            I do precisely what I have written up there. None of this involves any "paper based on microarrays" nor any results which are justified by "simulations" rather than proofs or whose validity assume normality.

                            Comment


                            • #59
                              Originally posted by Simon Anders View Post
                              I do precisely what I have written up there. None of this involves any "paper based on microarrays" nor any results which are justified by "simulations" rather than proofs or whose validity assume normality.
                              You didn't refer to this paper?

                              Comment


                              • #60
                                Originally posted by rskr View Post
                                You didn't refer to this paper?

                                http://www.pnas.org/content/107/21/9546.long
                                You realise how disingenuous that reply is, I hope. He already directly addressed this above.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X