Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #16
    Originally posted by ThePresident View Post
    It's somewhat near but not exactly the same thing. I did it with just a couple of values but still: 8vs9 , 146vs300, 13vs18.
    As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

    BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.

    Comment

    • ThePresident
      Member
      • Jun 2012
      • 72

      #17
      Originally posted by dpryan View Post
      As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

      BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.
      I'm dealing with bacterial transcriptome, so no splicing here. That's why I'm simply using bowtie instead of tophat. And, somebody told me that bowtie does not include the NH auxiliary tag (which is used to discriminate for multimappers / unique reads). What I have been told is that by default, bowtie should assign a mapping quality of 255 to reads that map once. So I could extract them with samtools:

      Code:
      samtools view -hb -q 255 input.bam > output.bam
      That way, I have BAM files with uniquely-mapped reads. Then perform HTSeq-count and manual calculation of RPKM. What's your opinion on that?

      Comment

      • AdrianP
        Senior Member
        • Apr 2011
        • 130

        #18
        Originally posted by dpryan View Post
        As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).
        You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

        From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #19
          I vaguely remember bowtie1 using a MAPQ of 255 for unique mappers (similar to older versions of tophat, since the equivalent is now 50 there). If you're using bowtie2, then just use -q 2 (0 and 1 are for multimappers in bowtie2). If you did in fact use bowtie1, then what you wrote should work. BTW, it's faster to just:

          Code:
          samtools view -q 255 input.bam | htseq-count ...
          Just use a dash ("-") as the file name. If what you wrote above is giving you the apparent 2x increase in read counts then something went seriously wrong at some point

          Comment

          • ThePresident
            Member
            • Jun 2012
            • 72

            #20
            Originally posted by AdrianP View Post
            Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.
            Yup, it's cufflinks RPKM value that is always higher. But I can't guarantee it would've been for all the values.

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #21
              Originally posted by AdrianP View Post
              You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

              From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.
              Part of the point of cufflinks is to reconstruct the transcripts that the organism is actually expressing, rather than just directly using those in an annotation file. If the observed transcript size after assembly is different from the annotation file then the denominators will differ as well. This is more frequently the case in eukaryotes, where there are multiple transcripts and cufflinks will calculate the gene's length as their weighted average (or something like that, I'd have to reread the paper to refresh my memory). Perhaps cufflinks is observing a shorter or longer transcript in these cases, in which case the RPKMs could differ by 2x.

              Comment

              • bswhite
                Junior Member
                • Jan 2014
                • 5

                #22
                TP,

                I'm also trying to calculate RPKM manually (though for paired end
                reads) and to compare against cufflinks FPKM.

                Are you seeing a bimodal distribution in your cufflinks RPKMs? Are
                you recapitulating that manually? Was this dependent on your only using
                uniquely mapped reads?

                I can't reproduce the bimodal distribution I see with cufflinks. In particular,
                the mode I see at higher cufflinks FPKMs is not present with my manual
                htseq-count-based approach. I understand there are plenty of caveats (e.g.,
                with respect to FPKMs vs RPKMs, transcripts vs gene, etc).

                I'm also using htseq-count. As the length of the gene, I'm taking the number
                of exonic bases in any transcript of associated with a gene. These lengths
                are universally not longer than those cufflinks is reporting in the locus field.

                I have tried passing --no-length-correction to cufflinks(2.0.2) as suggested by
                sdriscoll here:

                Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                thinking the second mode might be an artifact of cufflinks. But I retain it.

                Any ideas or particulars I'm missing? It sounds as if you are much closer
                than I am.

                Hope this newbie isn't hijacking your thread ...

                Comment

                • ThePresident
                  Member
                  • Jun 2012
                  • 72

                  #23
                  Hi whiteman, I'm still trying to figure this out.

                  Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

                  TP

                  Comment

                  • bswhite
                    Junior Member
                    • Jan 2014
                    • 5

                    #24
                    Originally posted by ThePresident View Post
                    Hi whiteman, I'm still trying to figure this out.

                    Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

                    TP
                    Did you mean samtools view -F 256? i.e., exclude any read that
                    has the 'not primary alignment' flag set?



                    I'm also trying to simplify by excluding both mates if either is multi mapped
                    (i.e., has flag 256 set or has NH field set with something other than NH:i:1).
                    This should be extremely conservative in giving uniquely mapped reads.

                    I plan to run these through cufflinks, htseq-count, and featureCounts
                    to compare results.

                    Comment

                    • Monika_bioinf
                      Junior Member
                      • Sep 2011
                      • 7

                      #25
                      It should be probably mentioned that author of RPKM concept (which has been later replaced by FPKM) deprecated it in his last talk and encouraged not using it.

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #26
                        Originally posted by Monika_bioinf View Post
                        It should be probably mentioned that author of RPKM concept (which has been later replaced by FPKM) deprecated it in his last talk and encouraged not using it.

                        http://www.youtube.com/watch?v=5NiFibnbE8o
                        Thanks for pointing out that video. His slide around minute 23 is particularly interesting.

                        One unfortunate thing about the second part of his talk is that many of us hesitate using estimated values when look at gene-level changes because the earlier methods (cufflinks chiefly among them) simply produced unreliable results. Granted, this seems to be going away as the programs mature, but "once bitten twice shy" as the saying goes.

                        Comment

                        • ThePresident
                          Member
                          • Jun 2012
                          • 72

                          #27
                          Originally posted by bswhite View Post
                          Did you mean samtools view -F 256? i.e., exclude any read that
                          has the 'not primary alignment' flag set?



                          I'm also trying to simplify by excluding both mates if either is multi mapped
                          (i.e., has flag 256 set or has NH field set with something other than NH:i:1).
                          This should be extremely conservative in giving uniquely mapped reads.

                          I plan to run these through cufflinks, htseq-count, and featureCounts
                          to compare results.
                          So, just to give you a brief update: finally, I aligned with bowtie2 using --sensitive parameter. Then, I used
                          Code:
                          grep -v XS:i: YourFile.sam > YourFileSingleMappers.sam
                          That was the easiest way of dealing with it (bowtie2 prints XS: tag for multimappers). Then I run HTseq-count with -intersection-nonempty parameter. Everything went smooth. I calculated RPKM values manually, still didn't compare it with Cufflinks RPKMs, but visually I have a lognormal distribution. Sounds logical for me...

                          TP

                          Comment

                          • dpryan
                            Devon Ryan
                            • Jul 2011
                            • 3478

                            #28
                            @ThePresident: Depending on your definition of multimapper, that grep command may remove a lot more than multimappers. What you're doing is removing any read that has at least one other "valid alignment". But keep in mind that bowtie2 defines a valid alignment as any yielding an alignment score >= that defined by the --score-min function. So, if you have a 100bp read that has 1 alignment with no mismatches (an alignment score of 0 with default settings) and a second with 10 mismatches (minimum alignment score of -60, but possibly higher), then you're calling that a multimapper. The question becomes if you're comfortable having the rate of multimappers change if you tweak --score-min. The more conventional definition of a multimapper is one where the best and second best alignments have equal scores. In bowtie2, those will be given MAPQs of either 0 or 1 and the NH:i auxiliary tag will be set accordingly.

                            So yeah, you're getting rid of multimappers. But you're probably getting rid of a LOT of accurate alignments in the process.

                            BTW, --sensitive is the default, you needn't specify it.

                            Comment

                            • bruce01
                              Senior Member
                              • Mar 2011
                              • 160

                              #29
                              Can I ask, what do people think of using flags 99/163 or 83/147, eg a 'primary' alignment with paired mate, as another way to do this. Of course this removes all multimappers. I have a script to use the alignment score from STAR to determine which multimapper to keep, but again, wondering is this a safe method? Any potential pitfalls for using those flags, or for using alignment score out of STAR?

                              Comment

                              • dpryan
                                Devon Ryan
                                • Jul 2011
                                • 3478

                                #30
                                If you just want to keep the first reported alignment of a multimapper then that'll work. To just ignore them, just use MAPQ==255 for STAR.

                                Comment

                                Latest Articles

                                Collapse

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 10:09 AM
                                0 responses
                                9 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, Yesterday, 08:59 AM
                                0 responses
                                17 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                25 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...