Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • priya
    Member
    • Apr 2013
    • 57

    Problems with FPKM values

    Hi,

    I see problems with the fpkm values calculated by cufflinks program. I dont see correlation between counts and fpkms for some genes.

    To be more clear I explain my scenario
    I have RNA-SEQ data for two conditions (let say condA and condB) with 3 replicates each
    for a particular gene : say gene Tagap

    Cond A:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Count 169 171 124 154
    fpkms 4.44 3.8 3.2 3.8

    Cond B:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Count 17 38 21 25
    fpkms 5.0 4.7 4.3 4.6

    From above :
    CondA (counts) > CondB (counts)
    CondA (fpkms) < CondB (fpkms)


    I tried to understand the fpkm calculations from cufflinks supplementary methods, but I couldnt completely get their calculations. After googling abit, I thought I can use this simple formulae to do normalization

    Nc = (10^9*Counts)/(Transcript_length*Total no. of uniquely mapped reads)
    with this formulae, I made use of counts to make normalizations and got below values

    Cond A:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Nc 1.72 1.88 1.26 1.62

    Cond B:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Nc 0.18 0.4 0.26 0.28


    CondA (Nc) > CondB (Nc)

    Nc values correlates with the counts.

    Can I use this approach (Nc calculations) to get the normalized values instead of FPKMs calculated from cufflinks?

    Why the direction of fold change is completely reversed by cufflinks fpkms?
  • blancha
    Senior Member
    • May 2013
    • 367

    #2
    Which version of Cufflinks are you using?
    Also, what is the total number of uniquely mapped reads for each sample?
    Have you tried comparing your results with the output of featureCounts and DESeq2?

    Comment

    • priya
      Member
      • Apr 2013
      • 57

      #3
      Thanks for your reply!
      Originally posted by blancha View Post
      Which version of Cufflinks are you using?
      Also, what is the total number of uniquely mapped reads for each sample?
      Have you tried comparing your results with the output of featureCounts and DESeq2?
      Which version of Cufflinks are you using?
      cufflinks/2.1.1 is used on bam files to calculate the fpkms

      Also, what is the total number of uniquely mapped reads for each sample?
      This number i got from this command
      samtools view -F 4 accepted_hits_sorted_dupRemoved_P1827_101.bam | wc -l
      31837064 + 0 in total (QC-passed reads + QC-failed reads)
      0 + 0 duplicates
      31837064 + 0 mapped (100.00%:-nan%)
      0 + 0 paired in sequencing
      0 + 0 read1
      0 + 0 read2
      0 + 0 properly paired (-nan%:-nan%)
      0 + 0 with itself and mate mapped
      0 + 0 singletons (-nan%:-nan%)
      0 + 0 with mate mapped to a different chr
      0 + 0 with mate mapped to a different chr (mapQ>=5)

      Have you tried comparing your results with the output of featureCounts and DESeq2?
      No, But the counts I used for comparision is from htSeqCount package

      Comment

      • yzzhang
        Member
        • Jan 2013
        • 67

        #4
        So you used different bam file for cufflinks and htseqcount? accepted_hits.bam for cufflinks and accepted_hits_sorted_dupRemoved_P1827_101.bam for htseqcount?

        Comment

        • blancha
          Senior Member
          • May 2013
          • 367

          #5
          I would just start by updating Cufflinks.
          I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
          I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
          I'd be curious to know if your problem disappears with the latest version of Cufflinks.

          Comment

          • priya
            Member
            • Apr 2013
            • 57

            #6
            Originally posted by blancha View Post
            I would just start by updating Cufflinks.
            I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
            I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
            I'd be curious to know if your problem disappears with the latest version of Cufflinks.
            Thanks for the suggestion! As the final output tables for htseq and cufflinks are delivered for us by sequencing facility, I am not sure which files they used as input for htseqcount and cufflinks (with duplicates or without), now I am rerunning all the results of Cufflinks and htSeq-Count with duplicated_removed_bam files and update you if I overcome the issue.

            Comment

            • blancha
              Senior Member
              • May 2013
              • 367

              #7
              Yes, please let me know if the issue is resolved.
              I would not remove the duplicates, though.

              Comment

              • priya
                Member
                • Apr 2013
                • 57

                #8
                Originally posted by blancha View Post
                Yes, please let me know if the issue is resolved.
                I would not remove the duplicates, though.
                Dear blancha,
                I rerun htseq-count (0.6.1) and cufflinks(version 2.2.1) on my RNA-Seq samples without removing the duplicates. But I see the same results now, without much changes

                First 3 values from Condition A with 3 replicates (red) and last 3 from Condition B with 3 replicates (Green)
                Counts
                ENSMUSG00000033450 Tagap 169 171 124 17 38 21
                FPKMS
                ENSMUSG00000033450 Tagap 4.43869 3.79506 3.20408 4.99311 4.66142 4.2739


                Counts = Cond A (mean) > Cond B (mean)
                FPKMs = Cond A (mean) < Cond B (mean)

                Actually this is not for all the genes, for fewer genes (say 15 genes as far I noticed)


                Commands used for htseq and cufflinks:

                samtools sort $raw_dir/$sam $bam_dir/$nnm'.sorted'

                cufflinks --library-type fr-firststrand -p 8 -G $main_dir/data/Mus_musculus.GRCm38.82.gtf -M $main_dir/data/mask_MR.gtf $bam_dir/$nnm'.sorted.bam' -o $cuff_dir

                samtools view $bam_dir/$nnm'.sorted.bam' | htseq-count -s reverse - $main_dir/data/Mus_musculus.GRCm38.82.gtf> $ht_dir/$nnm"_count.txt"

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Have a look at the BAM files in IGV. It's likely that the files with low raw counts have a bunch of multimappers that htseq-count is (correctly, given its purpose) excluding while cufflinks (possibly correctly, given its purpose) is including. You should be able to tell by eye whether cufflinks is being reasonable or not.

                  Comment

                  • blancha
                    Senior Member
                    • May 2013
                    • 367

                    #10
                    Yes, @dpryan has a point.

                    @priya, I just want to point out that Cufflinks doesn't provide you only with the FPKM values, but also with the raw counts.
                    I initially thought you were giving me the raw counts for Cufflinks, and that there was a discrepancy between the raw counts given by Cufflinks and the FPKM given by Cufflinks.

                    If there is a difference between the counts for htseq-count and the counts for Cufflinks, this is a different matter altogether, and not necessarily indicative of a bug, since they may count reads differently, as pointed out by @dpryan.

                    The file "genes.count_tracking" contains the counts for Cuffdiff.

                    As pointed out by @dpryan, the simplest solution is always to check the reads in the BAM file yourself with IGV, if you are getting different counts with different software programs.

                    I hope I have not given you any misleading information
                    Last edited by blancha; 10-13-2015, 12:58 PM.

                    Comment

                    • priya
                      Member
                      • Apr 2013
                      • 57

                      #11
                      Thanks dpryan and blancha for your suggestions!

                      @dpryan is correct, I noticed it when I viewed my bam files in IGV, actually its issue with multiple mapped reads for low raw counts.

                      @blancha: My aim of the analysis is to identify differentially expressed genes between wild-type and knockout lines, after going through this old posts from seqanswers


                      Two ways - htseqcount -> DE-tests (DESEq/EdgeR/Voom)
                      Cufflinks -> Cuffdiff
                      I stick to the first approach and also in my case, problem arises with lower counts and genes with higher counts has no issues (even i compare htseq_counts and fpkms) ; so for downstream analysis I am considering to eliminate low counts with considerable cutoff and perform differential expression analysis.

                      Another question is ;
                      If we want to compare htseq_counts between two different genes, we need to normalize by gene-lengths ; so can I do by this formulae: (my beginning question of this post)
                      for gene i; normalized count (Nc)
                      Nci = (htseq_count)i * 10^9/ (gene length * Total reads in the sample)

                      so by this way both gene lengths and sample sizes are corrected (If i am not wrong??)

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        So that will sort of but not really make things comparable. There are a couple issues:

                        1) Unless there's only 1 isoform for each of the genes, there's always the question of what the most appropriate gene length is. In an ideal world you'd quantify things at the transcript level and then have an "effective gene length" to use, but that sort of thing takes a bit more work.

                        2) Typically things like GC content affect what gets sequenced, so you end up needing to correct for that as well.

                        3) Inevitably there are things not listed here that also need to be accounted for.

                        The safest method would be to try to do sequencing with some spike-ins that are similar so you can determine how to correct everything, but that'll take a significant time and resource commitment.

                        Comment

                        • priya
                          Member
                          • Apr 2013
                          • 57

                          #13
                          Good to know about those points! Thanks !

                          Comment

                          • blancha
                            Senior Member
                            • May 2013
                            • 367

                            #14
                            @priya

                            The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
                            You do need to provide the gene lengths yourself.
                            If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
                            The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

                            The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.

                            Comment

                            • priya
                              Member
                              • Apr 2013
                              • 57

                              #15
                              Originally posted by blancha View Post
                              @priya

                              The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
                              You do need to provide the gene lengths yourself.
                              If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
                              The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

                              The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.
                              @blancha Thanks for bringing into my notice about the latest version, its just updated yesterday .. (https://www.bioconductor.org/package...man/DESeq2.pdf)
                              I see there are lot of new functions added into DESeq2 , I need to explore more of it esp. fpkms() and normalizeGeneLength() .

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              32 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...