Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks FPKM range

    I am observing high FPKMs for cufflinks result, as many of you.
    After going through the literature, it seems small genes and upper-quartile normalizations may be involved. While I am finding this to be true, but I am also finding high FPKMs for some other genes as high as ~4000 (3.5 Kb gene). I have 100s such genes in the dataset. I visualized few in IGV, they have large no. of reads, but certainly not as high as the FPKM says.

    Please comments on my questions as much as you can.

    1. Have you observed such cases, what could be the reason for these.

    2. What is the normal range of FPKMs observed, is there a normal range?

    3. What to do with small novel genes which cufflink finds, should just ignore it. Is there any command line settings to prevent it.

    4. For non-novel genes (from GTF annonation) with such high FPKMs, would you ignore those for cuffdiff or include it.

    Thank you for responding

  • #2
    I need an answer to this too...

    Comment


    • #3
      FPKMS are simply rate measurements. You could have a gene with an FPKM of 100 that only got 20 reads. It all depends on that last part of the normalization: per million mapped reads.

      There is no logical bottom end cutoff for FPKM where you can say "these genes are not expressed", other than 0 of course.

      If you mean that most of the genes in your results seem right bu a subset of them seem to have higher FPKMS than others with similar amounts of coverage then you're probably seeing an artifact from the cufflinks pipeline. I have seen that many times myself for small genes like those single exon ones. It doesn't make much sense. I recommend trying the -b option on cufflinks and/or cuffdiff. That uses the bias correction pipeline within cufflinks and it seems to fix those erroneous FPKMS.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        I have a different problem - all my RPKM values in one dataset are shifted by 10,000 with respect to another! Both are described to have been prepared the same way. I am pasting my message to cufflinks developers:

        I wanted to compare this dataset

        to available Encode datasets on other cell lines:


        My expression analysis with Cufflinks is weird. In particular, it seems that the
        whole RPKM distribution is shifted up for the first dataset samples (HMEC and
        HCC1954) . For example, the minimum of both HMEC and H1HESC is 0, but the maximum
        is 3*10^9 and 3*10^4 respectively. So in log space, the average RPKM for
        the other cell lines is around 2-3, while for HMEC and HCC1955 it's 10-12. At this
        point I went all the way back to fastq, realigned to Hg19 with bowtie,
        and used cufflinks to compute RPKM - the difference remains. Any ideas why?

        It is true that one library may have more reads. But isn't FPKM supposed to normalize for the number of total reads in the library and if so how can the entire distribution be shifted?

        2) On another note, I also do not understand how I am getting some really small non-zero values from both datasets when the total number of reads would not seem to permit this:

        total reads HMEC_expression:
        2.2983e+10

        min HMEC_expression >0
        3.0939e-312


        I would really appreciate your help.

        Comment


        • #5
          I've seen cuffdiff blow the read count normalizations but not cufflinks. In my case I saw a 10 fold increase in the baseline of one group's mean expression verses the other causing almost all genes to be tagged as significantly misexpressed.

          Have you tried testing the different normalization options that Cufflinks provides? Have you tried the --compatible-hits-norm option or the -N option for upper quartile normalization.

          You can also look in the isoforms.fpkm_tracking files and check the "length" and "coverage" columns. You can roughly compute the number of raw reads aligned to each gene by multiplying those columns together. Sum the column of products to get a rough "total bases aligned to genes" count and divide the column by that number to roughly normalize the counts. Try that at each sample and see if you still have that massive offset between samples.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            thanks, i will try this. but I am now worried this software works erratically. do you have any idea why such blowing of the normalization occurs? can i trust results from other people computed with this software?

            Comment


            • #7
              I don't use it as my primary quantification tool nor my primary differential expression tool. I've never seen DESeq or edgeR blow the normalization step. We are only talking about a division step so it doesn't make sence for any software to mess it up. To me Cufflinks is very desirable but I don't trust it so I don't use it. I have explored it quite a lot though because I very much want to be able to use it.

              In your case it COULD be a result of the normalization being based on total reads aligned instead of the more robust upper quartile method. But you should check the coverages to make sure. If your manual normalizations give you the same result then you've got some small population of highly expressed genes biasing the normalization. The -N option should fix that or normalizing by the upper quartile of the read counts of the genes. I'd also try the -b option because it seems to help fix some other things that Cufflinks does that make me not trust it. I still dont trust it though. Maybe im just not smart enough to understand it.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                very low fpkm?

                sdriscoll-

                Nearly all of my fpkm values are very low. The median of all of my replicates is ~0.1 and I have between 50 and 60 million mapped reads per sample. Very few genes are above 10. See the attached graph boxplot2.pdf and testdensity.pdf. Are these values too low, or as you said caused by a larger denominator and thus are okay? Also, I've attached a .pdf of a volcano plot, which is strange because I have ~870 significantly differentially expressed genes, but they all show up at the top of the graph where they don't belong (pvalues are not that small). Perhaps cummeRbund is just doing something improperly.

                The sequencing is from RNA-seq from ribosomal depleted RNA, could this lower the fpkms? I did mask all repetitive regions when using cuffdif.

                The sequencing was performed on a HiSeq. The data was generated through the Tuxedo package -Tophat 2, cufflinks,cuffmerge,cummeRbund.
                Attached Files

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X