Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • honey
    Senior Member
    • Feb 2010
    • 151

    very high RPKM values from Cufflink

    I ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
  • peromhc
    Senior Member
    • Sep 2009
    • 108

    #2
    Originally posted by honey View Post
    I ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
    I think that these values should be taken to the log(10).. this is not documented, but my suspicion.

    log(10) values from cufflinks roughly equals FPKM values from cuffdiff..

    Comment

    • honey
      Senior Member
      • Feb 2010
      • 151

      #3
      very high RPKM values from 4.5 to sevreal thousands

      Howver the problem which I have is that the RPKM are high apx. 5% are > 1000RPKM (like 3245, 4356 and so on) in the same sample. If I change to log 10 than then what will happen to values around zero. Is it a usual method to log transform RPKM value?
      Any feedback is welcome
      Thanks

      Comment

      • peromhc
        Senior Member
        • Sep 2009
        • 108

        #4
        Honey, I could be totally wrong here about the log(10) thing, but I don't think I am..

        Can you look at the mappings for some of those transcripts where 'raw' FPMK is about 0-- do they have few reads mapped?

        See:
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        Comment

        • honey
          Senior Member
          • Feb 2010
          • 151

          #5
          High RPKM

          Thanks I looked at the Bam files and can say that there are very few reads wherver it is 0 values of RPKM however where the values are very high those are the kind of hot spots there are large no of reads. Now the question is is this an artifact -High RPKM or very low RPKM how we rope up both extreme values?

          Comment

          • pbluescript
            Senior Member
            • Nov 2009
            • 224

            #6
            honey, it might be a good idea to look a bit more in depth into that specific gene. You can certainly get high FPKMs mapping to genes like actin that make up a lot of the mRNA percentage of a cell. I had huge numbers of reads mapping to one region of a miRNA gene once that all turned out to be within a LINE and a SINE. For that gene at least, it was clear the repeat regions skewed the results.

            Comment

            • yumtaoist
              Member
              • Dec 2011
              • 10

              #7
              Is your reference very short?

              Comment

              • honey
                Senior Member
                • Feb 2010
                • 151

                #8
                very large RPKM

                It is human genome so it is not small.
                The egnes which have very high RPKM values are relavnt to biology of the tissue samples, but my problem is how to provide a scientific rational that our results are not nonspecific.
                Thanks for the input

                Comment

                • Nicolas
                  Member
                  • Apr 2009
                  • 41

                  #9
                  Originally posted by peromhc View Post
                  I think that these values should be taken to the log(10).. this is not documented, but my suspicion.

                  log(10) values from cufflinks roughly equals FPKM values from cuffdiff..
                  That does not make sense to me. Unless it is an option in either Cufflinks or Cuffdiff, but I have never saw a log relationship between Cufflinks and Cuffdiff outputs.

                  Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?

                  It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....

                  In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...

                  Comment

                  • Xiaobin
                    Junior Member
                    • Jun 2011
                    • 2

                    #10
                    Are those genes very short?
                    As cufflinks will remove the fragment length from gene length in calculating FPKM, sometimes it will give this kind of results.

                    Comment

                    • honey
                      Senior Member
                      • Feb 2010
                      • 151

                      #11
                      very high RPKM

                      Here are three examples with genomic coordinates

                      CGA - 87795222 to 87804824
                      KISS1 – 204159469- 204165619
                      TFP12- 93515745- 93520065

                      Should I then go back to count method?

                      Thanks for all your help.

                      Comment

                      • honey
                        Senior Member
                        • Feb 2010
                        • 151

                        #12
                        High RPKM

                        Originally posted by Nicolas View Post
                        That does not make sense to me. Unless it is an option in either Cufflinks or Cuffdiff, but I have never saw a log relationship between Cufflinks and Cuffdiff outputs.

                        Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?


                        It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....

                        In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...
                        I used simple quantification

                        So you mean probably count method is better?

                        Comment

                        • Xiaobin
                          Junior Member
                          • Jun 2011
                          • 2

                          #13
                          These genes don't seem to be that short. There must be other reasons.
                          I suggest you try count method first. Cufflinks is just too complex to be understood.

                          Comment

                          • Cole Trapnell
                            Senior Member
                            • Nov 2008
                            • 213

                            #14
                            This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

                            This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

                            I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

                            In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.

                            Comment

                            • epi
                              Member
                              • Jan 2012
                              • 38

                              #15
                              Hi Cole, Thanks for your post. I keep reading your comments here which are useful for many including me. I asked a similar question, with a twist, here: http://seqanswers.com/forums/showthread.php?t=17992

                              Can you comment please. In short, it is about how to deal with larger(>300 bp) transcripts with high FPKMs.




                              Originally posted by Cole Trapnell View Post
                              This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

                              This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

                              I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

                              In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              10 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...