Announcement

Collapse
No announcement yet.

How do I get one FPKM value per gene?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    batch ORFs finder for cufflinks assembled transcripts(mrna)

    Hi,
    I have used the cufflinks assembled the transcripts(mrna) from RNA-SEQ experiment.
    my purpose is to check the possible length of the UTRs of each transcripts, and i should firstly find the best ORF for each transcripts, is there any tool for batch find the best ORF?

    Comment


    • #17
      The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

      I would not draw any conclusions about the FPKM of the FAILED genes.

      Comment


      • #18
        Originally posted by adarob View Post
        The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

        I would not draw any conclusions about the FPKM of the FAILED genes.
        Hi Adam,
        I ran tophat (1.1.0) without a mouse gtf file. Run cufflinks (0.9.1) without a mouse gtf file. Then run cuffcompare with a mouse gtf file and two gtf files generated from cufflinks for my two samples. Finally, I ran cuffdiff with compare.combined.gtf and two accepted_hits.bam files.

        However, I checked gene_exp.diff. I found there is still multiple FPKM problem for some genes (see below):

        XLOC_000009 Cspp1 chr1:10053629-10189988 q1 q2 OK 44.5012 58.359 0.271096 -2.93789 0.00330457 yes
        XLOC_000010 Arfgef1 chr1:10053629-10189988 q1 q2 OK 10.0582 7.68137 -0.269589 4.88261 1.04688e-06 yes
        XLOC_000011 Arfgef1 chr1:10053629-10189988 q1 q2 OK 40.66 31.8566 -0.244 17.6406 0 yes
        XLOC_000013 Arfgef1 chr1:10053629-10189988 q1 q2 OK 2.7768 40.8059 2.68753 -144.972 0 yes
        XLOC_000015 Arfgef1 chr1:10053629-10189988 q1 q2 OK 54.0345 65.0081 0.18489 -12.9339 0 yes
        XLOC_000016 Arfgef1 chr1:10053629-10189988 q1 q2 OK 23.4654 43.6672 0.62107 -29.4492 0 yes
        XLOC_000031 Tram2 chr1:20986216-20997026 q1 q2 OK 5.8219 2.96147 -0.67594 3.70609 0.000210487 yes
        XLOC_000032 Tram2 chr1:20986216-20997026 q1 q2 OK 3.33419 14.9065 1.49757 -29.7646 0 yes
        XLOC_000057 Tmem131 chr1:36849038-36996484 q1 q2 OK 37.3723 30.8444 -0.191975 5.03247 4.84195e-07 yes

        Did I do something wrong?

        I have another question regarding gene_exp.diff file. As you can see, the first gene Cspp1 has the same coordiates (chr1:10053629-10189988) as the second gene Arfgef1. But in my mouse gtf file (from Ensembl), the coordinates for those two genes are:
        Cspp1: Chromosome 1: 10,028,299-10,126,849
        Arfgef1: Chromosome 1: 10,127,652-10,222,751

        Those two genes are not overlapped. Why do they have the same coordinates in gene_exp.diff file?

        Thank you very much!

        Comment


        • #19
          If one has to sum the FPKM for a gene One has to use FPKM gene tracking file or gene expr file of cuffdiff. Mgogol's perl script uses fpkm lo, high and fpkm values which are only in tracking file. Is it ok to sum the fpkm values for a gene?
          Thanks

          Comment


          • #20
            Originally posted by adarob View Post
            The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

            I would not draw any conclusions about the FPKM of the FAILED genes.
            Does this mean that I will have to download the cuffcompare file, edit it, upload it on galaxy and then run cuffdiff on this gtf file? Thanks for your help!

            Comment


            • #21
              Sorry, in my previous thread I had asked whether the cuffcompare file needs to be edited. I just looked at a cuffcompare file, it seems to have only annotation information and no FPKM values. So, how (or where) is one supposed to combine the FPKM values from different transcripts for a gene and run cuffdiff?

              Comment


              • #22
                Read

                Not clear what you want to say. However, I agree FPKM per gene is an ongoing research.

                Comment


                • #23
                  Hi Honey,
                  Sorry if I am not being clear. This is what I have done so far and I am struggling to make some sense of the information I am getting:
                  1. I have 2 .bam files (1 control and 1 disease). I am trying to identify gene expression differences).
                  2. Using galaxy I ran the cufflinks-cuffcompare-cuffdiff workflow.
                  3. For running cufflinks, I took the .bam files and ran cufflinks with the defaults.
                  4. I ran cuffcompare (with assembled transcripts file from each of the sample, along with the reference).
                  5. I fed the output (transcript file) of cuffcompare along with the two original bam files into cuffdiff.
                  6. I was looking at the output of cuffdiff and am seeing a few things I don't quite understand:
                  There are more than one rows per gene for most of the genes in the output file (I would have thought that the differential expression would be reported at gene level). I read in some other threads on Seqanswers (including this one) that summing up the FPKM values of the transcript shall give me the gene level value (which is file). What I don't understand is which output file fom the workflow should I perform the operation on:
                  a) The cufflinks output has the FPKM, but no gene annotations
                  b) The cuffcompare output has the annotations, but not the FPKM values (unless I m missing them).
                  c) The cuffdiff output has both the FPKM and gene annotation values, but the "statistical" analysis is already done.
                  So should I take the cuffdiff output, edit it and then fed it back into the workflow (again, at what point?)
                  This is where my first confusion is coming from.

                  There is another (possibly related) issue that some of the transcripts in the cuffdiff output have FPKM = 0, so when diff analysis is run, the FC are ridiculous.

                  What is making this all the more frustrating is that I am trying to use published data (with paper that gives some list of genes that are diff expressed between conditions analyzed using galxaxy) in a bid to educate myself and am going in circles.

                  As you pointed out in one of my other threads that I have a lot of reading to do, but at the risk of sounding like a nag and unbelievably dense, i have been unsuccessful in finding some material that might help me understand these things.

                  Any help from anybody greatly appreciated

                  Comment


                  • #24
                    You will look for cuffdiff out put files-gene.expr, isoform.expr which are diff files and combined GTF file. However, to get one FPKM per gene it is suggested sum FOKM corresponding to gene name and same location. However as Adam has also suggested if gene has more than on location (overlap) it may not be possible to sum those FPKM. It is on going area of research. I am not very convinced that summing of FPKM all row per gene is good idea. Though several publications including a recent one has reported the same. (http://genome.cshlp.org/content/earl...d-4783a31b68c6). My suggestion is if you are trying to learn RNA-seq start with isoform.expr not gene level.
                    Best.

                    Comment


                    • #25
                      Hi yjlui,

                      Do you have already figure out the problem of the description of "test status" that shown "OK" , "LOWDATA", and "FAIL".
                      Should I delete those transcript for downstream analysis and consider them as poor assembly transcript?
                      Apart from that, do you have any idea about FPKM is 0?
                      Is it mean that those transcript is poor assembly transcript as well?
                      Thanks in advance.

                      Comment


                      • #26
                        Collapse duplicate FPKMs for a gene

                        Originally posted by mgogol View Post
                        I ended up writing a script to sum the FPKMS for a given gene id, which I think is right...

                        Here's my (unpolished) code (a perl script and a shell script).

                        This botches the confidence intervals, by the way.

                        The format of cufflinks outputs (genes.fpkm_tracking files) are now different from previous. I updated the code written by mgogol and published it on sourceforge.net https://sourceforge.net/projects/col...?source=navbar . I hope it will facilitate your work.

                        Comment


                        • #27
                          I'm using Cufflinks 2.2.1 but still seeing duplicate genes in the tracking file. Has the issue ever fixed?

                          Comment

                          Working...
                          X