Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

cummeRbund - how to get gene name in diffData output

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cummeRbund - how to get gene name in diffData output

    I'm going through the tophat/cufflinks/cuffmerge/cuffdiff/cummeRbund protocol just published at Nature Protocols. I'm wondering how to get a different identifier (e.g. the gene symbol instead of XLOC) in the diffData(genes(cuff)) output for printing tables, annotating plots, etc.

    Code:
    > cuff <- readCufflinks()
    > gene_diff_data <- diffData(genes(cuff))
    > sig_gene_data <- subset(gene_diff_data, significant=="yes")
    > head(sig_gene_data)
            gene_id sample_1 sample_2 status   value_1   value_2 ln_fold_change test_stat     p_value     q_value significant
    3   XLOC_000003       C1       C2     OK   48.7791   77.9603       0.676476 -11.81010 0.00000e+00 0.00000e+00         yes
    60  XLOC_000060       C1       C2     OK   65.0337  107.4470       0.724360 -39.42460 0.00000e+00 0.00000e+00         yes
    81  XLOC_000081       C1       C2     OK   26.8576   23.4374      -0.196521   4.68763 2.76389e-06 1.26818e-04         yes
    175 XLOC_000175       C1       C2     OK 2724.4700 2514.0400      -0.115971   7.49516 6.61693e-14 4.70596e-12         yes
    237 XLOC_000237       C1       C2     OK   23.3386   19.1596      -0.284646   3.73237 1.89689e-04 6.18840e-03         yes
    250 XLOC_000250       C1       C2     OK   24.0661   42.1466       0.808413  -9.23741 0.00000e+00 0.00000e+00         yes
    >

  • #2
    Hi turnersd,
    The workflow for cummeRbund has been simplified a bit since the paper was submitted. The recommended approach to this (for cummeRbund 1.1.3 or greater) is as follows

    Code:
    > cuff <- readCufflinks()
    
    #Retrive significant gene IDs (XLOC) with a pre-specified alpha
    > diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)
    
    #Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
    > diffGenes<-getGenes(cuff,diffGeneIDs)
    
    #gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
    > featureNames(diffGenes)
    fpkm(), fpkmMatrix(), features(), and diffData() are all available methods for the CuffGeneSet object as well.

    Cheers,
    Loyal

    Comment


    • #3
      weird...

      cummeRbund tells me this:

      > diffGeneIDs <-getSig(cuff_data_Input,level="genes",alpha=0.05)
      Error: could not find function "getSig"

      Did I do something wrong?

      Comment


      • #4
        OK, got it, I got an older version (0.1.3)....
        But where can I find the latest version then?

        K.
        Last edited by kareldegendt; 03-25-2012, 10:45 PM. Reason: new problem

        Comment


        • #5
          Hi,
          you can find the freshest cummeRbund at:

          http://compbio.mit.edu/cummeRbund/

          Also, be sure to sign up for the bowtie-bio-announce mailing list if you would like to be updated to new releases/features.

          Cheers,
          Loyal

          Comment


          • #6
            Hi Loyal,

            I'm facing the same situation as kareldegendt.

            Can you provide a set of instructions for upgrading to a newer version when cummerRbund is already on a system (64-bit Linux)? This would be very helpful for R novices like myself. I've tried downloading and unzipping the tarball and using make but that doesn't seem to work.

            Thanks,

            Shurjo

            Comment


            • #7
              OK, here's what I did:

              I first downloaded the cummeRbund Mac OS X binary (for version 1.1.5)

              Then I did this in R:
              > install.packages('/Users/kareldegendt/Downloads/cummeRbund_1.1.5.tgz',repos = NULL)

              then I added cummeRbund to the current session:

              >library(cummeRbund)

              It loaded abunch of things and told me that the package cummeRbund was built under R version 2.15.00
              No clue if that's gonna hurt anything. I'll test and if so, I'll probably upgrade R...

              best,
              Karel

              Comment


              • #8
                OK, this works BUT:

                I still did not get gene symbols (like f.e. Akt or Bact) but the XLOC_000.... and uc00.... names...
                Not really a solution :-/

                K.
                Last edited by kareldegendt; 03-27-2012, 01:04 PM. Reason: typo

                Comment


                • #9
                  Hi Karel,

                  Thanks for the tips. They worked for me as well.

                  Regards,

                  Shurjo

                  Comment


                  • #10
                    You could do it with merge:
                    cuff <- readCufflinks()

                    #Retrive significant gene IDs (XLOC) with a pre-specified alpha
                    diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

                    #Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
                    diffGenes<-getGenes(cuff,diffGeneIDs)

                    #gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
                    names<-featureNames(diffGenes)
                    row.names(names)=names$tracking_id
                    diffGenesNames<-as.matrix(names)
                    diffGenesNames<-diffGenesNames[,-1]

                    # get the data for the significant genes
                    diffGenesData<-diffData(diffGenes)
                    row.names(diffGenesData)=diffGenesData$gene_id
                    diffGenesData<-diffGenesData[,-1]

                    # merge the two matrices by row names
                    diffGenesOutput<-merge(diffGenesNames,diffGenesData,by="row.names")

                    Comment


                    • #11
                      Hi All,
                      Sorry I missed the earlier posts in this thread. Sorry for the troubles in getting updates to cummeRbund installed, but it appears that you were both successful.

                      The reason for this setup is that a 'stable' version of cummeRbund (v1.0.0) has to be maintained with the current 'release' version of Bioconductor. Active development of new features (including getSig, etc) is done on the 'development' version of Bioconductor which is attached to the 'development' release of R (currently v2.15). When the new version of Bioconductor is released (in the next few weeks), all of the development features of cummeRbund will be available using the standard BioC install methods. The benefit of using this is obviously earlier access to newer features but the drawback is of course a moderate amount of instability and growing pains. You can also install the development version of R and the most recent version of cummeRbund will be installed by BiocLite() by default.

                      The way I am trying to write cummeRbund, the 'development' versions should also be compatible with earlier versions of R (at least 2.13 or greater).

                      To answer the question of gene names directly, they can always be accessed as part of the 'features' data.frame returned from a call to features() on a CuffData or CuffGeneSet object. The 'featureNames()' function is just a shorthand that in most cases just returns the gene_id and gene_short_names (when present) only.

                      Another common way to represent the FPKM data is in a 'matrix' format of featuresXconditions. You can use fpkmMatrix(myGeneSet,fullnames=T) to generate this matrix.

                      The general problem with using gene names is that they are inherently non-unique (despite efforts to enforce this). This causes significant problems for a lot of the behind-the-scenes data wrangling in both cufflinks/cuffdiff and cummeRbund. This is why the XLOC_* and TCONS_* ids are essential to track individual features. Our suggestion, as mentioned above, is to use the 'features()' method to get all annotation associated with features in a CuffData, CuffGeneSet, or CuffGene object. The output of this method should be a standard R data.frame on which you can do any manipulations/merges that you would like. Please let me know if you have specific workflows in which you are having difficulty mapping these ids to gene names and I can help with the syntax.

                      Cheers!

                      Loyal

                      Comment


                      • #12
                        Originally posted by Thomas Doktor View Post
                        You could do it with merge:
                        Thanks Thomas for posting this solution...

                        Cheers,
                        Loyal

                        Comment


                        • #13
                          Hi everyone,
                          very helpful responses so far! Retrieving the gene names works great for me, but unfortunately just for differentially expressed genes. If I want to look at splicing or isoforms I cannot get it to work. I just get a list of identifiers or, if I try with the above path and use "Isoforms" or "Splicing" instead of "Genes" the list is empty.
                          I am very new to all these things so not sure what I am doing wrong or what I have to do to determine the significantly differentially expressed isoforms, splicing, promoters,...and get a list including the gene names.
                          Will be very greatful for any help!
                          Cheers,
                          K

                          Comment


                          • #14
                            Hi all,
                            It finally worked for me too. I re-ran tophat and cufflinks with the refFlat file as an annotation file and that did the trick :-)

                            Thanks for all your help!!

                            Karel

                            Comment


                            • #15
                              Originally posted by Kittykat22 View Post
                              Hi everyone,
                              very helpful responses so far! Retrieving the gene names works great for me, but unfortunately just for differentially expressed genes. If I want to look at splicing or isoforms I cannot get it to work. I just get a list of identifiers or, if I try with the above path and use "Isoforms" or "Splicing" instead of "Genes" the list is empty.
                              I am very new to all these things so not sure what I am doing wrong or what I have to do to determine the significantly differentially expressed isoforms, splicing, promoters,...and get a list including the gene names.
                              Will be very greatful for any help!
                              Cheers,
                              K
                              Hi Kittykat22,
                              This is something that I'm actively working on including in a future release of cummeRbund. Right now, it's very easy to retrieve all information for a particular gene, however, several people have asked for a 'getFeatures()' method (similar to getGenes()) that would retrieve just the information you are looking for. I will try to post to this thread when I have it working, but also, please keep checking the website for an update.

                              As an alternative, you can get the significantly different isoforms list by using getSig (level='isoforms'). And you can retrieve ALL isoform annotation by using 'features(isoforms(cuff))' and/or 'fpkm(isoforms(cuff))'. You should be able to filter those data.frames using the list generated from the call to getSig().

                              Cheers,
                              Loyal

                              Comment

                              Working...
                              X