Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

CummeRbund questions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CummeRbund questions

    Hi all.
    I'm new to R and really enjoying the cummerbund package. I have three questions regarding the graphs cummerbund produces.

    1) Even though I have several biological replicates, my error bars in the expression plots for genes are huge. When looking at a specific gene, having run cuffdiff with 4+4 samples, the error bars are really long, all the way to the ground and high up. Why is this and why is there no difference when I have several replicates as opposed to none? Shouldn't the error bars decrease? How do I edit the graph so the error bars accurately can display the variance between replicates?

    2) I cannot find the argument for y limits in the csVolcano plot. I have used the argument: xlimits = c(-10, 10) and it works perfectly but what is the equivalent for y? I have tried ylimits, ymin, ymax and nothing works.

    3) When testing 3 conditions against each other, I get different number of differential expressed genes depending on if I use the getSig() function or counting the number of rows as they did in the Protocol paper. Which one is more correct?

    Code:
    diffGeneIDs <- getSig(cuff, level="genes", alpha=0.05)
    gives 315 genes, while
    Code:
    gene_diff_data <- diffData(genes(cuff))
    sig_gene_data <- subset(gene_diff_data, (significant == 'yes'))
    nrow(sig_gene_data)
    gives 372 genes.

    Thank you for any help!

  • #2
    For 3, the getSig way is the preferred method now. You can set your own significance, while the old method had cuffdiff do it.

    Loyal explains it here:
    http://seqanswers.com/forums/showthr...ghlight=getSig

    Following up on this, does CummeRbund account for the selection bias the way GoSeq does?

    Comment


    • #3
      [QUOTE=glados;70775]Hi all.
      I'm new to R and really enjoying the cummerbund package. I have three questions regarding the graphs cummerbund produces.

      1) Even though I have several biological replicates, my error bars in the expression plots for genes are huge. When looking at a specific gene, having run cuffdiff with 4+4 samples, the error bars are really long, all the way to the ground and high up. Why is this and why is there no difference when I have several replicates as opposed to none? Shouldn't the error bars decrease? How do I edit the graph so the error bars accurately can display the variance between replicates?
      [\QUOTE]
      This is very dependent on the depth of your library and the sequencing quality in general. What sort of libraries are you using and how deep?

      [QUOTE=glados;70775]
      2) I cannot find the argument for y limits in the csVolcano plot. I have used the argument: xlimits = c(-10, 10) and it works perfectly but what is the equivalent for y? I have tried ylimits, ymin, ymax and nothing works.
      [\QUOTE]
      This was not included in the arguments. The reason that the x-limits argument is included is because this is a filtering step on the database backend. For the y-axis rescaling try this:

      Code:
      cuff<-readCufflinks()
      v<-csVolcano(genes(cuff),'Sample_A','Sample_B')
      v+ scale_y_continuous(limits = c(0, 20))
      #This is the 'ggplot2' way to control the limits.  This is one of the conveniences of making these plotting methods using ggplot2 is that you have complete control over the returned object.
      Originally posted by glados View Post
      3) When testing 3 conditions against each other, I get different number of differential expressed genes depending on if I use the getSig() function or counting the number of rows as they did in the Protocol paper. Which one is more correct?

      Code:
      diffGeneIDs <- getSig(cuff, level="genes", alpha=0.05)
      gives 315 genes, while
      Code:
      gene_diff_data <- diffData(genes(cuff))
      sig_gene_data <- subset(gene_diff_data, (significant == 'yes'))
      nrow(sig_gene_data)
      gives 372 genes.

      Thank you for any help!
      bill is correct, the 'appropriate' way to do this now is to use the getSig function. This gives you more control and conducts an appropriate multiple-testing-correction depending on the specific comparisons you are making for the differential test.

      Hope this helps!

      Cheers,
      Loyal

      Comment


      • #4
        Thank you Loyal! This is very helpful. My Volcano plot looks much better now.

        Originally posted by lgoff View Post
        This is very dependent on the depth of your library and the sequencing quality in general. What sort of libraries are you using and how deep?
        I think I will have to wait with this question nr 1. After a day of testing and comparing with 2+2 samples and receiving an awful lot more significant genes with fewer samples (blue volcano instead of red), it is apparent that I have made some mistake somewhere and I will try to look into this more.

        Comment


        • #5
          Loyal, if you have time I would appreciate some help. I've been trying to figure out this thing for the last few days. I have looked carefully through every script and everything has been run exactly the same with only different input and output files. I have 2 conditions which I want to see differences between. When I run CuffDiff with 4+4 samples (4 biological replicates in each condition) I get a total of only 140 significant genes with function getSig() in cummerbund. When I decrease the replicates to 3+3 samples I get 11 200 sig. genes. When I decrease it one more step to 2+2 samples I get 20 000 sig. genes. A dramatic difference to say the least. Do you have any idea of why this is?

          Additionally, the volcano plots look very different depending on the run. For the 4+4 run, the "limit" on the y-axis where the dots become blue (significant) is very high, 3.5 (which equals to p much less than 0.001). For the 2+2 run the same value on the y-axis is 1.5. Another 2+2 run I did with different samples gave a volcano plot with y-value of precisely 2. I'm using the same R code for every run (simply: csVolcano(genes(cuff), 'x-value', 'y-value') so I'm not sure why I get so drastically different number of sig. genes and why the volcano plots y-axis significance limit is very different between different runs. I have attached 2 volcano plots if you wonder what I mean that are zoomed in somewhat (with cropped out header), one from 2+2 samples, one from 4+4 samples . edit: every sample has approximately around 15-20 million reads mapped (in pairs). Thank you in advance.
          Attached Files
          Last edited by glados; 04-24-2012, 05:20 AM.

          Comment


          • #6
            Originally posted by glados View Post
            Hi all.
            I'm new to R and really enjoying the cummerbund package. I have three questions regarding the graphs cummerbund produces.

            1) Even though I have several biological replicates, my error bars in the expression plots for genes are huge. When looking at a specific gene, having run cuffdiff with 4+4 samples, the error bars are really long, all the way to the ground and high up. Why is this and why is there no difference when I have several replicates as opposed to none? Shouldn't the error bars decrease? How do I edit the graph so the error bars accurately can display the variance between replicates?
            Have you solved the question 1, I am having the same problem as you have, I have 3 replicates at each condiction, but my error bars is huge. whats the problem?

            Comment


            • #7
              Hi All,
              In general, these remaining questions pertain to cuffdiff and the statistical model that it uses to determine variance, and should probably be asked of the cufflinks people. The error bars plotted by cummeRbund directly reflect the variability of gene:condition as determined by cuffdiff and cannot be changed from within cummeRbund. I would suggest however, that with the newest versions of cuffdiff and cummeRbund, you can expose replicate fpkm values on most plots but using the replicates=T argument where applicable. This should be a good place to start to find out what the distribution of FPKM values for your replicates looks like for a given gene:condition. If the distribution of FPKM values is broad, then the error bars would reflect this variability, if not, then perhaps the cufflinks variance model is a poor fit (and is perhaps reflected in the 'status' field) for a particular gene:condition.

              Cheers,
              Loyal

              Comment


              • #8
                Thank you Loyal. Yes I understand this is more of a cufflinks problem, not so much cummerbund. I still have not solved it though. I get huge variances in cufflinks 2 also even with many replicates. I don't know why. Will update here if I find out.

                Comment


                • #9
                  Boxplot in CummeRbund

                  I got a weird thing with the Boxplot. No box with medians displayed, only dots. I tried with or without replicates= T option, both produced the exactly same image as the attached image.
                  I have three replicates at four time points of a cell line after drug treatment,3+3+3+3. I ran cuffdiff in this way, T0(3 replicates at time point 0) versus T1(3 replicates), T0(3 replicates)versus T2(3 re), T0(3 replicates)versus T3(3 re). I did't use time series option, which seems a better way to do it after I checked on this forum.
                  I use CummeRbund analyze the three comparison seperately, all three boxplots I got are like the attached image without box and medians.
                  Can you tell me where is wrong or I did some mistakes?


                  #########################################################

                  I move this question to cuffdiff thread.
                  Attached Files
                  Last edited by zhuya5607; 06-04-2012, 01:20 AM. Reason: Wrong place to post the question

                  Comment


                  • #10
                    ExpressionBarPlot in cummeRbund

                    BTW, what do the error bars in the expressionBar plot represent for each gene? I don't find any documentation specific to this? Any help?

                    Comment


                    • #11
                      Originally posted by pravee1216 View Post
                      BTW, what do the error bars in the expressionBar plot represent for each gene? I don't find any documentation specific to this? Any help?
                      Based on the Nature Protocol paper I believe it is confidence intervals, but I agree it is not very clear. Also, I don't know if it's possible to change the error bars to represent something else, like standard error, etc.

                      Comment


                      • #12
                        I know this thread has been dead for a while...here goes nothing.

                        I'm relatively new to R and to the fantastic cummeRbund package but I have a question regarding the diffData() and getSig() functions.
                        As stated:
                        3) When testing 3 conditions against each other, I get different number of differential expressed genes depending on if I use the getSig() function or counting the number of rows as they did in the Protocol paper. Which one is more correct?

                        Code:
                        diffGeneIDs <- getSig(cuff, level="genes", alpha=0.05)
                        gives 315 genes, while
                        Code:
                        gene_diff_data <- diffData(genes(cuff))
                        sig_gene_data <- subset(gene_diff_data, (significant == 'yes'))
                        nrow(sig_gene_data)
                        gives 372 genes.
                        Based on the replies, the getSig() function is more accurate due to multiple-testing corrections, etc. However, the getSig() function only returns a vector of the genes rather than a data.frame that includes the log fold change, q-value, FPKM values, etc. (which is very useful and is what diffData() returns). The question I have is if a way exists to get that information from the list of getSig() genes?

                        Thanks,
                        Derik

                        Comment


                        • #13
                          There is a getGenes-Funktion that will give you a subset of the whole database for only those genes. If cuff is you dataset and sigGenes the list you got with getSig use getGenes(cuff, sigGenes).

                          Comment


                          • #14
                            I have similar problem, getGenes(cuff, sigGenes) produced sig_diff_genes but the error bars are very high and FPKM value is very low.
                            In my attached file;
                            1st group has low error bar and high fpkm value clearly indicates diff_gene. but this was not identified by getGenes(cuff, sigGenes). I manually got it.
                            2nd and remaining all have high error bars identified by getGenes(cuff, sigGenes), however, does not seems to be really important.
                            3rd How much the error bar should be ? Can I select diff_genes with high error bars ? How much FPKM should be ?
                            please do reply, if possible.
                            Thank you in advanced.
                            seq.png

                            Comment


                            • #15
                              Heatmap scale

                              Hi guys,

                              I am trying to get a heatmap for all my differentially expressed genes (500 total). However I am experiencing difficulties with this. I tried to manually input all gene codes to R, use the function getGenes and then find the heatmap, but I was hopeless. The problem is with the scale of the heat map. I don't know how to modify the scale and it gives me 3 ranges (from 0 to 0.3, 0.3 to 2 and 2 to anything above that). Most of my genes have FPKM values higher than 2, so my heatmap has only one color all across the map

                              Any idea on how to change the scale of my heatmap?

                              Comment

                              Working...
                              X