Announcement

Collapse
No announcement yet.

DEGseq

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Thanks Likun,

    I was confused by the q-value. It all makes sense now.

    - Nick

    Comment


    • #32
      Hi Xi

      Thanks for putting your work in the open source. I am quite happy to see many different methods at my hand to play around with numbers.

      I think I have some repeated questions that have been asked on this thread in some way or the other. I may be getting down to very basics and hope that will help some others also.

      As a pilot DEgseq usage run, I used the LRT method to find expression values for two samples(control/mutant). I am taking one specific gene count example to ask my questions. The input was bed format, alignment was done using TopHat.

      HTML Code:
      Control
      
      Gene    Count      RPKM         total no. reads      gene length
      ABAT    811         1.57049      87082676          5930
      
      Mutant
      ABAT     4264      10.3046        69779755           5930
      While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.

      1. why we are not accounting for diff read numbers in the output_score.txt
      2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.

      Thanks!
      -Abhi

      Comment


      • #33
        Hi Abhi,

        Thanks for your questions.

        Originally posted by apratap View Post
        While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.
        Yes, if the isoforms between samples are changed, we should count all the reads mapping to those isoforms and then normalize against the isoform lengths, respectively. However, our DEGseq package is highly dependent on the annotated gene model, and is not able to infer (novel) isoforms with isoform expression levels at the current stage. We count the read number for each gene only the read mapped to gene exons based on gene annotation. In other words, we only take the reads from the same region with the same length for comparison.

        Originally posted by apratap View Post
        1. why we are not accounting for diff read numbers in the output_score.txt
        Could you please show me the line for the gene you mentioned in the output_score.txt and what command you used to generate the output_score.txt? Thanks.
        Originally posted by apratap View Post
        2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.
        We calculate RPKM following its definition: reads per kilo-base perl million reads. Take your control data for example: RPKM = 811 / 87082676 / 5930 * 1e+9 = 1.57049.

        Hope this helps. If further questions, just let me know. I am glad to solve all the problems.

        Best,
        Xi
        Xi Wang

        Comment


        • #34
          DEGseq err

          Hi Xi,

          I'm trying to use DEGseq and my RNA-seq reads have already mapped. The results are the "eland" format. Does DEGseq support this kind of format?

          I just follow these commands as the manual metioned:

          library(DEGseq)
          sample1 <- system.file("data","D:/data/sample1.txt",package = "DEGseq")
          sample2 <- system.file("data","D:/data/sample2.txt",package = "DEGseq")
          refFlat <- system.file("data","D:/data/refFlat.txt",package = "DEGseq")
          mapResultBatch1 <- c(sample1)
          mapResultBatch2 <- c(sample2)
          outputDir <- file.path(tempdir(),"D:/data/DEGseqExample")
          DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")

          However, it reported:

          Loading required package: qvalue
          Loading required package: tcltk
          Loading Tcl/Tk interface ... done
          Loading required package: samr
          Loading required package: impute
          Error in file(file, "r") : cannot open the connection
          Calls: DEGseq -> DEGexp -> read.table -> file
          In addition: Warning message:
          In file(file, "r") :
          cannot open file 'D:/data/DEGseqExample/group1.exp': No such file or directory
          Execution halted


          Is my CMD all right? Any kind of suggetion will be appreciated.
          Thanks.

          lix
          Last edited by lix; 01-28-2010, 04:55 AM.

          Comment


          • #35
            Sorry for the late reply. I was just back home.

            You can try the following script. If the problem remains, just let me know.

            Code:
            library(DEGseq)
            sample1 <- "D:/data/sample1.txt"
            sample2 <- "D:/data/sample2.txt"
            refFlat <- "D:/data/refFlat.txt"
            mapResultBatch1 <- c(sample1)
            mapResultBatch2 <- c(sample2)
            outputDir <- "D:/data/DEGseqExample"
            DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")
            if you suspect the eland format problem, show a few lines of your eland file. thanks.
            Xi Wang

            Comment


            • #36
              Hi Xi,

              Thank you for your so kind reply.

              My mapped reads are the "eland" format like this:
              26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

              So, my data should convert to the "eland" format that DEGseq supports like this:
              26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

              I'm just wondering whether my conversion was right.

              BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?
              And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?

              Thanks.

              lix

              Comment


              • #37
                Hi lix,

                Originally posted by lix View Post
                My mapped reads are the "eland" format like this:
                26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

                So, my data should convert to the "eland" format that DEGseq supports like this:
                26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

                I'm just wondering whether my conversion was right.
                I am wondering how you convert the format. If you used a script to implement the conversion, you can check the result after conversion directly. Certainly, you need to make this step work.

                BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?
                Sure, even if DEGexp() successfully reads the values, the values are all equal to 0.

                And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?
                You can take they the same. But "valCol" could be any col while "expCol" should only be expression cols.
                Xi Wang

                Comment


                • #38
                  Hi,

                  I was wondering what density is a measure of in the first output graph of DEGseq,

                  thanks,
                  Amy

                  Comment


                  • #39
                    Originally posted by AmyL View Post
                    Hi,

                    I was wondering what density is a measure of in the first output graph of DEGseq,

                    thanks,
                    Amy
                    The plot is generated by:

                    Code:
                    hist(LogVal(Sample1),main=label1,xlab="log2(Number of reads mapped to a gene)",col=4,breaks=100,freq=FALSE,ylim=c(0,0.5))
                    Using "freq=FALSE" means, component density are plotted, so that the histogram has a total area of one: sum(density * bin_width) = 1
                    Xi Wang

                    Comment


                    • #40
                      FET method

                      Hi everybody,

                      I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.

                      I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?

                      I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?

                      Thanks for you help

                      Maria

                      Comment


                      • #41
                        Hi, Maria

                        Originally posted by maria.b View Post
                        Hi everybody,
                        I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.
                        Thanks for using DEGseq.

                        Originally posted by maria.b View Post
                        I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?
                        What is you data size? I don't think it is normal primarily. But we need to confirm what caused this time consuming problem.

                        Originally posted by maria.b View Post
                        I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?
                        The values by your means roughly equals to (read count) * (read length) * 10⁹ / (gene length) / (total reads) = RPKM * (read length)
                        It is ok to use you method, but when counting RPKM, you need divide the values by the read length, further.
                        Last edited by Xi Wang; 02-16-2010, 06:54 PM. Reason: a typo corrected
                        Xi Wang

                        Comment


                        • #42
                          Ok thanks for your reply,

                          I have three values per gene:
                          - the sum of read on each base (count)
                          - the average coverage on each base(mean)
                          - the expression value in RPKM (rpkm) (rpkm = count * 10⁹ /length*nbreads) (maybe this is false, i will change my calculation of count to have the number of reads mapped per gene, but it's an other problem)

                          I have 13661 genes and three sample in 2 replicats
                          Here you have the value min and max for each replicat and for each type of expression value (I don'tif it's important)
                          count values :
                          's1_1': [0, 5983478],
                          's1_2': [0, 17697854],
                          's2_2': [0, 14879008],
                          's2_1': [0, 14369451],
                          's3_2': [0, 11717714],
                          's3_1': [0, 11696411]

                          mean_values:
                          's1_1': [0.0, 5942.0],
                          's1_2': [0.0, 65791.0],
                          's2_2': [0.0, 14776.0],
                          's2_1': [0.0, 14270.0],
                          's3_2': [0.0, 11636.0],
                          's3_1': [0.0, 11615.0]

                          rpkm values:
                          's1_1': [0.0, 1075393.0],
                          's1_2': [0.0, 4577530.0],
                          's2_2': [0.0, 1072475.0],
                          's2_1': [0.0, 1145468.0],
                          's3_2': [0.0, 1064802.0],
                          's3_1': [0.0, 869385.0]

                          I want to run the FET method to compare S1 and S2, S1 and S3 using the different expression value type.
                          Is the command DEGexp() different when we use the method MARS than when we use the method FET?

                          My output looks like this (exemple comparing s1 and s2):

                          #############analyse differentielle, methode FET, s2 vs s1, count#############
                          Please wait...

                          geneExpFile1: fileEXPR
                          gene id column in geneExpFile1: 1
                          expression value column(s) in geneExpFile1: 9 10
                          total number of reads uniquely mapped to genome obtained from sample1: 468022379 515938474

                          geneExpFile2: fileEXPR
                          gene id column in geneExpFile2: 1
                          expression value column(s) in geneExpFile2: 7 8
                          total number of reads uniquely mapped to genome obtained from sample2: 204283936 231306719

                          method to identify differentially expressed genes: FET
                          pValue threshold: 0.001
                          output directory: out

                          Please wait ...
                          Identifying differentially expressed genes ...
                          Please wait patiently ...

                          and it never ends.

                          Thanks for your help.

                          Maria
                          Last edited by maria.b; 02-16-2010, 08:56 AM.

                          Comment


                          • #43
                            Hi, Maria

                            Maybe the figures are too large for DEGseq to calculate. But I need to check if it is the reason, or it's a bug of DEGseq.

                            From how we model the RNA-seq data, we strongly recommend you use the read counts (instead of the sum of read counts on every base) as the gene expression level estimate. You can just simply try DEGseq function in the package.

                            Thanks,
                            Xi Wang

                            Comment


                            • #44
                              Hello,

                              Finally it ends during this night.(maybe I don't wait enough patiently) I will recalculate the expression values like you said, and I will tell you if it work's better.

                              Thanks for your advices

                              Maria
                              Last edited by maria.b; 02-17-2010, 01:01 AM.

                              Comment


                              • #45
                                Hi,

                                I am trying to use samWrapper to analyze my RNA-seq data on Mac OS X.
                                Is there a simple way to specify the path to the files?

                                I noticed that you can use the following on Windows:
                                >geneExpFile <- "D:/data/sample1.txt"

                                Thanks!

                                Comment

                                Working...
                                X