Announcement

Collapse
No announcement yet.

DEGseq

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Originally posted by m!x View Post
    Hi,

    I am trying to use samWrapper to analyze my RNA-seq data on Mac OS X.
    Is there a simple way to specify the path to the files?

    I noticed that you can use the following on Windows:
    >geneExpFile <- "D:/data/sample1.txt"

    Thanks!
    It's not very difficult. For example:
    Code:
    >geneExpFile <- "/PATH/TO/FILE"
    but you should know where you file is. maybe "pwd" command can help you.
    Xi Wang

    Comment


    • #47
      Hi,
      I wanted to use DEGseq to identify differentially expressed genes between wildtype and mutant experiments. I have 4 datasets for each condition (WT,mutant) : 2 biological replicates and for each of them 2 technical replicates.

      First I tried the method "CTR" to check the variation between the replicates, but I couldn't find a code example for this method in package material. On the last output plot produced by the method "CTR" one can see the difference between the standard deviation of M according to the RSM and the theoretical four-fold local standard deviation of M by the comparison of technical replicates. But what does it mean when there is a distance between these to lines (read and blue)? Can I use the method "MATR" which is based on technical replicates anyway?

      Because I have these 4 datasets per condition, I wanted to uses them in the correct way, not simple adding the raw counts of each gene. Which method would you propose in this case and how should the correct code of the function DEGexp(..?..)
      look like?

      Many thanks for your help!
      Steffen

      Comment


      • #48
        Hi Steffen,

        Thanks for using DEGseq.

        First I tried the method "CTR" to check the variation between the replicates, but I couldn't find a code example for this method in package material.
        Actually, the "CTR" expample code is similar to the example for DEGexp function. The only modification is to specify: method="CTR". So, a code example is:
        Code:
        DEGexp(geneExpFile1=geneExpFile, geneCol1=1, expCol1=2, groupLabel1="R1", geneExpFile2=geneExpFile, geneCol2=1, expCol2=3, groupLabel2="R2",
        method="CTR", outputDir=outputDir)
        Note: geneExpFile contains the expression values for the two replicates, where gene names are listed in column 1, expresssion values for replicate 1listed in column 2, and expresssion values for replicate 2 listed in column 3.


        On the last output plot produced by the method "CTR" one can see the difference between the standard deviation of M according to the RSM and the theoretical four-fold local standard deviation of M by the comparison of technical replicates. But what does it mean when there is a distance between these to lines (read and blue)? Can I use the method "MATR" which is based on technical replicates anyway?
        This phenomenon means the two replicates do not match well. Yes, you can use MATR method anyway. Besides, you may also use other methods to get the corresponding results. An extra validation step should be done (if feasible) and then you can jude which method is better.

        Because I have these 4 datasets per condition, I wanted to uses them in the correct way, not simple adding the raw counts of each gene. Which method would you propose in this case and how should the correct code of the function DEGexp(..?..)
        look like?
        There is another function "samWrapper" between two samples with biological replicates. You can try this on your 4 datasets. But, theoretically, the technical replicates cannot be treated as biological replicates.
        Xi Wang

        Comment


        • #49
          Originally posted by svl View Post
          I seem to be unable to install the package...anyone had succes?
          ----
          source("http://bioconductor.org/biocLite.R")
          biocLite("DEGseq")
          ----

          Also their site is unavailable: http://bioinfo.au.tsinghua.edu.cn/software/degseq
          Here is the page at Bioc: http://www.bioconductor.org/packages...ml/DEGseq.html
          Thanks for your information! It`s usefull!
          Harbin Institute of Technology

          Comment


          • #50
            I successfully ran the SamWrapper, but would appreciate further explanation about the output columns.

            What do the "numerator" and "denominator" columns represent?
            How do I tell whether the genes are upregulated or downregulated?
            eg. if I set the min. foldchange = 2, will this include the genes that are downregulated by 2 times?

            I also got different result when I used "filtered" data as an input.
            First, I used "getGeneExp" to get the raw read counts for each gene.
            From the output, I removed all genes that have less than 2 reads. Then, I filtered for the genes that are common between the biological replicates for each sample.
            I used this filtered set as the input for samWrapper.
            Is this a valid approach?
            Or, should I have used the unfiltered set?

            Thank you!

            Comment


            • #51
              Originally posted by m!x View Post
              I successfully ran the SamWrapper, but would appreciate further explanation about the output columns.

              What do the "numerator" and "denominator" columns represent?
              How do I tell whether the genes are upregulated or downregulated?
              eg. if I set the min. foldchange = 2, will this include the genes that are downregulated by 2 times?

              I also got different result when I used "filtered" data as an input.
              First, I used "getGeneExp" to get the raw read counts for each gene.
              From the output, I removed all genes that have less than 2 reads. Then, I filtered for the genes that are common between the biological replicates for each sample.
              I used this filtered set as the input for samWrapper.
              Is this a valid approach?
              Or, should I have used the unfiltered set?

              Thank you!
              Hi,

              For samWrapper function, the output file contains some columns related to the T statistic, such as score(d) for the T-statistic value, numerator(r) for the numerator of the T-statistic, and denominator(s + s0) for the denominator of the T-statistic. For more details, please find in the Section 12.2 of sam manual.
              http://www-stat.stanford.edu/~tibs/SAM/sam.pdf

              The Signature column indicates each gene is differentially expressed or not. If it is a "TURE", the gene can be either upregulated or downregulated. The default foldchange = 2 including both cases. I.e., including foldchange > 2 and foldchange < 0.5. You validate it easily by comparing the two columns "Fold Change" and "Signature".

              What do you mean by "the genes that are common"? I think it si no need to do this filtering. Also, you can compare the results, and pick up the difference (genes appear in either results), and analyze what cause the difference. If you could show me an example, I may give you some clue. Thanks.
              Xi Wang

              Comment


              • #52
                Hi Xi,

                Thanks for the explanation about the foldchange.

                The weird thing about my data is that all the downregulated genes has at least q-value of 0.5, so they are counted as "FALSE" in the signature column. I set the max qValue = 0.1.

                When I just tried running the data with DEGexp and MARS method (setting p value = 0.001), I get the genes that are downregulated to have very low p-value and q-values.
                I believe the calculation of the q-values in samWrapper and DEGexp is different. But, how can it be so different for the downregulated genes in my dataset?

                I understand that it's best to use samWrapper for biological replicates (as in my case).
                If I may know, what's the fundamental difference between using DEGexp and samWrapper for biological replicates?
                If the fold-change in samWrapper is not calculated by summing the raw read counts, how then the comparison is made?

                Below are the lines that I used for the samWrapper analysis (3 biological replicates and unpaired data).

                > geneExpFile <- "/Users/lilyanamargaretha/Desktop/ICMgene_filtered.exp"
                > set.seed(100)
                > geneExpFile1 <- geneExpFile
                > geneExpFile2 <- geneExpFile
                > output <- "/Users/lilyanamargaretha/Desktop/samWrapperOutICM.txt"
                > expCol1 = c(2, 5, 8)
                > expCol2 = c(11, 14, 17)
                > measure1 = rep(1, length(expCol1))
                > measure2 = rep(2, length(expCol2))

                > samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 0.1, logged2=FALSE, output = output, paired = FALSE)



                Thanks for the answering the questions.

                Comment


                • #53
                  Originally posted by m!x View Post
                  Hi Xi,

                  Thanks for the explanation about the foldchange.

                  The weird thing about my data is that all the downregulated genes has at least q-value of 0.5, so they are counted as "FALSE" in the signature column. I set the max qValue = 0.1.

                  When I just tried running the data with DEGexp and MARS method (setting p value = 0.001), I get the genes that are downregulated to have very low p-value and q-values.
                  I believe the calculation of the q-values in samWrapper and DEGexp is different. But, how can it be so different for the downregulated genes in my dataset?

                  I understand that it's best to use samWrapper for biological replicates (as in my case).
                  If I may know, what's the fundamental difference between using DEGexp and samWrapper for biological replicates?
                  If the fold-change in samWrapper is not calculated by summing the raw read counts, how then the comparison is made?

                  Below are the lines that I used for the samWrapper analysis (3 biological replicates and unpaired data).

                  > geneExpFile <- "/Users/lilyanamargaretha/Desktop/ICMgene_filtered.exp"
                  > set.seed(100)
                  > geneExpFile1 <- geneExpFile
                  > geneExpFile2 <- geneExpFile
                  > output <- "/Users/lilyanamargaretha/Desktop/samWrapperOutICM.txt"
                  > expCol1 = c(2, 5, 8)
                  > expCol2 = c(11, 14, 17)
                  > measure1 = rep(1, length(expCol1))
                  > measure2 = rep(2, length(expCol2))

                  > samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 0.1, logged2=FALSE, output = output, paired = FALSE)



                  Thanks for the answering the questions.
                  Hi,

                  Sorry for the late reply.

                  1. By downregulated genes, did you mean the genes have "Fold Change“ value less than 0.5?

                  2. In samWrapper, the fold-change is calculated based on the average expression values from biological replicates. And with other statistical metric to determine a gene is differentially expressed or not. If you are interested in the methods, the SAM paper helps.
                  http://www.ncbi.nlm.nih.gov/pubmed/11309499

                  Many thanks for pointing it out.
                  Last edited by Xi Wang; 03-18-2010, 01:03 AM. Reason: No bug found
                  Xi Wang

                  Comment


                  • #54
                    hi m!x,

                    We investigated the results given by samWrapper on our test data and as well samWrapper code carefully, and don't find a bug. I said we found some similar results days ago, but the results were not that extreme as yours. We just set the q-value cutoff too stringent. In your case, as you only have 3 biological replicates, the q-value obtained by permutation could not have a high resolution.

                    However, I still felt weird about your results. I noticed that you used the raw read counts as the input for samWrapper. This maybe the reson. Read counts are not normalized against the sequencing depth. If the sequencing depth is quite different from the two samples, say the total read counts of sample B is as twice as that of sample A, the results could be lots of genes have a higher expression level in sample B than sample A. We included a normalization step in function DEGexp, so the results given by DEGexp are normal. I think the difference between the two results doesn't have much to do with the calculation of q-values.

                    I suggest you use RPKM to rerun samWrapper. It will give more reasonable and promising results.

                    Hope this helps.
                    Xi Wang

                    Comment


                    • #55
                      Hi Xi,

                      Thank you for the reply. Yes, you are right about the sequencing depth.
                      They can vary up to 2 times in my biological replicates.

                      Just to make sure I will do the right thing...here's what I should do with my samples:
                      1. Find the RPKM values generated in the getGeneExp function
                      2. Use the DEGexp with MARS method and set the p-value cutoff.

                      As you've said before about using the SAMWrapper, I assumed then the permutation method included in the SAMWrapper method only works well for the data that have more biological replicates?

                      Again, the same question as I asked before: for this DEGExp method, how is the fold-change calculated here?

                      Thanks a lot once again!

                      Comment


                      • #56
                        Originally posted by m!x View Post
                        Hi Xi,

                        Thank you for the reply. Yes, you are right about the sequencing depth.
                        They can vary up to 2 times in my biological replicates.

                        Just to make sure I will do the right thing...here's what I should do with my samples:
                        1. Find the RPKM values generated in the getGeneExp function
                        2. Use the DEGexp with MARS method and set the p-value cutoff.

                        As you've said before about using the SAMWrapper, I assumed then the permutation method included in the SAMWrapper method only works well for the data that have more biological replicates?

                        Again, the same question as I asked before: for this DEGExp method, how is the fold-change calculated here?

                        Thanks a lot once again!
                        Hi,

                        First, I need to clarify here that (1) for DEGseq or DEGexp function, it is recommended to take the raw read counts as gene expression values; but (2) for samWrapper, it is however only RPKM can be used, as raw read counts don't normalize against the sequencing depth.


                        Actually, for DEGseq and GEGexp functions, there are two types of fold-change values in the recent version of DEGseq. The output are both log2 values. The one labeled by "log2(Fold_change)" is calculated as follows:
                        Code:
                        log2 ( EXP_G_A / EXP_G_B )
                        The other one labeled by "log2(Fold_change) normalized" is defined as:
                        Code:
                        log2 [ (EXP_G_A/T.EXP_A) / (EXP_G_B/T.EXP_B) ]
                        where EXP_G_A stands the expression value for gene G in sample A, T.EXP_A stands the total expression values (sum up the expression values for all genes) in sample A; similar for sample B.
                        The two types of fold-change values are different from each other in whether or not normalizing against the total expression values. "log2(Fold_change)" is suitable for RPKM expression values, while "log2(Fold_change) normalized" for raw read counts.

                        It is true that the q-value estimate is not that precise if the number of biological replicates is small. But this method still works. If you have more repliactes, the estimated q-vaules are more reliable.
                        Last edited by Xi Wang; 03-21-2010, 01:20 AM. Reason: Change the description for fold-change for DEGseq and DEGexp functions.
                        Xi Wang

                        Comment


                        • #57
                          unused arguments error (DEGseq)

                          I tried DEGexp in R with example data, but got error "Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
                          unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
                          "

                          The CMD

                          > geneExpFile <- "/home/Bio1/MMi/RNASeq/Example1000.txt"
                          > geneExpMatrix2 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(7,9,10,12))
                          > geneExpMatrix1 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(6,8,11,14))
                          > DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5), groupLabel1="WT_uni", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5), groupLabel2="TR_uni",method= "MARS")
                          Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
                          unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
                          ################

                          Did anyone face the same problem?

                          > sessionInfo()
                          R version 2.10.1 (2009-12-14)
                          x86_64-unknown-linux-gnu

                          locale:
                          [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
                          [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
                          [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
                          [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
                          [9] LC_ADDRESS=C LC_TELEPHONE=C
                          [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

                          attached base packages:
                          [1] tcltk stats graphics grDevices utils datasets methods
                          [8] base

                          other attached packages:
                          [1] DEGseq_1.0.0 samr_1.28 impute_1.20.0 qvalue_1.20.0


                          cheers

                          Comment


                          • #58
                            Originally posted by middlemale View Post
                            I tried DEGexp in R with example data, but got error "Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
                            unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
                            "

                            The CMD

                            > geneExpFile <- "/home/Bio1/MMi/RNASeq/Example1000.txt"
                            > geneExpMatrix2 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(7,9,10,12))
                            > geneExpMatrix1 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(6,8,11,14))
                            > DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5), groupLabel1="WT_uni", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5), groupLabel2="TR_uni",method= "MARS")
                            Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
                            unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
                            ################

                            Did anyone face the same problem?

                            > sessionInfo()
                            R version 2.10.1 (2009-12-14)
                            x86_64-unknown-linux-gnu

                            locale:
                            [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
                            [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
                            [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
                            [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
                            [9] LC_ADDRESS=C LC_TELEPHONE=C
                            [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

                            attached base packages:
                            [1] tcltk stats graphics grDevices utils datasets methods
                            [8] base

                            other attached packages:
                            [1] DEGseq_1.0.0 samr_1.28 impute_1.20.0 qvalue_1.20.0


                            cheers
                            Hi,
                            Thanks for using DEGseq for RNA-seq differential expression analysis.
                            I would like to notice you that DEGseq has a newer release with version number 1.0.5. You can download it through
                            http://www.bioconductor.org/packages...ml/DEGseq.html

                            The arguments are changed a lot for the using convenience. Please refer to the newest manual for details.

                            Cheers.
                            Xi Wang

                            Comment


                            • #59
                              thanks xi, I updated R (2.10.1) and DEGseq now , but the errors are the same
                              "Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
                              unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)".

                              another point is example file contains column "ExternalID" would introduce error
                              "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
                              line 5 did not have 20 elements " . after deleting this column, that is fine.

                              any further suggestion?

                              Comment


                              • #60
                                Hi middlemale,

                                Did you installed the release version of DEGseq follow this link http://www.bioconductor.org/packages...ml/DEGseq.html ?

                                If so, please also use the examples in the manual of the release version, which can be downloaded at:
                                http://www.bioconductor.org/packages...man/DEGseq.pdf

                                Hope this helps. Further questions, let me know.
                                Xi Wang

                                Comment

                                Working...
                                X