DEGseq - SEQanswers

ikremsky replied

04-02-2010, 05:41 PM
I am starting to think the problem has to do with the large file size. I tried simply scanning the bed file into R using function read.delim(), but it took quite about 20 minutes. Accessing parts of the table was also slow.

What is the best way to deal with very large bed files with millions of reads? R seems to be inherently very slow in dealing with this type of file.

Thanks!

Last edited by ikremsky; 04-02-2010, 05:44 PM.
Leave a comment:
ikremsky replied

04-02-2010, 03:26 PM
Originally posted by Xi Wang View Post

Hi,

1. Don't need to convert the files to .TXT, but .TXT could also work.
2. As you also got errors when you run the examples, it seems that the DEGseq was not correctly installed. It is recommended to re-install the DEGseq package, and the release version is better. It can be got via

DEGseq

http://bioconductor.org/packages/release/bioc/html/DEGseq.html

DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

I'd like to know if DEGseq works well this time.

Good luck!

Ok I reinstalled DEGseq. I was able to run the example, but I am still unable to run it on the files I have produced. I noticed that on the readID section in your sample bed file, that all id's have the '>' symbol before the id, whereas this symbol is not in the bed files I have produced. For example, the entry in your sample file looks like ">CMLIVERKIDNEY_7:1:1:67:406" (without quotes) whereas in the file I've created, they look like "SRR001357.7181352". Could that be the problem? What do those numbers mean on your sample id's?

Thanks!
Leave a comment:
Xi Wang replied

04-02-2010, 07:29 AM
Originally posted by ikremsky View Post

Hi Xi,

Thank you for your quick answer. Unfortunately, it did not solve the problem. I verified the file format, and used the full file addresses, but the problem persists. In addtion, I ruled out any potential compatibility issues, as I tried running it in R version 2.10.0 (actually it is still running, but it has been running for over 6 hours in the same spot it always gets stuck). Actually the same thing happens when I run DEGseq, it never moves past counting the number of reads.

Is there a special way the files should be converted to .txt? Initially the files have .bed extensions, but I simply added the .txt extension to the end to match more closely with your examples. It does the samet thing whether or not I add the .txt at the end actually. Any other ideas what the problem might be? The bed files are the output read files from Rmap. I tried going through the example from your guide, but was getting some error message there as well (although that was coming before being put into the getGeneExp function).

Thanks!

Hi,

1. Don't need to convert the files to .TXT, but .TXT could also work.
2. As you also got errors when you run the examples, it seems that the DEGseq was not correctly installed. It is recommended to re-install the DEGseq package, and the release version is better. It can be got via

DEGseq

http://bioconductor.org/packages/release/bioc/html/DEGseq.html

DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

I'd like to know if DEGseq works well this time.

Good luck!
Leave a comment:
ikremsky replied

04-02-2010, 12:22 AM
Originally posted by Xi Wang View Post

Thanks for your question.

Usually, it takes several minutes for the function getGeneExp to get a result. It seems weird in your case. Could you please make sure two things? First, I am wondering if the program really found the input files. Using the full path instead of the file name could be helpful. Second, make sure your BED and REFFLAT files following the instruction of file formats in DEGseq manual. BED file format here is a little bit different from its original version. An example of BED format:

Code:

chr12 7 38 readID 2 +

which means: chromosome, start_pos, end_pos, read_ID, mis_matches (actually this column not used), strand.

I don't think there are some compatibility problems here.

Hope this helps. Any further questions, just let me know.

Hi Xi,

Thank you for your quick answer. Unfortunately, it did not solve the problem. I verified the file format, and used the full file addresses, but the problem persists. In addtion, I ruled out any potential compatibility issues, as I tried running it in R version 2.10.0 (actually it is still running, but it has been running for over 6 hours in the same spot it always gets stuck). Actually the same thing happens when I run DEGseq, it never moves past counting the number of reads.

Is there a special way the files should be converted to .txt? Initially the files have .bed extensions, but I simply added the .txt extension to the end to match more closely with your examples. It does the samet thing whether or not I add the .txt at the end actually. Any other ideas what the problem might be? The bed files are the output read files from Rmap. I tried going through the example from your guide, but was getting some error message there as well (although that was coming before being put into the getGeneExp function).

Thanks!
Leave a comment:
Xi Wang replied

04-01-2010, 07:22 PM
Originally posted by ikremsky View Post

Hello,

I am attempting to run the function getGeneExp but am running into problems. After loading the DEGseq package and setting the working directory to the one containing the bed and refFlat files, I type:
>getGeneExp(c("nameofbedfile.bed.txt"), refFlat = "nameofrefFlatfile.txt")

The output is as follows:
SampleFiles:
nameofbedfile.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait...

It then continues indefinitely, as if it is in an infinite loop. I know it is not a memory issue, as I have run this on a large Unix cluster, and I also tried using only a small fraction of the original .bed file, but after up to 12 hours of runtime the job eventually times out and nothing is returned.

I am running the latest version of R, version 2.10.1. I noticed earlier in the thread that you recommend using version 2.10.0, so perhaps there is a compatibility problem with the new version? This is just a passing thought; I would appreciate some advice though.

Thanks!

Thanks for your question.

Usually, it takes several minutes for the function getGeneExp to get a result. It seems weird in your case. Could you please make sure two things? First, I am wondering if the program really found the input files. Using the full path instead of the file name could be helpful. Second, make sure your BED and REFFLAT files following the instruction of file formats in DEGseq manual. BED file format here is a little bit different from its original version. An example of BED format:

Code:

chr12 7 38 readID 2 +

which means: chromosome, start_pos, end_pos, read_ID, mis_matches (actually this column not used), strand.

I don't think there are some compatibility problems here.

Hope this helps. Any further questions, just let me know.
Last edited by Xi Wang; 04-01-2010, 07:27 PM.
Leave a comment:
ikremsky replied

04-01-2010, 04:43 PM
Hello,

I am attempting to run the function getGeneExp but am running into problems. After loading the DEGseq package and setting the working directory to the one containing the bed and refFlat files, I type:
>getGeneExp(c("nameofbedfile.bed.txt"), refFlat = "nameofrefFlatfile.txt")

The output is as follows:
SampleFiles:
nameofbedfile.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait...

It then continues indefinitely, as if it is in an infinite loop. I know it is not a memory issue, as I have run this on a large Unix cluster, and I also tried using only a small fraction of the original .bed file, but after up to 12 hours of runtime the job eventually times out and nothing is returned.

I am running the latest version of R, version 2.10.1. I noticed earlier in the thread that you recommend using version 2.10.0, so perhaps there is a compatibility problem with the new version? This is just a passing thought; I would appreciate some advice though.

Thanks!

Last edited by ikremsky; 04-01-2010, 04:46 PM.
Leave a comment:
Xi Wang replied

03-30-2010, 04:52 AM
Hi middlemale,

Did you installed the release version of DEGseq follow this link http://www.bioconductor.org/packages...ml/DEGseq.html ?

If so, please also use the examples in the manual of the release version, which can be downloaded at:

http://www.bioconductor.org/packages/2.5/bioc/manuals/DEGseq/man/DEGseq.pdf

Hope this helps. Further questions, let me know.
Leave a comment:
middlemale replied

03-30-2010, 02:48 AM
thanks xi, I updated R (2.10.1) and DEGseq now , but the errors are the same
"Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)".

another point is example file contains column "ExternalID" would introduce error
"Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 5 did not have 20 elements " . after deleting this column, that is fine.

any further suggestion?
Leave a comment:
Xi Wang replied

03-29-2010, 08:24 AM
Originally posted by middlemale View Post

I tried DEGexp in R with example data, but got error "Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
"

The CMD

> geneExpFile <- "/home/Bio1/MMi/RNASeq/Example1000.txt"
> geneExpMatrix2 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(7,9,10,12))
> geneExpMatrix1 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(6,8,11,14))
> DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5), groupLabel1="WT_uni", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5), groupLabel2="TR_uni",method= "MARS")
Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
################

Did anyone face the same problem?

> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-unknown-linux-gnu

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] DEGseq_1.0.0 samr_1.28 impute_1.20.0 qvalue_1.20.0

cheers

Hi,
Thanks for using DEGseq for RNA-seq differential expression analysis.
I would like to notice you that DEGseq has a newer release with version number 1.0.5. You can download it through

Bioconductor - DEGseq

http://www.bioconductor.org/packages/2.5/bioc/html/DEGseq.html

The arguments are changed a lot for the using convenience. Please refer to the newest manual for details.

Cheers.
Leave a comment:
middlemale replied

03-29-2010, 07:54 AM
unused arguments error (DEGseq)

I tried DEGexp in R with example data, but got error "Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
"

The CMD

> geneExpFile <- "/home/Bio1/MMi/RNASeq/Example1000.txt"
> geneExpMatrix2 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(7,9,10,12))
> geneExpMatrix1 <- readGeneExp (file=geneExpFile, geneCol=1,valCol =c(6,8,11,14))
> DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5), groupLabel1="WT_uni", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5), groupLabel2="TR_uni",method= "MARS")
Error in DEGexp(geneExpMatrix1 = geneExpMatrix1, geneCol1 = 1, expCol1 = c(2, :
unused argument(s) (geneExpMatrix1 = geneExpMatrix1, geneExpMatrix2 = geneExpMatrix2)
################

Did anyone face the same problem?

> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-unknown-linux-gnu

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] DEGseq_1.0.0 samr_1.28 impute_1.20.0 qvalue_1.20.0

cheers
Leave a comment:
Xi Wang replied

03-20-2010, 09:45 AM
Originally posted by m!x View Post

Hi Xi,

Thank you for the reply. Yes, you are right about the sequencing depth.
They can vary up to 2 times in my biological replicates.

Just to make sure I will do the right thing...here's what I should do with my samples:
1. Find the RPKM values generated in the getGeneExp function
2. Use the DEGexp with MARS method and set the p-value cutoff.

As you've said before about using the SAMWrapper, I assumed then the permutation method included in the SAMWrapper method only works well for the data that have more biological replicates?

Again, the same question as I asked before: for this DEGExp method, how is the fold-change calculated here?

Thanks a lot once again!

Hi,

First, I need to clarify here that (1) for DEGseq or DEGexp function, it is recommended to take the raw read counts as gene expression values; but (2) for samWrapper, it is however only RPKM can be used, as raw read counts don't normalize against the sequencing depth.

Actually, for DEGseq and GEGexp functions, there are two types of fold-change values in the recent version of DEGseq. The output are both log2 values. The one labeled by "log2(Fold_change)" is calculated as follows:

Code:

log2 ( EXP_G_A / EXP_G_B )

The other one labeled by "log2(Fold_change) normalized" is defined as:

Code:

log2 [ (EXP_G_A/T.EXP_A) / (EXP_G_B/T.EXP_B) ]

where EXP_G_A stands the expression value for gene G in sample A, T.EXP_A stands the total expression values (sum up the expression values for all genes) in sample A; similar for sample B.
The two types of fold-change values are different from each other in whether or not normalizing against the total expression values. "log2(Fold_change)" is suitable for RPKM expression values, while "log2(Fold_change) normalized" for raw read counts.

It is true that the q-value estimate is not that precise if the number of biological replicates is small. But this method still works. If you have more repliactes, the estimated q-vaules are more reliable.
Last edited by Xi Wang; 03-21-2010, 01:20 AM. Reason: Change the description for fold-change for DEGseq and DEGexp functions.
Leave a comment:
m!x replied

03-19-2010, 03:02 PM
Hi Xi,

Thank you for the reply. Yes, you are right about the sequencing depth.
They can vary up to 2 times in my biological replicates.

Just to make sure I will do the right thing...here's what I should do with my samples:
1. Find the RPKM values generated in the getGeneExp function
2. Use the DEGexp with MARS method and set the p-value cutoff.

As you've said before about using the SAMWrapper, I assumed then the permutation method included in the SAMWrapper method only works well for the data that have more biological replicates?

Again, the same question as I asked before: for this DEGExp method, how is the fold-change calculated here?

Thanks a lot once again!
Leave a comment:
Xi Wang replied

03-18-2010, 01:01 AM
hi m!x,

We investigated the results given by samWrapper on our test data and as well samWrapper code carefully, and don't find a bug. I said we found some similar results days ago, but the results were not that extreme as yours. We just set the q-value cutoff too stringent. In your case, as you only have 3 biological replicates, the q-value obtained by permutation could not have a high resolution.

However, I still felt weird about your results. I noticed that you used the raw read counts as the input for samWrapper. This maybe the reson. Read counts are not normalized against the sequencing depth. If the sequencing depth is quite different from the two samples, say the total read counts of sample B is as twice as that of sample A, the results could be lots of genes have a higher expression level in sample B than sample A. We included a normalization step in function DEGexp, so the results given by DEGexp are normal. I think the difference between the two results doesn't have much to do with the calculation of q-values.

I suggest you use RPKM to rerun samWrapper. It will give more reasonable and promising results.

Hope this helps.
Leave a comment:
Xi Wang replied

03-17-2010, 06:43 AM
Originally posted by m!x View Post

Hi Xi,

Thanks for the explanation about the foldchange.

The weird thing about my data is that all the downregulated genes has at least q-value of 0.5, so they are counted as "FALSE" in the signature column. I set the max qValue = 0.1.

When I just tried running the data with DEGexp and MARS method (setting p value = 0.001), I get the genes that are downregulated to have very low p-value and q-values.
I believe the calculation of the q-values in samWrapper and DEGexp is different. But, how can it be so different for the downregulated genes in my dataset?

I understand that it's best to use samWrapper for biological replicates (as in my case).
If I may know, what's the fundamental difference between using DEGexp and samWrapper for biological replicates?
If the fold-change in samWrapper is not calculated by summing the raw read counts, how then the comparison is made?

Below are the lines that I used for the samWrapper analysis (3 biological replicates and unpaired data).

> geneExpFile <- "/Users/lilyanamargaretha/Desktop/ICMgene_filtered.exp"
> set.seed(100)
> geneExpFile1 <- geneExpFile
> geneExpFile2 <- geneExpFile
> output <- "/Users/lilyanamargaretha/Desktop/samWrapperOutICM.txt"
> expCol1 = c(2, 5, 8)
> expCol2 = c(11, 14, 17)
> measure1 = rep(1, length(expCol1))
> measure2 = rep(2, length(expCol2))

> samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 0.1, logged2=FALSE, output = output, paired = FALSE)

Thanks for the answering the questions.

Hi,

Sorry for the late reply.

1. By downregulated genes, did you mean the genes have "Fold Change“ value less than 0.5?

2. In samWrapper, the fold-change is calculated based on the average expression values from biological replicates. And with other statistical metric to determine a gene is differentially expressed or not. If you are interested in the methods, the SAM paper helps.

Significance analysis of microarrays applied to the ionizing radiation response - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/11309499

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Micr …

Many thanks for pointing it out.

Last edited by Xi Wang; 03-18-2010, 01:03 AM. Reason: No bug found
Leave a comment:
m!x replied

03-15-2010, 12:39 PM
Hi Xi,

Thanks for the explanation about the foldchange.

The weird thing about my data is that all the downregulated genes has at least q-value of 0.5, so they are counted as "FALSE" in the signature column. I set the max qValue = 0.1.

When I just tried running the data with DEGexp and MARS method (setting p value = 0.001), I get the genes that are downregulated to have very low p-value and q-values.
I believe the calculation of the q-values in samWrapper and DEGexp is different. But, how can it be so different for the downregulated genes in my dataset?

I understand that it's best to use samWrapper for biological replicates (as in my case).
If I may know, what's the fundamental difference between using DEGexp and samWrapper for biological replicates?
If the fold-change in samWrapper is not calculated by summing the raw read counts, how then the comparison is made?

Below are the lines that I used for the samWrapper analysis (3 biological replicates and unpaired data).

> geneExpFile <- "/Users/lilyanamargaretha/Desktop/ICMgene_filtered.exp"
> set.seed(100)
> geneExpFile1 <- geneExpFile
> geneExpFile2 <- geneExpFile
> output <- "/Users/lilyanamargaretha/Desktop/samWrapperOutICM.txt"
> expCol1 = c(2, 5, 8)
> expCol2 = c(11, 14, 17)
> measure1 = rep(1, length(expCol1))
> measure2 = rep(2, length(expCol2))

> samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 0.1, logged2=FALSE, output = output, paired = FALSE)

Thanks for the answering the questions.
Leave a comment:

Previous 1 4 5 6 7 8 9 10 11 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News