DEGseq - SEQanswers

lix replied

01-29-2010, 06:59 PM
Hi Xi,

Thank you for your so kind reply.

My mapped reads are the "eland" format like this:
26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

So, my data should convert to the "eland" format that DEGseq supports like this:
26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

I'm just wondering whether my conversion was right.

BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?
And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?

Thanks.

lix
Leave a comment:

Xi Wang replied

01-29-2010, 06:44 AM

Sorry for the late reply. I was just back home.

You can try the following script. If the problem remains, just let me know.

Code:

library(DEGseq)
sample1 <- "D:/data/sample1.txt"
sample2 <- "D:/data/sample2.txt"
refFlat <- "D:/data/refFlat.txt"
mapResultBatch1 <- c(sample1)
mapResultBatch2 <- c(sample2)
outputDir <- "D:/data/DEGseqExample"
DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")

if you suspect the eland format problem, show a few lines of your eland file. thanks.

Leave a comment:

lix replied

01-27-2010, 09:54 PM
DEGseq err

Hi Xi,

I'm trying to use DEGseq and my RNA-seq reads have already mapped. The results are the "eland" format. Does DEGseq support this kind of format?

I just follow these commands as the manual metioned:

library(DEGseq)
sample1 <- system.file("data","D:/data/sample1.txt",package = "DEGseq")
sample2 <- system.file("data","D:/data/sample2.txt",package = "DEGseq")
refFlat <- system.file("data","D:/data/refFlat.txt",package = "DEGseq")
mapResultBatch1 <- c(sample1)
mapResultBatch2 <- c(sample2)
outputDir <- file.path(tempdir(),"D:/data/DEGseqExample")
DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")

However, it reported:

Loading required package: qvalue
Loading required package: tcltk
Loading Tcl/Tk interface ... done
Loading required package: samr
Loading required package: impute
Error in file(file, "r") : cannot open the connection
Calls: DEGseq -> DEGexp -> read.table -> file
In addition: Warning message:
In file(file, "r") :
cannot open file 'D:/data/DEGseqExample/group1.exp': No such file or directory
Execution halted

Is my CMD all right? Any kind of suggetion will be appreciated.
Thanks.

lix

Last edited by lix; 01-28-2010, 04:55 AM.
Leave a comment:
Xi Wang replied

12-11-2009, 08:36 PM
Hi Abhi,

Thanks for your questions.

Originally posted by apratap View Post

While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.

Yes, if the isoforms between samples are changed, we should count all the reads mapping to those isoforms and then normalize against the isoform lengths, respectively. However, our DEGseq package is highly dependent on the annotated gene model, and is not able to infer (novel) isoforms with isoform expression levels at the current stage. We count the read number for each gene only the read mapped to gene exons based on gene annotation. In other words, we only take the reads from the same region with the same length for comparison.

Originally posted by apratap View Post

1. why we are not accounting for diff read numbers in the output_score.txt

Could you please show me the line for the gene you mentioned in the output_score.txt and what command you used to generate the output_score.txt? Thanks.

Originally posted by apratap View Post

2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.

We calculate RPKM following its definition: reads per kilo-base perl million reads. Take your control data for example: RPKM = 811 / 87082676 / 5930 * 1e+9 = 1.57049.

Hope this helps. If further questions, just let me know. I am glad to solve all the problems.

Best,
Xi
Leave a comment:
apratap replied

12-11-2009, 07:31 PM
Hi Xi

Thanks for putting your work in the open source. I am quite happy to see many different methods at my hand to play around with numbers.

I think I have some repeated questions that have been asked on this thread in some way or the other. I may be getting down to very basics and hope that will help some others also.

As a pilot DEgseq usage run, I used the LRT method to find expression values for two samples(control/mutant). I am taking one specific gene count example to ask my questions. The input was bed format, alignment was done using TopHat.

HTML Code:

Control Gene Count RPKM total no. reads gene length ABAT 811 1.57049 87082676 5930 Mutant ABAT 4264 10.3046 69779755 5930

While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.

1. why we are not accounting for diff read numbers in the output_score.txt
2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.

Thanks!
-Abhi
Leave a comment:
ngcrawford replied

12-01-2009, 10:33 AM
Thanks Likun,

I was confused by the q-value. It all makes sense now.

- Nick
Leave a comment:
Likun Wang replied

11-22-2009, 05:13 AM
Originally posted by Xi Wang View Post

AmyL,

Thanks for your question.
If you only care the fold changes, you can use a normalization as you mentioned. Or, in DEGseq, you can use the option normalMethod="median".

Xi

BTW: For the fold changes you can do normalization as you methioned or use the option normalMethod="median". But for the methods "LRT", "FET(fisher's exact test)" and "MARS", the row count and normalMethod="none" are recommended for your example. .
Leave a comment:
Likun Wang replied

11-22-2009, 05:01 AM
Originally posted by ngcrawford View Post

How do you set it? It's mentioned in the paper, but I can only find ways to adjust the p-value cut-off.

Thanks in advance.

- Nick

Hi ngcrawford and Xi,
I think ngcrawford want to find a way to set the fdr cut-off.
The following is an example to set it.
DEGexp(geneExpFile1=geneExpFile,geneExpFile2=geneExpFile2,thresholdKind=4,qValue=0.001)
Please type ?DEGexp for detail.
---------------
Likun
Leave a comment:
Xi Wang replied

11-20-2009, 07:08 PM
Originally posted by AmyL View Post

Question:

I am attempting to analyze samples that do not have the same number of reads. For example, one has 800K and another has 1.3M. With the analysis from DEGseq, it is obvious that the fold changes between samples are due to the difference in total read numbers. With this particular example, would you recommend using a reads/million normalization?

Thanks in advance!

AmyL,

Thanks for your question.
If you only care the fold changes, you can use a normalization as you mentioned. Or, in DEGseq, you can use the option normalMethod="median".

Xi
Leave a comment:
Xi Wang replied

11-20-2009, 06:45 PM
Originally posted by ngcrawford View Post

How do you set it? It's mentioned in the paper, but I can only find ways to adjust the p-value cut-off.

Thanks in advance.

- Nick

Sorry, I cannot catch what your meaning. What "it" refers to? Thanks.

Xi
Leave a comment:
AmyL replied

11-20-2009, 06:30 PM
Question:

I am attempting to analyze samples that do not have the same number of reads. For example, one has 800K and another has 1.3M. With the analysis from DEGseq, it is obvious that the fold changes between samples are due to the difference in total read numbers. With this particular example, would you recommend using a reads/million normalization?

Thanks in advance!
Leave a comment:
ngcrawford replied

11-20-2009, 01:21 PM
Fdr?

How do you set it? It's mentioned in the paper, but I can only find ways to adjust the p-value cut-off.

Thanks in advance.

- Nick
Leave a comment:
ngcrawford replied

11-20-2009, 07:04 AM
Xi,

That worked like a charm. Thanks!

- Nick
Leave a comment:
Xi Wang replied

11-19-2009, 09:48 PM
Originally posted by ngcrawford View Post

DEGseq look really nice, but I'm having trouble getting my data file read. Do I just need to substitute:

with

and then run DEGexp(commands)?

I'm getting the following error:

Thanks in advance,
Nick

Nick,

You can specify the gene expression file in this way:

Suppose the file path is "D:/data/MyData.txt" (windows platform), then

Code:

geneExpFile <- "D:/data/MyData.txt"

Xi
Leave a comment:
ngcrawford replied

11-19-2009, 04:37 PM
I can't get DEGseq to run my data

DEGseq look really nice, but I'm having trouble getting my data file read. Do I just need to substitute:

>geneExpFile <- system.file("data", "GeneExpExample5000.txt", package = "DEGseq")

with

>geneExpFile <- system.file("data", "MyData.txt", package = "DEGseq")

and then run DEGexp(commands)?

I'm getting the following error:

Error in read.table(geneExpFile1, header = header, sep = sep) :
no lines available in input
In addition: Warning message:
In file(file, "rt") :
file("") only supports open = "w+" and open = "w+b": using the former

Thanks in advance,
Nick
Leave a comment:

Previous 1 6 7 8 9 10 11 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News