Seqanswers Leaderboard Ad

**ngcrawford** · 12-01-2009, 10:33 AM

Thanks Likun,

I was confused by the q-value. It all makes sense now.

- Nick

**apratap** · 12-11-2009, 07:31 PM

Hi Xi

Thanks for putting your work in the open source. I am quite happy to see many different methods at my hand to play around with numbers.

I think I have some repeated questions that have been asked on this thread in some way or the other. I may be getting down to very basics and hope that will help some others also.

As a pilot DEgseq usage run, I used the LRT method to find expression values for two samples(control/mutant). I am taking one specific gene count example to ask my questions. The input was bed format, alignment was done using TopHat.

HTML Code:

Control

Gene    Count      RPKM         total no. reads      gene length
ABAT    811         1.57049      87082676          5930

Mutant
ABAT     4264      10.3046        69779755           5930

While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.

1. why we are not accounting for diff read numbers in the output_score.txt
2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.

Thanks!
-Abhi

**Xi Wang** · 12-11-2009, 08:36 PM

Hi Abhi,

Thanks for your questions.

Originally posted by apratap View Post

While I understand if we ignore the isoforms of the same gene, normalizing for the gene len wont make a difference. I am not sure why we should not normalize for the difference in reads counts. Also I am not able to generate the RPKM result using the read count /gene length. Could you please explain.

Yes, if the isoforms between samples are changed, we should count all the reads mapping to those isoforms and then normalize against the isoform lengths, respectively. However, our DEGseq package is highly dependent on the annotated gene model, and is not able to infer (novel) isoforms with isoform expression levels at the current stage. We count the read number for each gene only the read mapped to gene exons based on gene annotation. In other words, we only take the reads from the same region with the same length for comparison.

Originally posted by apratap View Post

1. why we are not accounting for diff read numbers in the output_score.txt

Could you please show me the line for the gene you mentioned in the output_score.txt and what command you used to generate the output_score.txt? Thanks.

Originally posted by apratap View Post

2. How is the RPKM being calculated here. Somehow I am not able to find the same number shown in the example taken above.

We calculate RPKM following its definition: reads per kilo-base perl million reads. Take your control data for example: RPKM = 811 / 87082676 / 5930 * 1e+9 = 1.57049.

Hope this helps. If further questions, just let me know. I am glad to solve all the problems.

Best,
Xi

**lix** · 01-27-2010, 09:54 PM

DEGseq err

Hi Xi,

I'm trying to use DEGseq and my RNA-seq reads have already mapped. The results are the "eland" format. Does DEGseq support this kind of format?

I just follow these commands as the manual metioned:

library(DEGseq)
sample1 <- system.file("data","D:/data/sample1.txt",package = "DEGseq")
sample2 <- system.file("data","D:/data/sample2.txt",package = "DEGseq")
refFlat <- system.file("data","D:/data/refFlat.txt",package = "DEGseq")
mapResultBatch1 <- c(sample1)
mapResultBatch2 <- c(sample2)
outputDir <- file.path(tempdir(),"D:/data/DEGseqExample")
DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")

However, it reported:

Loading required package: qvalue
Loading required package: tcltk
Loading Tcl/Tk interface ... done
Loading required package: samr
Loading required package: impute
Error in file(file, "r") : cannot open the connection
Calls: DEGseq -> DEGexp -> read.table -> file
In addition: Warning message:
In file(file, "r") :
cannot open file 'D:/data/DEGseqExample/group1.exp': No such file or directory
Execution halted

Is my CMD all right? Any kind of suggetion will be appreciated.
Thanks.

lix

**Xi Wang** · 01-29-2010, 06:44 AM

Sorry for the late reply. I was just back home.

You can try the following script. If the problem remains, just let me know.

Code:

library(DEGseq)
sample1 <- "D:/data/sample1.txt"
sample2 <- "D:/data/sample2.txt"
refFlat <- "D:/data/refFlat.txt"
mapResultBatch1 <- c(sample1)
mapResultBatch2 <- c(sample2)
outputDir <- "D:/data/DEGseqExample"
DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "eland",refFlat = refFlat, outputDir = outputDir, method = "LRT")

if you suspect the eland format problem, show a few lines of your eland file. thanks.

**lix** · 01-29-2010, 06:59 PM

Hi Xi,

Thank you for your so kind reply.

My mapped reads are the "eland" format like this:
26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

So, my data should convert to the "eland" format that DEGseq supports like this:
26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

I'm just wondering whether my conversion was right.

BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?
And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?

Thanks.

lix

**Xi Wang** · 01-29-2010, 10:17 PM

Hi lix,

Originally posted by lix View Post

My mapped reads are the "eland" format like this:
26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

So, my data should convert to the "eland" format that DEGseq supports like this:
26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

I'm just wondering whether my conversion was right.

I am wondering how you convert the format. If you used a script to implement the conversion, you can check the result after conversion directly. Certainly, you need to make this step work.

BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?

Sure, even if DEGexp() successfully reads the values, the values are all equal to 0.

And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?

You can take they the same. But "valCol" could be any col while "expCol" should only be expression cols.

**AmyL** · 02-10-2010, 01:58 PM

Hi,

I was wondering what density is a measure of in the first output graph of DEGseq,

thanks,
Amy

**Xi Wang** · 02-10-2010, 07:36 PM

Originally posted by AmyL View Post

Hi,

I was wondering what density is a measure of in the first output graph of DEGseq,

thanks,
Amy

The plot is generated by:

Code:

hist(LogVal(Sample1),main=label1,xlab="log2(Number of reads mapped to a gene)",col=4,breaks=100,freq=FALSE,ylim=c(0,0.5))

Using "freq=FALSE" means, component density are plotted, so that the histogram has a total area of one: sum(density * bin_width) = 1

**maria.b** · 02-16-2010, 04:46 AM

FET method

Hi everybody,

I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.

I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?

I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?

Thanks for you help

Maria

**Xi Wang** · 02-16-2010, 07:48 AM

Hi, Maria

Originally posted by maria.b View Post

Hi everybody,
I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.

Thanks for using DEGseq.

Originally posted by maria.b View Post

I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?

What is you data size? I don't think it is normal primarily. But we need to confirm what caused this time consuming problem.

Originally posted by maria.b View Post

I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?

The values by your means roughly equals to (read count) * (read length) * 10⁹ / (gene length) / (total reads) = RPKM * (read length)
It is ok to use you method, but when counting RPKM, you need divide the values by the read length, further.

**maria.b** · 02-16-2010, 08:44 AM

Ok thanks for your reply,

I have three values per gene:
- the sum of read on each base (count)
- the average coverage on each base(mean)
- the expression value in RPKM (rpkm) (rpkm = count * 10⁹ /length*nbreads) (maybe this is false, i will change my calculation of count to have the number of reads mapped per gene, but it's an other problem)

I have 13661 genes and three sample in 2 replicats
Here you have the value min and max for each replicat and for each type of expression value (I don'tif it's important)
count values :
's1_1': [0, 5983478],
's1_2': [0, 17697854],
's2_2': [0, 14879008],
's2_1': [0, 14369451],
's3_2': [0, 11717714],
's3_1': [0, 11696411]

mean_values:
's1_1': [0.0, 5942.0],
's1_2': [0.0, 65791.0],
's2_2': [0.0, 14776.0],
's2_1': [0.0, 14270.0],
's3_2': [0.0, 11636.0],
's3_1': [0.0, 11615.0]

rpkm values:
's1_1': [0.0, 1075393.0],
's1_2': [0.0, 4577530.0],
's2_2': [0.0, 1072475.0],
's2_1': [0.0, 1145468.0],
's3_2': [0.0, 1064802.0],
's3_1': [0.0, 869385.0]

I want to run the FET method to compare S1 and S2, S1 and S3 using the different expression value type.
Is the command DEGexp() different when we use the method MARS than when we use the method FET?

My output looks like this (exemple comparing s1 and s2):

#############analyse differentielle, methode FET, s2 vs s1, count#############
Please wait...

geneExpFile1: fileEXPR
gene id column in geneExpFile1: 1
expression value column(s) in geneExpFile1: 9 10
total number of reads uniquely mapped to genome obtained from sample1: 468022379 515938474

geneExpFile2: fileEXPR
gene id column in geneExpFile2: 1
expression value column(s) in geneExpFile2: 7 8
total number of reads uniquely mapped to genome obtained from sample2: 204283936 231306719

method to identify differentially expressed genes: FET
pValue threshold: 0.001
output directory: out

Please wait ...
Identifying differentially expressed genes ...
Please wait patiently ...

and it never ends.

Thanks for your help.

Maria

**Xi Wang** · 02-16-2010, 11:36 PM

Hi, Maria

Maybe the figures are too large for DEGseq to calculate. But I need to check if it is the reason, or it's a bug of DEGseq.

From how we model the RNA-seq data, we strongly recommend you use the read counts (instead of the sum of read counts on every base) as the gene expression level estimate. You can just simply try DEGseq function in the package.

Thanks,

**maria.b** · 02-17-2010, 12:23 AM

Hello,

Finally it ends during this night.(maybe I don't wait enough patiently) I will recalculate the expression values like you said, and I will tell you if it work's better.

Thanks for your advices

Maria

**m!x** · 02-18-2010, 06:22 PM

Hi,

I am trying to use samWrapper to analyze my RNA-seq data on Mac OS X.
Is there a simple way to specify the path to the files?

I noticed that you can use the following on Windows:
>geneExpFile <- "D:/data/sample1.txt"

Thanks!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News