DEGseq - SEQanswers

Xi Wang replied

03-13-2010, 12:41 AM
Originally posted by m!x View Post

I successfully ran the SamWrapper, but would appreciate further explanation about the output columns.

What do the "numerator" and "denominator" columns represent?
How do I tell whether the genes are upregulated or downregulated?
eg. if I set the min. foldchange = 2, will this include the genes that are downregulated by 2 times?

I also got different result when I used "filtered" data as an input.
First, I used "getGeneExp" to get the raw read counts for each gene.
From the output, I removed all genes that have less than 2 reads. Then, I filtered for the genes that are common between the biological replicates for each sample.
I used this filtered set as the input for samWrapper.
Is this a valid approach?
Or, should I have used the unfiltered set?

Thank you!

Hi,

For samWrapper function, the output file contains some columns related to the T statistic, such as score(d) for the T-statistic value, numerator(r) for the numerator of the T-statistic, and denominator(s + s0) for the denominator of the T-statistic. For more details, please find in the Section 12.2 of sam manual.

Page Not Found | Department of Statistics

http://www-stat.stanford.edu/~tibs/SAM/sam.pdf

The Signature column indicates each gene is differentially expressed or not. If it is a "TURE", the gene can be either upregulated or downregulated. The default foldchange = 2 including both cases. I.e., including foldchange > 2 and foldchange < 0.5. You validate it easily by comparing the two columns "Fold Change" and "Signature".

What do you mean by "the genes that are common"? I think it si no need to do this filtering. Also, you can compare the results, and pick up the difference (genes appear in either results), and analyze what cause the difference. If you could show me an example, I may give you some clue. Thanks.
Leave a comment:
m!x replied

03-11-2010, 03:47 PM
I successfully ran the SamWrapper, but would appreciate further explanation about the output columns.

What do the "numerator" and "denominator" columns represent?
How do I tell whether the genes are upregulated or downregulated?
eg. if I set the min. foldchange = 2, will this include the genes that are downregulated by 2 times?

I also got different result when I used "filtered" data as an input.
First, I used "getGeneExp" to get the raw read counts for each gene.
From the output, I removed all genes that have less than 2 reads. Then, I filtered for the genes that are common between the biological replicates for each sample.
I used this filtered set as the input for samWrapper.
Is this a valid approach?
Or, should I have used the unfiltered set?

Thank you!
Leave a comment:
qqsvery replied

03-04-2010, 08:29 PM
Originally posted by svl View Post

I seem to be unable to install the package...anyone had succes?
----
source("http://bioconductor.org/biocLite.R")
biocLite("DEGseq")
----

Also their site is unavailable: http://bioinfo.au.tsinghua.edu.cn/software/degseq
Here is the page at Bioc: http://www.bioconductor.org/packages...ml/DEGseq.html

Thanks for your information! It`s usefull!
Leave a comment:
Xi Wang replied

03-04-2010, 07:02 PM
Hi Steffen,

Thanks for using DEGseq.

First I tried the method "CTR" to check the variation between the replicates, but I couldn't find a code example for this method in package material.

Actually, the "CTR" expample code is similar to the example for DEGexp function. The only modification is to specify: method="CTR". So, a code example is:

Code:

DEGexp(geneExpFile1=geneExpFile, geneCol1=1, expCol1=2, groupLabel1="R1", geneExpFile2=geneExpFile, geneCol2=1, expCol2=3, groupLabel2="R2", method="CTR", outputDir=outputDir)

Note: geneExpFile contains the expression values for the two replicates, where gene names are listed in column 1, expresssion values for replicate 1listed in column 2, and expresssion values for replicate 2 listed in column 3.

On the last output plot produced by the method "CTR" one can see the difference between the standard deviation of M according to the RSM and the theoretical four-fold local standard deviation of M by the comparison of technical replicates. But what does it mean when there is a distance between these to lines (read and blue)? Can I use the method "MATR" which is based on technical replicates anyway?

This phenomenon means the two replicates do not match well. Yes, you can use MATR method anyway. Besides, you may also use other methods to get the corresponding results. An extra validation step should be done (if feasible) and then you can jude which method is better.

Because I have these 4 datasets per condition, I wanted to uses them in the correct way, not simple adding the raw counts of each gene. Which method would you propose in this case and how should the correct code of the function DEGexp(..?..)
look like?

There is another function "samWrapper" between two samples with biological replicates. You can try this on your 4 datasets. But, theoretically, the technical replicates cannot be treated as biological replicates.
Leave a comment:
steffenp replied

03-04-2010, 04:43 AM
Hi,
I wanted to use DEGseq to identify differentially expressed genes between wildtype and mutant experiments. I have 4 datasets for each condition (WT,mutant) : 2 biological replicates and for each of them 2 technical replicates.

First I tried the method "CTR" to check the variation between the replicates, but I couldn't find a code example for this method in package material. On the last output plot produced by the method "CTR" one can see the difference between the standard deviation of M according to the RSM and the theoretical four-fold local standard deviation of M by the comparison of technical replicates. But what does it mean when there is a distance between these to lines (read and blue)? Can I use the method "MATR" which is based on technical replicates anyway?

Because I have these 4 datasets per condition, I wanted to uses them in the correct way, not simple adding the raw counts of each gene. Which method would you propose in this case and how should the correct code of the function DEGexp(..?..)
look like?

Many thanks for your help!
Steffen
Leave a comment:
Xi Wang replied

02-18-2010, 08:07 PM
Originally posted by m!x View Post

Hi,

I am trying to use samWrapper to analyze my RNA-seq data on Mac OS X.
Is there a simple way to specify the path to the files?

I noticed that you can use the following on Windows:
>geneExpFile <- "D:/data/sample1.txt"

Thanks!

It's not very difficult. For example:

Code:

>geneExpFile <- "/PATH/TO/FILE"

but you should know where you file is. maybe "pwd" command can help you.
Leave a comment:
m!x replied

02-18-2010, 06:22 PM
Hi,

I am trying to use samWrapper to analyze my RNA-seq data on Mac OS X.
Is there a simple way to specify the path to the files?

I noticed that you can use the following on Windows:
>geneExpFile <- "D:/data/sample1.txt"

Thanks!
Leave a comment:
maria.b replied

02-17-2010, 12:23 AM
Hello,

Finally it ends during this night.(maybe I don't wait enough patiently) I will recalculate the expression values like you said, and I will tell you if it work's better.

Thanks for your advices

Maria

Last edited by maria.b; 02-17-2010, 01:01 AM.
Leave a comment:
Xi Wang replied

02-16-2010, 11:36 PM
Hi, Maria

Maybe the figures are too large for DEGseq to calculate. But I need to check if it is the reason, or it's a bug of DEGseq.

From how we model the RNA-seq data, we strongly recommend you use the read counts (instead of the sum of read counts on every base) as the gene expression level estimate. You can just simply try DEGseq function in the package.

Thanks,
Leave a comment:
maria.b replied

02-16-2010, 08:44 AM
Ok thanks for your reply,

I have three values per gene:
- the sum of read on each base (count)
- the average coverage on each base(mean)
- the expression value in RPKM (rpkm) (rpkm = count * 10⁹ /length*nbreads) (maybe this is false, i will change my calculation of count to have the number of reads mapped per gene, but it's an other problem)

I have 13661 genes and three sample in 2 replicats
Here you have the value min and max for each replicat and for each type of expression value (I don'tif it's important)
count values :
's1_1': [0, 5983478],
's1_2': [0, 17697854],
's2_2': [0, 14879008],
's2_1': [0, 14369451],
's3_2': [0, 11717714],
's3_1': [0, 11696411]

mean_values:
's1_1': [0.0, 5942.0],
's1_2': [0.0, 65791.0],
's2_2': [0.0, 14776.0],
's2_1': [0.0, 14270.0],
's3_2': [0.0, 11636.0],
's3_1': [0.0, 11615.0]

rpkm values:
's1_1': [0.0, 1075393.0],
's1_2': [0.0, 4577530.0],
's2_2': [0.0, 1072475.0],
's2_1': [0.0, 1145468.0],
's3_2': [0.0, 1064802.0],
's3_1': [0.0, 869385.0]

I want to run the FET method to compare S1 and S2, S1 and S3 using the different expression value type.
Is the command DEGexp() different when we use the method MARS than when we use the method FET?

My output looks like this (exemple comparing s1 and s2):

#############analyse differentielle, methode FET, s2 vs s1, count#############
Please wait...

geneExpFile1: fileEXPR
gene id column in geneExpFile1: 1
expression value column(s) in geneExpFile1: 9 10
total number of reads uniquely mapped to genome obtained from sample1: 468022379 515938474

geneExpFile2: fileEXPR
gene id column in geneExpFile2: 1
expression value column(s) in geneExpFile2: 7 8
total number of reads uniquely mapped to genome obtained from sample2: 204283936 231306719

method to identify differentially expressed genes: FET
pValue threshold: 0.001
output directory: out

Please wait ...
Identifying differentially expressed genes ...
Please wait patiently ...

and it never ends.

Thanks for your help.

Maria

Last edited by maria.b; 02-16-2010, 08:56 AM.
Leave a comment:
Xi Wang replied

02-16-2010, 07:48 AM
Hi, Maria

Originally posted by maria.b View Post

Hi everybody,
I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.

Thanks for using DEGseq.

Originally posted by maria.b View Post

I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?

What is you data size? I don't think it is normal primarily. But we need to confirm what caused this time consuming problem.

Originally posted by maria.b View Post

I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?

The values by your means roughly equals to (read count) * (read length) * 10⁹ / (gene length) / (total reads) = RPKM * (read length)
It is ok to use you method, but when counting RPKM, you need divide the values by the read length, further.

Last edited by Xi Wang; 02-16-2010, 06:54 PM. Reason: a typo corrected
Leave a comment:
maria.b replied

02-16-2010, 04:46 AM
FET method

Hi everybody,

I'm using DEGseq to identify gene differentially expressed genes from expression values that I already have.

I would like to know how many time does it takes to run the DEGexp function with FET method. Because I recieve the result for the LRT and MARS method in a few minutes and for the FET method I let it run more than one night and it was still running. Is it normal?

I have an other question concerning the expression value. For the moment I calculate these values like the sum of reads on each base of a gene and not the number of reads mapped on the gene and next I transform these values in RPKM. Do you think that it will change anything in the differentially expressed genes analysis? What do you use to calculate thiss expression values?

Thanks for you help

Maria
Leave a comment:
Xi Wang replied

02-10-2010, 07:36 PM
Originally posted by AmyL View Post

Hi,

I was wondering what density is a measure of in the first output graph of DEGseq,

thanks,
Amy

The plot is generated by:

Code:

hist(LogVal(Sample1),main=label1,xlab="log2(Number of reads mapped to a gene)",col=4,breaks=100,freq=FALSE,ylim=c(0,0.5))

Using "freq=FALSE" means, component density are plotted, so that the histogram has a total area of one: sum(density * bin_width) = 1
Leave a comment:
AmyL replied

02-10-2010, 01:58 PM
Hi,

I was wondering what density is a measure of in the first output graph of DEGseq,

thanks,
Amy
Leave a comment:
Xi Wang replied

01-29-2010, 10:17 PM
Hi lix,

Originally posted by lix View Post

My mapped reads are the "eland" format like this:
26 CCTTTCCACATCTTTCTCCCTCGCT U1 0 1 1 chr12 81865484 R

So, my data should convert to the "eland" format that DEGseq supports like this:
26 CCTTTCCACATCTTTCTCCCTCGCT 81865484 U1 R

I'm just wondering whether my conversion was right.

I am wondering how you convert the format. If you used a script to implement the conversion, you can check the result after conversion directly. Certainly, you need to make this step work.

BTW, after I used the getGeneExp() function, if all of the RPKM values in the expression value files are "0", does it mean that the DEGexp() will fail to read the expCol1 or expCol2 value?

Sure, even if DEGexp() successfully reads the values, the values are all equal to 0.

And, is there any difference between the "valCol" in readGeneExp() and the "expCol" in DEGexp()?

You can take they the same. But "valCol" could be any col while "expCol" should only be expression cols.
Leave a comment:

Previous 1 5 6 7 8 9 10 11 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News