Seqanswers Leaderboard Ad

**Proteos** · 03-07-2012, 06:43 AM

Is there anyone who can help me?

**Simon Anders** · 03-07-2012, 07:39 AM

RNA-Seq read counts usually show quite good proportionality to the concentration of transcripts, as several authors demonstrated with spike-ins. Hence, the counts are a good measure of expression strength, once you normalize for sequencing depth. For more information, see this thread, e.g. my post #13.

It might also be appropriate to divide by transcript length. If you have multiple isoforms, figuring out what length to use is a highly non-trivial (but nevertheless often ignored) problem.

There are bias effects due to CG content and transcript length etc. which you may want to look into, but this is not always that important.

**Proteos** · 03-08-2012, 06:29 AM

Thank you very much Simon!
You saved me!

So, I've read your 'post #13' and used the function estimateSizeFactors() to estimate the factors instead of using the library sizes.
The full logics is as follows:
* get counts
* get newCountDataSet from counts and conditions (in my case simply the samples)
* calculate size factors
* calculate base means

I realized that by dividing count on sizeFactor you get what would be 'baseMeanA' if you did nbinomTest().
(The baseMeanB is simply the second condition's baseMean)
Although I don't know how to get 'baseMean' i.e. the first column in nbinomTest's result and whether it is the one I need.

I guess not, and what I need is dividing counts by sizeFactor i.e. baseMeanA. Is not it correct?

So, with my scarce R knowledge and with a help of Google, I created a little script that I am sharing:

Code:

#!/usr/bin/env Rscript
# Accepts counts file as an input ( fileIn )
# Outputs the file with baseMeans (fileOut)
#
# In the R shell, create variable:
# fileIn = "filename.counts"
# and run this script
# example:
# fileIn="counts_hs.tab"
# source("tool.counts_to_bmeans.r")

fileOut=paste(fileIn,"_bmeans.csv",sep="")

library( DESeq )

# read the counts data from file
countsTable <- read.delim( fileIn, header=TRUE, stringsAsFactors=TRUE )

# Convert column 1 to row names
rownames( countsTable ) <- countsTable$name
countsTable <- countsTable[ , -1 ]

# Get conditions from column names
# 'conds' should determine conditions, but in our case when every sample is separate, it is the same as 'samples'
conds <- colnames(countsTable)	

# Calculate factors
cds <- newCountDataSet( countsTable, conds )
cds <- estimateSizeFactors( cds )	# Factors to normalize from count data

# Calculate baseMeans
nfeatures <- nrow(assayData(cds)$counts)
nfactors  <- NROW(pData(cds)$sizeFactor)
mfactors <- matrix(pData(cds)$sizeFactor,nfeatures,nfactors, byrow=TRUE)
bmeans <- assayData(cds)$counts / mfactors

# Output to tab file
#write.table(bmeans, file="cn.tab", sep="\t")


# Output to CSV file
write.csv(bmeans, file=fileOut)

I've checked the script. It is working correctly.
(compared the results to baseMeanA of nbinomTest() results and they are the same.)

Now I will take these basemeans data and simply divide by the summary length of exons of the corresponding gene.
( because my case is simple: I don't need the particular isoform expression; I just need pretty much Yes/No value.)

Right now, I will simply import basemeans to Excel and do the length normalization there. Later on, maybe I will do it in R or better in python and will share that too

Please let me know if I am doing something wrong.
I hope not, because I have to present these data next Wednesday before my group and I don't have much time

Thanks again!

P.S. I think it would be a good idea to create Python script for all this.
If you are interested and think it is worth, at some point when it is more mature, one might even add it to your HTSeq package; I will be more than glad to share what I have

**Simon Anders** · 03-08-2012, 06:39 AM

The 'baseMeans' itself is just the mean of the normalized counts over all samples, ignoring the condition.

If you want to get rid of your script's dependency on R, just reimplement the scheme I explained in the post in Python. With numpy, this should be just three or four lines.

**Proteos** · 03-08-2012, 06:47 AM

And what would you say about the algorithm and logics?
Is everything correct or I am missing some point?

Right now, because the task is simple, it is not worth porting to Python and R is fine.
(I have very limited time in these weeks)

But at some point, when I am more 'hardcore HTS guy', I will probably try to do that

**Proteos** · 03-08-2012, 06:52 AM

Oops, forgot to ask:
If I have replicates, should I do something in addition to this? (variance, dispersion etc)
At least for my simple task?

**ugolino** · 10-30-2012, 04:30 PM

Hi Proteos,

I am also interested in read coverage of intergenic regions and have only bed formatted coordinates for those. Could you possibly share or give instruction on how to modify htseq-count to accept a bed format with say 4 columns (chrom, start, end, gene_name)? Bed formatted coordinates are 1-based start and 0-based end, but it is opposite in htseq, correct? I am decent in perl, but clueless in python.

thanks much!

**ugolino** · 10-30-2012, 05:09 PM

correction, bed format is 0-based start and 1-based end, so it's the same as htseq.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq-count and BED-formatted coordinates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News