Unconfigured Ad

**Bukowski** · 08-16-2014, 12:54 PM

Originally posted by thickrick99 View Post

Hi Everyone,

I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.

You know that cuffdiff basically does this for you? I'm not sure why you're seeking a statistical test here (maybe I'm missing the point). It's not uncommon to find genes where they are not significantly differentially expressed at the gene level because there is no empirical change, but have significantly differentially expressed isoforms, and one goes up the same amount as the other goes down - this is just a matter of checking what is significant in each case in the cuffdiff outputs.

The tracking files are not where you need to be looking, I'd advise you look at the .diff files.

Another source of these situations is the splicing.diff file which will show you the genes which have alterations in the distribution of isoforms between conditions

**thickrick99** · 08-16-2014, 01:04 PM

Ok thanks for the advice! I haven't run cuffdiff yet but I will now. So what you're saying is, I can look for genes with similar expression in the .diff file and then look at the isoforms that have a difference using the cuffdiff output files right?

Do you (or anyone else) have a suggestion regarding what type of plots I can use. It seems like around 5000+ genes have similar expression so what would be the best way to plot such a large amount of data?

Thanks!

**thickrick99** · 08-16-2014, 01:33 PM

I understand that I don't actually need a test statistic since programs like cuffdiff can do it for me. However, as part of the project I am working on, I need to program a test statistic with R in order to get and compare the p values rather than having the programs do it for me. So how should I choose the test statistic based on what I am doing (2 populations each with 2-5 samples and the populations are unpaired). I am trying to compare the differential gene expression and I am not sure how to choose a statistical test for this using the FPKM values from cufflinks.

**dpryan** · 08-17-2014, 01:02 AM

The most common test statistic in your circumstance would be the T-statistic. That ends up being similar to what cuffdiff is using internally anyway.

**thickrick99** · 08-17-2014, 04:59 AM

Yeah that's what I was planning to do as well. But I read online about other tests like poisson or negative binomial and I wasn't sure if these are better than using the T-test?

**dpryan** · 08-17-2014, 06:14 AM

Only the T-test is compatible with FPKMs.

**thickrick99** · 08-17-2014, 06:28 AM

So if I wanted to use more complicated tests with poisson or negative binomial I would have to use the raw read count data right? Where do I access this information assuming I used top hat and then cufflinks. Do I have to convert fpkm to the read count or are the read count in the accept_hits.bam file?

**dpryan** · 08-17-2014, 07:14 AM

The general workflow is to map with tophat (or STAR or whatever else you want to use) and then quantify with htseq-count or featureCounts. The latter two will give you counts that you can use in a negative binomial model. If you used cufflinks to find new features, then just run it first and use the merged GTF file with the aforementioned counting programs. I wouldn't bother with a Poisson model, it's not worth your time.

BTW, there's no great way to convert between FPKM and raw counts, since the latter doesn't use multimappers while the former does.

**thickrick99** · 08-17-2014, 08:20 AM

ok thanks Devon! So I have my count data, but how do I got about using the negative binomial model. I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R. Is there a way to do this and how would I get the p-values to compare the differential expression of the genes.

It would be great if you could give me some advice on how to program the negative binomial model into R and any resources that you think would help me do this in order to compare differential gene expression between the two populations.

Thanks for all your help!

**dpryan** · 08-17-2014, 08:44 AM

The simplest way would be to use glm.nb() from the MASS library.

**Gordon Smyth** · 08-18-2014, 12:51 AM

Originally posted by thickrick99 View Post

I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R.

If you are not familiar with the relevant models, wouldn't it make sense to use existing tools?

Anyway, the edgeR methods, dispersion estimation especially, are sufficiently sophisticated that it is very unlikely you could reproduce them yourself in any reasonable amount of time.

For example, glm.nb() implements simple maximum likelihood for the dispersion parameter, which will markedly underestimate the true dispersions for RNA-seq data, and hence give overly liberal DE results. You need software like edgeR to do better, there's no easy way around it.

**thickrick99** · 08-18-2014, 06:09 AM

Ok thanks for your help Gordon. So I tried using the t-test on the FPKM values but I realized that I can't do the test because of the 0 FPKM values. Is there an accepted way to get around this so that I can do the t-test on the FPKM values?

**Gordon Smyth** · 08-18-2014, 04:10 PM

It is impossible to do a high performance statistical test on FPKM values alone, because they have varying precisions, and the precision depends on the original count size rather than on the FPKM value itself.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 64 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Advice for Statistics in Gene Expression??

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News