Seqanswers Leaderboard Ad

**Davis McC** · 05-26-2010, 07:31 PM

DE Analysis Pipeline with edgeR

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

**Simon Anders** · 05-26-2010, 11:06 PM

Hi

Davis gave a nice summary of the way how to do it.

Two additional points (which I mainly put to advertise our software):

- An alternative to edgeR is our package, DESeq. DESeq's method is based on edgeR's, but different in a number of points (and we think, of course, that this makes it better). See our paper for the exact differences.

The main point, however, is that you get a proper analysis only if you have a method that can, as Davis writes, "deal with overdispersion in the data, investigate inter-library (incl. biological) variability", and to my knowledge, edgeR and its derivative, DESeq, are the only tools currently available, which do this properly.

- While both edgeR and DESeq are easy enough to use that even users unfamiliar with R will manage, the summerization might be a bit more tricky. An alternative is htseq-count.

Simon

**Livi81** · 05-31-2010, 09:25 AM

I'm also looking for a biologist friendly way to analyse RNA-seq output files. I saw that Partek seem to have some nice software, how does it compare to edgeR and DESeq?
Thanks

**jgibbons1** · 05-31-2010, 12:02 PM

Hi Livi81,
I've played around with the trial version of the Partek software quite a bit and was not thrilled with it. My major problem was speed. I was working with Illumina data sets consisting of 25 million reads each. Even with 4 gigs RAM the software stalled and froze my computer a few times. You may be able to bypass this doing by bringing mapped output in, rather than doing it in the Partek software. I found that using a UNIX or R environment to be much better for me. It's worth calling the company though for a free trial.

**townway** · 07-14-2010, 08:15 AM

Before moving to step4. Summarize reads on the gene/transcript/exon level.

do you think it is necessary to remove the reads mapping to rRNA and psudogene region? and do you know how to make this from SAM file?

Thank you

**Simon Anders** · 07-14-2010, 08:41 AM

Originally posted by townway View Post

do you think it is necessary to remove the reads mapping to rRNA and psudogene region? and do you know how to make this from SAM file?

No, why should this be necessary? After step 4, you have a table with counts, with one row for each genes. Provided the rRNA and pseudo genes were in your annotation, there will also be some rows for these, and then you can conveniently kick out these rows if you don't like them. You can also leave them in. After all, if you get counts for a pseudogene, the gene may not be that 'pseudo' and you may want to look at it. And the counts to rRNA may be informative to judge the effectiveness of the the RNA removal step of your sample prep. Of course, any differential expression that edgeR or DESeq may report for them will be biologically meaningless.

Simon

**greigite** · 07-15-2010, 11:11 AM

Hopefully this question is allowed on the RNA-seq forum

I'm interested to get opinions from the developers of edgeR and DEseq (and others) about whether the statistical analyses in these packages are appropriate for a couple other types of biological count data. Specifically, I work on metagenomic analyses of complex microbial communities (in soils, plants, water etc). The type of data I'm working with are sequencing reads, typically produced on 454, that are then annotated through various different pipelines. The outcome is a bunch of counts of genes with particular annotations or that are in specific functional categories. The genome space of the community is certainly greatly undersampled, as in many RNA-seq experiments, but the magnitude of difference in counts is less. Could I apply one of these packages to analyzing my data? I have biological but not technical replicates at the moment. The second type of data are counts of the number of organisms in particular phylogenetic categories, and this data is closer to RNA-seq data in that there a few highly abundant categories and a long tail of low-abundance types. Again, I have biological but not technical replicates.

**Simon Anders** · 07-20-2010, 08:39 AM

Hi

in principle both edgeR and DESeq are suitable for any kind of count data for which the model fits. The assumption of a negative binomial distribution is quite robust; the crucial question is the variance-mean relation.

To do proper statistics, you need to have a reasonable estimate of the variance for each gene (or gene category, or species, or clade, or whatever it is you count in meta-genomics). As one typically has only few replicates, one needs to assume that genes of similar expression strength (or clades of similar abundance, or whatever) have similar variance.

In case of DESeq, there are diagnostics (the variance residuals, visualized with 'residualEcdfPlot' and also used to find 'variance outliers') that allow you to check how well this model fits, so that you know whether you can put trust in your results.

So, yes, it is worth a try, and I'd be very interested to hear how it goes.

Cheers
Simon

**greigite** · 07-20-2010, 01:32 PM

Thank you, Simon. I will try out DESeq on my data and let you know how it goes. BTW I would also be very interested in a way to compare multiple treatments. In the present project I have 3 treatments each with 3 biological replicates.

**crh** · 08-20-2010, 03:01 AM

Originally posted by Simon Anders View Post

Hi

Davis gave a nice summary of the way how to do it.

Two additional points (which I mainly put to advertise our software):

- An alternative to edgeR is our package, DESeq. DESeq's method is based on edgeR's, but different in a number of points (and we think, of course, that this makes it better). See our paper for the exact differences.

The main point, however, is that you get a proper analysis only if you have a method that can, as Davis writes, "deal with overdispersion in the data, investigate inter-library (incl. biological) variability", and to my knowledge, edgeR and its derivative, DESeq, are the only tools currently available, which do this properly.

- While both edgeR and DESeq are easy enough to use that even users unfamiliar with R will manage, the summerization might be a bit more tricky. An alternative is htseq-count.

Simon

Simon and Davis,

I have 4 sets of solid reads (control & 3 experimental) that I'd like to generate DE for. There are no replicates for these samples.

I was initially planning to simply normalize against the control (rpkm) but this now seems like not the way to go. Will either edgeR or DESeq generate DE for non-replicated data sets?

thanks

Charles

**Simon Anders** · 08-20-2010, 03:24 AM

Hi Charles

Short answer: no. You cannot get useful results from an experiment without replication, no matter what tool you use. (Why do people keep wasting their time and money on producing such data?)

Longer answer: DESeq has a mode to work with data without replicates that can give you at least those genes which really stick out by having way larger fold-changes then the rest. However, you might see only a small part of your potential hits.

Simon

**bioinfosm** · 08-23-2010, 10:18 AM

Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?

**Simon Anders** · 08-23-2010, 11:57 PM

Originally posted by bioinfosm View Post

Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?

Not quite. Imagine Charles find a couple of genes which are, in one of his treatments, upregulated by 50% in comparison to the value in the controls, and he writes in his paper that these genes are obviously responding to the treatment.

Somebody else performs the same control experiment but does it twice, with two independent samples, and notices that Charles' genes differ between the two control samples by around 50%, too. This invalidates the initial conclusion that the genes upregulation is due to the treatment, as it happens without treatment as well. Without replicates, you would never know.

So, all I am talking about it the old-fashioned rule that you should do every experiment several times in order to see how much the measured quantities change even if you don't change anything. While this is considered absolutely required in most subfields of biology, for some reasons, people forget about it once they use high-throughput sequencing.

What you suggested, i.e., spreading a given sample over several lanes (called "technical replicates" by some), will not help at all with this; nevertheless, it might be necessary in addition if you work with organisms with large exomes.

Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

Simon

**quix** · 08-24-2010, 06:33 PM

Originally posted by Davis McC View Post

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

Thanks to Davis for your great advices! Such information is really useful for the beginners like me. I learned a lot from discussion here.

I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?

Is it possible to run these software in my pc?

One more question, how to analyze the quality of RNA-seq output data?

I don't major in bio-informatics and I know these questions look naive.... Thanks for your answers

Quix

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

RNA-seq output

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News