RNA-seq output - SEQanswers

wangxj replied

10-30-2012, 06:43 PM
Originally posted by Davis McC View Post

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

Hi Davis,

Is edgeR suit for the Data with bio replicates?
Leave a comment:
morellr replied

02-01-2012, 12:39 PM
Back to RNA-Seq output

Can we go back and address a question that was posed at the beginning of this thread? I'm sorry to digress from this interesting discussion of replicates and comparing RNAseq data, but Zorph has Solid reads, and I'm not sure I understand how those reads get into the pipeline that's now under discussion. As I understand it, the preferred analysis pipeline for RNA-seq passes through Bowtie/Tophat, but the most recent versions don't analyze colorspace. Also, I gather that, for DNA frag reads, Lifescope does the best job of mapping colorspace reads, and the output is a BAM file. But is this true also of mapping RNA (cDNA) colorspace reads? i.e. does it do a good job of novel splice finding etc? If so, then what is the best way to get the resulting BAM file into a format for the Tophat/cufflink/DEseq downstream analyses. Can it be used as input for step 4 of the process outlined by Davis McQ (post #2)? If not, then what would people recommend as the best first step in getting Solid data mapped and converted into BAM format?
Leave a comment:
emilyjia2000 replied

02-01-2012, 10:37 AM
Hello,
I am not sure if it is right place to raise my question. I would like to check on exon level reads on RNA-seq data, I don't know which tools are better on getting exon expression level. I have some option list: DEXseq, DEseq, Rsamtools, HT-seq. Anybody happens to know, please give some suggestion.
Thanks in advance.
Leave a comment:
Davis McC replied

06-26-2011, 10:18 PM
Hi

Using "classic" edgeR (i.e. not the newer GLM methods) the abundance of each gene is estimated as a "concentration", i.e. the concentration of the total RNA sample that that particular gene accounts for. See the d$conc element of the a DGEList object after applying estimateCommonDisp(). If you have just two groups, then logConc is log2 of the d$conc$conc.common element of the DGEList object, i.e. the average concentration across all samples. The logFC is log2( d$conc$conc.group[,1] / d$conc$conc.group[,2] ) or the log2 fold change of the estimated concentrations of the two groups being compared. This is why you see that logFC is not just the counts divided by lib.sizes.

The software will certainly allow you to test your own normalization with edgeR; you can set the library sizes to a fixed value if you wish. However, I can not guarantee that the results will be sensible! I would need to know more about what sort of normalization you were wanting to do to be able to offer any more advice on that.

Cheers
Davis
Leave a comment:
sdm replied

06-26-2011, 07:37 AM
Originally posted by Davis McC View Post

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

Hi,

I have just started to use edgeR. From the package pdf I am not clear how the logFC value (topTags) is actually calculated, it seems not to be based on e.g. the counts divided by lib.sizes. Anyway, it would be great to know exactly how the values logConc and logFC from the topTags function are calculated.
I am also asking, since I wonder, if it is possible to test my own normalization with edgeR using only the statistical neg-binomial test. Can I do this by setting the library sizes to a fixed value e.g. (1000000, 1000000 ... )?

Thanks !
Leave a comment:
Simon Anders replied

10-20-2010, 04:29 AM
Originally posted by Bob Settlage View Post

First, you mention no normalization, however, there is some mention of TMM normalization. Is this appropriate?

DESeq has an implicit normalization (the function 'estimateSizeFactors') that is conceptually nearly the same as what Robinson and Oshlack call "TMM". (The difference is that we use a median instead of a trimmed mean, and we consider it more appropriate to look at ratios rather than differences of counts.)

Second, there are some recent papers on length normalization, I guess this is more of a scaling, would this be appropriate?

Lastly, as TMM normalization is a global normalization and length scaling is per gene, is there an order preference?

You should not divide by length. See earlier threads here for explanation why. In case you refer to the biasing effect that length is supposed to have, I've discussed this here:

http://article.gmane.org/gmane.science.biology.informatics.conductor/30671

Last edited by Simon Anders; 10-20-2010, 05:54 AM.
Leave a comment:
Bob Settlage replied

10-20-2010, 04:19 AM
Two more questions.

First, you mention no normalization, however, there is some mention of TMM normalization. Is this appropriate?

Second, there are some recent papers on length normalization, I guess this is more of a scaling, would this be appropriate?

Lastly, as TMM normalization is a global normalization and length scaling is per gene, is there an order preference?
Leave a comment:
Simon Anders replied

08-30-2010, 12:25 AM
Originally posted by yh253 View Post

I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?

"Sample pairs" means that your samples come in pairs, each pair containing one treatment and one control, such that the two samples within a pair might be more similar than two control samples or two treatment sample. For example, if you have several patients, and from each patient, you have one sample of normal tissue and one of tumor tissue, the differences between the patients might obscure the differences between tumor and normal and you drastically lose power to make statistical discoveries if your method is not informed about which healthy sample is paired with which tumor sample.

BTW: You need raw, unnormalized counts to use edgeR or DESeq. RPKM values are not suitable.

Simon
Leave a comment:
yh253 replied

08-29-2010, 03:15 PM
Originally posted by Simon Anders View Post

Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

Simon

Hi Simon,

I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?
Leave a comment:
Simon Anders replied

08-25-2010, 10:35 AM
Originally posted by quix View Post

about the replicates, I have a further question.
I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.

Is this biological replication?

Not quite. Maybe, re-read my post #16. You will only see the average of your three replicates. How do you want to know that the spread within replicates (the within-group variance, in the terminology of anova) is not as large as the differences that you observe between conditions (the between-groups variance)? Without this, you cannot calculate a p value and only make wild guesses about the statistical significance of your findings.

Why didn't you use multiplexing (i.e., bar-coding tags next to the sequencing primer) to keep your samples separable before pooling them into a sequencing lane?

Simon
Leave a comment:
quix replied

08-25-2010, 07:03 AM
Thanks Simon for your kind reply,

about the replicates, I have a further question.
I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.
Is this biological replication?

Thanks
Quix
Leave a comment:
Simon Anders replied

08-25-2010, 12:23 AM
Originally posted by quix View Post

I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?

To my knowledge, there is no good explanation yet on how to use Rsamtools for this task. The ones found on the web have a few issues (see this post).

Hence, maybe you want to give me htseq-count tool a try.

Is it possible to run these software in my pc?

Usually yes.

One more question, how to analyze the quality of RNA-seq output data?

For a first look, use htseq-qa or FastQC. Once you have counts, compare your replicates to see how well they agree. (I plan to a section to the DESeq on how to do that.)

Simon

Last edited by Simon Anders; 08-25-2010, 12:23 AM. Reason: fmt
Leave a comment:
quix replied

08-24-2010, 06:33 PM
Originally posted by Davis McC View Post

Hi zorph

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

Thanks to Davis for your great advices! Such information is really useful for the beginners like me. I learned a lot from discussion here.

I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?

Is it possible to run these software in my pc?

One more question, how to analyze the quality of RNA-seq output data?

I don't major in bio-informatics and I know these questions look naive.... Thanks for your answers

Quix

Last edited by quix; 08-24-2010, 06:42 PM.
Leave a comment:
Simon Anders replied

08-23-2010, 11:57 PM
Originally posted by bioinfosm View Post

Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?

Not quite. Imagine Charles find a couple of genes which are, in one of his treatments, upregulated by 50% in comparison to the value in the controls, and he writes in his paper that these genes are obviously responding to the treatment.

Somebody else performs the same control experiment but does it twice, with two independent samples, and notices that Charles' genes differ between the two control samples by around 50%, too. This invalidates the initial conclusion that the genes upregulation is due to the treatment, as it happens without treatment as well. Without replicates, you would never know.

So, all I am talking about it the old-fashioned rule that you should do every experiment several times in order to see how much the measured quantities change even if you don't change anything. While this is considered absolutely required in most subfields of biology, for some reasons, people forget about it once they use high-throughput sequencing.

What you suggested, i.e., spreading a given sample over several lanes (called "technical replicates" by some), will not help at all with this; nevertheless, it might be necessary in addition if you work with organisms with large exomes.

Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

Simon
Leave a comment:
bioinfosm replied

08-23-2010, 10:18 AM
Simon, I am curious, what kind and how many replicates are you suggesting?
Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News