Seqanswers Leaderboard Ad

**gen2prot** · 04-06-2010, 06:33 AM

Hi Xi,

Thank you for the explanation. I'll try plotting the p-values. Another question. I have the absolute read counts for 4 male and 4 female drosophila flies. I tried both the MARS and the FET method on them. The results are similar (the FET method catches maybe 50-100 more genes than MARS) Is there a reason why u would use the MARS method over FET or vice-versa?

Thanks
Abhijit

**Xi Wang** · 04-07-2010, 10:16 PM

Originally posted by gen2prot View Post

Hi Xi,

Thank you for the explanation. I'll try plotting the p-values. Another question. I have the absolute read counts for 4 male and 4 female drosophila flies. I tried both the MARS and the FET method on them. The results are similar (the FET method catches maybe 50-100 more genes than MARS) Is there a reason why u would use the MARS method over FET or vice-versa?

Thanks
Abhijit

Sorry for the late reply.

Its quite right that those methods give different results. This is because different methods build on different statistical models, and will reflect different aspects of statistical significance. The p-value derived from different methods are not comparable. If the gene ranks are the same, the tradeoff between sensitivity and specificity will be the same between two methods. Then only thing he needs to do it choose the appropriate threshold based on the empirical distribution of p-value.

**m!x** · 04-15-2010, 11:09 AM

Hi Xi,

I got a question regarding the p-value generated by MARS method. Is it corrected for multiple comparisons by any methods, eg. Bonferroni correction?
I have zero background about this statistical analysis and would appreciate if you could give some insight into this.

**Xi Wang** · 04-15-2010, 06:31 PM

Originally posted by m!x View Post

Hi Xi,

I got a question regarding the p-value generated by MARS method. Is it corrected for multiple comparisons by any methods, eg. Bonferroni correction?
I have zero background about this statistical analysis and would appreciate if you could give some insight into this.

Hi m!x,

Actually, in our package, for the methods MARS, FET, LRT, and MATR, both p-values and q-values are calculated. Q-value is the corrected p-value with respect to the multiple testing. Two kinds of q-values are given in DEGseq package, one is given by Benjamini correction, and the other by Storey correction. The two methods take different aspects for correction. The Benjamini correction gives more false negatives, while the Storey correction more false positives.

More insight of multiple testing correction, please refer to http://en.wikipedia.org/wiki/Multiple_comparisons

**RSK** · 04-30-2010, 02:25 PM

Default FDR

Hi Xi,

What is the default FDR % used for the Benjamini and Storey corrections for DEGexp? Also where can I find the vignette for the DEGseq package that member Simon Anders refers to in another post?

Thanks,
RSK

**Xi Wang** · 05-01-2010, 04:37 AM

Hi RSK,

The default FDR used for the Benjamini or the Storey correction is 0.001. The documents for DEGseq can be found at:

DEGseq

http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html

DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

Thanks.

**RSK** · 05-01-2010, 05:44 PM

Thanks Xi!

**Davis McC** · 05-23-2010, 10:43 PM

Gene/transcript length bias

Hi Adam

Originally posted by adamreid View Post

I guess I'm asking whether there is/ought to be a correction for gene length because more reads are expected to map to longer genes.

This is definitely something to be taken into consideration. See Oshlack and Wakefield, 'Transcript length bias in RNA-seq data confounds systems biology' (http://www.biology-direct.com/content/4/1/14) for further discussion.

The more we look into RNA-seq data, the more it seems that various normalization steps are required to get the most out of the data.

Cheers
Davis

**Xi Wang** · 05-23-2010, 10:57 PM

Originally posted by Davis McC View Post

Hi Adam

This is definitely something to be taken into consideration. See Oshlack and Wakefield, 'Transcript length bias in RNA-seq data confounds systems biology' (http://www.biology-direct.com/content/4/1/14) for further discussion.

The more we look into RNA-seq data, the more it seems that various normalization steps are required to get the most out of the data.

Cheers
Davis

Hi Davis,

You are quite right that the gene length and also expression levels affect the statistical power of detecting DEGs. Goseq provides a good example on how to solve this problem. Please note that the normalization itself cannot completely eliminate the biases.

**apratap** · 05-26-2010, 07:51 AM

Just following up on this educating thread on RNA-Seq calculation.

Is it fine to normalize by gene length or should we use transcript length( i.e sum of exons) and NOT sum of (exons+introns) for a gene

Thanks !
-Abhi

**Xi Wang** · 05-26-2010, 08:38 AM

Originally posted by apratap View Post

Just following up on this educating thread on RNA-Seq calculation.

Is it fine to normalize by gene length or should we use transcript length( i.e sum of exons) and NOT sum of (exons+introns) for a gene

Thanks !
-Abhi

Hi Abhi,

If you just want to identify differentially expressed genes, as is suggested in the DEGseq manuscript, you'd better not normalize the raw read counts. The reason is that the hypothesis tests (including Fisher's exact test, likelihood ratio test, and our MARS) are based on the random sampling model, and this model requires the raw counts as input to correctly calculate p-values. You are right that, according to the known gene annotation, only reads in the exonic regions should be counted.
Alternatively, if you want to estimate gene expression levels, you can refer to the concept RPKM. A newly published NBT paper made great efforts to clarify the correct way to estimate gene expression.
Hope this helps.

**David Lyon** · 05-26-2010, 09:59 AM

Hi Xi or anyone

I have the tophat and cufflinks output files from mapping RNA seq against the annotated genome from different tissues. As I am a newbie can someone provide the recommended method in a step by step guide how to use the results for input into DEGseq and process them by DEGseq to get the differential expression from the different tissues.

Thank you

**Xi Wang** · 05-26-2010, 05:33 PM

Hi David,

You can first use this script to convert the SAM files output by Tophat to BED files, and then feed the BED files to DEGseq. See the example of function DEGseq in the manual.
Any further questions, just let me know.

**Davis McC** · 05-26-2010, 07:25 PM

DE Analysis pipeline

Hi David

I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

Colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing.

It sounds like you would jump in somewhere in the middle, since you have output files already from tophat and cufflinks. NB: similar sort of approach to that suggested by Xi, just a little more fleshed-out.

Steps with required tools & files

To perform the entire analysis, the following steps and tools will be needed:

1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there, such as DEGseq. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

Considerations for DE Analysis

You should find the visualization tools in DEGseq useful, but you may find that the Poisson/variable sampling rate model does not adequately model the variability in the data. Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

DEGseq links in quite well with edgeR, so you might look at your data first with DEGseq then use edgeR to deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

Best regards
Davis

**Xi Wang** · 05-27-2010, 06:59 PM

Thanks, Davis. You give us a very clear and detailed way to analyze RNA-seq data, especially if one want to compare genes’ differential expression in two experimental conditions. Here, I still have some minor points to complete.

1. To process RNA-seq short read data, one may be suggested to use Tophat or SpliceMap for mapping the short reads against the reference genome (e.g., the human genome). These tools have the ability to detect splicing junctions, even novel junctions. As mentioned by Daves, Bowtie is a very fast mapper, but if one wants to apply Bowtie directly to RNA-seq data, one should prepared junction sequences (put two near ends of adjacent exons together to form a new sequence), in order to detect potential junctions.

2. SAM format is recommended to save the read mapping results, as it can be used in general cases and reserves lots of information. If one only focuses on DE, we provide a very simple BED-like format for user to choose. We also provide a PERL script (if cannot access, contact me) to convert the general SAM format to the BED-like format. This BED-like format can be directly fed to DEGseq.

3. As DEGseq can directly deal with 6-colunm BED-like files, no others additional tools are required. The embedded C++ code is able to process read counting swiftly.

4. DEGseq can not only give users a list of DE genes, also provide figures showing the loci of DE genes in the M-A plot, which offers the researchers a direct visual impression on the DE genes identified.

5. In the DEGseq Bioconductor package, the function samWrapper, which imports the very popular SAM method in microarray studies, can deal with the RNA-seq data with biological replicates. Rather than the strategy involved in edgeR, samWrapper uses an empirical approach to estimate the FDR of DE gene identification.

6. For high level analyses, given the DE gene list by DEGseq, users may have very flexible options to choose tools according to their specific scientific questions. GO term enrichment, pathway enrichment, and gene set enrichment analyses are all under users’ choice. GOseq, specifically designed for RNA-seq (and other techniques may result in selection bias), is recommended.

Hope the above information helps. I’d like to help researchers in the community with data analysis or other problems that I am able to.

Cheers!

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News