Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • wangxj
    replied
    Originally posted by Davis McC View Post
    Hi zorph

    I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

    I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

    Steps with required tools & files

    To perform the entire analysis, the following steps and tools will be needed:

    1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

    2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

    3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

    4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

    5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

    6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

    Considerations for DE Analysis

    Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

    edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

    Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

    Best regards
    Davis
    Hi Davis,

    Is edgeR suit for the Data with bio replicates?

    Leave a comment:


  • morellr
    replied
    Back to RNA-Seq output

    Can we go back and address a question that was posed at the beginning of this thread? I'm sorry to digress from this interesting discussion of replicates and comparing RNAseq data, but Zorph has Solid reads, and I'm not sure I understand how those reads get into the pipeline that's now under discussion. As I understand it, the preferred analysis pipeline for RNA-seq passes through Bowtie/Tophat, but the most recent versions don't analyze colorspace. Also, I gather that, for DNA frag reads, Lifescope does the best job of mapping colorspace reads, and the output is a BAM file. But is this true also of mapping RNA (cDNA) colorspace reads? i.e. does it do a good job of novel splice finding etc? If so, then what is the best way to get the resulting BAM file into a format for the Tophat/cufflink/DEseq downstream analyses. Can it be used as input for step 4 of the process outlined by Davis McQ (post #2)? If not, then what would people recommend as the best first step in getting Solid data mapped and converted into BAM format?

    Leave a comment:


  • emilyjia2000
    replied
    Hello,
    I am not sure if it is right place to raise my question. I would like to check on exon level reads on RNA-seq data, I don't know which tools are better on getting exon expression level. I have some option list: DEXseq, DEseq, Rsamtools, HT-seq. Anybody happens to know, please give some suggestion.
    Thanks in advance.

    Leave a comment:


  • Davis McC
    replied
    Hi

    Using "classic" edgeR (i.e. not the newer GLM methods) the abundance of each gene is estimated as a "concentration", i.e. the concentration of the total RNA sample that that particular gene accounts for. See the d$conc element of the a DGEList object after applying estimateCommonDisp(). If you have just two groups, then logConc is log2 of the d$conc$conc.common element of the DGEList object, i.e. the average concentration across all samples. The logFC is log2( d$conc$conc.group[,1] / d$conc$conc.group[,2] ) or the log2 fold change of the estimated concentrations of the two groups being compared. This is why you see that logFC is not just the counts divided by lib.sizes.

    The software will certainly allow you to test your own normalization with edgeR; you can set the library sizes to a fixed value if you wish. However, I can not guarantee that the results will be sensible! I would need to know more about what sort of normalization you were wanting to do to be able to offer any more advice on that.

    Cheers
    Davis

    Leave a comment:


  • sdm
    replied
    Originally posted by Davis McC View Post
    Hi zorph

    I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

    Hi,

    I have just started to use edgeR. From the package pdf I am not clear how the logFC value (topTags) is actually calculated, it seems not to be based on e.g. the counts divided by lib.sizes. Anyway, it would be great to know exactly how the values logConc and logFC from the topTags function are calculated.
    I am also asking, since I wonder, if it is possible to test my own normalization with edgeR using only the statistical neg-binomial test. Can I do this by setting the library sizes to a fixed value e.g. (1000000, 1000000 ... )?

    Thanks !

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by Bob Settlage View Post
    First, you mention no normalization, however, there is some mention of TMM normalization. Is this appropriate?
    DESeq has an implicit normalization (the function 'estimateSizeFactors') that is conceptually nearly the same as what Robinson and Oshlack call "TMM". (The difference is that we use a median instead of a trimmed mean, and we consider it more appropriate to look at ratios rather than differences of counts.)

    Second, there are some recent papers on length normalization, I guess this is more of a scaling, would this be appropriate?

    Lastly, as TMM normalization is a global normalization and length scaling is per gene, is there an order preference?
    You should not divide by length. See earlier threads here for explanation why. In case you refer to the biasing effect that length is supposed to have, I've discussed this here:
    Last edited by Simon Anders; 10-20-2010, 05:54 AM.

    Leave a comment:


  • Bob Settlage
    replied
    Two more questions.

    First, you mention no normalization, however, there is some mention of TMM normalization. Is this appropriate?

    Second, there are some recent papers on length normalization, I guess this is more of a scaling, would this be appropriate?

    Lastly, as TMM normalization is a global normalization and length scaling is per gene, is there an order preference?

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by yh253 View Post
    I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?
    "Sample pairs" means that your samples come in pairs, each pair containing one treatment and one control, such that the two samples within a pair might be more similar than two control samples or two treatment sample. For example, if you have several patients, and from each patient, you have one sample of normal tissue and one of tumor tissue, the differences between the patients might obscure the differences between tumor and normal and you drastically lose power to make statistical discoveries if your method is not informed about which healthy sample is paired with which tumor sample.

    BTW: You need raw, unnormalized counts to use edgeR or DESeq. RPKM values are not suitable.

    Simon

    Leave a comment:


  • yh253
    replied
    Originally posted by Simon Anders View Post
    Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

    Simon
    Hi Simon,

    I've been a bit confused by this 'sample pairs' concept. I got RNA-seq data of two samples: wide-type and knocked-down, with 4 'biological replicates' for each, all four replicates for each sample were sequenced on different lanes of a single flow cell, and two samples on each lane by using multiplex. Is my data regarded as 'pairs', which can't be analyzed by DESeq or edgeR? So far, I got the gene level read counts (rpkm values) from ERANGE, and going to preceding to DE analysis. If I can't use any of the two packages, do you have a suggestion of other tools for this purpose?

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by quix View Post
    about the replicates, I have a further question.
    I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.

    Is this biological replication?
    Not quite. Maybe, re-read my post #16. You will only see the average of your three replicates. How do you want to know that the spread within replicates (the within-group variance, in the terminology of anova) is not as large as the differences that you observe between conditions (the between-groups variance)? Without this, you cannot calculate a p value and only make wild guesses about the statistical significance of your findings.

    Why didn't you use multiplexing (i.e., bar-coding tags next to the sequencing primer) to keep your samples separable before pooling them into a sequencing lane?

    Simon

    Leave a comment:


  • quix
    replied
    Thanks Simon for your kind reply,

    about the replicates, I have a further question.
    I have submitted my samples for RNA-seq(1, control; 2, protein treatment for 1 hr,; 3, protein treat for two hrs). What I have done is to pool the RNA samples from three independent experiments(ctrlX3, 1hrX3, 2hrsX3). For each experiment, I have verified that the protein works on my cells.
    Is this biological replication?

    Thanks
    Quix

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by quix View Post
    I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?
    To my knowledge, there is no good explanation yet on how to use Rsamtools for this task. The ones found on the web have a few issues (see this post).

    Hence, maybe you want to give me htseq-count tool a try.

    Is it possible to run these software in my pc?
    Usually yes.

    One more question, how to analyze the quality of RNA-seq output data?
    For a first look, use htseq-qa or FastQC. Once you have counts, compare your replicates to see how well they agree. (I plan to a section to the DESeq on how to do that.)

    Simon
    Last edited by Simon Anders; 08-25-2010, 12:23 AM. Reason: fmt

    Leave a comment:


  • quix
    replied
    Originally posted by Davis McC View Post
    Hi zorph

    I am one of the developers for the Bioconductor package edgeR , which is designed for carrying out differential expression analysis of count data (like RNA-seq). Check out the User's Guide for more details and case studies to provide examples on how to use the package.

    I'm not familiar with Wig files and can't tell what sort of analysis you've carried out already, but colleagues of mine suggest the following sort of steps to go from raw RNA-seq short read data from the raw fasta files, through to GO category testing. You may find this sort of analysis pipeline useful.

    Steps with required tools & files

    To perform the entire analysis, the following steps and tools will be needed:

    1. Get some short read RNA-seq data, for at least two different experimental conditions you wish to compare

    2. Choose a reference to map against, and map your data using a short read mapper that outputs in SAM format. We tend to use bowtie. Other options are bwa, SOAP2, novoalign, shrimp.

    3. Use SAMtools to convert SAM output into the binary BAM format, which is both smaller on disk and allows for fast indexing.

    4. Summarize reads on the gene/transcript/exon level. We use the R platform with the Rsamtools and GenomicFeatures packages.

    5. Calculate DE genes from counts summarized on the gene level. We use the R package edgeR, which we have developed, although there are other tools out there. edgeR can account for biological variation in the data (using a negative binomial model), separate biological from technical variation, produce an MDS plot, and conduct exact testing procedures.

    6. Perform GO category testing on the results of the differential expression analysis, using the R package goseq.

    Considerations for DE Analysis

    Extra-Poisson variation (or overdispersion) is typical of RNA-seq data, especially if there is biological replication amongst your samples. If you only have technical replicates then this may not be an issue, but I would recommend running your data through edgeR to get some idea of the inter-library variability. If you have overdispersed data, then using a Poisson model will *drastically* overestimate the levels of differential expression in your data. Using a NB model like in edgeR can account for this extra variation in the data and give much better assessment of DE.

    edgeR can deal with overdispersion in the data, investigate inter-library (incl. biological) variability and get exact p-values for DE based on the NB model.

    Hope that is helpful and good luck with your data analysis. Please ask if you have any more questions I might be able to help with.

    Best regards
    Davis

    Thanks to Davis for your great advices! Such information is really useful for the beginners like me. I learned a lot from discussion here.

    I am good with steps 1-3 and 6. However, I am not very clear with the software Rsamtools indicated in step 4 and DE genes calculation of step 5. Can anybody give a little more details about these?

    Is it possible to run these software in my pc?

    One more question, how to analyze the quality of RNA-seq output data?

    I don't major in bio-informatics and I know these questions look naive.... Thanks for your answers

    Quix
    Last edited by quix; 08-24-2010, 06:42 PM.

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by bioinfosm View Post
    Simon, I am curious, what kind and how many replicates are you suggesting?
    Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?
    Not quite. Imagine Charles find a couple of genes which are, in one of his treatments, upregulated by 50% in comparison to the value in the controls, and he writes in his paper that these genes are obviously responding to the treatment.

    Somebody else performs the same control experiment but does it twice, with two independent samples, and notices that Charles' genes differ between the two control samples by around 50%, too. This invalidates the initial conclusion that the genes upregulation is due to the treatment, as it happens without treatment as well. Without replicates, you would never know.

    So, all I am talking about it the old-fashioned rule that you should do every experiment several times in order to see how much the measured quantities change even if you don't change anything. While this is considered absolutely required in most subfields of biology, for some reasons, people forget about it once they use high-throughput sequencing.

    What you suggested, i.e., spreading a given sample over several lanes (called "technical replicates" by some), will not help at all with this; nevertheless, it might be necessary in addition if you work with organisms with large exomes.

    Tumor-normal sample pairs are proper replicates, of course, if you have several pairs. The specific issue with paired samples is that DESeq cannot deal with them at the moment (and neither can edgeR) but we are working on it.

    Simon

    Leave a comment:


  • bioinfosm
    replied
    Simon, I am curious, what kind and how many replicates are you suggesting?
    Best I have seen is tumor normal pairs for rna-seq data; but by replicates do you mean redundant lanes of data per sample which are somehow averaged to get a more accurate read-out?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 07:23 AM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-17-2024, 06:54 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-14-2024, 07:24 AM
0 responses
24 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-13-2024, 08:58 AM
0 responses
17 views
0 likes
Last Post seqadmin  
Working...
X