Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to study differential expression?

    My data is got from 2 tissues by Illumina sequencing with 75nt reads. Are there any standard ways to study differential expression?
    Is it necessary to calculate RPKM for each gene? If so what is the best tool to calculate RPKM? ERANGE, TopHat or Cufflinks?
    Is simply counting and comparing the number of reads mapping to each gene between tissues also acceptable for studying differential gene expression?

    Thank you for your time!

  • #2
    You may use DEGseq to do the anaysis you want to. The input for DEGseq could be mapped reads rather than RPKM.
    Have a look:

    and the related paper:


    best,
    Xi
    Xi Wang

    Comment


    • #3
      Originally posted by beliefbio View Post
      Is it necessary to calculate RPKM for each gene?
      Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.

      If so what is the best tool to calculate RPKM? ERANGE, TopHat or Cufflinks?
      Erange I haven't used yet. Tophat is for mapping not counting (it does count, but the creator of this software has said this will be removed from future versions since Cufflinks now exists), so Cufflinks is meant for RPKM determination.

      So, you could map with tophat and then feed the produced "accepted_hits.sam" file to Cufflinks which will count and return RPKM values. But do realize that Tophat does more than just mapping, it tries to find exon-exon splice junctions (and is therefor potentially slow for just mapping).

      -svl

      update: and btw, when you have the RPKM values from Cufflinks you could also use the mentioned DEGseq for determining which transcripts are differentially expressed.
      Last edited by svl; 12-01-2009, 03:25 AM.

      Comment


      • #4
        Thanks a lot svl!!

        Comment


        • #5
          Originally posted by svl View Post
          Because transcripts (or genes) vary in length (kilobases) and sequence-runs vary in the amount of reads produced, you would somehow like to account for these variations if you want to compare runs/samples. RPKM is a measure that (up to a certain degree of course) accounts for these.
          If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.

          Comment


          • #6
            Originally posted by kmcarr View Post
            If you are examining in differential expression of genes between samples you don't really need to normalize for transcript length. When comparing gene to gene between samples the length of the transcript is constant (let's ignore the possibility of differential isoform expression). In this case you only need to normalize for the total number of reads in each sample pool.
            I totally agree with your point. DEGseq follows this to identify differentially expressed genes.
            Xi Wang

            Comment


            • #7
              Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right !
              Last edited by svl; 12-01-2009, 02:04 PM.

              Comment


              • #8
                CuffCompare

                Originally posted by Xi Wang View Post
                I totally agree with your point. DEGseq follows this to identify differentially expressed genes.
                Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.

                Comment


                • #9
                  Hello everybody,

                  Some quick questions about the topic, I number them as they are quite different from each other. Any input appreciated!

                  1. Can tophat/cufflinks be used with a de-novo transcriptome assembly if no good genome is available (assuming that SOME contigs are actually long isoforms containing most exons)?

                  2. Is it correct that the model behind cufflinks tries to allocate reads mapping at multiple locations? Thus giving a more precise result in the case where two isoforms are almost identicals (e.g. premature stops)

                  3. I understand that the RPKM (Reads Per Kilobase exon Model per million mapped reads) is:
                  3a. number of reads normalized per kilobase exon (to make it more comparable to qPCR results... although with caveats --> good for relative comparison of transcripts abundance in one sample)
                  3b. per millions mapped reads (to normalize between different sequenced librairies)
                  (3c. limited to uniquely mapped reads except in the case of cufflinks???)

                  I think that the point 3a cannot be really detrimental, although it can give a false sense on absolute quantitation for example in case of premature stops if unambiguously mapped reads only are taken into account. Howver it can be useful as mentioned above by svl.

                  On 3b. This is my main question: I am not that to normalize on the total number of reads mapped is fully satisfying in case where gene expression is massively altered for highly expressed transcripts. Do somebody knows if a package for RNAseq (or adapted from microarrays) allows to do quantile regressions, even better with outlier removal? Or if this method would perform worse than normalization on the total mapped count in certain cases?

                  Cheers,

                  Yvan

                  Comment


                  • #10
                    Cuffcompare output for DE genes

                    Originally posted by tebuffer View Post
                    Cuffcompare (which is part of the Cufflinks) could be used to identify differentially expressed genes.
                    Can Cuffcompare directly give out the list of differentially expressed genes?

                    If not, how its output can be used for the identification of DE genes?

                    Comment


                    • #11
                      Originally posted by svl View Post
                      Agreed. Looking at other things, like the top (100) expressing genes/transcripts though, is impossible then, so for the sake of future comparison it's nice to use RPKM instead of RPM, it's not hard to calculate anyway. But you're absolutely right !
                      If you are interested in differential expression then once you calculate the log ratio, you may be more interested in the top 100 induced/repressed transcripts rather than 100 most highly expressed transcripts.

                      Comment


                      • #12
                        Originally posted by jiwu2573 View Post
                        Can Cuffcompare directly give out the list of differentially expressed genes?

                        If not, how its output can be used for the identification of DE genes?
                        I just wanted to point out that we just released a standalone tool, "cuffdiff", as part of the Cufflinks package to help you test for differential expression and regulation in your samples. Cuffdiff does differential expression on genes and transcripts, and a few other tests you may find helpful.

                        Comment


                        • #13
                          Hi,

                          as already pointed out, it is not necessary to normalize for transcript length. It is even advantageous to not do so, as you can then use a statistical test that takes the specificities of count data into account, which gives you much better power at low count rates.

                          We have recently released a tool to do this, called DESeq: http://www-huber.embl.de/users/anders/DESeq/

                          DESeq is based on the so-called negative binomial distribution, which allows a powerful test for count data. Furthermore, it can estimate the variance between the samples from the data and uses this information in the test. The basic idea is older and has, e.g., already been used in the edgeR package (Robinson and Smyth), but we added an improved variance estimation that does a better job if the amount of noise depends on the expression strength as is often the case.

                          Note that this variance estimation is crucial. It is often claimed (e.g. by the DEGSeq package suggested above) that a Poisson-based test, such as the binomial or the chi-squared test, are suitable, but then, the p value will only tell you whether your difference is stronger than what to expect between _technical_ replicates, which is not biologically meaningful.

                          Comment


                          • #14
                            You would need biological replicates to assess biological variability. One sample in each group limits your ability to see how much biological variability you should expect in future experiments, irrespective of the statistical test being used.

                            Regarding benchmarking of statistical methods for RNA-Seq data, I would recommend this paper from the Dudoit lab:

                            Turnkey institutional repository software featuring professional-grade publishing and faculty profiles tools to openly publish, manage and showcase the full spectrum of your institution’s research, scholarship and expertise


                            On the practical side of things, we have recently released a set of tools with a program to estimate various statistics of differential expression. It can evaluate RPKMs, Fisher exact tests to compare low counts across groups, but also t-test when you have several samples per group. All statistics are corrected for multiple testing with a Benjamini Hochberg FDR correction. We've tried to make it easy and fast to go from reads to differential expression results.

                            See the Goby home page at http://icbtools.med.cornell.edu/goby/ and a tutorial at http://icb.med.cornell.edu/wiki/index.php/Goby/DE

                            Comment


                            • #15
                              I'd just like to clarify some of the discussion on this thread regarding how to normalize reads, how to measure expression, and then how to find differential expression.

                              First of all, RPKM is a unit, not a method. It stands for "reads per kilobase of transcript per million of sequenced reads". As we point out in the Cufflinks paper (to appear shortly) this unit is flawed, as the objects being sequenced are fragments, not reads. We use the unit FPKM (expected fragments per kilobase of transcript per million fragments sequenced). This is not only a technicality- it is crucial to use units that are proportional (i.e. a scalar multiple) of the estimated proportion of each transcript. FPKM has this property, RPKM cannot.

                              Secondly, regarding expression estimates, a current favored method is to "count" the reads that map to a gene and normalize by length. If the gene is single isoform, this is well-defined, but its problematic with multiple isoforms that may have different lengths, and share different exons. The current favored method I allude to of counting all reads that map somewhere in the locus, and dividing by the number of exonic bases _provably underestimates gene expression_ It is essential to normalize not only by transcript length, but in fact it is essential to probabilistically assign fragments to isoforms. This is what Cufflinks does.

                              Regarding differential expression tests, one has to keep in mind that in genes with multiple isoforms the relative abundances may chance, making it crucial to have correctly estimated individual expression levels.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X