Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Expression quantification/differential expression gene analysis by RNA-Seq

    Hey. RNA-Seq is very efficient to estimate gene expression level in terms of RPKM/FPKM and to perform differential expression analysis then. which of the following method do you prefer to calculate the RPKM/FPKM value/perform DEG analysis especially when a single gene has multiple isoforms?

    RPKM calculation
    a). Reads falling on any exon of the gene are counted and then normalized by the total length of all exons and library size;
    b). Only reads falling on the constitutive exons are counted and normalized by the length of constitutive exons and library size;
    c). Reads falling on the epresentative isoform (eg. longest isoform) of a gene are counted and normalized by isoform length and library size
    d). by ERANGE software (http://www.nature.com/nmeth/journal/...meth.1226.html)
    e). by cufflinks/cuffdiff software, which made effort to locate reads in isoform/transcrip level.

    Differential Expression Gene Analysis
    a) DEGseq, b) EdgeR, c), Cuffdiff

    By the way, what's your comments on Cufflinks/Cuffidff software? Actually, I heard of both positive and critical opinions, which makes me coufused on whether it is reliable.

  • #2
    DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.

    Comment


    • #3
      Originally posted by dpryan View Post
      DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.
      Thanks for you reply!

      Can you explain why previous versions of cufflinks/cuffdiff were suboptimal. Is it because cufflinks/cuffdiff works on transcript level, while it is not possible to accurately define all the transcripts based on RNA-Seq? or because the underlying statistic model is not good enough?

      Comment


      • #4
        It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.

        Comment


        • #5
          You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

          In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.

          Comment


          • #6
            Originally posted by NGSfan View Post
            You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

            In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.
            Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurately allocating each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?
            Last edited by chenjy; 07-31-2013, 07:59 AM.

            Comment


            • #7
              Originally posted by dpryan View Post
              It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.
              I see. and the new version of Cuffdiff may require further evaluations.

              Comment


              • #8
                Originally posted by chenjy View Post
                Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurate allocate each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?
                The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.

                Comment


                • #9
                  Originally posted by jparsons View Post
                  The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.
                  Thanks for sharing this! What were the RPKMs of these transcripts?

                  RNA-seq suffers from random sampling so if the overall gene expression level is low, then the chance is higher that reads will get assigned to another transcript even in technical replicates. Eg. if you only have 2-4 supporting reads for a splice junction, and the next time you sample it, you get 0 supporting reads for that same junction, I can imagine the reads being flipped to another transcript in the other technical replicate.

                  What was the FPKM status of this transcript? If it is "OK" then that would be troubling...

                  Comment


                  • #10
                    The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.


                    Code:
                    SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
                    TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no
                    I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

                    Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

                    I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.
                    Last edited by jparsons; 07-31-2013, 08:54 AM. Reason: adding last bit.

                    Comment


                    • #11
                      Originally posted by jparsons View Post
                      The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.


                      Code:
                      SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
                      TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no
                      I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

                      Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

                      I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.

                      Wow.. that one is a tricky case. They are both transcribed off the minus strand. I looked at it in the genome browser:



                      But I'm confused about the annotation, the Refseq track shows SNORA5A is sitting between TBRG4 exons. While UCSC genes track shows overlapping of TBRG4 exons with (a larger) SNORA5C.

                      Ensembl also has the smaller SNORA5A and SNORA5C in between exons.

                      What are your settings for cuffdiff? are you using --compatible-hits-norm ?
                      I would hope it would make things agree more with the GTF annotation. The defaults are not always the best and are actually flipped between cufflinks and cuffdiff (I have found from personal experience). Also the defaults claimed are not always the true defaults. For example --max-frag-multihits is listed as unlimited in cuffdiff, but it's actually set to 1 in the program.
                      Last edited by NGSfan; 07-31-2013, 09:29 AM.

                      Comment


                      • #12
                        Unless "default=TRUE" isn't actually true for compatible-hits-norm, yes i am using it. I wasn't using the flag, however.

                        Interestingly, RSEM also says that the SNO disappears in 1aU. It gets the lengths correct, though.

                        (Different reference because obviously these programs are picky about references so you can't directly compare them without doing way too much extra work):

                        Code:
                         
                        sample	gene_id	transcript_id			length	effectivelength	exp.count	TPM	FPKM 
                        1a:	ENSG00000206838	ENST00000384111		134.00	84.34	1.00	1.58	0.95
                        1aU:	ENSG00000206838	ENST00000384111		0.00	0.00	0.00	0.00	0.00
                        
                        1a:	ENSG00000136270	ENST00000258770...	2020.44	1966.21	690.00	44.79	27.03
                        
                        1aU:	ENSG00000136270	ENST00000258770...	1953.33	1900.96	585.00	47.69	28.96
                        Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.
                        Last edited by jparsons; 08-01-2013, 09:18 AM.

                        Comment


                        • #13
                          Originally posted by jparsons View Post
                          Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.
                          Thanks for the update jparsons. Interesting that it switched to NOTEST in the newer version.

                          Are you using the same Ensembl GTF with Cuffdiff as you do with RSEM?

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-27-2024, 06:37 PM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-27-2024, 06:07 PM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          69 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X