Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice for Statistics in Gene Expression??

    Hi Everyone,

    I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

    I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.

    I am looking for some advice on statistical tests that I can use to look for genes that are statistically expressed similarly using the FPKM values of the in the gene.fpkm_tracking file, and then compare the FPKM values for the isoforms to see if there is a statistically significant difference between the isoforms.

    I was looking online and I found some ideas for tests ranging from t-Tests to Poisson Distributions to negative binomial distribution (which I have no idea about). Another thing that I found is that a lot of existing programs like edgeR or DEseq use the raw read count data but cufflinks only outputs the FPKM values. How should I go about this?

    I wasn't sure if CuffDiff would be a good option for what I am trying to do. Also, in general, if I were to compare the differential gene expression using the genes.fpkm_tracking file w/ the FPKM values for each gene from cufflinks, what types of Plots are ideal in this field. I have heard about density plots and heat maps but I am not sure and wanted some advice from anyone else who has done this before. I am familiar with R if that helps

    Thanks in advance!!!

  • #2
    Originally posted by thickrick99 View Post
    Hi Everyone,

    I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

    I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.
    You know that cuffdiff basically does this for you? I'm not sure why you're seeking a statistical test here (maybe I'm missing the point). It's not uncommon to find genes where they are not significantly differentially expressed at the gene level because there is no empirical change, but have significantly differentially expressed isoforms, and one goes up the same amount as the other goes down - this is just a matter of checking what is significant in each case in the cuffdiff outputs.

    The tracking files are not where you need to be looking, I'd advise you look at the .diff files.

    Another source of these situations is the splicing.diff file which will show you the genes which have alterations in the distribution of isoforms between conditions

    Comment


    • #3
      Ok thanks for the advice! I haven't run cuffdiff yet but I will now. So what you're saying is, I can look for genes with similar expression in the .diff file and then look at the isoforms that have a difference using the cuffdiff output files right?

      Do you (or anyone else) have a suggestion regarding what type of plots I can use. It seems like around 5000+ genes have similar expression so what would be the best way to plot such a large amount of data?


      Thanks!
      Last edited by thickrick99; 08-16-2014, 01:33 PM.

      Comment


      • #4
        I understand that I don't actually need a test statistic since programs like cuffdiff can do it for me. However, as part of the project I am working on, I need to program a test statistic with R in order to get and compare the p values rather than having the programs do it for me. So how should I choose the test statistic based on what I am doing (2 populations each with 2-5 samples and the populations are unpaired). I am trying to compare the differential gene expression and I am not sure how to choose a statistical test for this using the FPKM values from cufflinks.

        Comment


        • #5
          The most common test statistic in your circumstance would be the T-statistic. That ends up being similar to what cuffdiff is using internally anyway.

          Comment


          • #6
            Yeah that's what I was planning to do as well. But I read online about other tests like poisson or negative binomial and I wasn't sure if these are better than using the T-test?

            Comment


            • #7
              Only the T-test is compatible with FPKMs.

              Comment


              • #8
                So if I wanted to use more complicated tests with poisson or negative binomial I would have to use the raw read count data right? Where do I access this information assuming I used top hat and then cufflinks. Do I have to convert fpkm to the read count or are the read count in the accept_hits.bam file?

                Comment


                • #9
                  The general workflow is to map with tophat (or STAR or whatever else you want to use) and then quantify with htseq-count or featureCounts. The latter two will give you counts that you can use in a negative binomial model. If you used cufflinks to find new features, then just run it first and use the merged GTF file with the aforementioned counting programs. I wouldn't bother with a Poisson model, it's not worth your time.

                  BTW, there's no great way to convert between FPKM and raw counts, since the latter doesn't use multimappers while the former does.

                  Comment


                  • #10
                    ok thanks Devon! So I have my count data, but how do I got about using the negative binomial model. I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R. Is there a way to do this and how would I get the p-values to compare the differential expression of the genes.

                    It would be great if you could give me some advice on how to program the negative binomial model into R and any resources that you think would help me do this in order to compare differential gene expression between the two populations.

                    Thanks for all your help!

                    Comment


                    • #11
                      The simplest way would be to use glm.nb() from the MASS library.

                      Comment


                      • #12
                        Originally posted by thickrick99 View Post
                        I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R.
                        If you are not familiar with the relevant models, wouldn't it make sense to use existing tools?

                        Anyway, the edgeR methods, dispersion estimation especially, are sufficiently sophisticated that it is very unlikely you could reproduce them yourself in any reasonable amount of time.

                        For example, glm.nb() implements simple maximum likelihood for the dispersion parameter, which will markedly underestimate the true dispersions for RNA-seq data, and hence give overly liberal DE results. You need software like edgeR to do better, there's no easy way around it.

                        Comment


                        • #13
                          Ok thanks for your help Gordon. So I tried using the t-test on the FPKM values but I realized that I can't do the test because of the 0 FPKM values. Is there an accepted way to get around this so that I can do the t-test on the FPKM values?

                          Comment


                          • #14
                            It is impossible to do a high performance statistical test on FPKM values alone, because they have varying precisions, and the precision depends on the original count size rather than on the FPKM value itself.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Advanced Tools Transforming the Field of Cytogenomics
                              by seqadmin


                              At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                              09-26-2023, 06:26 AM
                            • seqadmin
                              How RNA-Seq is Transforming Cancer Studies
                              by seqadmin



                              Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                              09-07-2023, 11:15 PM
                            • seqadmin
                              Methods for Investigating the Transcriptome
                              by seqadmin




                              Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                              Whole Transcriptome RNA-seq
                              Whole transcriptome sequencing...
                              08-31-2023, 11:07 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:57 AM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 09-26-2023, 07:53 AM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 09-25-2023, 07:42 AM
                            0 responses
                            14 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 09-22-2023, 09:05 AM
                            0 responses
                            44 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X