Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice for Statistics in Gene Expression??

    Hi Everyone,

    I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

    I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.

    I am looking for some advice on statistical tests that I can use to look for genes that are statistically expressed similarly using the FPKM values of the in the gene.fpkm_tracking file, and then compare the FPKM values for the isoforms to see if there is a statistically significant difference between the isoforms.

    I was looking online and I found some ideas for tests ranging from t-Tests to Poisson Distributions to negative binomial distribution (which I have no idea about). Another thing that I found is that a lot of existing programs like edgeR or DEseq use the raw read count data but cufflinks only outputs the FPKM values. How should I go about this?

    I wasn't sure if CuffDiff would be a good option for what I am trying to do. Also, in general, if I were to compare the differential gene expression using the genes.fpkm_tracking file w/ the FPKM values for each gene from cufflinks, what types of Plots are ideal in this field. I have heard about density plots and heat maps but I am not sure and wanted some advice from anyone else who has done this before. I am familiar with R if that helps

    Thanks in advance!!!

  • #2
    Originally posted by thickrick99 View Post
    Hi Everyone,

    I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

    I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.
    You know that cuffdiff basically does this for you? I'm not sure why you're seeking a statistical test here (maybe I'm missing the point). It's not uncommon to find genes where they are not significantly differentially expressed at the gene level because there is no empirical change, but have significantly differentially expressed isoforms, and one goes up the same amount as the other goes down - this is just a matter of checking what is significant in each case in the cuffdiff outputs.

    The tracking files are not where you need to be looking, I'd advise you look at the .diff files.

    Another source of these situations is the splicing.diff file which will show you the genes which have alterations in the distribution of isoforms between conditions

    Comment


    • #3
      Ok thanks for the advice! I haven't run cuffdiff yet but I will now. So what you're saying is, I can look for genes with similar expression in the .diff file and then look at the isoforms that have a difference using the cuffdiff output files right?

      Do you (or anyone else) have a suggestion regarding what type of plots I can use. It seems like around 5000+ genes have similar expression so what would be the best way to plot such a large amount of data?


      Thanks!
      Last edited by thickrick99; 08-16-2014, 01:33 PM.

      Comment


      • #4
        I understand that I don't actually need a test statistic since programs like cuffdiff can do it for me. However, as part of the project I am working on, I need to program a test statistic with R in order to get and compare the p values rather than having the programs do it for me. So how should I choose the test statistic based on what I am doing (2 populations each with 2-5 samples and the populations are unpaired). I am trying to compare the differential gene expression and I am not sure how to choose a statistical test for this using the FPKM values from cufflinks.

        Comment


        • #5
          The most common test statistic in your circumstance would be the T-statistic. That ends up being similar to what cuffdiff is using internally anyway.

          Comment


          • #6
            Yeah that's what I was planning to do as well. But I read online about other tests like poisson or negative binomial and I wasn't sure if these are better than using the T-test?

            Comment


            • #7
              Only the T-test is compatible with FPKMs.

              Comment


              • #8
                So if I wanted to use more complicated tests with poisson or negative binomial I would have to use the raw read count data right? Where do I access this information assuming I used top hat and then cufflinks. Do I have to convert fpkm to the read count or are the read count in the accept_hits.bam file?

                Comment


                • #9
                  The general workflow is to map with tophat (or STAR or whatever else you want to use) and then quantify with htseq-count or featureCounts. The latter two will give you counts that you can use in a negative binomial model. If you used cufflinks to find new features, then just run it first and use the merged GTF file with the aforementioned counting programs. I wouldn't bother with a Poisson model, it's not worth your time.

                  BTW, there's no great way to convert between FPKM and raw counts, since the latter doesn't use multimappers while the former does.

                  Comment


                  • #10
                    ok thanks Devon! So I have my count data, but how do I got about using the negative binomial model. I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R. Is there a way to do this and how would I get the p-values to compare the differential expression of the genes.

                    It would be great if you could give me some advice on how to program the negative binomial model into R and any resources that you think would help me do this in order to compare differential gene expression between the two populations.

                    Thanks for all your help!

                    Comment


                    • #11
                      The simplest way would be to use glm.nb() from the MASS library.

                      Comment


                      • #12
                        Originally posted by thickrick99 View Post
                        I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R.
                        If you are not familiar with the relevant models, wouldn't it make sense to use existing tools?

                        Anyway, the edgeR methods, dispersion estimation especially, are sufficiently sophisticated that it is very unlikely you could reproduce them yourself in any reasonable amount of time.

                        For example, glm.nb() implements simple maximum likelihood for the dispersion parameter, which will markedly underestimate the true dispersions for RNA-seq data, and hence give overly liberal DE results. You need software like edgeR to do better, there's no easy way around it.

                        Comment


                        • #13
                          Ok thanks for your help Gordon. So I tried using the t-test on the FPKM values but I realized that I can't do the test because of the 0 FPKM values. Is there an accepted way to get around this so that I can do the t-test on the FPKM values?

                          Comment


                          • #14
                            It is impossible to do a high performance statistical test on FPKM values alone, because they have varying precisions, and the precision depends on the original count size rather than on the FPKM value itself.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Exploring the Dynamics of the Tumor Microenvironment
                              by seqadmin




                              The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                              07-08-2024, 03:19 PM
                            • seqadmin
                              Exploring Human Diversity Through Large-Scale Omics
                              by seqadmin


                              In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                              06-25-2024, 06:43 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 07-10-2024, 07:30 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-03-2024, 09:45 AM
                            0 responses
                            201 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-03-2024, 08:54 AM
                            0 responses
                            210 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 07-02-2024, 03:00 PM
                            0 responses
                            192 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X