Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Measuring gene expression from RNA-seq data

    Hi there,

    If I understand correctly, there are two ways to measure the gene expression from the RNA-seq data. The simple way is to use the tophat2 pipeline, which generates a file named "genes.fpkm_tracking". We can just extract the FPKM value from the file for each transcripts.

    We can also use htseq-count to get the reads count from a aligned file created by either bwa or bowtie2 or some other aligners, and then convert the reads count into RPKM value with the formula: 10^9*counts/"total counts"/"transcript length".

    However, when comparing the results from the above two methods, I found that the FPKM and RPKM values were pooly correlated. It about 0.5 for spearman correlation coefficient or 0.7 for the pearson correlation coefficient.

    So, which method should we trust? I personally feel that the tophat one might be better since the author of htseq once mentioned that htseq is designed to test differential expression but not to quantify expression.

    Any idea? is there another/better way to measuring gene expression from RNA-seq data?

    Thanks,

  • #2
    I don't think bwa or bowtie2 deal with spliced reads, so you should use the same tophat2 alignment file (bam file) for htseq and cufflinks. You might find the correlation is a little closer after that.

    Comment


    • #3
      To more things to consider:

      - htseq-count is not a tool to estimate the strength of gene expression in a sample. It is a tool to provide input for methods to test for differential expression. See my post #4 in this thread.

      (Personally I do not see much value in quantifying expression in a sample. What does it help me to know that one gene's expression is stronger than another gene's? Unless the difference is really strong (and then we do not need to discuss subtleties of counting strategies), we cannot infer that the gene with more mRNA will also produce more protein.)

      2. What have you used as "transcript length" when converting counts to RPKM values? One of the core points that the cufflinks authors made in their 2010 paper is that finding the right value for the "transcript length" is actually the harder problem.

      Comment


      • #4
        @Wallysb01
        I tried "bwa aln"/"bwa mem"/bowtie2/tophat2. The accepted_hits.bam from topaht2 shows slightly higher correlation, but not substantial. It's about 0.55 compared to 0.5 from other aligners.

        @Simon
        1, Yes, there is a need to quantify gene expression. Most people are interested in finding differentially expressed genes in case/control studies. However, we'd also like to know the relative gene expression level in a physiological condition. For example, we can use this information to define housekeeping gene, or to find gene cluster in which genes are tend to be coexpressed. You are right, more mRNA doesn't necessarily mean more protein, but this is what we can do at this time. Quantifying gene expression is cheaper than quantifying protein expression, and people generally make the assumption that there exists a correlation.

        2, I dont' have much idea about how to define "transcript length", so I use the naive way by extracting the max transcript span in a given gtf file. (I use the same file as tophat2/cufflinks inputs)

        here is my code:

        awk '{print $NF"\t"$4"\t"$5}' mouse_refgene.gtf|awk '{n[$1]++;if(n[$1]==1){max[$1]=$3;min[$1]=$2}if(max[$1]<$3){max[$1]=$3};if(min[$1]>$2){min[$1]=$2}}END{for(x in n)print x"\t"max[x]-min[x]+1}'|sed 's/"//g;s/;//g'|sort > transcript_length

        Comment


        • #5
          @rainbird

          For example, we can use this information to define housekeeping gene, or to find gene cluster in which genes are tend to be coexpressed.
          but these task's are also only relatively obtainable. you do need many biological replicates to define housekeeping genes (low cv) and coexpressed clusters (highly correlated over these samples), assuming that coexpressed genes are correlated in expression over samples but NOT in expression hight (transcription factors and their regulated genes are correlated over samples but not equaly high expressed).

          Comment


          • #6
            [QUOTE=Rainbird;107207 However, we'd also like to know the relative gene expression level in a physiological condition.[/QUOTE]

            You wote it yourself: "relative". If a method for expression estimation always undercounts gene A and overcounts gene B, but does so consistently by the same factor in all samples, it won't mess up your analysis. The important thing is to get right the differences between samples, not so much the absolute value in a sample. This is why I think that your method of comparing absolute expression values in a scatter plot is very helpful. Try a scatter plot of log expression ratios (log fold changes) between two samples, comparing the ratios from one method to those from the other method.

            In these ratios, transcript length cancels out. And this is why I don't care about it. (The cufflinks people care about it because they say, rightly, that it may change because a gene may produce short transcripts in one sample and longer ones in another. This is another reason why using transcript lengths from annotation is not very helpful.)

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM
            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 05-14-2024, 07:03 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-10-2024, 06:35 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-09-2024, 02:46 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-07-2024, 06:57 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X