Seqanswers Leaderboard Ad

**Wallysb01** · 06-09-2013, 01:04 PM

I don't think bwa or bowtie2 deal with spliced reads, so you should use the same tophat2 alignment file (bam file) for htseq and cufflinks. You might find the correlation is a little closer after that.

**Simon Anders** · 06-09-2013, 01:36 PM

To more things to consider:

- htseq-count is not a tool to estimate the strength of gene expression in a sample. It is a tool to provide input for methods to test for differential expression. See my post #4 in this thread.

(Personally I do not see much value in quantifying expression in a sample. What does it help me to know that one gene's expression is stronger than another gene's? Unless the difference is really strong (and then we do not need to discuss subtleties of counting strategies), we cannot infer that the gene with more mRNA will also produce more protein.)

2. What have you used as "transcript length" when converting counts to RPKM values? One of the core points that the cufflinks authors made in their 2010 paper is that finding the right value for the "transcript length" is actually the harder problem.

**Rainbird** · 06-09-2013, 07:22 PM

@Wallysb01
I tried "bwa aln"/"bwa mem"/bowtie2/tophat2. The accepted_hits.bam from topaht2 shows slightly higher correlation, but not substantial. It's about 0.55 compared to 0.5 from other aligners.

@Simon
1, Yes, there is a need to quantify gene expression. Most people are interested in finding differentially expressed genes in case/control studies. However, we'd also like to know the relative gene expression level in a physiological condition. For example, we can use this information to define housekeeping gene, or to find gene cluster in which genes are tend to be coexpressed. You are right, more mRNA doesn't necessarily mean more protein, but this is what we can do at this time. Quantifying gene expression is cheaper than quantifying protein expression, and people generally make the assumption that there exists a correlation.

2, I dont' have much idea about how to define "transcript length", so I use the naive way by extracting the max transcript span in a given gtf file. (I use the same file as tophat2/cufflinks inputs)

here is my code:

awk '{print $NF"\t"$4"\t"$5}' mouse_refgene.gtf|awk '{n[$1]++;if(n[$1]==1){max[$1]=$3;min[$1]=$2}if(max[$1]<$3){max[$1]=$3};if(min[$1]>$2){min[$1]=$2}}END{for(x in n)print x"\t"max[x]-min[x]+1}'|sed 's/"//g;s/;//g'|sort > transcript_length

**dietmar13** · 06-09-2013, 09:51 PM

@rainbird

For example, we can use this information to define housekeeping gene, or to find gene cluster in which genes are tend to be coexpressed.

but these task's are also only relatively obtainable. you do need many biological replicates to define housekeeping genes (low cv) and coexpressed clusters (highly correlated over these samples), assuming that coexpressed genes are correlated in expression over samples but NOT in expression hight (transcription factors and their regulated genes are correlated over samples but not equaly high expressed).

**Simon Anders** · 06-09-2013, 10:21 PM

[QUOTE=Rainbird;107207 However, we'd also like to know the relative gene expression level in a physiological condition.[/QUOTE]

You wote it yourself: "relative". If a method for expression estimation always undercounts gene A and overcounts gene B, but does so consistently by the same factor in all samples, it won't mess up your analysis. The important thing is to get right the differences between samples, not so much the absolute value in a sample. This is why I think that your method of comparing absolute expression values in a scatter plot is very helpful. Try a scatter plot of log expression ratios (log fold changes) between two samples, comparing the ratios from one method to those from the other method.

In these ratios, transcript length cancels out. And this is why I don't care about it. (The cufflinks people care about it because they say, rightly, that it may change because a gene may produce short transcripts in one sample and longer ones in another. This is another reason why using transcript lengths from annotation is not very helpful.)

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Measuring gene expression from RNA-seq data

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News