Unconfigured Ad

**gringer** · 02-03-2013, 05:43 PM

Originally posted by syintel87 View Post

I want to use HTSeq-count, by using new.combined.gtf.
However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oID share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID.

TSS -> Transcript Start Site (according to convention, and cufflinks documentation). This is related to the gene_id. There will be multiple isoforms / transcripts that have the same transcript start site, and arise from different splicing events.

It looks like 'oID' is the name of the original transcript (or cufflink's name if it's found a novel transcript):

404 Not Found

http://cufflinks.cbcb.umd.edu/manual.html#cuffcomp_output

There also should be a .tmap file that links transcripts to genes. The first column of this file is "The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used." If you're doing gene-level comparisons, that's probably what you want to use. For transcript-level comparisons, the second column names are probably appropriate.

Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option?

Any particular reason why you want to use HTSeq-count? What do you propose to use as your count data?

Cufflinks has its own recommended differential expression test method, cuffdiff, which identifies differences at both a gene level and an isoform level:

404 Not Found

http://cufflinks.cbcb.umd.edu/manual.html#gene_exp_diff

**dpryan** · 02-04-2013, 01:56 AM

You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.

**vkartha** · 02-04-2013, 12:47 PM

I have a question, and I would really appreciate it if I was to get a prompt reply!

I am performing an RNASeq analysis and I have paired-end 101 bp reads. I want to generate count data and perform a DE analysis using DESeq where I can identify genes that are differentially expressed between 2 conditions (for which we have replicates for).

I have run ht-seq count using an alignment file (sam file sorted by read name of course) and the latest version of Ensembl's annotation file as input to obtain this count data. I ran this using two different identifiers (-i option).

1) gene_id (default, and 'preferred' for RNASeq)
2) gene_name

My preference was that I wouldn't want to go through the headache of converting these gene-id's into gene symbols which are easier for everyone to look at and identify. But what I realized was that these 2 runs (keeping the mode and everything else constant) yielded different number of rows (entries) in the resulting count dataset.

Why is this? Isn't it a 1 to 1 (id to gene name) mapping/conversion? I'm not sure what the difference in it's working is between the two runs. Am I losing anything by using gene_name instead of gene_id?

Again, any help would be greatly appreciated.

Thanks

**Simon Anders** · 02-05-2013, 01:27 AM

Originally posted by vkartha View Post

Isn't it a 1 to 1 (id to gene name) mapping/conversion?

Doesn't seem so, does it?

I quickly checked with Biomart, and the first thing I noticed is that snoRNAs that have multiple copies appear with different ENSG gene IDs but the same gene symbol. I'm sure there are many such things, probably especially (but maybe only) regarding non-coding genes.

Sometimes it seems to me that the biggest everyday task in bioinformatics is struggling with ID conversions.

**syintel87** · 02-06-2013, 01:56 AM

Originally posted by dpryan View Post

You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.

1.
So, you mean tss_id and oId work poorly, so that gene_id has to be used?

2.
My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?

3.
My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?

**dpryan** · 02-06-2013, 02:03 AM

Originally posted by syintel87 View Post

1.
So, you mean tss_id and oId work poorly, so that gene_id has to be used?

Yes

2.
My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?

That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).

3.
My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?

Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.

**syintel87** · 02-06-2013, 02:23 AM

Originally posted by dpryan View Post

Yes

That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).

Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.

Thank you so much!!!!

So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.

Also, to see differential expression at transcript level, tss_id(???) has to be used in htseq count, and DEXSeq or something will be fine.

Do I understand correctly?

**dpryan** · 02-06-2013, 02:37 AM

Originally posted by syintel87 View Post

Thank you so much!!!!

So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.

Exactly

Also, to see differential expression at transcript level, tss_id has to be used in htseq count, and DEXSeq or something will be fine.

Do I understand correctly?

As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.

**syintel87** · 02-06-2013, 05:46 AM

Originally posted by dpryan View Post

Exactly

As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.

Now I am thinking of two ways.

1.
1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
2) HTSeq-count by using combined.gtf and its "gene_id".

2.
1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
2) python dexseq_prepare_annotation.py combined.gtf combined_DEXSeq.gtf
3) python dexseq_count.py -p no -s no -a 10 combined_DEXSeq.gtf accepted_hits.sam treatment1_counts.txt
4) read.HTSeqCounts( treatment1_counts.txt, design, flattenedfile = "combined_DEXSeq.gtf" )

Would you please give me a tip about whetehr these two ways seem to make sense?

Thank you very much!!!

**gringer** · 02-06-2013, 06:58 PM

Originally posted by syintel87 View Post

2) HTSeq-count by using combined.gtf and its "gene_id".

HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.

**dpryan** · 02-07-2013, 01:16 AM

Originally posted by gringer View Post

HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.

From earlier in the thread, there's an implicit "use edgeR" third step

Topics	Statistics	Last Post
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, Yesterday, 10:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM

Unconfigured Ad

Which ID should be used for HTSeq-count?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News