Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Jane M
    Senior Member
    • Aug 2011
    • 239

    Which feature file for htseq-count for non coding elements of ribodepleted samples?

    Dear all,

    Sorry if this question has been asked before, I have not found similar topics on the web, but there are probably...
    Until now, I used a gtf downloaded from UCSC table (genome: human, assembly: hg19, group:genes and gene predictions, track: RefSeq genes, table: refFlat) as a feature file, with gene name, in GFF format for counting reads in genes with htseq-count.
    I have one experiment with ribodepleted samples (TruSeq total RNA Stranded). I first counted the reads and performed differential expression analysis using such a gtf. In this gtf downloaded few weeks ago, there are 26688 genes, including 913 "LINC*". This number of lincRNA seems low, so I wonder if this table is comprehensive for ribodepleted experiment.
    Can you please tell me which reference you use when interesting in non coding elements of ribodepleted experiments?

    Thank you for your feedback,
    Jane
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    UCSC references...kind of suck. Use Gencode/Ensembl and you'll get more complete coding and non-coding transcripts. Having said that, for lincRNAs you might want to check out lincrnadb or RNAcentral.

    Comment

    • hyeonjipark
      Junior Member
      • Dec 2015
      • 2

      #3
      I used gencode. but no counts.
      perhaps what column used htseq-count ?

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Presumably the lack of counts is due to the difference in chromosome names.

        Comment

        • Jane M
          Senior Member
          • Aug 2011
          • 239

          #5
          Thank you dpryan for your answer.

          Originally posted by dpryan View Post
          UCSC references...kind of suck.
          Don't you thing this refFlat reference is sufficient for experiments with polyA RNA selection?


          Originally posted by dpryan View Post
          Use Gencode/Ensembl and you'll get more complete coding and non-coding transcripts. Having said that, for lincRNAs you might want to check out lincrnadb or RNAcentral.
          Since I am working with hg19 annotation, I downloaded the gtf from http://www.gencodegenes.org/releases/19.html for "ALL" regions, that is the gencode.v19.chr_patch_hapl_scaff.annotation file. I am not familiar with these annotation files. Is it the right one?

          I am currently looking at what this file contains: number of genes (name, ID, status), transcripts, ...
          It has not the same format like the refFlat file, so I expect some issues for read counting.

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            The UCSC file annotations are never that good. The file you downloaded is fine, though again the chromosome names probably differ, which is what's causing problems. I think featureCounts can handle the chromosome naming difference internally, so try using that instead (it's much faster anyway).

            Comment

            • Jane M
              Senior Member
              • Aug 2011
              • 239

              #7
              Originally posted by dpryan View Post
              The UCSC file annotations are never that good. The file you downloaded is fine, though again the chromosome names probably differ, which is what's causing problems. I think featureCounts can handle the chromosome naming difference internally, so try using that instead (it's much faster anyway).
              Thank you again. No problem yet, since I did not try to count. Looking at the file: now, I have ~56600 different gene names, compared to ~26600 with refFlat. The file contains twice more lines. Looking forward to see how the number of LINC and AS will increase.
              chr1->22, chrX, chrY have the same names, but the additional chromosomes have indeed different names: chr17_ctg5_hap1, chr17_gl000205_random in refFlat and GL000191.1, GL000192.1. I start the tests right now!

              Comment

              • Jane M
                Senior Member
                • Aug 2011
                • 239

                #8
                Compared to how I ran htseq-count previously, I added -i option to get counts for each gene_name and not gene_ID. I could run htseq count on this new gtf without any problem
                Code:
                htseq-count --format=bam --order=name --mode=union -a 20 -i gene_name --stranded=reverse mybam.bam gencode.v19.chr_patch_hapl_scaff.annotation.gtf > mynewcounts.gencod.htseqCount
                But I still have a couple of questions:
                1. With the genecode annotation, we can find how many genes are attributed to each gene_type (30 gene types), like protein coding, lincRNA, pseudogene, rRNA,... Do you know where to find this information for the refFlat annotation?

                2. I am a bit confused by gene_name and gene_ID:
                When counting in gencode.v19.chr_patch_hapl_scaff.annotation.gtf the number of genes ("gene" in field 3), I get 63,568:
                Code:
                cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -f3  | wc -l
                63568
                but the number of unique gene_name is 56,629:
                Code:
                cat gencode.v19.chr_patch_hapl_scaff.annotation.gtf | cut -d';' -f5 | sort | uniq | wc -l
                56629
                In the output file of htseq count, 56,629 are listed.

                Does someone know what are those genes with same gene_name but different gene_ID?
                Code:
                cat gencode.v19.chr_patch_hapl_scaff.annotation.gene.gtf | cut -f9  | sort | uniq | wc -l
                63568
                3. Finally, I noticed that the counting changed a lot between the 2 annotations, even for well known genes as TP53, the number of reads attributed to this gene - and others - double! I am surprised that the annotation of such very well known genes changes. Is it normal?

                Any clarification would be greatly appreciated.
                Last edited by Jane M; 05-11-2016, 06:49 AM.

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
                  2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed.
                  3. UCSC annotations are rather minimalistic.

                  Comment

                  • Jane M
                    Senior Member
                    • Aug 2011
                    • 239

                    #10
                    Originally posted by dpryan View Post
                    1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
                    Well, there is no description in the tables I downloaded from UCSC. Otherwise, I could indeed check as I did with Genecode annotation. Here are some file header:

                    refFlat file (RefSeq), that I used until now:

                    Code:
                    chr1	hg19_refFlat	exon	11874	12227	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
                    chr1	hg19_refFlat	exon	12613	12721	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
                    chr1	hg19_refFlat	exon	13221	14409	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
                    chr1	hg19_refFlat	exon	14362	14829	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P"; 
                    chr1	hg19_refFlat	exon	14970	15038	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P"; 
                    chr1	hg19_refFlat	exon	15796	15947	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P";

                    refGene file (RefSeq):
                    Code:
                    chr1	hg19_refGene	start_codon	67000042	67000044	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	CDS	67000042	67000051	0.000000	+	0	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	exon	66999639	67000051	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	CDS	67091530	67091593	0.000000	+	2	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	exon	67091530	67091593	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	CDS	67098753	67098777	0.000000	+	1	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	exon	67098753	67098777	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
                    chr1	hg19_refGene	CDS	67101627	67101698	0.000000	+	0	gene_id "NM_032291"; transcript_id "NM_032291";
                    knownGenes file (UCSC):
                    Code:
                    chr1	hg19_knownGene	exon	11874	12227	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
                    chr1	hg19_knownGene	exon	12646	12697	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
                    chr1	hg19_knownGene	exon	13221	14409	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
                    chr1	hg19_knownGene	start_codon	12190	12192	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	CDS	12190	12227	0.000000	+	0	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	exon	11874	12227	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	CDS	12595	12721	0.000000	+	1	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	exon	12595	12721	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	CDS	13403	13636	0.000000	+	0	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	stop_codon	13637	13639	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
                    chr1	hg19_knownGene	exon	13403	14409	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
                    There might exist one file with comprehensive description.

                    Originally posted by dpryan View Post
                    2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed.
                    Thank you for the explanation.

                    Originally posted by dpryan View Post
                    3. UCSC annotations are rather minimalistic.
                    Ok, but I am very surprised for annotation of well known genes.

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Why are you not using the GTF file from UCSC instead of all those other files. You could create that from Table Browser. (iGenomes bundle does not have non-coding genes).

                      Edit: @Vikas Bansal has a solution to get the non-coding elements (you can choose GTF output in table browser) for hg19 here.
                      Last edited by GenoMax; 05-12-2016, 05:54 AM.

                      Comment

                      • Jane M
                        Senior Member
                        • Aug 2011
                        • 239

                        #12
                        Originally posted by GenoMax View Post
                        Why are you not using the GTF file from UCSC instead of all those other files. You could create that from Table Browser. (iGenomes bundle does not have non-coding genes).
                        I was actually using only the refFlat gtf from UCSC. It looks like what I showed.
                        I added the refGene and knownGenes gtf files to show that these files do not contain description neither.
                        There seem to be some non coding elements in refFlat gtf: ~900 LINC, ...

                        Originally posted by GenoMax View Post
                        Edit: @Vikas Bansal has a solution to get the non-coding elements (you can choose GTF output in table browser) for hg19 here.
                        Thank you, I will take a look. But it is probably easier to use genecode to get everything from a single gtf file.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        21 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...