htseq-count for sam and gff3

cmbetts replied

06-19-2015, 09:40 AM
Originally posted by padmoo View Post

Hi everyone,

I have a similar problem with my gff file. I get the error:

Error occured when processing GFF file (line 1 of file /gpfs/scratch/cbh12wsu/Thaps3_chromosomes_geneModels_FilteredModels2.gff):
Feature fgenesh1_pg.C_chr_1000001 does not contain a 'gene_id' attribute
[Exception type: ValueError, raised in count.py:53]

My gff files looks like this:
chr_1 JGI exon 300 1153 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 300 1153 . - 0 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 10
chr_1 JGI exon 1199 2425 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 1199 2425 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 9
chr_1 JGI exon 2512 2935 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 2512 2935 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber

I'm guessing there is something missing? Can anyone help?

Thanks!
Padmoo

In the gff format, the final column has a semi colon seperated list of attributes describing the feature. htseq-counts is looking for one called gene_id that lets it assign counts on the gene level, but your gff doesn't have that as one of the attributes. I'm not sure if "fgenesh1_pg.C_chr_1000001" is supposed to be a gene of if it's something else (it would make your life much easier if it is), but somehow you need to add something that looks like 'gene_id "myGene"' to the gff where it describes exons.
Leave a comment:
padmoo replied

06-19-2015, 09:14 AM
Hi everyone,

I have a similar problem with my gff file. I get the error:

Error occured when processing GFF file (line 1 of file /gpfs/scratch/cbh12wsu/Thaps3_chromosomes_geneModels_FilteredModels2.gff):
Feature fgenesh1_pg.C_chr_1000001 does not contain a 'gene_id' attribute
[Exception type: ValueError, raised in count.py:53]

My gff files looks like this:
chr_1 JGI exon 300 1153 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 300 1153 . - 0 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 10
chr_1 JGI exon 1199 2425 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 1199 2425 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 9
chr_1 JGI exon 2512 2935 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
chr_1 JGI CDS 2512 2935 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber

I'm guessing there is something missing? Can anyone help?

Thanks!
Padmoo
Leave a comment:
alvarani replied

03-27-2015, 02:57 AM
Ok, Thanks
Leave a comment:
dpryan replied

03-26-2015, 07:35 AM
You're probably better off with featureCounts, since it allows counts for alignments to multiple features (this will treat spliced alignments properly).

Anyway, the following bit of awk code should work with your BED file:

Code:

awk '{printf("%s\tawk\texon\t%i\t%i\t.\t%s\t.\texon_id \"%s%i\"\n", $1, $2+1, $3, $6, $4, NR)}' foo.bed > foo.gff

Then the exons will be labeled with the associated gene and then a number (the line number). I haven't tested this, but something like that should work.
Leave a comment:
alvarani replied

03-26-2015, 07:23 AM
Dear Devon,
Thanks for the prompt reply. I would like at ask..So I have the targeted region in bed format and converted to this above mentioned format..

So in order to extract the read count for exons what would be most suggested tool and the format of the bed file. ??
Leave a comment:
dpryan replied

03-26-2015, 07:14 AM
That's neither GFF nor GTF format. If bed2gff produced that then the program is broken.
Leave a comment:
alvarani replied

03-26-2015, 07:11 AM
Htseq-count throws error

Dear All,

I am using htseq count to get exonic read count fro targeted seq data..
I have output BAM file with alignments and I need to get the read counts for the respective exonics..the exon gtf file looks like this,
chr9 bed2gff exon 133589697 133589852 0 + . ABL1
chr9 bed2gff exon 133709410 133709441 0 + . ABL1
chr9 bed2gff exon 133710251 133710279 0 + . ABL1

And the error I receive from HT SEQ-count is,
Error occurred when processing GFF file (line 1 of file Regions_new.GTF):
Failure parsing GFF attribute line
[Exception type: Value Error, raised in init.py:164]
Pleas elet me know why I am having this error, and the command line I used is,

htseq-count -a 10 -m intersection-strict -s yes mydata.sam Regions_new.GTF
Leave a comment:
dpryan replied

01-21-2014, 08:40 AM
Please post the command you used and the first few lines of the SAM/BAM file in question.
Leave a comment:
m232 replied

01-21-2014, 08:36 AM
HTSeq_ Error occured when reading first line of sam file.

Hello,

I am very new in bioinformatics and I am having some problems that can´t solve. I have seen similar threads to this one, but not exactly the same, and I am not sure how to solve my issue. When running HTSeq, I received this error:

100000 GFF lines processed.
200000 GFF lines processed.
217821 GFF lines processed.
Error occured when reading first line of sam file.

[Exception type: StopIteration, raised in count.py:79]
100000 GFF lines processed.
200000 GFF lines processed.
217821 GFF lines processed.
Error occured when reading first line of sam file.

[Exception type: StopIteration, raised in count.py:79]
100000 GFF lines processed.
200000 GFF lines processed.
217821 GFF lines processed.
Error occured when reading first line of sam file.

It does not tell me any specific error for why it can´t read the first line of the sam file, like in other threads that I have seen.

I have used Bowtie for the alignment to a reference metagenome, and samtools to create the nameSorted.sam files.

Any thoughts?

Thank you!
Leave a comment:
Dedeusan replied

12-12-2011, 07:25 AM
OK, Simon...
I will ty it...
Thanks for your help!
SAndra
Leave a comment:
Simon Anders replied

12-12-2011, 05:10 AM
I would say, no, it didn't work, if you see only zeroes.

I've just had another look at the excerpt from your GFF file that you posted above, and it does not seem right. htseq-count counts on the level of genes, not exons. Hence, for each "exon" line, the attribute "ID" (or whatever you have specified with "-i") has to be theID of the gene. All exons from the same gene must have the same ID. In your file, it seems that some exon or transcript number is appended to the ID. You may need some perl (or python or sed) script to remove these.
Leave a comment:
Dedeusan replied

12-12-2011, 04:33 AM
Well, SImon, this is what I get:

apidb|exon_Tb09.160.0220-1 0

And over 12.000 similar, and:

no_feature 0
ambiguous 0
too_low_aQual 0
not_aligned 0
alignment_not_unique 11678480

But at the txt document I couldn't see anything else about the 11678480 aligments. Did it worked?? Can I see the aligments somewhere?

And Soooo sorry for disturbing you so much. THanks for your attention!
Sandra
Leave a comment:
Simon Anders replied

12-12-2011, 03:46 AM
Well, it is a only warning that told you that a single read got skipped. I suppose you can ignore that. If you have contigs in your reference to which the GFF file does not assign any exons it is completely expected to get a few of these warnings.
Leave a comment:
Dedeusan replied

12-12-2011, 03:09 AM
Hi SImon,

I am really desiring now to be working with Drosophila or something similar...
I applied what you told me:

samtools view s_1_sequence_clipped_tophat.bam | htseq-count -s no -i ID - TbruceiTreu927_TriTrypDB-3.3.4.gff > count.txt

Warning: Skipping read 'HWUSI-EAS582_211:1:106:1372:1342', because chromosome 'GeneDB|Tb927_01_v4', to which it has been aligned, did not appear in the GFF file

And after I stopped the execution of the program:

^CError: 'itertools.chain' object has no attribute 'get_line_number_string'
[Exception type: AttributeError, raised in count.py:198]

Maybe it was not a good idea to eliminate the fasta part...
Leave a comment:
Simon Anders replied

12-12-2011, 02:33 AM
In my understanding of the GFF standard, a GFF file is supposed to contain GFF lines and not FASTA lines. I've seen this before that a full FASTA file is concatenated to the end of the GFF file but this is really confusing for any software trying to parse the file. Please remove everything that is not GFF from your GFF file.
Leave a comment:

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News