Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pecanton
    Member
    • Jun 2011
    • 13

    HTseq count parsing headache

    Hi,

    I've searched the forum for a problem similar to mine, and there is a thread dealing with, but it's form two years ago and sometimes it's hard to get help on resurrected threads.

    So, I have my RNAseq data back, I've aligned it with Bowtie2. I've sorted and coverted the sam files into sorted bam files with samtools. I've now trying to use HTseq for couting reads, but I get a parsing error.

    Specifically, I've running this command

    Code:
     samtools view 11A_align_sort.bam | htseq-count -s no  -i ID - ~/aedes_genomic/Aedes-aegypti-Liverpool_BASEFEATURES.gff3 > 11A_count
    and get the error

    Code:
    Error occured in line 15 of file /home/emiliano/aedes_genomic/Aedes-aegypti-Liverpool_BASEFEATURES.gff3.
    Error: Failure parsing GFF attribute line
    [Exception type: ValueError, raised in __init__.py:171]
    For reference, these are the first 25 lines from the gff3 file:

    Code:
    ##gff-version 3
    ##feature-ontology so.obo
    ##attribute-ontology gff3_attributes.obo
    #
    # Dumped from  database.
    # 
    # using SO:0000704 for VectorBase Gene
    # using SO:0000234 for VectorBase Transcript
    # using SO:0000147 for VectorBase Exon
    # using SO:0000316 for VectorBase CDS
    # using SO:0000204 for VectorBase 5'UTR 
    # using SO:0000205 for VectorBase 3'UTR
    # 
    ##sequence-region supercontig:AaegL1:supercont1.1:1:5856339:1
    supercont1.1	VectorBase	contig	1	5856339	.	.	.	ID=supercont1.1;molecule_type=dsDNA;GenBank:supercontig:AaegL1:supercont1.1:1:5856339:1;translation_table=1;topology=linear;localization=chromosomal;
    supercont1.1	VectorBase	gene	35414	53420	.	+	.	ID=AAEL000064;
    supercont1.1	VectorBase	mRNA	35414	53420	.	+	.	ID=AAEL000064-RA;Parent=AAEL000064;Dbxref=UniProtKB:Q8T4S1,UniProtKB:Q8T4S2,UniProtKB:Q8T4S3,UniProtKB:Q8T4S4,UniProtKB:Q8T4S5,UniProtKB:Q8T4S6,UniProtKB:Q9GSZ4,GenBank:AF288384,GenBank:AY064094,GenBank:AY064095,GenBank:AY064096,GenBank:AY064097,GenBank:AY064098,GenBank:AY064099,GenBank:CH477186,protein_id:AAG01014,protein_id:AAL85595,protein_id:AAL85596,protein_id:AAL85597,protein_id:AAL85598,protein_id:AAL85599,protein_id:AAL85600,protein_id:EAT48898,UniParc:UPI000007F997;description=hypothetical protein;
    supercont1.1	VectorBase	exon	35414	35644	.	+	.	ID=E036654A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	35699	35901	.	+	.	ID=E036655A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	35993	36098	.	+	.	ID=E036656A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	52193	52851	.	+	.	ID=E036657A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	52973	53420	.	+	.	ID=E036658A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	five_prime_utr	35414	35523	.	+	.	Parent=AAEL000064-RA
    supercont1.1	VectorBase	CDS	35524	35644	.	+	0	Parent=AAEL000064-RA;
    supercont1.1	VectorBase	CDS	35699	35901	.	+	2	Parent=AAEL000064-RA;
    I've not entirely sure what part of the line is causing the parsing to fail. It says it's on line 15, which is the first non-commented line, but it's a contig. Wouldn't HTseq skip this and go for the lines with exons?
    Last edited by pecanton; 07-08-2013, 09:00 AM.
  • pecanton
    Member
    • Jun 2011
    • 13

    #2
    Also, I saw an error in my code to run HTseq. the -i ID option I was using only works if I also put -t gene, since that's the type of feature that has that ID. However, I'm not sure if I should count by exon or by gene. I suppose it depends on the downstream analysis I want to do, but I'm not very clear on the benefits of using either couting procedure.

    Comment

    • chadn737
      Senior Member
      • Jan 2009
      • 392

      #3
      For some reason htseq-count does not like everything in line 15 after the second semicolon.

      Code:
      supercont1.1	VectorBase	contig	1	5856339	.	.	.	ID=supercont1.1;molecule_type=dsDNA;[COLOR="Red"]GenBank:supercontig:AaegL1:supercont1.1:1:5856339:1;translation_table=1;topology=linear;localization=chromosomal;[/COLOR]
      If I delete everything highlighted in red, it proceeds normally.

      Comment

      • chadn737
        Senior Member
        • Jan 2009
        • 392

        #4
        Originally posted by pecanton View Post
        Also, I saw an error in my code to run HTseq. the -i ID option I was using only works if I also put -t gene, since that's the type of feature that has that ID. However, I'm not sure if I should count by exon or by gene. I suppose it depends on the downstream analysis I want to do, but I'm not very clear on the benefits of using either couting procedure.
        Thats because your ID's for your exons are all different and would need to be modified to match the ID for the gene. That would be a pain in the ***.

        Comment

        • pecanton
          Member
          • Jun 2011
          • 13

          #5
          Well, since the Aedes aegypti genome is not fully assembled, and exists as supercotings (more than 2000), it would still be a pain to go and edit every one of the "contig" lines from the gff3. I'll report back if I find an easier way.

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            If it's just choking on the "contig" entries, which will be later ignored anyway, then just write a small script to edit those and leave the others unchanged. Alternatively, you should just be able to use grep to remove them (something like "grep -v -w 'contig' ...").

            Comment

            Latest Articles

            Collapse

            • GATTACAT
              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by GATTACAT
              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
              Today, 11:43 AM
            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 05:37 AM
            0 responses
            9 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            18 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            52 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            110 views
            0 reactions
            Last Post SEQadmin2  
            Working...