Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dedeusan
    replied
    Almost...

    Hi Simon.
    First of all, thank you very much!!!
    Second:
    Apparently, it worked. It was an ; extra in a pair of lines so it continues reading everything just until it arrives to the #fasta part:

    htseq-count -s no -i ID s_1_sequence_clipped_tophat.bam TbruceiTreu927_TriTrypDB-3.3.3.gff
    Error occured in line 46748 of file TbruceiTreu927_TriTrypDB-3.3.3.gff.
    Error: need more than 1 value to unpack
    [Exception type: ValueError, raised in __init__.py:214]

    And when I look at this line this is what I can see:


    tryp_XI-1036e06.p1k.snoRNA.0004;Dbxref=ApiDB:tryp_XI-1036e06.p1k.snoRNA.0004,taxon:185431
    apidb|tryp_XI-1036e06.p1k ApiDB exon 22123 22200 . + . ID=apidb|exon_tryp_XI-1036e06.p1k.snoRNA.0004-1;Name=exon;description=exon;size=78;Parent=apidb|rna_tryp_XI-1036e06.p1k.snoRNA.0004-1
    (This is line 46748)##FASTA
    >apidb|TB927.5.300b
    ATGGCTCACGGCTCGATTCCAGTTATTGATGTCGGCCCTCTGTTCTGTGATGGAGAAAAG
    GGGATGATGGATGTTGCGAAACAGATTGATCATGCCTGTAGGACGTGGGGTGTTTTTCTT
    GTTGTGGGTCATCCCATTCCCCGTGAGCGAACGGAAAAGTTGATGGAAATGGCCAAGGCT
    TTTTTTTCGCTTCCATTGGAAGAGAAACTTAAGGTTGATATTCGAAAGAGCAAACATCAT
    CGCGGTTACGGATGCCTCGATGCGGAGAATGTTGACCCAACGAAACCATTTGATTGTAAA
    GAAACATTTAATATGGGCTGTCATCTCCCTGAGGATCACCCCGATGTTGCAGCTGGAAAG
    CCATTGCGTGGACCGAACAATCACCCCACGCAAGTGAAAGGTTGGGTAGAGTTGATGAAC
    AGACATTATCGCGAAATGCAGGAATTTGCCCTCGTTATTCTTCGTGCCCTCGCACTCGCT
    ATTGGTTTAAAGAAAGACTTTTTCGATACCAAATTTGATGAACCTTTGAGTGTGTTCCGT
    ATGCTACATTATCCTCCACAAAAGCAAGGGACCCGTTATCCCATCGTGTGTGGTGAGCAT
    ACGGATTATGGTATTATTACATTACTCTACCAAGATTCGGTGGGAGGACTGCAGGTGCGC
    AATCTGTCAGATGAGTGGGTGGATGTGGAACCCCTTGAAGGAAGTTTTGTTGTGAATATT
    GGGGACATGATGAATATGTGGAGTAATGGCCGTTACCGCTCAACACCGCATCGCGTTCGC
    TTAACCACAACTGATCGCTACTCCATGCCATTTTTCTGTCAGCCTAATCCTTATACTGTT
    ATTAAATGCCTTGATCATTGCCATTCGCCAAGCAATCCCCCCAAATATCCACCAGTCCGT
    GCTGTGGATTGGTTGCTGAAGCGTTTCGCGGAAACATATGCCCATCGCAAAACAAAGATG
    TGA
    >apidb|cds_TB927.5.300b-1
    MAHGSIPVIDVGPLFCDGEKGMMDVAKQIDHACRTWGVFLVVGHPIPRERTEKLMEMAKA
    FFSLPLEEKLKVDIRKSKHHRGYGCLDAENVDPTKPFDCKETFNMGCHLPEDHPDVAAGK
    PLRGPNNHPTQVKGWVELMNRHYREMQEFALVILRALALAIGLKKDFFDTKFDEPLSVFR
    MLHYPPQKQGTRYPIVCGEHTDYGIITLLYQDSVGGLQVRNLSDEWVDVEPLEGSFVVNI
    GDMMNMWSNGRYRSTPHRVRLTTTDRYSMPFFCQPNPYTVIKCLDHCHSPSNPPKYPPVR
    AVDWLLKRFAETYAHRKTKM

    Why does htseq-count gives now an error?
    Again, thanks for all!
    Sandra

    Leave a comment:


  • Simon Anders
    replied
    Using 'i ID' did solve your first problem, because afterward, the script no longer complained about line 263 of the GFF file but about line 29875. So, have a look there, too.

    Your next problem will be that htseq-count, at the moment, does not read bam file. You will need to do something like
    Code:
    samtools view myalignment.bam | htseq-count [options] - myannotation.gff
    The lone minus sign tells htseq-count to read the sam file from standard input, i.e., through the pipe from 'samtools view'.

    Leave a comment:


  • Dedeusan
    replied
    Problem with htseq-count

    Dear SImon,
    I am just starting using HTSeq and I found the same problem as Sofia, but anything of the solutions that you recommended works this time. CAn you help me? This is the command line I use:
    sandra@sandra-VirtualBox:~/HTSeq-0.5.3p3$ htseq-count -s no s_1_sequence_clipped_CDS_tophat.bam TbruceiTreu927_TriTrypDB-3.32.gff

    And what I get:
    Error occured in line 263 of file TbruceiTreu927_TriTrypDB-3.3.2.gff.
    Error: Feature gene|exon_TB927.5.300b-1 does not contain a 'gene_id' attribute
    [Exception type: SystemExit, raised in count.py:55]

    I'd tried to use -i ID but I only obtain this:

    sandra@sandra-VirtualBox:~/HTSeq-0.5.3p3$ htseq-count -s no -i ID s_1_sequence_clipped_CDS_tophat.bam TbruceiTreu927_TriTrypDB-3.32.gff
    Error occured in line 29875 of file TbruceiTreu927_TriTrypDB-3.32.gff.
    Error: Failure parsing GFF attribute line
    [Exception type: ValueError, raised in __init__.py:171]

    I am working wih trypanosoma brucei from what I only have the gff data (there is no gtf data available) and when I look for the line this is the information I have in the gff starting the line 263 adn some more lines as example:


    pidb|BAC26D11 ApiDB exon 53516 54478 . - . ID=apidb|exon_TB927.5.300b-1;Name=exon;description=exon;size=963;Parent=apidb|rna_TB927.5.300b-1
    apidb|Tb927_04_v4 ApiDB gene 1458621 1459109 . - . ID=apidb|Tb04.24M18.150;Name=Tb04.24M18.150;description=hypothetical+protein%2C+conserved;size=489;web_id=Tb04.24M18.150;locus_tag=Tb04.24M18.150;size=489;Alias=20660476,Tb04.24M18.150
    apidb|Tb927_04_v4 ApiDB mRNA 1458621 1459109 . - . ID=apidb|rna_Tb04.24M18.150-1;Name=Tb04.24M18.150-1;description=Tb04.24M18.150-1;size=489;Parent=apidb|Tb04.24M18.150;Ontology_term=GO:0044429;Dbxref=ApiDB:Tb04.24M18.150,taxon:185431
    apidb|Tb927_04_v4 ApiDB CDS 1458621 1459109 . - 0 ID=apidb|cds_Tb04.24M18.150-1;Name=cds;description=.;size=489;Parent=apidb|rna_Tb04.24M18.150-1
    apidb|Tb927_04_v4 ApiDB exon 1458621 1459109 . - . ID=apidb|exon_Tb04.24M18.150-1;Name=exon;description=exon;size=489;Parent=apidb|rna_Tb04.24M18.150-1
    apidb|Tb927_04_v4 ApiDB gene 1312951 1313406 . - . ID=apidb|Tb04.3I12.100;Name=Tb04.3I12.100;description=hypothetical+protein;size=456;web_id=Tb04.3I12.100;locus_tag=Tb04.3I12.100;size=456;Alias=Tb04.3I12.100
    apidb|Tb927_04_v4 ApiDB mRNA 1312951 1313406 . - . ID=apidb|rna_Tb04.3I12.100-1;Name=Tb04.3I12.100-1;description=Tb04.3I12.100-1;size=456;Parent=apidb|Tb04.3I12.100;Dbxref=ApiDB:Tb04.3I12.100,taxon:185431
    apidb|Tb927_04_v4 ApiDB CDS 1312951 1313406 . - 0 ID=apidb|cds_Tb04.3I12.100-1;Name=cds;description=.;size=456;Parent=apidb|rna_Tb04.3I12.100-1
    apidb|Tb927_04_v4 ApiDB exon 1312951 1313406 . - . ID=apidb|exon_Tb04.3I12.100-1;Name=exon;description=exon;size=456;Parent=apidb|rna_Tb04.3I12.100-1
    apidb|Tb927_05_v4 ApiDB gene 345728 346237 . + . ID=apidb|Tb05.28F8.200;Name=Tb05.28F8.200;description=hypothetical+protein;size=510;web_id=Tb05.28F8.200;locus_tag=Tb05.28F8.200;size=510;Alias=Tb05.28F8.200
    apidb|Tb927_05_v4 ApiDB mRNA 345728 346237 . + . ID=apidb|rna_Tb05.28F8.200-1;Name=Tb05.28F8.200-1;description=Tb05.28F8.200-1;size=510;Parent=apidb|Tb05.28F8.200;Dbxref=ApiDB:Tb05.28F8.200,taxon:185431
    apidb|Tb927_05_v4 ApiDB CDS 345728 346237 . + 0 ID=apidb|cds_Tb05.28F8.200-1;Name=cds;description=.;size=510;Parent=apidb|rna_Tb05.28F8.200-1
    apidb|Tb927_05_v4 ApiDB exon 345728 346237 . + . ID=apidb|exon_Tb05.28F8.200-1;Name=exon;description=exon;size=510;Parent=apidb|rna_Tb05.28F8.200-1
    apidb|Tb927_05_v4 ApiDB gene 235483 235992 . - .


    What may I change to make htseq-count work?

    Thank you very much in advance!
    Sandra

    Leave a comment:


  • sofia17
    replied
    Oooops. Is working now. Maybe the bioconductor site was down earlier....

    Leave a comment:


  • sofia17
    replied
    Dear Simon,

    I am back in business asking questions.

    I ran at the R console

    source("http://www.bioconductor.org/biocLite.R")
    biocLite("DESeq")

    to install DESeq but when I try to load the library with

    library(DESeq)

    it gives the following err:

    Error in loadNamespace(i[[1L]], c(lib.loc, .libPaths())) :
    there is no package called 'annotate'
    Error: package/namespace load failed for 'DESeq'

    I am running R 2.13.1 (latest version) on Mac OS 10.6.4. Any suggestions?

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by sofia17 View Post
    From your experience, is it normal that out of about 15 million Illumina reads 1.5 million go in the no_feature pile?
    Yes, that does not sound unusual.

    Leave a comment:


  • sofia17
    replied
    Thank you Simon for “holding hands” . It worked with “–i ID”.

    From your experience, is it normal that out of about 15 million Illumina reads 1.5 million go in the no_feature pile?

    Leave a comment:


  • Simon Anders
    replied
    Try '-i ID', not '-i ID=gene'. Only the part before the '=' is the attribute name.

    Leave a comment:


  • sofia17
    replied
    I have done the replacement (please see below that all exons have the correct 'ID=gene:...' field). But it still gives same error:
    Error occured in line 6 of file /Users/vanderlab1/Desktop/try/processed.gff.
    Error: Feature exon:Solyc00g005000.2.1.1 does not contain a 'gene_id' attribute
    Exception type: SystemExit, raised in count.py:55]

    I tried also: htseq-count -i ID=gene <samFile> <gff3File>
    and got the same error

    Error occured in line 6 of file /Users/vanderlab1/Desktop/try/processed.gff.
    Error: Feature exon:Solyc00g005000.2.1.1 does not contain a 'ID=gene' attribute
    Exception type: SystemExit, raised in count.py:55]

    Maybe I use wrongly the -i option?

    ##gff-version 3
    ##feature-ontology http://song.cvs.sourceforge.net/*che...?revision=1.93
    ##sequence-region SL2.40ch00 1 21805821
    SL2.40ch00 ITAG_eugene gene 16437 18189 . + . Alias=Solyc00g005000;ID=gene:Solyc00g005000.2;Name=Solyc00g005000.2;from_BOGAS=1;length=1753
    SL2.40ch00 ITAG_eugene mRNA 16437 18189 . + . ID=mRNA:Solyc00g005000.2.1;Name=Solyc00g005000.2.1;Note=Aspartic proteinase nepenthesin I (AHRD V1 **-- A9ZMF9_NEPAL)%3B contains Interpro domain(s) IPR001461 Peptidase A1 ;Ontology_term=GO:0006508;Parent=gene:Solyc00g005000.2;from_BOGAS=1;interpro2go_term=GO:0006508;length=1753;nb_exon=2
    SL2.40ch00 ITAG_eugene exon 16437 17275 . + . ID=exon:Solyc00g005000.2.1.1;ID=gene:Solyc00g005000.2;from_BOGAS=1
    SL2.40ch00 ITAG_eugene five_prime_UTR 16437 16479 . + . ID=five_prime_UTR:Solyc00g005000.2.1.0;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
    SL2.40ch00 ITAG_eugene CDS 16480 17275 . + 0 ID=CDS:Solyc00g005000.2.1.1;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
    SL2.40ch00 ITAG_eugene intron 17276 17335 . + . ID=intron:Solyc00g005000.2.1.1;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
    SL2.40ch00 ITAG_eugene exon 17336 18189 . + 0 ID=exon:Solyc00g005000.2.1.2;ID=gene:Solyc00g005000.2;from_BOGAS=1
    SL2.40ch00 ITAG_eugene CDS 17336 17940 . + 2 ID=CDS:Solyc00g005000.2.1.2;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
    SL2.40ch00 ITAG_eugene three_prime_UTR 17941 18189 . + . ID=three_prime_UTR:Solyc00g005000.2.1.0;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
    Last edited by sofia17; 09-08-2011, 07:34 AM.

    Leave a comment:


  • Simon Anders
    replied
    Only for the "exon" lines, because htseq-count ignores the others.

    Leave a comment:


  • sofia17
    replied
    Within a gene should I do the above replacement for ALL components, i.e. mRNA, exon, intron, CDS, three_prime_UTR, etc...? Or for mRNA lines ONLY?
    Last edited by sofia17; 09-06-2011, 08:29 AM.

    Leave a comment:


  • Simon Anders
    replied
    Nearly. It should be "ID=gene:Solyc00g005000.2". Compare carefully the "mRNA" and the "gene" lines.

    Leave a comment:


  • sofia17
    replied
    so in the gff3 file I should replace entries such as

    "Parent=mRNA:Solyc00g005000.2.1"

    with

    "ID=gene:Solyc00g005000.2.1"???

    Leave a comment:


  • Simon Anders
    replied
    Sorry, that was wrong. The GFF file contains a transcript ID in the 'Parent' attribute of 'exon' lines, but we would need the gene ID. You will need to fix this manually.

    Leave a comment:


  • Simon Anders
    replied
    It might be because this is not a GTF file; it does not have 'gene_ID' attribute -- or rather, it has them, but they are named 'Parent'. Hence, you need the option '-i Parent'.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X