Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • @joro

    Here are the 3 lines from gff file.
    Middle one is 2169


    NC_003911.11 RefSeq gene 176231 176306 . + . locus_tag=SPO_tRNA-Gln-1;db_xref=GeneID:3196480
    NC_003911.11 RefSeq exon 176231 176306 . + . gbkey=tRNA;locus_tag=SPO_tRNA-Gln-1;product=tRNA-Gln;db_xref=GeneID:3196480;exon_number=1
    NC_003911.11 RefSeq gene 176312 176389 . + . locus_tag=SPO_tRNA-Pro-2;db_xref=GeneID:3196280

    Comment


    • ssharma:

      This line does not have a 'gene_id' attribute, but neither do the others. Please note that htseq-counts expects by default a GTF file. Yours is not one. (See this post for details.) See also the htseq-count documentation, especially then options '-t' and '-i'.

      Comment


      • Thanks Simon,
        So if i want to use gff file i need to mention the feature type to be used (3rd column) with -t option.

        So this is what i did: (i want to use 'gene')

        htseq-count -t gene CL1_forBWA_left_DSS3.sam NC_003911.gff
        but now i am getting following error:


        Error occured in line 6 of file NC_003911.gff.
        Error: Feature NC_003911.11:gidA does not contain a 'gene_id' attribute
        [Exception type: SystemExit, raised in count.py:55]

        These are few lines from my gff:

        ##gff-version 3
        #!gff-spec-version 1.14
        #!source-version NCBI C++ formatter 0.2
        ##Type DNA NC_003911.11
        NC_003911.11 RefSeq source 1 4109442 . + . organism=Ruegeria pomeroyi DSS-3;mol_type=genomic DNA;strain=DSS-3;db_xref=taxon:246200
        NC_003911.11 RefSeq gene 1 1869 . + . ID=NC_003911.11:gidA;locus_tag=SPO0001;db_xref=GeneID:3194636
        NC_003911.11 RefSeq CDS 1 1866 . + 0 ID=NC_003911.11:gidA:unknown_transcript_1;Parent=NC_003911.11:gidA;locus_tag=SPO0001;note=GidA%3B glucose-inhibited cell division protein A%3B involved in the 5-carboxymethylaminomethyl modification %28mnm%285%29s%282%29U%29 of the wobble uridine base in some tRNAs;transl_table=11;product=tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA;protein_id=YP_165274.1;db_xref=GI:56694929;db_xref=GeneID:3194636;exon_number=1

        Comment


        • Well, there is no attribute called "gene_id" in your GFF file, or is there? The attributes are the entries in the last column, and their name is the part before the "=" sign. Try '-i ID'.

          Comment


          • No, there is no gene_id.
            I even tried -i ID option and nothing is working.
            Is there any way i can convert gff file to gtf or can directly get GTF file for my Genome?

            Thanks
            Shalabh

            Comment


            • Prequisites
              To use HTSeq, you need at least version 2.5 of Python (Python 3 does not work yet),

              Python 3 does not work yet....ooh no!!

              Comment


              • For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam:



                edit: on that note, is there any way I can get the Query name from a BAM file line using the BAM_Reader interface? I need this because it contains an experiment ID, and I would like to split my output based on that ID. My current way to get around it is to use pysam directly. Looking at the HTSeq source code showed me how to convert the output into something that could be used for coverage analysis:

                Code:
                bamFile = pysam.Samfile(bamFileName, "rb")
                coverage = dict()
                totalCount = 0
                hitCount = 0
                for read in bamFile:
                    totalCount += 1
                    if(not(read.is_unmapped)):
                        experiment = read.qname[:read.qname.find(".")]
                        hitCount += 1
                        ## Taken from HTSeq source code
                        chrom = bamFile.getrname(read.tid)
                        strand = "-" if read.is_reverse else "+"
                        iv = HTSeq.GenomicInterval( chrom, read.pos, read.aend, strand )
                        if(not(experiment in coverage)):
                            coverage[experiment] = (
                                HTSeq.GenomicArray("auto", stranded = True, typecode = "i"))
                        (coverage[experiment])[iv] += 1
                For what it's worth, my BAM file contains query names like '<experiment>.<id>'. The reads were mapped to the yeast genome using tophat, and I'm now trying to extract out coverages for each gene / exon (including n bases up or downstream) on a per-experiment basis.
                Last edited by gringer; 10-25-2011, 05:15 AM. Reason: added additional question

                Comment


                • Originally posted by gringer View Post
                  For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam
                  After discovering that pysam doesn't work so well using python 2.4 (which is installed on the Illumina server I use for analysis), I had another look at Simon's code for reading BAM files using SAMtools. The resulting method I came up with uses yield to process the samtools view output, which makes it just a little bit faster (seems to be about an order of magnitude faster than the previous samtools view hack, and possibly a teensy bit faster than pysam). There's no reading of the advanced SAM features (like split reads), and I took out the region-based search, but it might be good enough for a few other people:

                  Code:
                  def BAM_Reader( bamfile, samtools_exec="samtools" ):
                      # derived from Simon Anders' code from
                      # http://seqanswers.com/forums/showpost.php?p=22114&postcount=46
                  
                     """Get Reader object for part of a BAM file
                  
                     bamfile -- a string, either a file name or an URL to a BAM file
                     region -- a GenomicInterval
                     samtools_exec -- file name (with path, if necessary)
                        of the 'samtools' executable
                  
                     Returns a SAM_Reader object.
                     """
                  
                     cmd_line = ( samtools_exec , "view", bamfile )
                  
                     samtools_view = subprocess.Popen( cmd_line,
                        stdout = subprocess.PIPE )
                  
                     for line in samtools_view.stdout:
                        fields = line.split('\t')
                        #columns from http://samtools.sourceforge.net/SAM1.pdf
                        read = HTSeq.SequenceWithQualities(name = fields[1-1], seq = fields[10-1],
                                                            qualstr = fields[11-1])
                        mapStrand = '+'
                        if(int(fields[2-1]) & 0x10 == 0x10):
                           mapStrand = '-'
                        if(int(fields[2-1]) & 0x4 == 0x4):
                           iv = None
                        else:
                           iv = HTSeq.GenomicInterval(
                              chrom = fields[3-1], start = int(fields[4-1]) - 1,
                              end = int(fields[4-1]) - 1 + len(fields[10-1]),
                              strand = mapStrand)
                        yield HTSeq.Alignment(read, iv)
                  I've also attached a full script that uses this code for generating 3-column count files and coverage graphs with experiment labels.
                  Attached Files

                  Comment


                  • For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam:
                    Question: Is there an easy way to use a bam file as an input fot htseq-count now???

                    Comment


                    • Hi,

                      I usely use a command line like this, if this can help:
                      I use samtools to read the bam file and with the pipe, I use htseq-count. The "-" call the output of the samtools view and pass it to htseq-count

                      samtools view file.bam | python2.6 htseq-count - file.gtf -q > htseq_output &

                      Delphine

                      Comment


                      • Trouble with HTSeq install

                        Trying to install HTSeq onto a machine running Windows7 and I got the following error:

                        C:\Users\ArrayStar>cd C:\HTSeq-0.5.3p9

                        C:\HTSeq-0.5.3p9>PATH=%PATH%;C:\Python27;C:\MinGW\bin

                        C:\HTSeq-0.5.3p9>python setup.py build --compiler=mingw32
                        Could not import 'setuptools', falling back to 'distutils'.
                        running build
                        running build_py
                        running build_ext
                        building 'HTSeq._HTSeq' extension
                        C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IC:\Python27\lib\site-packages\
                        numpy\core\include -IC:\Python27\include -IC:\Python27\PC -c src/_HTSeq.c -o bui
                        ld\temp.win-amd64-2.7\Release\src\_htseq.o -w
                        cc1.exe: error: unrecognized command line option '-mno-cygwin'
                        error: command 'gcc' failed with exit status 1

                        C:\HTSeq-0.5.3p9>

                        Any help with what I'm doing wrong and what I can do to fix it? Thanks!

                        Comment


                        • KHubbard - not sure but it looks like you may not have cygwin installed?

                          Comment


                          • Clang errors when trying to build HTseq

                            Hi All
                            I have a mac (OSx10.8.2) and get the following error when trying to build HT seq (0.5.3p9). Looks like a gcc issue...Any suggestions please?
                            running build
                            running build_py
                            running build_ext
                            building 'HTSeq._StepVector' extension
                            clang -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/StepVector_wrap.cxx -o build/temp.macosx-10.8-intel-2.7/src/StepVector_wrap.o -w
                            clang++ -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.8-intel-2.7/src/StepVector_wrap.o src/step_vector.h -o build/lib.macosx-10.8-intel-2.7/HTSeq/_StepVector.so
                            clang: warning: treating 'c-header' input as 'c++-header' when in C++ mode, this behavior is deprecated
                            clang: error: cannot use 'precompiled-header' output with multiple -arch options
                            clang: error: cannot specify -o when generating multiple output files
                            error: command 'clang++' failed with exit status 1

                            Comment


                            • Ok I got an answer from another thread; I have now installed it successfully
                              Check this in case you are grappling with a similar problem
                              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                              Comment


                              • Hello Dr. Simon,
                                I tried to use HTseq_count function to find no of reads.
                                Data of D. melanogaster; ref genome from iGenomes (Ensembl).
                                Command typed >>
                                htseq-count -m intersection-nonempty -s no -t exon -i gene_id -o samout -q /home/ajay/bin2/binp/dmel-all-r5.48.gff /home/ajay/bin2/binp/C2_R1_thout/outc2r1.sam

                                Error occured in line 1 of file /home/ajay/bin2/binp/C2_R1_thout/outc2r1.sam.
                                Error: need more than 3 values to unpack
                                [Exception type: ValueError, raised in __init__.py:214]

                                Please suggest me where I am going wrong.
                                Ajay

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 11:49 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X