Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Siva View Post
    Steven your awk one liner works great.
    Thanks, glad i could help.

    Awk is truly powerful!!
    Yes, but not the best if you need more than a quick and dirty one liner. My ones sometimes get pages long and i do not recommend this. One typo in the name of a variable and you are done.. plus the memory management is not as good as in perl i think (i don't know about python).

    Can you modify it further to add a column that gives the number of exons per transcript?
    Sure. I would simply try

    Code:
    awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
    cheers,
    s.

    Comment


    • #17
      [QUOTE=Simon Anders;18102]It seems that it's still there. Try the following. Start a Python session and type:

      Code:
      import HTSeq
      print HTSeq.__file__
      This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

      Simon
      Thanks, it now works. Will post feedback after running it on the full GFF.

      Siva

      Comment


      • #18
        Originally posted by steven View Post
        Sure. I would simply try

        Code:
        awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
        cheers,
        s.
        Steven
        This works fine too. Thanks!! Will post feedback soon.

        Siva

        Comment


        • #19
          when running py script on gff file of zea mays v3.21 from ensembl, it popped up this error:
          Traceback (most recent call last):
          File "lenTransGff.py", line 14, in <module>
          for feature in gff_file:
          File "/usr/lib64/python2.6/site-packages/HTSeq-0.5.4p5-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 223, in __iter__
          iv = GenomicInterval( seqname, int(start)-1, int(end), strand )
          File "_HTSeq.pyx", line 62, in HTSeq._HTSeq.GenomicInterval.__init__ (src/_HTSeq.c:2789)
          File "_HTSeq.pyx", line 71, in HTSeq._HTSeq.GenomicInterval.strand.__set__ (src/_HTSeq.c:2910)
          ValueError: Strand must be'+', '-', or '.'.
          any idea?

          for latest ensembl gff file of zea mays AGPv3.21, the awk script should also need some changes for matching in column 9. here is the one I modified:
          zcat Zea_mays.AGPv3.21.gff3.gz | awk '$3=="exon" && $9~/Parent/{sub(/.*Parent=/,"",$9);sub(/;.*/,"",$9);ID=$9;L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' > len_trans_Enum.txt
          hope some one need this.

          Comment


          • #20
            Need to find intronic regions from a gtf gile

            Hello Simon,

            I found the HTSeq amazing to work with, I am trying to find intronic regions from a gtf file for horse. Here in this thread, you have clearly explained to get transcripts and their respective lengths. But what if a single gene have multiple transcripts, i wanted to choose the transcripts with maximum length. I wanted a output which gives gene_id, transcript_id, intron_start, intron_end, transcript_length.

            Could you please give me a idea how to do this using HTseq.

            Thanks in advance.

            Comment


            • #21
              Originally posted by Simon Anders View Post
              Code:
                 for exon in transcripts[ transcript_id ]:
                    # Each GenomicFeature object has a slot called 'iv', which is a
                    # GenomicInterval object, which in turn contains slots such as
                    # 'chrom', 'start', 'end', 'strand', 'length', etc., to describe
                    # where the feature sits in the genome. We use 'length' here.
                    transcript_length += exon.iv.length + 1
                    # We add the 1 because this GFF file does no
                 # Now, 'transcript_length' is the sum of the lengths of all the
                 # transcripts exon.
                 # Print this result, and also add the number of exons as (which is just the 
                 # length of the list of exons for the transcript):
                 print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
              Simon
              So the iv object already accounted for the 1-based counting and end inclusive (information from the GFF_reader documentation of HTSeq) when calculating the length? So why do you need a +1 again? The documentation seems to want to explain it but somehow truncated.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Latest Developments in Precision Medicine
                by seqadmin



                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                Somatic Genomics
                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                Yesterday, 01:16 PM
              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 07:15 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-23-2024, 10:28 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-23-2024, 07:35 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-22-2024, 02:06 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Working...
              X