Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • syintel87
    Member
    • Dec 2012
    • 81

    [DEXSeq] prepare_annotation.py: exonic part starts too early!

    When I tried the command below,

    "python dexseq_prepare_annotation.py Mt.gtf Mt_dexseq.gtf"

    an error occurs, saying that "AssertionError: <GenomicFeature: exonic_part 'Medtr7g090810' at Mt:3458:Mt3.5.1Chr7: 28524762 -> 28523495 (strand '-') > starts too early"

    So, would you please give me a tip about how I will be able to fix this error?
    Thank you.
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?

    Comment

    • syintel87
      Member
      • Dec 2012
      • 81

      #3
      Originally posted by Simon Anders View Post
      Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?
      The attached file includes the part of 'Medtr7g090810'.
      Thank you for sparing your precious time.
      Attached Files

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.

        Comment

        • syintel87
          Member
          • Dec 2012
          • 81

          #5
          Originally posted by Simon Anders View Post
          It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
          1. So, annotation file has to be fixed??

          2. When I have just made small changes on the "dexseq_prepare_annotation.py", it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "exon":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          exons[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, seeing the original gtf file, since there are both exon and CDS, I am not sure whether this code is okay for my gtf file or not.

          3. I have another gtf file. For this, I also made a small change. And it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "CDS":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          CDS[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, I am not sure whether this code is okay or not.

          4. Do you think the codes that I have modified would be okay?
          The attached file is about original gtf and dexseq_prepare_annotation.py and output gtf.

          Thank you very much!

          Comment

          • syintel87
            Member
            • Dec 2012
            • 81

            #6
            Originally posted by Simon Anders View Post
            It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
            This is the attachment.
            Attached Files

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

              You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

              BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines

              Comment

              • syintel87
                Member
                • Dec 2012
                • 81

                #8
                Originally posted by Simon Anders View Post
                No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

                You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

                BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines
                1.
                So, you mean I will have to replace each +/- into + ?
                (Alternatively, replace each +/- into -).

                2.
                In my Mhapla.gtf file, it has only exon. So do I need to fix my gtf file by replaceing CDS with exon?

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                  Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                  2. No, why?

                  Comment

                  • syintel87
                    Member
                    • Dec 2012
                    • 81

                    #10
                    Originally posted by Simon Anders View Post
                    1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                    Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                    2. No, why?
                    1. Thank you. I'll try that. I obtained these gtf files from a member of my project group.

                    2. If Mhapla.gtf only has CDS but dexseq_prepare.py only looks at "exon" lines and ignored "CDS" lines, the output might have no line at all, I guess.

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      2. Sure, if it's this way round, you need to change the CDS lines to exon lines.

                      Comment

                      • arkanion
                        Member
                        • Jul 2016
                        • 10

                        #12
                        Originally posted by Simon Anders View Post
                        1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                        Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                        2. No, why?

                        I have the same problem. If you search UCSC Genome Browser for the gene for ex; HIST2H3C, you will see 2 genes appearing, one on + and on one - strand. dexseq cannot deal with this, but this situation actually happens in reality for those who use gtf file downloaded from UCSC track.

                        Comment

                        Latest Articles

                        Collapse

                        • GATTACAT
                          Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by GATTACAT
                          Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                          07-01-2026, 11:43 AM
                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Yesterday, 11:08 AM
                        0 responses
                        7 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-30-2026, 05:37 AM
                        0 responses
                        12 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-26-2026, 11:10 AM
                        0 responses
                        20 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        54 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...