Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • joxcargator73
    Member
    • Dec 2012
    • 28

    Augustus gene prediction CDS problem

    I am using Augustus to predict genes. It works well most of the time. However I am dealing with a problem of translation of the extracted codingsequences. I use getAnnoFasta.pl to extract the codingseq and the aminoacids also. If I am not wrong, the codingSeqs should translate exactly the aminoacids sequences. This is not the case in several instances. When I checked my coding seq and translate them many results in other protein totally different form the file containing the aa sequences. Is this a common problem?. Is this best way to extract sequences?. I tried to use the GFF and loaded in artemis to edit error and I noticed that some features lost the frame. Maybe I am missing something.

    Thanks

  • WhatsOEver
    Senior Member
    • Apr 2012
    • 215

    #2
    Could you provide (a) the command you used, (b) the version number of your Augustus installation and (c) a reproducible example where the translation fails? I have used Augustus some times but have never experienced anything like that.

    Comment

    • joxcargator73
      Member
      • Dec 2012
      • 28

      #3
      I am using augustus 3.0.3.
      It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

      I checked the gff file and only the incomplete genes have this problem.

      I used this command
      ./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

      Then

      perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

      to obtain CodingSeq, and aa



      For example in my file: Gcitri.gff
      I have an "incomplete call"

      # ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
      #
      # Constraints/Hints:
      # (none)
      # Predicted genes for sequence number 2169 on both strands
      # start gene g9811
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
      NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
      # coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
      # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
      # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
      # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
      # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
      # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
      # tcacgagcctcctcgcaggataggctggactga]
      # protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
      FEVPDCHARLPVASRASSQDRLD]
      # end gene g9811


      The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
      NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
      APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
      This is good

      the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
      ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
      # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
      # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
      # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
      # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
      # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
      # tcacgagcctcctcgcaggataggctggactga
      !!!

      If you translate that in frame 1
      You obtain this

      IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
      SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
      SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
      QTAMPDCQSHHEPPRRIGWT

      Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

      Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.


      Thanks
      Last edited by joxcargator73; 03-23-2015, 03:55 PM.

      Comment

      • WhatsOEver
        Senior Member
        • Apr 2012
        • 215

        #4
        Originally posted by joxcargator73 View Post
        I am using augustus 3.0.3.
        It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

        I checked the gff file and only the incomplete genes have this problem.
        ok, the prediction is fine then. I'll explain why:

        Originally posted by joxcargator73 View Post
        I used this command
        ./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

        Then

        perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

        to obtain CodingSeq, and aa
        Just a comment: I assume that there are no closer related species than Magnaporthe (eg something from the Dothideomycetes) available? In that case it is important to keep in mind that some of your gene predictions will probably be different when using alternative gene models. But for an initial working set everything is fine and you don't have to bother.

        Originally posted by joxcargator73 View Post
        For example in my file: Gcitri.gff
        I have an "incomplete call"

        # ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
        #
        # Constraints/Hints:
        # (none)
        # Predicted genes for sequence number 2169 on both strands
        # start gene g9811
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
        NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
        # coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
        # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
        # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
        # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
        # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
        # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
        # tcacgagcctcctcgcaggataggctggactga]
        # protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
        FEVPDCHARLPVASRASSQDRLD]
        # end gene g9811


        The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
        NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
        APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
        This is good

        the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
        ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
        # cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
        # aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
        # ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
        # tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
        # aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
        # tcacgagcctcctcgcaggataggctggactga
        !!!

        If you translate that in frame 1
        You obtain this

        IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
        SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
        SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
        QTAMPDCQSHHEPPRRIGWT

        Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

        Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.

        Thanks
        1) What you get is what you see. The perl script does nothing more than extracting everything between "# coding sequence = [" and its respective closing bracket (Same thing for protein sequence).

        2) Augustus uses 0 to indicate 1st frame, 1 for 2nd and 2 for 3rd and if you translate the coding sequence in all frames, you'll also see that it is the 3rd frame translation.

        3) I highlighted the most important part. You have a partial gene which was detected at the start of a contig. Obivously, for such a prediction no start "ATG" can be found (as you have noticed). But your prediction starts with an intron and to see a "frame-shift" is absolutely fine in that case. Keep in might that you cannot necessarily translate every exon independently in 1st frame!

        4) It doesn't make sense to translate in a frame that contains premature stop codons as this only indicates that the prediction is somehow wrong. For your example, there seems to be evidence for a transcript of that certain length and there is 1 frame which completely covers this region with only 1 stop at the end. This frame is shown.

        I hope this helps

        Comment

        • joxcargator73
          Member
          • Dec 2012
          • 28

          #5
          Thanks, for your help.
          The problem that I still have , it is how to deal with the codingSeqs. If some of them (mostly the incomplete calls) do not start in frame and I need the nucleotide sequence to run a codon usage analysis , I will get a something else but not what I expect. Moreover if I compare with the aminoacid tarlatan that Ausguts shows. Should I avoid using truncate calls. Are the complete gene calls start in frame, are these safe to just take their "coding seqs" and translate them?

          Thanks for your help

          Comment

          Latest Articles

          Collapse

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          24 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          29 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          39 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          61 views
          0 reactions
          Last Post SEQadmin2  
          Working...