No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • gene annotation evidence combiners in fungi

    Hi, I'm testing out Maker and EVM with fungal genomes, combining a range of data including RNAseq (trinity alignment to genome and tophat junctions), protein alignment, cDNA alignment, and ab initio predictions. One problem area with applying this technique to fungi seems to be the gene density, with the RNAseq data supporting most UTRs overlapping or very close <20 bp between adjacent genes (I don't have stranded RNAseq which would minimise this problem slightly). As a side note: the jaccard clipping feature from trinity was implemented to get around this... however i've noticed that some fungal transcripts are clipped to buggery (i.e. at sites other than overlapping UTRs).

    The end result of running an evidence combiner is that many adjacent gene models are combined into a single long gene model.

    Has anyone had experience with applying these techniques to fungi? Are there any workarounds or is the only solution tedious manual curation after auto-reannotation?

    James Hane

  • #2
    Hi James -

    For filamentous fungi, we have simply used either GeneMark.hmm-ES or Augustus and have gotten pretty good results. I don't have any quantitative data, but we compared GeneMark models to those produced by the Broad Institute's pipeline and found them to be identical in most cases. Augustus can incorporate RNA-Seq and protein alignments. In the case of RNA-Seq, I think it can be used both to train the application and as hints during the annotation. I have not tried Maker myself but I've spoken to others who have tried it on fungi and have not been too satisfied with the results.


    • #3
      Thanks Mike,

      That's interesting, I haven't tried training Augustus on my current datasets, however I've used the pre-trained models from closely related species and found them to be pretty poor compared to genemark-ES. I strongly favour genemark-ES over Augustus predictions. The intron-exon boundaries between the two were mostly identical... however Augustus has an annoying tendency to also merge adjacent gene models (in the absence of RNAseq). Additionally, (in my experience) maker seems to incorporate longer gene models wherever possible, making the Augustus mistakes the most prominent.

      I can see how, if the above problem with Augustus was overcome and with an absence of RNAseq data, maker could have been successfully applied to fungal genomes in the past. I was reasonably happy with how maker performed (without Augustus) in spite of the RNAseq UTR overlaps... but would really like to minimise the amount of manual curation required.


      • #4
        Just posting this here to summarise how this issue was resolved - for those who might stumble across it in future.

        Long story short: use STRANDED RNAseq and you will avoid most of these problems... almost all UTR overlap will be gone because adjacent overlapping loci will mostly be on opposite strands... I think there might still however be some rare 3' to 5' UTR overlap (on the same strand) in some cases - depending on the gene density of the sequenced species. This would make PASA without manual annotation or even cufflinks viable options.

        If you are still using UNSTRANDED RNAseq data for fungal annotation then read on.

        Maker and cufflinks appear to be best suited to larger eukaryote genomes i.e. animals and plants, with lower gene density than fungi - and operate on the principle that a genome feature overlap implies that both features belong to the same locus. Other annotation tools using feature overlap only, without considering other factors, are fundamentally flawed and unsuitable for automated fungal gene annotation. Incidentally I also recently found some issues with the default BLASTP parameters in Maker2, which can hide some very small (but confirmed by RNAseq) introns (~20-50 bp) as gaps within HSPs... Since HSP alignments are treated as full exons the microintron is skipped... meaning a few proteins can be translated out of frame.

        In the end I found the best combination for my purposes was Trinity/Cufflinks, AAT, PASA, EvidenceModeller and Apollo.

        Trinity (with jaccard clipping to predict UTR overlaps and separate sequences) to assemble transcripts de novo.

        AAT ( to align transcripts AND proteins (i.e. related species/swissprot) - generally better for avoiding intron-skipping problems than BLASTP (see above)

        PASA... one of the most important steps as it is capable of coping with merged transcript alignments and converting them into separate loci. Not perfectly but it does most of the work.

        EVM... combining inputs from PASA output, cufflinks gff, tophat junctions, AAT protein alignments, in silico gene predictions.
        Unlike MAKER which seems to treat all supporting evidence as equally valid, EVM allows you to assign weight to different evidence types. Would personally assign cufflinks gff data with relatively lower weight than other transcript or protein data... If you want gene predictions modified if there is supporting transcript/protein alignments, but retained in the final set of annotations if there is no supporting evidence, then you should define their data type as OTHER_PREDICTION rather than ABINITIO_PREDICTION, when it comes to defining their weights. I personally would only do this with genemark-ES predictions and add any other predictions as the ABINITIO_PREDICTION type.

        Note: EVM can also benefit from some "negative evidence" defining genome regions in which protein-coding genes are not likely to reside e.g. antifam, repetitive DNA coordinates such as from repeatmasker/modeller, transposonPSI... or non-coding gene coordinates from tRNAscan-SE, infernal etc.

        Since I used "unstranded" RNAseq, manual annotation was still necessary for many genes to resolve merge/split errors - converted EVM outputs to Apollo and curated from there. Manually annotating a whole fungal genome takes a LONG TIME (in the order of months... possibly to years). Myself and an experienced annotator took only 3 months, but usually this takes much, much longer.

        As a final aside, in my experience very (usually several Kb) large fungal genes such as PKS and NRPS genes can sometimes remain split... sometimes even the de novo transcripts don't assemble all the way through the length of the gene and you only have support in some sections. I would recommend identifying these regions by homology, inspecting raw RNAseq coverage and giving these genes extra attention when manually curating.

        Best of luck with fungal gene annotation... and if you come up with better solutions please post them here or contact me directly.

        Kind Regards,
        James Hane


        • #5
          Hi James - thanks for posting all the info. I looked at EVM but it seemed like it was going to be a lot of work to set up a pipeline, so we haven't implemented it yet. More recently we have been using MAKER but we're not so happy with the results so maybe we'll go back to just using Augustus or GeneMark-es until we can get back to something more sophisticated. I wonder how much the supporting evidence (protein, RNAseq alignments) actually improves the gene models because in our experience with maker it seems to create as many errors as it corrects. Do you find that GenMark-ES is making errors that are being corrected by the supporting evidence?

          As for Augustus, more recently we have been having better results by using the included models from closely related organisms. this is probably taxon-specific. We have trained some of our own but the results were not as good as Augustus+included model or GeneMark-ES. We probably did not spend enough time to build up a good training data set.


          • #6

            How can we change AAT protein alignment output to gff file to be used for EVM.

            Thanks Sandesh


            • #7
              Dear James & Mike ,

              I appreciate your help to the community, this conversation has helped me to clear confusions.
              I see that this post was made in 2014 , in 2018, the problem statement still exists with Fungal genomes. However, Augustus, genemark ES , Maker all have come up with new versions.

              If you can throw some light, about their performances at present. I am using Augustus at the moment, to annotate a filamentous fungal genome ( I am just analysing the Data, and have no experience in the Sequencing techniques ) , can you suggest a pipeline, which is fast and efficient ?




              • #8
                I am currently working on is to use EVM to generate a high confidence list of gene predictions and then feed that into Augustus for training. Then used trained Augustus + hints for the final gene predictions. The general steps are:

                1) RepeatModeler + RepeatMasker to get repeat regions

                2) PASA on transcriptome if you have one (I used GMAP for mapping transcripts to genome). You can alternatively try the BRAKER method (The Augustus manual I think has a section on how to do that). Or you can also try to perform a reference assembly with STAR/tophat + cufflinks.

                3) Exonerate align uniprot metazoa proteins on your genome

                4) SNAP/GeneMark for ab initio predictions

                5) Throw results from steps 2-4 into EVM (weighing the evidence is really subjective).

                6) Get a list of "high-confidence" gene predictions from EVM that contains evidence from all three sources (PASA, exonerate, ab initio). This should be a relatively short list (I had ~800).

                7) Train Augustus with these "high-confidence" EVM gene models.

                8) Re-use PASA, Exonerate, SNAP, RepeatMasker as hints for Augustus and run gene predictions. The weighing of the hints is very subjective.
                Clinical Research


                Latest Articles


                • seqadmin
                  Advanced Methods for the Detection of Infectious Disease
                  by seqadmin

                  The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                  11-27-2023, 01:15 PM
                • seqadmin
                  Strategies for Investigating the Microbiome
                  by seqadmin

                  Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                  11-09-2023, 07:02 AM





                Topics Statistics Last Post
                Started by seqadmin, 12-01-2023, 09:55 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-30-2023, 10:48 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-29-2023, 08:26 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-29-2023, 08:12 AM
                0 responses
                Last Post seqadmin