Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dexseq_prepare script error with Ensembl human gtf

    Hi all

    I was following the instructions to analyze my RNASeq data with the DEXSeq package but I run into the following error while preparing the gff file:

    /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 127, in <module>
    assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
    AssertionError: <GenomicFeature: exonic_part 'ENSG00000166260+ENSG00000141198' at 17: 54951904 -> 54951900 (strand '-')> starts too early


    I've seen a few posts with similar errors but never with the files downloaded from Ensembl itself, thus my post here.

    I got the files from: ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/

    The command I run is:

    python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

    I do not know what "ENSG00000166260+ENSG00000141198" is. Is there something I'm doing wrong?

    BTW, it happens with all the gtf files, and with version 88 as well. My apologies if this has been answered and I missed it. I'm struggling to understand what I'm doing here!

  • #2
    Hi aleadam,

    By the look of it, that assert statement is making sure that the 'exonic parts' of an aggregated gene set do not overlap. I.e. the end of one exonic part "l[i].iv.end" should not be located a higher bp position than the start of the next exonic part "l[i+1].iv.start". That's all the error is saying.

    I noticed that those two genes are in a different orientation so I'm not sure why the script is complaining about this. You could open up that python file "dexseq_prepare_annotation.py" in a text editor and have a read to try and figure out what exactly it's doing. I had a quick look on my computer and it does contain comments about this.

    Also, you can run the script without doing the 'aggregate gene' part and it should work. Something like:

    python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py -r 'no' Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

    Although you may not want to turn this off depending on your requirements.

    Good luck!

    Matt.

    Comment


    • #3
      Thanks Matt for your reply.

      I don't know any python so it will take me a while to try to understand the script. My first attempt was to simply comment out that assert line, but who knows what other problems would that bring later on!

      I'm looking for advice on what would be the best approach to fix the issue. All I want is to get a list of differential exon usage on my data as part of an exploration for changes in endothelial behavior. I think I can deal with some lost entries, so I'll try your suggestion to use the -r 'no' option.
      By looking at similar posts like mine, it seems that it is usually a problem with the gtf annotation. My other option would be to delete the entries regarding that particular gene and try again, or try to find if there is an error in the annotation (maybe a sign misplaced confusing the orientation of a particular exon?).
      I'm very new to this and I'm learning by doing, so I apologize in advance if anything I am saying does not make any sense.

      Thanks again,

      Alex.

      Comment


      • #4
        As a quick update, both commands:

        python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py -r 'no' Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

        and

        python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.87.gtf.gz Homo_sapiens.GRCh38.87.DEXSeq.gff

        seem to work just fine. Thus it appears to be a problem specifically with the aggregate gene option in releases 88 and 89.

        Thanks again for your help

        Alex
        Last edited by aleadam; 07-24-2017, 06:25 AM.

        Comment


        • #5
          Hi Alex,

          Yes, it seems that something changed in the annotation of the latest version. Perhaps you could compare the start and end positions for that particular gene in each of the gtf versions? This might give you a clue to what is going on.

          I suppose it might be that the script is not quite configured to deal with these edge cases of genes that are close together and sort of intertwined. Maybe if you email the authors of DEXSeq they could give you an explanation, or else you might have to dig through and learn some python!

          Good luck,

          Matt.

          Comment


          • #6
            Hi Matt

            I indeed tried
            zcat Homo_sapiens.GRCh38.87.gtf.gz | grep ENSG00000166260 > 87.txt
            zcat Homo_sapiens.GRCh38.89.gtf.gz | grep ENSG00000166260 > 89.txt
            diff 87.txt 89.txt > 87vs89.diff

            But the diff was not very helpful. Lots of small changes in the annotations. I'm not a bioinformatician, but a cell biologist using bioinformatic tools, so my abilities to understand small differences in code or annotations are limited.

            I saw the authors of DEXSeq answering a few questions here, so they might see it eventually. If not I will write them directly so they can update the script if needed.

            Anyhow, using the release 87 I was able to follow through the analysis, so now it's time to dig into pubmed to figure out what might be the role (if any) for each hit.

            Cheers,

            Alex.

            Comment


            • #7
              Looks to me like an error in the Ensembl annotation, transcript ENST00000639671 is misannotated as being a transcript produced by the TOM1L1 gene (TOM1L1-224), when it is in fact a COX11 derived transcript. The error thrown by the DEXSeq script could be because the two genes are located on opposite strands.

              Comment


              • #8
                We're aware of the problem with TOM1L1 and COX11 in Ensembl, and it is fixed in the next release, coming out next week.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Exploring the Dynamics of the Tumor Microenvironment
                  by seqadmin




                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                  07-08-2024, 03:19 PM
                • seqadmin
                  Exploring Human Diversity Through Large-Scale Omics
                  by seqadmin


                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                  06-25-2024, 06:43 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 07-10-2024, 07:30 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-03-2024, 09:45 AM
                0 responses
                197 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-03-2024, 08:54 AM
                0 responses
                207 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-02-2024, 03:00 PM
                0 responses
                191 views
                0 likes
                Last Post seqadmin  
                Working...
                X