Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ega2d
    Member
    • Oct 2010
    • 12

    Cufflinks GTF for DEXSeq

    I'm trying to use a Cufflinks GTF that includes novel isoforms as an input for DEXSeq. I'm having trouble with the first python script, which I successfully ran with the ensembl gtf...

    Here is a portion of my cufflinks GTF:

    1 protein_coding exon 5473 5485 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Vom2r-ps1"; oId "ENSRNOT00000044270"; nearest_ref "ENSRNOT00000044270"; class_code "="; tss_id "TSS1"; p_id "P10";
    1 protein_coding exon 5524 5725 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Vom2r-ps1"; oId "ENSRNOT00000044270"; nearest_ref "ENSRNOT00000044270"; class_code "="; tss_id "TSS1"; p_id "P10";

    When I run the python script, I get the following error:

    Traceback (most recent call last):
    File "/home/ega2d/bin/dexseq_prepare_annotation.py", line 33, in <module>
    exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
    File "_HTSeq.pyx", line 509, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:8702)
    KeyError: 'Non-stranded index used for stranded GenomicArray.'


    My gtf clearly has strand information, so I'm at a loss for how to make my gtf play nicely with the python script. Any help from HTSeq/DEXSeq experts would be greatly appreciated!
  • CK_
    Junior Member
    • Jan 2010
    • 3

    #2
    Hi
    your cufflinks.gtf most probably contains "." in the strand column. (for 'don't know strand')
    cut -f7 cufflinks.gtf | sort | uniq -c
    to see whether you got any "."

    You can modify dexseq_prepare_annotation.py to work in un-stranded mode
    changing
    # Step 1: Store all exons with their gene and transcript ID
    # in a GenomicArrayOfSets

    exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )

    to

    exons = HTSeq.GenomicArrayOfSets( "auto", stranded=False )

    I guess you can analyze stranded and un-stranded transcripts separately
    or discard un-stranded transcripts if there are only few?
    I wonder what DEXSeq developers reckon.
    ck

    Comment

    • Simon Anders
      Senior Member
      • Feb 2010
      • 995

      #3
      The portion of your GFF file that you posted: Does it contain the offending line 509?

      Comment

      • CK_
        Junior Member
        • Jan 2010
        • 3

        #4
        Changing strand from "+" or "-" to "." causes the error:
        #-------------------------------------------------------------------------------
        my small_un_str.gtf:

        chr1 Cufflinks exon 11869 12227 . . . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328.2"; nearest_ref "ENST00000456328.2"; class_code "="; tss_id "TSS1";

        python dexseq_prepare_annotation.py small_un_str.gtf small_un_str.gff

        Traceback (most recent call last):
        File "~/R/x86_64-unknown-linux-gnu-library/2.15/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 29, in <module>
        exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
        File "_HTSeq.pyx", line 509, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:8702)
        KeyError: 'Non-stranded index used for stranded GenomicArray.'
        u
        #-------------------------------------------------------------------------------
        small.gtf
        chr1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328.2"; nearest_ref "ENST00000456328.2"; class_code "="; tss_id "TSS1";

        python dexseq_prepare_annotation.py small.gtf small.gff

        cat small.gff

        chr1 small.gtf aggregate_gene 11869 12227 . + . gene_id "XLOC_000001"
        chr1 small.gtf exonic_part 11869 12227 . + . transcripts "TCONS_00000001"; exonic_part_number "001"; gene_id "XLOC_000001"
        #-------------------------------------------------------------------------------
        Doesn't line 509 refer to python code itself and not gtf?
        thank you
        ck

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Originally posted by CK_ View Post
          Changing strand from "+" or "-" to "." causes the error:
          Your suggestion to change "stranded=True" to "stranded=False" is correct. When programming it, I did not expect that people might have GFF files without strand information, so I did not expose this option.

          Doesn't line 509 refer to python code itself and not gtf?
          You are right, of course. Seems I wasn't fully awake yet when I wrote this earlier this today.

          Comment

          • CK_
            Junior Member
            • Jan 2010
            • 3

            #6
            Hi Simon,
            Originally posted by Simon Anders View Post
            When programming it, I did not expect that people might have GFF files without strand information, so I did not expose this option.
            I reckon (at least with cufflinks) all non-stranded entries are single exon transcripts and hence cufflinks don't know how to guess about strand from splice sites (when using not strand specific library for RNA-seq).
            Here's one example output from cufflinks:
            #--------------------------------------------
            # print all unstranded lines and exclude first exon:
            awk '{if($7=="."){print $0}}' merged.gtf | grep -v 'exon_number "1"' | wc -l
            0
            #--------------------------------------------
            awk '{if($7=="+"){print $0}}' merged.gtf | grep -v 'exon_number "1"' | wc -l
            859632
            #--------------------------------------------

            Given library is not strand specific,
            dexseq_count.py will be called with '-s no' flag anyways;
            Will changing stranded=True to stranded=False in dexseq_prepare_annotation.py affect results for transcripts with known strand in cufflinks.gtf?
            Thanks a lot!
            ck

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Originally posted by CK_ View Post
              dexseq_count.py will be called with '-s no' flag anyways;
              Will changing stranded=True to stranded=False in dexseq_prepare_annotation.py affect results for transcripts with known strand in cufflinks.gtf?
              If you call the script with '-s no' anyway, this should not make a difference.

              Comment

              • delasa
                Member
                • Sep 2012
                • 12

                #8
                Hi Simon,
                I have gone through all the steps for DEXSeq. I want to know which isoforms/ splice variants had significant DE. I was wondering how I can find from DEXSeq results, which genes have been alternatively spliced? what are their isoforms?

                Thanks,

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  Not sure I understand your question. You already have the list of exons that show significant differential usage, right? And each exon is annotated with a gene ID. So what exactly do you need now?

                  Comment

                  • delasa
                    Member
                    • Sep 2012
                    • 12

                    #10
                    Yes, I have genes with exons that have been significantly changed. I have attached "plotDEXSeq" result for one of the genes that have two isoforms:GLYMA01G01805.1, GLYMA01G01805.2.
                    I want to match up between the isoforms that are shown in the picture with the real one. Can we have access to isoforms in DEXSeq? I have genes and exons that are significantly changed, but I want to find the isoforms. All I can see in the "res1 = DEUresultTable(ecs_DE)" file, is the genes and eons, with no information about the isoforms. Is that clear now?
                    Attached Files

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      In your plot, no exon is significant. I hope you have others.

                      Anyway, DEXSeq does not offer functionality to perform inference on the level of isoforms. It's DEXSeq's design philosophy to test of differential exon usgae not for differential isoform expression. Our paper gives a few examples why we feel that this is actually an advantage.

                      Comment

                      • delasa
                        Member
                        • Sep 2012
                        • 12

                        #12
                        Thanks for your answer. I know that DEXSeq is exons based not isoform based.
                        I have attached another plot, I guess the pink area implies that there is a significant DE there, right?

                        For drawing this plot, you are using some information in ExonCountSet object returned by "testForDEU" to find the transcripts, the number of transcripts, right? I want to pull this information out of testForDEU results. To be clear, How I can find information about transcripts of a gene from an ExonCountSet returned by testForDEU function?
                        Attached Files

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          With "featureData(ecs)$transcripts" you get a character vectors with one entry for each exon. This entry is a list of transcript IDs, concatenated with semicolons, namely the IDs of all transcripts that contain the exon.

                          This information is retained by the dexseq_prepare_annotation.py script and put into the flattened GFF file. DEXSeq stores it in the "transcripts" slot of featureData but uses it only at a single point, namely to draw the transcripts under the plot.

                          You can try to do your own inference with this information, but in my experience, this does not work too well (though I haven't tried too hard either).

                          Comment

                          • delasa
                            Member
                            • Sep 2012
                            • 12

                            #14
                            Thanks a lot for your quick response,
                            can the information about the "expression", "norCounts", and "splicing" -which are used for "plotDEXSeq" function- be accessed through "featureData(ecs)" as well? or they are stored in another objects.

                            Comment

                            • Simon Anders
                              Senior Member
                              • Feb 2010
                              • 995

                              #15
                              norCounts is simply the counts, divided by the size factors. You get this with: counts( ecs, normalized=TRUE" )

                              The function "estimatelog2FoldChanges" gives you the values for the coefficients. See the paper for the formulae to assemble the plotted values from these. There are also two internal function,
                              "fitAndArrangeCoefs" and "getEffectsForPlotting", which do this, but they might be a bit hard to understand.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 06:09 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              39 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              45 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...