Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks GTF for DEXSeq

    I'm trying to use a Cufflinks GTF that includes novel isoforms as an input for DEXSeq. I'm having trouble with the first python script, which I successfully ran with the ensembl gtf...

    Here is a portion of my cufflinks GTF:

    1 protein_coding exon 5473 5485 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Vom2r-ps1"; oId "ENSRNOT00000044270"; nearest_ref "ENSRNOT00000044270"; class_code "="; tss_id "TSS1"; p_id "P10";
    1 protein_coding exon 5524 5725 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Vom2r-ps1"; oId "ENSRNOT00000044270"; nearest_ref "ENSRNOT00000044270"; class_code "="; tss_id "TSS1"; p_id "P10";

    When I run the python script, I get the following error:

    Traceback (most recent call last):
    File "/home/ega2d/bin/dexseq_prepare_annotation.py", line 33, in <module>
    exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
    File "_HTSeq.pyx", line 509, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:8702)
    KeyError: 'Non-stranded index used for stranded GenomicArray.'


    My gtf clearly has strand information, so I'm at a loss for how to make my gtf play nicely with the python script. Any help from HTSeq/DEXSeq experts would be greatly appreciated!

  • #2
    Hi
    your cufflinks.gtf most probably contains "." in the strand column. (for 'don't know strand')
    cut -f7 cufflinks.gtf | sort | uniq -c
    to see whether you got any "."

    You can modify dexseq_prepare_annotation.py to work in un-stranded mode
    changing
    # Step 1: Store all exons with their gene and transcript ID
    # in a GenomicArrayOfSets

    exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )

    to

    exons = HTSeq.GenomicArrayOfSets( "auto", stranded=False )

    I guess you can analyze stranded and un-stranded transcripts separately
    or discard un-stranded transcripts if there are only few?
    I wonder what DEXSeq developers reckon.
    ck

    Comment


    • #3
      The portion of your GFF file that you posted: Does it contain the offending line 509?

      Comment


      • #4
        Changing strand from "+" or "-" to "." causes the error:
        #-------------------------------------------------------------------------------
        my small_un_str.gtf:

        chr1 Cufflinks exon 11869 12227 . . . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328.2"; nearest_ref "ENST00000456328.2"; class_code "="; tss_id "TSS1";

        python dexseq_prepare_annotation.py small_un_str.gtf small_un_str.gff

        Traceback (most recent call last):
        File "~/R/x86_64-unknown-linux-gnu-library/2.15/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 29, in <module>
        exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
        File "_HTSeq.pyx", line 509, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:8702)
        KeyError: 'Non-stranded index used for stranded GenomicArray.'
        u
        #-------------------------------------------------------------------------------
        small.gtf
        chr1 Cufflinks exon 11869 12227 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "DDX11L1"; oId "ENST00000456328.2"; nearest_ref "ENST00000456328.2"; class_code "="; tss_id "TSS1";

        python dexseq_prepare_annotation.py small.gtf small.gff

        cat small.gff

        chr1 small.gtf aggregate_gene 11869 12227 . + . gene_id "XLOC_000001"
        chr1 small.gtf exonic_part 11869 12227 . + . transcripts "TCONS_00000001"; exonic_part_number "001"; gene_id "XLOC_000001"
        #-------------------------------------------------------------------------------
        Doesn't line 509 refer to python code itself and not gtf?
        thank you
        ck

        Comment


        • #5
          Originally posted by CK_ View Post
          Changing strand from "+" or "-" to "." causes the error:
          Your suggestion to change "stranded=True" to "stranded=False" is correct. When programming it, I did not expect that people might have GFF files without strand information, so I did not expose this option.

          Doesn't line 509 refer to python code itself and not gtf?
          You are right, of course. Seems I wasn't fully awake yet when I wrote this earlier this today.

          Comment


          • #6
            Hi Simon,
            Originally posted by Simon Anders View Post
            When programming it, I did not expect that people might have GFF files without strand information, so I did not expose this option.
            I reckon (at least with cufflinks) all non-stranded entries are single exon transcripts and hence cufflinks don't know how to guess about strand from splice sites (when using not strand specific library for RNA-seq).
            Here's one example output from cufflinks:
            #--------------------------------------------
            # print all unstranded lines and exclude first exon:
            awk '{if($7=="."){print $0}}' merged.gtf | grep -v 'exon_number "1"' | wc -l
            0
            #--------------------------------------------
            awk '{if($7=="+"){print $0}}' merged.gtf | grep -v 'exon_number "1"' | wc -l
            859632
            #--------------------------------------------

            Given library is not strand specific,
            dexseq_count.py will be called with '-s no' flag anyways;
            Will changing stranded=True to stranded=False in dexseq_prepare_annotation.py affect results for transcripts with known strand in cufflinks.gtf?
            Thanks a lot!
            ck

            Comment


            • #7
              Originally posted by CK_ View Post
              dexseq_count.py will be called with '-s no' flag anyways;
              Will changing stranded=True to stranded=False in dexseq_prepare_annotation.py affect results for transcripts with known strand in cufflinks.gtf?
              If you call the script with '-s no' anyway, this should not make a difference.

              Comment


              • #8
                Hi Simon,
                I have gone through all the steps for DEXSeq. I want to know which isoforms/ splice variants had significant DE. I was wondering how I can find from DEXSeq results, which genes have been alternatively spliced? what are their isoforms?

                Thanks,

                Comment


                • #9
                  Not sure I understand your question. You already have the list of exons that show significant differential usage, right? And each exon is annotated with a gene ID. So what exactly do you need now?

                  Comment


                  • #10
                    Yes, I have genes with exons that have been significantly changed. I have attached "plotDEXSeq" result for one of the genes that have two isoforms:GLYMA01G01805.1, GLYMA01G01805.2.
                    I want to match up between the isoforms that are shown in the picture with the real one. Can we have access to isoforms in DEXSeq? I have genes and exons that are significantly changed, but I want to find the isoforms. All I can see in the "res1 = DEUresultTable(ecs_DE)" file, is the genes and eons, with no information about the isoforms. Is that clear now?
                    Attached Files

                    Comment


                    • #11
                      In your plot, no exon is significant. I hope you have others.

                      Anyway, DEXSeq does not offer functionality to perform inference on the level of isoforms. It's DEXSeq's design philosophy to test of differential exon usgae not for differential isoform expression. Our paper gives a few examples why we feel that this is actually an advantage.

                      Comment


                      • #12
                        Thanks for your answer. I know that DEXSeq is exons based not isoform based.
                        I have attached another plot, I guess the pink area implies that there is a significant DE there, right?

                        For drawing this plot, you are using some information in ExonCountSet object returned by "testForDEU" to find the transcripts, the number of transcripts, right? I want to pull this information out of testForDEU results. To be clear, How I can find information about transcripts of a gene from an ExonCountSet returned by testForDEU function?
                        Attached Files

                        Comment


                        • #13
                          With "featureData(ecs)$transcripts" you get a character vectors with one entry for each exon. This entry is a list of transcript IDs, concatenated with semicolons, namely the IDs of all transcripts that contain the exon.

                          This information is retained by the dexseq_prepare_annotation.py script and put into the flattened GFF file. DEXSeq stores it in the "transcripts" slot of featureData but uses it only at a single point, namely to draw the transcripts under the plot.

                          You can try to do your own inference with this information, but in my experience, this does not work too well (though I haven't tried too hard either).

                          Comment


                          • #14
                            Thanks a lot for your quick response,
                            can the information about the "expression", "norCounts", and "splicing" -which are used for "plotDEXSeq" function- be accessed through "featureData(ecs)" as well? or they are stored in another objects.

                            Comment


                            • #15
                              norCounts is simply the counts, divided by the size factors. You get this with: counts( ecs, normalized=TRUE" )

                              The function "estimatelog2FoldChanges" gives you the values for the coefficients. See the paper for the formulae to assemble the plotted values from these. There are also two internal function,
                              "fitAndArrangeCoefs" and "getEffectsForPlotting", which do this, but they might be a bit hard to understand.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X