Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gokhulkrishnakilaru
    Member
    • Jul 2011
    • 39

    DEXSEQ Prepare Annotation File and R output

    Hi Folks,

    I have downloaded the DEXSEQ package from Bioconductor.

    When I try to make the annotation file using the dexseq_prepare_annotations.py script, a gff file is generated but is of zero KB in size.

    I tried with the following files

    Mus_musculus.NCBIM37.64.gtf
    Mouse_UCSC_Refgene.gtf (Refgene from UCSC)
    Mouse_UCSC_RefFlat.gtf (Refflat from UCSC)
    And, all I got was a zero KB file.

    The command I tried was

    Code:
    python dexseq_prepare_annotation.py Mus_musculus.NCBIM37.64.gtf Mus_musculus.NCBIM37.64.gff
    The above command works for Mus_musculus.GRCm38.68.gtf downloaded from Ensemble and I don't understand why it wouldn't work for the NCBI gtf from the same Ensemble.

    Also, the R output from running DEXSEQ using Mus_musculus.GRCm38.68.gf and one wild type and one knock out sample (no replicates) didn't give me anything. All I see is NA in all the columns. I am using the bam files from Tophat aligned to Refseq annotation.

    What am I doing wrong? Any pointers are highly appreciated.
  • areyes
    Senior Member
    • Aug 2010
    • 165

    #2
    Have you checked the size of your first file? Looks like you are replacing your input file with the output file.

    Could you please include a reproducible code for your R code? with the output of the sessionInput()? Also, I would not have many hopes in the results without replicates

    Alejandro Reyes

    Comment

    • gokhulkrishnakilaru
      Member
      • Jul 2011
      • 39

      #3
      Originally posted by areyes View Post
      Have you checked the size of your first file? Looks like you are replacing your input file with the output file.

      Could you please include a reproducible code for your R code? with the output of the sessionInput()? Also, I would not have many hopes in the results without replicates

      Alejandro Reyes
      Hi Alejandro,

      Yes, I did check the size of the input file. I am changing the extension. The input file has GTF and the output has GFF as its extension.

      My R code is as follows

      Code:
      library(DEXSeq)
      options(digits=3)
      setwd("/test/dexseq/")
      library(DEXSeq)
      rm(list=ls())
      annotationfile = file.path("/test/dexseq/Mus_musculus.GRCm38.68.gff")
      annotationfile
      samples = data.frame(condition = c("WT", "KO"),replicate=c(1,1),row.names=c("WildType", "KnockOut"),stringsAsFactors=TRUE,check.names = FALSE)
      samples
      fullFilenames<- list.files("/test/dexseq/",full.names=TRUE,pattern="DEXSEQ.txt")
      fullFilenames
      ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
      head(counts(ecs))
      head(fData(ecs))
      All I see is NA and the estimate size factor is also giving out NA.

      Comment

      • areyes
        Senior Member
        • Aug 2010
        • 165

        #4
        ups, my bad in the gtf extensions thing...

        do your files contain NAs also?
        Last edited by areyes; 10-10-2012, 05:54 AM.

        Comment

        • gokhulkrishnakilaru
          Member
          • Jul 2011
          • 39

          #5
          Originally posted by areyes View Post
          ups, my bad in the gtf extensions thing...

          does your files contain NAs also?
          No. My files have either a value or 0 for nothing. Also, I see another error saying
          error in scan(file what nmax sep dec quote skip nlines na.strings line 1 did not have 3 elements
          I looked the tail of my counts file and it has got some four lines in the last saying _ambiguous, _lowqual etc.

          I deleted those lines and it gives me another error saying
          Error in round(countData) : Non-numeric argument to mathematical function
          .

          Any pointers to these issues. This is my counts file's head

          Code:
          "ENSMUSG00000000001"    :001	1
          "ENSMUSG00000000001"    :002	0
          "ENSMUSG00000000001"    :003	0
          "ENSMUSG00000000001"    :004	1
          "ENSMUSG00000000001"    :005	0
          "ENSMUSG00000000001"    :006	0
          "ENSMUSG00000000001"    :007	0
          "ENSMUSG00000000001"    :008	0
          Last edited by gokhulkrishnakilaru; 10-10-2012, 05:54 AM.

          Comment

          • areyes
            Senior Member
            • Aug 2010
            • 165

            #6
            I see, I think the files you are using as input are causing some problems with the output of our htseq python scripts. I will check what is going on. In the meantime you can reformat your files to look more like this:

            Code:
            FBgn0000003:001	0
            FBgn0000008:001	0
            FBgn0000008:002	0
            FBgn0000008:003	0
            FBgn0000008:004	1
            FBgn0000008:005	4
            FBgn0000008:006	1
            FBgn0000008:007	18
            FBgn0000008:008	4
            FBgn0000008:009	16
            Then it should be fine!

            Comment

            • areyes
              Senior Member
              • Aug 2010
              • 165

              #7
              By the way, where can I download the annotation files you used?

              Comment

              • gokhulkrishnakilaru
                Member
                • Jul 2011
                • 39

                #8
                Originally posted by areyes View Post
                I see, I think the files you are using as input are causing some problems with the output of our htseq python scripts. I will check what is going on. In the meantime you can reformat your files to look more like this:

                Code:
                FBgn0000003:001	0
                FBgn0000008:001	0
                FBgn0000008:002	0
                FBgn0000008:003	0
                FBgn0000008:004	1
                FBgn0000008:005	4
                FBgn0000008:006	1
                FBgn0000008:007	18
                FBgn0000008:008	4
                FBgn0000008:009	16
                Then it should be fine!
                Perfect. That helps me. Also, what about those last four lines with the underscore sign and a numerical value. Can I delete them?

                Comment

                • gokhulkrishnakilaru
                  Member
                  • Jul 2011
                  • 39

                  #9
                  Originally posted by areyes View Post
                  By the way, where can I download the annotation files you used?
                  ftp://ftp.ensembl.org/pub/release-68/gtf/mus_musculus

                  That is where I got the one that worked for me.

                  You can use genome.ucsc.edu and go to tables section. Choose mouse and refseq genes and then refFlat or refGene. Select format to be GTF and if you are successful in preparing the annotations file. Please upload it somewhere or I can invite you to my dropbox. So, that way I have a refseq annotation file.

                  Thanks for the support, my friend.

                  Comment

                  • areyes
                    Senior Member
                    • Aug 2010
                    • 165

                    #10
                    You could, but they are also deleted automatically in the function "read.HTSeqCounts"!

                    Comment

                    • gokhulkrishnakilaru
                      Member
                      • Jul 2011
                      • 39

                      #11
                      Originally posted by areyes View Post
                      You could, but they are also deleted automatically in the function "read.HTSeqCounts"!
                      Hi Alejandro,

                      I was successful in making the counts file as you suggested. I ran the script. The following are my errors. Any pointers that could be of help?

                      Code:
                      ecs<- estimateSizeFactors(ecs)
                      > ecs<- estimateDispersions(ecs)
                      Dispersion estimation. (Progress report: one dot per 100 genes)
                      Error in FUN(c("ENSMUSG00000000078", "ENSMUSG00000000134", "ENSMUSG00000000182",  : 
                        Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.
                      In addition: Warning messages:
                      1: In .local(object, ...) :
                        Exons with less than 11 counts will be discarded. For more details read the documentation, parameter minCount
                      2: In .local(object, ...) :
                        Genes with more than 70 testable exons will be kicked out of the analysis. For more details read the documentation, parameter maxExon
                      I was looking at this link - http://seqanswers.com/forums/archive...p/t-21212.html. Can I delete that line for my case?
                      Last edited by gokhulkrishnakilaru; 10-10-2012, 06:35 AM.

                      Comment

                      • gokhulkrishnakilaru
                        Member
                        • Jul 2011
                        • 39

                        #12
                        Any thoughts anybody?

                        Sorry mods, for bumping up posts.

                        Urgent task. So, had to.

                        Comment

                        • areyes
                          Senior Member
                          • Aug 2010
                          • 165

                          #13
                          Hi gokhulkrishnakilaru,

                          The error talks by its own: "Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.", you do not have replicates. Sorry that I can not help.

                          Alejandro

                          Comment

                          • gokhulkrishnakilaru
                            Member
                            • Jul 2011
                            • 39

                            #14
                            Originally posted by areyes View Post
                            Hi gokhulkrishnakilaru,

                            The error talks by its own: "Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.", you do not have replicates. Sorry that I can not help.

                            Alejandro
                            Thanks Alejandro,

                            So no dexseq could work without replicates?

                            Is that the conclusion?

                            Is there a possibility to change the declaration while specifying the replicates in this section

                            Code:
                            samples = data.frame(condition = c("WT", "KO"),replicate=c(1,1),row.names=c("WildType", "KnockOut"),stringsAsFactors=TRUE,check.names = FALSE)

                            Comment

                            • areyes
                              Senior Member
                              • Aug 2010
                              • 165

                              #15
                              The motivation of the development of DESeq and DEXSeq is being able to estimate biological variability between replicates, and take this into account to call differentially expressed genes or exons. If you don´t have replicates, you do not know if the changes that you are observing are due to biological variation or due to the differences in your genotypes. In any experiment is crucial to do replicates, this is the only way to guarantee reproducibility on your differential expressed calls. For more details, you could check:



                              the discussion of our DEXSeq paper:

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              12 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...