Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • FuzzyCoder
    Member
    • Aug 2011
    • 13

    DEXSeq Using Counts File From htseq-count

    I am attempting to run DEXSeq in R using counts files generated by the htseq-count python script rather than the dexseq-count script provided with the DEXSeq module for R 2.14.

    Here is the R script:

    Code:
    library(DEXSeq)
    annotationfile = file.path("/media/myLab/DEXSeq/Mus_musculus.NCBIM37.64.gff")
    samples = data.frame(
      condition = c("Condition1", "Condition2"),
      replicate = c(1:1,1:1),
      type = c("paired-end", "paired-end"),
      row.names = c("30D21712WKS", "B0541412WKS"),
      stringsAsFactors = TRUE,
      check.names = FALSE
    )
    
    ecs = read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", paste(paste(rownames(samples),"Counts", sep="_"), "txt", sep=".")),
                           design = samples,
                           flattenedfile = annotationfile
    )
    Here is the error I receive:

    Code:
    Error in read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts",  : 
      Count files do not correspond to the flattened annotation file
    In addition: Warning message:
    In aggregates$gene_id == genesrle :
      longer object length is not a multiple of shorter object length
    Here is my question:

    How do I properly format/use the htseq-count counts file to create the ExonCountSet object?

    Here is what I know:

    1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
    2) The GTF file is Mus_musculus.NCBIM37.64.gtf
    3) The same GTF file was used to generate the SAM file using GSNAP
    4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
    A) Replaced "." with "N" in sequences in SAM
    B) Replaced spaces with tabs in SAM
    C) Removed 'chr' from chromosome name in SAM
    D) Dropped alignments with '*' for QUAL string in SAM
    E) Sorted alignments for selecting best pairs in SAM
    F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
    G) Sorted selected alignments for counting in SAM
    H) Created HTSeq Counts File
    NOTES:

    Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')

    Re: 4E) sorted on name (field 1) and QUAL (field 5)

    Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):

    Code:
    import sys, re, optparse
    import HTSeq
    	
    def choose_best( samFile ):
    	insam = HTSeq.SAM_Reader( samFile )
    
    	# Go through all reads, with their alignments bundled up:
    	for bundle in HTSeq.bundle_multiple_alignments( insam ):
    	   fBestAlmt = None
    	   rBestAlmt = None
    	   # Go through all alignments of a given read, looking
    	   # for the one with the best alignment score
    	   for almt in bundle:
    		  if almt.pe_which == "first": 
    			 if fBestAlmt is None:
    				fBestAlmt = almt
    			 elif almt.aQual > fBestAlmt.aQual:
    				fBestAlmt = almt
    			 elif almt.aQual == fBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				fBestAlmt = None
    		  elif almt.pe_which == "second": 
    			 if rBestAlmt is None:
    				rBestAlmt = almt
    			 elif almt.aQual > rBestAlmt.aQual:
    				rBestAlmt = almt
    			 elif almt.aQual == rBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				rBestAlmt = None
    
    	   if fBestAlmt is not None and rBestAlmt is not None:
    		   if fBestAlmt.aQual > rBestAlmt.aQual:
    			  if fBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if fBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 rBestAlmt = mAlmt
    		   elif rBestAlmt.aQual > fBestAlmt.aQual:
    			  if rBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if rBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 fBestAlmt = mAlmt
    
    	   if fBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", fBestAlmt.original_sam_line ),
    		  
    	   if rBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", rBestAlmt.original_sam_line ),
    
    def main():
    	optParser = optparse.OptionParser()
    	(opts, args)  = optParser.parse_args()
    	
    	choose_best(args[0])
    Re: 4G): Sorted on fields 1 and 2 in SAM

    Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:


    Code:
    htseq-count -m intersection-nonempty -s no -t exonic_part -i gene_id -o mysam_XF.sam mysam_Final.sam Mus_musculus.NCBIM37.64.gff > mysam_Counts.txt
    Re: 4H): dexseq_count results in no counts for any features. The call to htseq-count results in many matches to features and acceptable rejects.

    Re 4H): The counts file from htseq is notably different from that dexseq_count:

    From dexseq_count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261:001	0
    ENSMUSG00000093219+ENSMUSG00000025261:002	0
    ENSMUSG00000093219+ENSMUSG00000025261:003	0
    ENSMUSG00000093219+ENSMUSG00000025261:004	0
    ENSMUSG00000093219+ENSMUSG00000025261:005	0
    ENSMUSG00000093219+ENSMUSG00000025261:006	0
    ENSMUSG00000093219+ENSMUSG00000025261:007	0
    ENSMUSG00000093219+ENSMUSG00000025261:008	0
    ENSMUSG00000093219+ENSMUSG00000025261:009	0
    ENSMUSG00000093219+ENSMUSG00000025261:010	0
    .
    .
    .
    whereas from htseq-count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261	4532
    Note 1 record in htseq-count for this feature but 136 records in dexseq_count.

    From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.

    How can I modify htseq-count to produce the same file as dexseq_count?
    OR
    How can I modify DEXSeq to accept the counts file produced by htseq-count?
    OR
    How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?

    or, humbly, what do I not know that I should know to make it the last step?

    NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
    Last edited by FuzzyCoder; 10-10-2011, 09:03 PM. Reason: Cleaned code snippets
    Best Regards,

    Paul Bergmann
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    It should be much easier to get dexseq_count.py to work than to try to use htseq_count.py. The reason that I have written two scripts is, after all, that the tasks are not exactly the same.

    dexseq_count.py should not have any problems with a GFF file produced with dexseq_prepare.py.

    I used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name.
    dexseq_prepare.py has split and renamed all your "exon" lines to "exonic_part", and this is what dexseq_count.py expects.

    We have tested it with TopHat SAM files but not yet with GSNAP SAM files. With TopHat, it works fine.

    Maybe you can send me by email an excerpt from your TopHat and GSNAP SAM files and then, we can investigate.

    Comment

    • FuzzyCoder
      Member
      • Aug 2011
      • 13

      #3
      Thank you Simon.

      After reading your reply, I decided to rerun dexseq_count.py and it indeed worked . It seems that I did not actually follow all the pre-processing steps necessary to make the SAM compatible when I last tried dexseq_count .

      I have been able to generate count files and create an ExonCountSet object per the 10/2/2011 pasilla vignette (using my data) through estimateDispersions.

      However, I now get an error when attempting to run fitDispersionFunction:

      Code:
      Error in glmgam.fit(mm, disps[good], start = coefs) : 
        More columns than rows in X
      In addition: Warning message:
      In is.na(rows) : is.na() applied to non-(list or vector) of type 'NULL'
      Error in fitDispersionFunction(ecs) : 
        Failed to fit the dispersion function
      Any thoughts?

      I will attempt to email the RData containing the ecs. However, it is 7MB, so please let me know if that does not make it to you. I can give you access to it via DropBox or FTP at your preference.
      Best Regards,

      Paul Bergmann

      Comment

      • areyes
        Senior Member
        • Aug 2010
        • 165

        #4
        Also... you don't have the latest version of DEXSeq (0.99.0). You could also try updating it.

        Comment

        • FuzzyCoder
          Member
          • Aug 2011
          • 13

          #5
          Alejandro-

          I updated to 0.1.29 with biocLite. Same error.

          I emailed the data file to you and Simon (~7MB). I also sent you an invitation to my DropBox folder.

          Thanks for your assistance.
          Best Regards,

          Paul Bergmann

          Comment

          • areyes
            Senior Member
            • Aug 2010
            • 165

            #6
            I found the reason why the function is breaking. DEXSeq follows the motivation of DESeq package to use biological replicates to estimate the variance between samples to distinguish real effects from your treatments from just technical and biological variation, in this case you don't have biological replicates and the individual exon dispersion estimations give values that are basically 0. (Try doing the Figure 1 of the vignette) Then, at the time of estimating the dispersion function it just breaks, because the data behaves differently of what we are expecting.

            Comment

            • FuzzyCoder
              Member
              • Aug 2011
              • 13

              #7
              Thank you!

              I will try it with additional replicates. I only used one per condition because I wanted to work my way through the GSNAP -> DEXSeq workflow successfully before I ran all the replicates through GSNAP (~12 hours per replicate). I will let you know how it goes tomorrow.
              Best Regards,

              Paul Bergmann

              Comment

              • oliviera
                Member
                • Apr 2010
                • 31

                #8
                Dear all,
                I try to use DExseq but got the following error when I call the function

                ecs<- fitDispersionFunction(ecs)

                Warning message:
                In glmgam.fit(mm, disps[good], coef.start = coefs) :
                Too much damping - convergence tolerance not achievable


                Here is the version in R I use
                R version 2.15.1 (2012-06-22)
                Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
                locale:
                [1] C/UTF-8/C/C/C/C
                attached base packages:
                [1] stats graphics grDevices utils datasets methods base
                other attached packages:
                [1] DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0
                loaded via a namespace (and not attached):
                [1] RCurl_1.95-1.1 XML_3.95-0.1 biomaRt_2.14.0 hwriter_1.3 plyr_1.7.1
                [6] statmod_1.4.16 stringr_0.6.1

                Any suggestion on what is going on?

                Cheers
                Olivier

                Comment

                • areyes
                  Senior Member
                  • Aug 2010
                  • 165

                  #9
                  Hi oliviera,

                  This warning sometimes happens when your data is sparsed. Have you done the "fit diagnostics" as indicated in the vignette? As it is just a warning, you can go on with your analysis, maybe you will loose some power if the fit is "above" most of the points...

                  Alejandro

                  Comment

                  • oliviera
                    Member
                    • Apr 2010
                    • 31

                    #10
                    Hi Alejandro,

                    Here is the graph to check dispersion estimate. What do you think?

                    meanvalues<- rowMeans(counts(ecs))
                    plot(meanvalues, fData(ecs)$dispBeforeSharing, log="xy", main="mean vs CR dispersion")
                    x<- 0.01:max(meanvalues)
                    y<- ecs@dispFitCoefs[1] + ecs@dispFitCoefs[2] / x
                    lines(x, y, col="red")
                    Attached Files

                    Comment

                    • areyes
                      Senior Member
                      • Aug 2010
                      • 165

                      #11
                      I think it should be OK, how many hits do you get?

                      Comment

                      • oliviera
                        Member
                        • Apr 2010
                        • 31

                        #12
                        Only 92 with adjp < 0.01.
                        With DEseq I get ~2000 genes at this cut off. I think I made something wrong

                        Comment

                        • metheuse
                          Member
                          • Jan 2013
                          • 84

                          #13
                          I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
                          I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.

                          Comment

                          • Simon Anders
                            Senior Member
                            • Feb 2010
                            • 995

                            #14
                            Originally posted by oliviera View Post
                            Only 92 with adjp < 0.01.
                            With DEseq I get ~2000 genes at this cut off. I think I made something wrong
                            Why would you use such as stringent cut-off?

                            Do you really need to make sure that your list of differentially used exons do not contain more than 1% false positives? In most use cases, 10% are considered acceptable.

                            And: Of course, you get much more genes than exons. Detecting differential expression is needs to see much less information in the data than detecting differential exon usage.

                            Comment

                            • Simon Anders
                              Senior Member
                              • Feb 2010
                              • 995

                              #15
                              Originally posted by metheuse View Post
                              I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
                              I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.
                              Hav you checked your alignments with a genome browser? Load the SAM file and the GFF file produced by dexseq_prepare in, e.g., IGV, and look at one of the loci with zero counts. If there really are no reads, you experiment has failed (or you are using a wrong annotation file).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              38 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              194 views
                              0 reactions
                              Last Post seqadmin  
                              Working...