Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSeq Using Counts File From htseq-count

    I am attempting to run DEXSeq in R using counts files generated by the htseq-count python script rather than the dexseq-count script provided with the DEXSeq module for R 2.14.

    Here is the R script:

    Code:
    library(DEXSeq)
    annotationfile = file.path("/media/myLab/DEXSeq/Mus_musculus.NCBIM37.64.gff")
    samples = data.frame(
      condition = c("Condition1", "Condition2"),
      replicate = c(1:1,1:1),
      type = c("paired-end", "paired-end"),
      row.names = c("30D21712WKS", "B0541412WKS"),
      stringsAsFactors = TRUE,
      check.names = FALSE
    )
    
    ecs = read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", paste(paste(rownames(samples),"Counts", sep="_"), "txt", sep=".")),
                           design = samples,
                           flattenedfile = annotationfile
    )
    Here is the error I receive:

    Code:
    Error in read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts",  : 
      Count files do not correspond to the flattened annotation file
    In addition: Warning message:
    In aggregates$gene_id == genesrle :
      longer object length is not a multiple of shorter object length
    Here is my question:

    How do I properly format/use the htseq-count counts file to create the ExonCountSet object?

    Here is what I know:

    1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
    2) The GTF file is Mus_musculus.NCBIM37.64.gtf
    3) The same GTF file was used to generate the SAM file using GSNAP
    4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
    A) Replaced "." with "N" in sequences in SAM
    B) Replaced spaces with tabs in SAM
    C) Removed 'chr' from chromosome name in SAM
    D) Dropped alignments with '*' for QUAL string in SAM
    E) Sorted alignments for selecting best pairs in SAM
    F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
    G) Sorted selected alignments for counting in SAM
    H) Created HTSeq Counts File
    NOTES:

    Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')

    Re: 4E) sorted on name (field 1) and QUAL (field 5)

    Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):

    Code:
    import sys, re, optparse
    import HTSeq
    	
    def choose_best( samFile ):
    	insam = HTSeq.SAM_Reader( samFile )
    
    	# Go through all reads, with their alignments bundled up:
    	for bundle in HTSeq.bundle_multiple_alignments( insam ):
    	   fBestAlmt = None
    	   rBestAlmt = None
    	   # Go through all alignments of a given read, looking
    	   # for the one with the best alignment score
    	   for almt in bundle:
    		  if almt.pe_which == "first": 
    			 if fBestAlmt is None:
    				fBestAlmt = almt
    			 elif almt.aQual > fBestAlmt.aQual:
    				fBestAlmt = almt
    			 elif almt.aQual == fBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				fBestAlmt = None
    		  elif almt.pe_which == "second": 
    			 if rBestAlmt is None:
    				rBestAlmt = almt
    			 elif almt.aQual > rBestAlmt.aQual:
    				rBestAlmt = almt
    			 elif almt.aQual == rBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				rBestAlmt = None
    
    	   if fBestAlmt is not None and rBestAlmt is not None:
    		   if fBestAlmt.aQual > rBestAlmt.aQual:
    			  if fBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if fBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 rBestAlmt = mAlmt
    		   elif rBestAlmt.aQual > fBestAlmt.aQual:
    			  if rBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if rBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 fBestAlmt = mAlmt
    
    	   if fBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", fBestAlmt.original_sam_line ),
    		  
    	   if rBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", rBestAlmt.original_sam_line ),
    
    def main():
    	optParser = optparse.OptionParser()
    	(opts, args)  = optParser.parse_args()
    	
    	choose_best(args[0])
    Re: 4G): Sorted on fields 1 and 2 in SAM

    Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:


    Code:
    htseq-count -m intersection-nonempty -s no -t exonic_part -i gene_id -o mysam_XF.sam mysam_Final.sam Mus_musculus.NCBIM37.64.gff > mysam_Counts.txt
    Re: 4H): dexseq_count results in no counts for any features. The call to htseq-count results in many matches to features and acceptable rejects.

    Re 4H): The counts file from htseq is notably different from that dexseq_count:

    From dexseq_count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261:001	0
    ENSMUSG00000093219+ENSMUSG00000025261:002	0
    ENSMUSG00000093219+ENSMUSG00000025261:003	0
    ENSMUSG00000093219+ENSMUSG00000025261:004	0
    ENSMUSG00000093219+ENSMUSG00000025261:005	0
    ENSMUSG00000093219+ENSMUSG00000025261:006	0
    ENSMUSG00000093219+ENSMUSG00000025261:007	0
    ENSMUSG00000093219+ENSMUSG00000025261:008	0
    ENSMUSG00000093219+ENSMUSG00000025261:009	0
    ENSMUSG00000093219+ENSMUSG00000025261:010	0
    .
    .
    .
    whereas from htseq-count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261	4532
    Note 1 record in htseq-count for this feature but 136 records in dexseq_count.

    From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.

    How can I modify htseq-count to produce the same file as dexseq_count?
    OR
    How can I modify DEXSeq to accept the counts file produced by htseq-count?
    OR
    How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?

    or, humbly, what do I not know that I should know to make it the last step?

    NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
    Last edited by FuzzyCoder; 10-10-2011, 09:03 PM. Reason: Cleaned code snippets
    Best Regards,

    Paul Bergmann

  • #2
    It should be much easier to get dexseq_count.py to work than to try to use htseq_count.py. The reason that I have written two scripts is, after all, that the tasks are not exactly the same.

    dexseq_count.py should not have any problems with a GFF file produced with dexseq_prepare.py.

    I used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name.
    dexseq_prepare.py has split and renamed all your "exon" lines to "exonic_part", and this is what dexseq_count.py expects.

    We have tested it with TopHat SAM files but not yet with GSNAP SAM files. With TopHat, it works fine.

    Maybe you can send me by email an excerpt from your TopHat and GSNAP SAM files and then, we can investigate.

    Comment


    • #3
      Thank you Simon.

      After reading your reply, I decided to rerun dexseq_count.py and it indeed worked . It seems that I did not actually follow all the pre-processing steps necessary to make the SAM compatible when I last tried dexseq_count .

      I have been able to generate count files and create an ExonCountSet object per the 10/2/2011 pasilla vignette (using my data) through estimateDispersions.

      However, I now get an error when attempting to run fitDispersionFunction:

      Code:
      Error in glmgam.fit(mm, disps[good], start = coefs) : 
        More columns than rows in X
      In addition: Warning message:
      In is.na(rows) : is.na() applied to non-(list or vector) of type 'NULL'
      Error in fitDispersionFunction(ecs) : 
        Failed to fit the dispersion function
      Any thoughts?

      I will attempt to email the RData containing the ecs. However, it is 7MB, so please let me know if that does not make it to you. I can give you access to it via DropBox or FTP at your preference.
      Best Regards,

      Paul Bergmann

      Comment


      • #4
        Also... you don't have the latest version of DEXSeq (0.99.0). You could also try updating it.

        Comment


        • #5
          Alejandro-

          I updated to 0.1.29 with biocLite. Same error.

          I emailed the data file to you and Simon (~7MB). I also sent you an invitation to my DropBox folder.

          Thanks for your assistance.
          Best Regards,

          Paul Bergmann

          Comment


          • #6
            I found the reason why the function is breaking. DEXSeq follows the motivation of DESeq package to use biological replicates to estimate the variance between samples to distinguish real effects from your treatments from just technical and biological variation, in this case you don't have biological replicates and the individual exon dispersion estimations give values that are basically 0. (Try doing the Figure 1 of the vignette) Then, at the time of estimating the dispersion function it just breaks, because the data behaves differently of what we are expecting.

            Comment


            • #7
              Thank you!

              I will try it with additional replicates. I only used one per condition because I wanted to work my way through the GSNAP -> DEXSeq workflow successfully before I ran all the replicates through GSNAP (~12 hours per replicate). I will let you know how it goes tomorrow.
              Best Regards,

              Paul Bergmann

              Comment


              • #8
                Dear all,
                I try to use DExseq but got the following error when I call the function

                ecs<- fitDispersionFunction(ecs)

                Warning message:
                In glmgam.fit(mm, disps[good], coef.start = coefs) :
                Too much damping - convergence tolerance not achievable


                Here is the version in R I use
                R version 2.15.1 (2012-06-22)
                Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
                locale:
                [1] C/UTF-8/C/C/C/C
                attached base packages:
                [1] stats graphics grDevices utils datasets methods base
                other attached packages:
                [1] DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0
                loaded via a namespace (and not attached):
                [1] RCurl_1.95-1.1 XML_3.95-0.1 biomaRt_2.14.0 hwriter_1.3 plyr_1.7.1
                [6] statmod_1.4.16 stringr_0.6.1

                Any suggestion on what is going on?

                Cheers
                Olivier

                Comment


                • #9
                  Hi oliviera,

                  This warning sometimes happens when your data is sparsed. Have you done the "fit diagnostics" as indicated in the vignette? As it is just a warning, you can go on with your analysis, maybe you will loose some power if the fit is "above" most of the points...

                  Alejandro

                  Comment


                  • #10
                    Hi Alejandro,

                    Here is the graph to check dispersion estimate. What do you think?

                    meanvalues<- rowMeans(counts(ecs))
                    plot(meanvalues, fData(ecs)$dispBeforeSharing, log="xy", main="mean vs CR dispersion")
                    x<- 0.01:max(meanvalues)
                    y<- ecs@dispFitCoefs[1] + ecs@dispFitCoefs[2] / x
                    lines(x, y, col="red")
                    Attached Files

                    Comment


                    • #11
                      I think it should be OK, how many hits do you get?

                      Comment


                      • #12
                        Only 92 with adjp < 0.01.
                        With DEseq I get ~2000 genes at this cut off. I think I made something wrong

                        Comment


                        • #13
                          I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
                          I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.

                          Comment


                          • #14
                            Originally posted by oliviera View Post
                            Only 92 with adjp < 0.01.
                            With DEseq I get ~2000 genes at this cut off. I think I made something wrong
                            Why would you use such as stringent cut-off?

                            Do you really need to make sure that your list of differentially used exons do not contain more than 1% false positives? In most use cases, 10% are considered acceptable.

                            And: Of course, you get much more genes than exons. Detecting differential expression is needs to see much less information in the data than detecting differential exon usage.

                            Comment


                            • #15
                              Originally posted by metheuse View Post
                              I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
                              I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.
                              Hav you checked your alignments with a genome browser? Load the SAM file and the GFF file produced by dexseq_prepare in, e.g., IGV, and look at one of the loci with zero counts. If there really are no reads, you experiment has failed (or you are using a wrong annotation file).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X