Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • areyes
    replied
    I found the reason why the function is breaking. DEXSeq follows the motivation of DESeq package to use biological replicates to estimate the variance between samples to distinguish real effects from your treatments from just technical and biological variation, in this case you don't have biological replicates and the individual exon dispersion estimations give values that are basically 0. (Try doing the Figure 1 of the vignette) Then, at the time of estimating the dispersion function it just breaks, because the data behaves differently of what we are expecting.

    Leave a comment:


  • FuzzyCoder
    replied
    Alejandro-

    I updated to 0.1.29 with biocLite. Same error.

    I emailed the data file to you and Simon (~7MB). I also sent you an invitation to my DropBox folder.

    Thanks for your assistance.

    Leave a comment:


  • areyes
    replied
    Also... you don't have the latest version of DEXSeq (0.99.0). You could also try updating it.

    Leave a comment:


  • FuzzyCoder
    replied
    Thank you Simon.

    After reading your reply, I decided to rerun dexseq_count.py and it indeed worked . It seems that I did not actually follow all the pre-processing steps necessary to make the SAM compatible when I last tried dexseq_count .

    I have been able to generate count files and create an ExonCountSet object per the 10/2/2011 pasilla vignette (using my data) through estimateDispersions.

    However, I now get an error when attempting to run fitDispersionFunction:

    Code:
    Error in glmgam.fit(mm, disps[good], start = coefs) : 
      More columns than rows in X
    In addition: Warning message:
    In is.na(rows) : is.na() applied to non-(list or vector) of type 'NULL'
    Error in fitDispersionFunction(ecs) : 
      Failed to fit the dispersion function
    Any thoughts?

    I will attempt to email the RData containing the ecs. However, it is 7MB, so please let me know if that does not make it to you. I can give you access to it via DropBox or FTP at your preference.

    Leave a comment:


  • Simon Anders
    replied
    It should be much easier to get dexseq_count.py to work than to try to use htseq_count.py. The reason that I have written two scripts is, after all, that the tasks are not exactly the same.

    dexseq_count.py should not have any problems with a GFF file produced with dexseq_prepare.py.

    I used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name.
    dexseq_prepare.py has split and renamed all your "exon" lines to "exonic_part", and this is what dexseq_count.py expects.

    We have tested it with TopHat SAM files but not yet with GSNAP SAM files. With TopHat, it works fine.

    Maybe you can send me by email an excerpt from your TopHat and GSNAP SAM files and then, we can investigate.

    Leave a comment:


  • FuzzyCoder
    started a topic DEXSeq Using Counts File From htseq-count

    DEXSeq Using Counts File From htseq-count

    I am attempting to run DEXSeq in R using counts files generated by the htseq-count python script rather than the dexseq-count script provided with the DEXSeq module for R 2.14.

    Here is the R script:

    Code:
    library(DEXSeq)
    annotationfile = file.path("/media/myLab/DEXSeq/Mus_musculus.NCBIM37.64.gff")
    samples = data.frame(
      condition = c("Condition1", "Condition2"),
      replicate = c(1:1,1:1),
      type = c("paired-end", "paired-end"),
      row.names = c("30D21712WKS", "B0541412WKS"),
      stringsAsFactors = TRUE,
      check.names = FALSE
    )
    
    ecs = read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", paste(paste(rownames(samples),"Counts", sep="_"), "txt", sep=".")),
                           design = samples,
                           flattenedfile = annotationfile
    )
    Here is the error I receive:

    Code:
    Error in read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts",  : 
      Count files do not correspond to the flattened annotation file
    In addition: Warning message:
    In aggregates$gene_id == genesrle :
      longer object length is not a multiple of shorter object length
    Here is my question:

    How do I properly format/use the htseq-count counts file to create the ExonCountSet object?

    Here is what I know:

    1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
    2) The GTF file is Mus_musculus.NCBIM37.64.gtf
    3) The same GTF file was used to generate the SAM file using GSNAP
    4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
    A) Replaced "." with "N" in sequences in SAM
    B) Replaced spaces with tabs in SAM
    C) Removed 'chr' from chromosome name in SAM
    D) Dropped alignments with '*' for QUAL string in SAM
    E) Sorted alignments for selecting best pairs in SAM
    F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
    G) Sorted selected alignments for counting in SAM
    H) Created HTSeq Counts File
    NOTES:

    Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')

    Re: 4E) sorted on name (field 1) and QUAL (field 5)

    Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):

    Code:
    import sys, re, optparse
    import HTSeq
    	
    def choose_best( samFile ):
    	insam = HTSeq.SAM_Reader( samFile )
    
    	# Go through all reads, with their alignments bundled up:
    	for bundle in HTSeq.bundle_multiple_alignments( insam ):
    	   fBestAlmt = None
    	   rBestAlmt = None
    	   # Go through all alignments of a given read, looking
    	   # for the one with the best alignment score
    	   for almt in bundle:
    		  if almt.pe_which == "first": 
    			 if fBestAlmt is None:
    				fBestAlmt = almt
    			 elif almt.aQual > fBestAlmt.aQual:
    				fBestAlmt = almt
    			 elif almt.aQual == fBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				fBestAlmt = None
    		  elif almt.pe_which == "second": 
    			 if rBestAlmt is None:
    				rBestAlmt = almt
    			 elif almt.aQual > rBestAlmt.aQual:
    				rBestAlmt = almt
    			 elif almt.aQual == rBestAlmt:
    				# If there are more than one best alignment, 
    				# better skip the read
    				rBestAlmt = None
    
    	   if fBestAlmt is not None and rBestAlmt is not None:
    		   if fBestAlmt.aQual > rBestAlmt.aQual:
    			  if fBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if fBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 rBestAlmt = mAlmt
    		   elif rBestAlmt.aQual > fBestAlmt.aQual:
    			  if rBestAlmt.mate_start is not None:
    				 for mAlmt in bundle:
    					if mAlmt.iv is not None:
    					   if mAlmt.iv.start is not None:
    						  if rBestAlmt.mate_start.pos == mAlmt.iv.start:
    							 fBestAlmt = mAlmt
    
    	   if fBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", fBestAlmt.original_sam_line ),
    		  
    	   if rBestAlmt is not None:
    		  # Change the NH field to 1 and print the line
    		  print re.sub( "NH:i:\d+", "NH:i:1", rBestAlmt.original_sam_line ),
    
    def main():
    	optParser = optparse.OptionParser()
    	(opts, args)  = optParser.parse_args()
    	
    	choose_best(args[0])
    Re: 4G): Sorted on fields 1 and 2 in SAM

    Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:


    Code:
    htseq-count -m intersection-nonempty -s no -t exonic_part -i gene_id -o mysam_XF.sam mysam_Final.sam Mus_musculus.NCBIM37.64.gff > mysam_Counts.txt
    Re: 4H): dexseq_count results in no counts for any features. The call to htseq-count results in many matches to features and acceptable rejects.

    Re 4H): The counts file from htseq is notably different from that dexseq_count:

    From dexseq_count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261:001	0
    ENSMUSG00000093219+ENSMUSG00000025261:002	0
    ENSMUSG00000093219+ENSMUSG00000025261:003	0
    ENSMUSG00000093219+ENSMUSG00000025261:004	0
    ENSMUSG00000093219+ENSMUSG00000025261:005	0
    ENSMUSG00000093219+ENSMUSG00000025261:006	0
    ENSMUSG00000093219+ENSMUSG00000025261:007	0
    ENSMUSG00000093219+ENSMUSG00000025261:008	0
    ENSMUSG00000093219+ENSMUSG00000025261:009	0
    ENSMUSG00000093219+ENSMUSG00000025261:010	0
    .
    .
    .
    whereas from htseq-count:

    Code:
    ENSMUSG00000093219+ENSMUSG00000025261	4532
    Note 1 record in htseq-count for this feature but 136 records in dexseq_count.

    From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.

    How can I modify htseq-count to produce the same file as dexseq_count?
    OR
    How can I modify DEXSeq to accept the counts file produced by htseq-count?
    OR
    How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?

    or, humbly, what do I not know that I should know to make it the last step?

    NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
    Last edited by FuzzyCoder; 10-10-2011, 09:03 PM. Reason: Cleaned code snippets

Latest Articles

Collapse

  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    Today, 06:43 AM
  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 06-21-2024, 07:49 AM
0 responses
224 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-20-2024, 07:23 AM
0 responses
230 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-17-2024, 06:54 AM
0 responses
231 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-14-2024, 07:24 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Working...
X