I am attempting to run DEXSeq in R using counts files generated by the htseq-count python script rather than the dexseq-count script provided with the DEXSeq module for R 2.14.
Here is the R script:
Here is the error I receive:
Here is my question:
How do I properly format/use the htseq-count counts file to create the ExonCountSet object?
Here is what I know:
1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
2) The GTF file is Mus_musculus.NCBIM37.64.gtf
3) The same GTF file was used to generate the SAM file using GSNAP
4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')
Re: 4E) sorted on name (field 1) and QUAL (field 5)
Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):
Re: 4G): Sorted on fields 1 and 2 in SAM
Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:
Re: 4H): dexseq_count results in no counts for any features. The call to htseq-count results in many matches to features and acceptable rejects.
Re 4H): The counts file from htseq is notably different from that dexseq_count:
From dexseq_count:
whereas from htseq-count:
Note 1 record in htseq-count for this feature but 136 records in dexseq_count.
From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.
How can I modify htseq-count to produce the same file as dexseq_count?
OR
How can I modify DEXSeq to accept the counts file produced by htseq-count?
OR
How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?
or, humbly, what do I not know that I should know to make it the last step?
NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
Here is the R script:
Code:
library(DEXSeq) annotationfile = file.path("/media/myLab/DEXSeq/Mus_musculus.NCBIM37.64.gff") samples = data.frame( condition = c("Condition1", "Condition2"), replicate = c(1:1,1:1), type = c("paired-end", "paired-end"), row.names = c("30D21712WKS", "B0541412WKS"), stringsAsFactors = TRUE, check.names = FALSE ) ecs = read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", paste(paste(rownames(samples),"Counts", sep="_"), "txt", sep=".")), design = samples, flattenedfile = annotationfile )
Code:
Error in read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", : Count files do not correspond to the flattened annotation file In addition: Warning message: In aggregates$gene_id == genesrle : longer object length is not a multiple of shorter object length
How do I properly format/use the htseq-count counts file to create the ExonCountSet object?
Here is what I know:
1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
2) The GTF file is Mus_musculus.NCBIM37.64.gtf
3) The same GTF file was used to generate the SAM file using GSNAP
4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
A) Replaced "." with "N" in sequences in SAM
B) Replaced spaces with tabs in SAM
C) Removed 'chr' from chromosome name in SAM
D) Dropped alignments with '*' for QUAL string in SAM
E) Sorted alignments for selecting best pairs in SAM
F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
G) Sorted selected alignments for counting in SAM
H) Created HTSeq Counts File
NOTES:B) Replaced spaces with tabs in SAM
C) Removed 'chr' from chromosome name in SAM
D) Dropped alignments with '*' for QUAL string in SAM
E) Sorted alignments for selecting best pairs in SAM
F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
G) Sorted selected alignments for counting in SAM
H) Created HTSeq Counts File
Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')
Re: 4E) sorted on name (field 1) and QUAL (field 5)
Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):
Code:
import sys, re, optparse import HTSeq def choose_best( samFile ): insam = HTSeq.SAM_Reader( samFile ) # Go through all reads, with their alignments bundled up: for bundle in HTSeq.bundle_multiple_alignments( insam ): fBestAlmt = None rBestAlmt = None # Go through all alignments of a given read, looking # for the one with the best alignment score for almt in bundle: if almt.pe_which == "first": if fBestAlmt is None: fBestAlmt = almt elif almt.aQual > fBestAlmt.aQual: fBestAlmt = almt elif almt.aQual == fBestAlmt: # If there are more than one best alignment, # better skip the read fBestAlmt = None elif almt.pe_which == "second": if rBestAlmt is None: rBestAlmt = almt elif almt.aQual > rBestAlmt.aQual: rBestAlmt = almt elif almt.aQual == rBestAlmt: # If there are more than one best alignment, # better skip the read rBestAlmt = None if fBestAlmt is not None and rBestAlmt is not None: if fBestAlmt.aQual > rBestAlmt.aQual: if fBestAlmt.mate_start is not None: for mAlmt in bundle: if mAlmt.iv is not None: if mAlmt.iv.start is not None: if fBestAlmt.mate_start.pos == mAlmt.iv.start: rBestAlmt = mAlmt elif rBestAlmt.aQual > fBestAlmt.aQual: if rBestAlmt.mate_start is not None: for mAlmt in bundle: if mAlmt.iv is not None: if mAlmt.iv.start is not None: if rBestAlmt.mate_start.pos == mAlmt.iv.start: fBestAlmt = mAlmt if fBestAlmt is not None: # Change the NH field to 1 and print the line print re.sub( "NH:i:\d+", "NH:i:1", fBestAlmt.original_sam_line ), if rBestAlmt is not None: # Change the NH field to 1 and print the line print re.sub( "NH:i:\d+", "NH:i:1", rBestAlmt.original_sam_line ), def main(): optParser = optparse.OptionParser() (opts, args) = optParser.parse_args() choose_best(args[0])
Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:
Code:
htseq-count -m intersection-nonempty -s no -t exonic_part -i gene_id -o mysam_XF.sam mysam_Final.sam Mus_musculus.NCBIM37.64.gff > mysam_Counts.txt
Re 4H): The counts file from htseq is notably different from that dexseq_count:
From dexseq_count:
Code:
ENSMUSG00000093219+ENSMUSG00000025261:001 0 ENSMUSG00000093219+ENSMUSG00000025261:002 0 ENSMUSG00000093219+ENSMUSG00000025261:003 0 ENSMUSG00000093219+ENSMUSG00000025261:004 0 ENSMUSG00000093219+ENSMUSG00000025261:005 0 ENSMUSG00000093219+ENSMUSG00000025261:006 0 ENSMUSG00000093219+ENSMUSG00000025261:007 0 ENSMUSG00000093219+ENSMUSG00000025261:008 0 ENSMUSG00000093219+ENSMUSG00000025261:009 0 ENSMUSG00000093219+ENSMUSG00000025261:010 0 . . .
Code:
ENSMUSG00000093219+ENSMUSG00000025261 4532
From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.
How can I modify htseq-count to produce the same file as dexseq_count?
OR
How can I modify DEXSeq to accept the counts file produced by htseq-count?
OR
How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?
or, humbly, what do I not know that I should know to make it the last step?
NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
Comment