I'm trying to run DEXseq but am having trouble with the newExonCountSet function because my gene_id list and exon_id list are different sizes. When I unlist the mm10 gene_id dataset from UCSC annotated with Ensembl Genes (ensGene) it yields 351,862 gene_ids even though the corresponding exon_id dataset contains only 350,630 IDs. I had a similar problem when trying to annotate the UCSC database with RefSeq Genes (tablename = 'refGene'). Any thoughts on why there is a difference and how I can get the Gene_ID and Exon_ID datasets to match up? Thanks!
Code:
library("GenomicRanges") library("Rsamtools") library("GenomicFeatures") library("rtracklayer") txdb <- makeTranscriptDbFromUCSC(genome="mm10", tablename = "ensGene") #Getting rid of strand information exonRangesList <- exons(txdb) strand(exonRangesList) <- '*' exonRangesNoStrand <- split(exonRangesList) ExonIDs <- exons(txdb, vals=NULL, columns=c('gene_id', 'exon_id')) GeneIDList <- elementMetadata(ExonIDs) GeneIDVector <- GeneIDList$gene_id ExonIDVector <- GeneIDList$exon_id GeneIDVector2 <-unlist(GeneIDVector)
Comment