I am trying to do differential expression analysis for 6 samples with no replicates
The protocol used by me after following several threads and http://www.nature.com/nprot/journal/....2012.016.html
is as follows
1) after steps like alignment, etc
2) I assembled transcripts for each sample
cufflinks -p 8 -o <output directory> accepted_hits.bam
5) Ran Cuffmerge to create a single merged transcriptome annotation:
cuffmerge -g Homo_sapiens.GRCh37.75.gtf -s Homo_sapiens.GRCh37.75.dna.primary_assembly.fa -p 8 assemblies.txt
6) ran cuffdiff using following options:
--library-type fr-unstranded
--dispersion-method blind -L C1,C2,C3,C4,C5,C6
-b ./index/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
-u ./merged_asm/merged.gtf
7) used CumeRbund for analysis
However, I was unable to get Ensemble id and instead got XLOC id for the genes.
following the thread http://seqanswers.com/forums/showthread.php?t=18357
I tried following two options
1) featureNames(sigGenes)
tracking_id gene_short_name
1 XLOC_002638 HES4,RP11-54O7.17
2 XLOC_005270 <NA>
3 XLOC_005288 <NA>
4 XLOC_005368 <NA>
5 XLOC_007367 RP11-47A8.5
6 XLOC_007664 <NA>
7 XLOC_007703 <NA>
But the names for some of XLOC were missing
2) then i tired other method and i got error
> names<-featureNames(sigGenes)
> row.names(names)=names$tracking_id
> sigGenesNames <-as.matrix(names)
> sigGenesNames <- sigGenesNames [,-1]
> sigGenesData<-diffData(sigGenes)
> row.names(sigGenesData)= sigGenesData$gene_id
Error in `row.names<-.data.frame`(`*tmp*`, value = c("XLOC_002638", "XLOC_002638", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘XLOC_002638’, ‘XLOC_005270’, [... truncated]
Hence, I started to look into the ways to get Ensemble ids and the threads lead to lot of confusion and queries.
Thanks in advance.
The protocol used by me after following several threads and http://www.nature.com/nprot/journal/....2012.016.html
is as follows
1) after steps like alignment, etc
2) I assembled transcripts for each sample
cufflinks -p 8 -o <output directory> accepted_hits.bam
5) Ran Cuffmerge to create a single merged transcriptome annotation:
cuffmerge -g Homo_sapiens.GRCh37.75.gtf -s Homo_sapiens.GRCh37.75.dna.primary_assembly.fa -p 8 assemblies.txt
6) ran cuffdiff using following options:
--library-type fr-unstranded
--dispersion-method blind -L C1,C2,C3,C4,C5,C6
-b ./index/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
-u ./merged_asm/merged.gtf
7) used CumeRbund for analysis
However, I was unable to get Ensemble id and instead got XLOC id for the genes.
following the thread http://seqanswers.com/forums/showthread.php?t=18357
I tried following two options
1) featureNames(sigGenes)
tracking_id gene_short_name
1 XLOC_002638 HES4,RP11-54O7.17
2 XLOC_005270 <NA>
3 XLOC_005288 <NA>
4 XLOC_005368 <NA>
5 XLOC_007367 RP11-47A8.5
6 XLOC_007664 <NA>
7 XLOC_007703 <NA>
But the names for some of XLOC were missing
2) then i tired other method and i got error
> names<-featureNames(sigGenes)
> row.names(names)=names$tracking_id
> sigGenesNames <-as.matrix(names)
> sigGenesNames <- sigGenesNames [,-1]
> sigGenesData<-diffData(sigGenes)
> row.names(sigGenesData)= sigGenesData$gene_id
Error in `row.names<-.data.frame`(`*tmp*`, value = c("XLOC_002638", "XLOC_002638", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘XLOC_002638’, ‘XLOC_005270’, [... truncated]
Hence, I started to look into the ways to get Ensemble ids and the threads lead to lot of confusion and queries.
- should i use gtf file crated by cuffmerge or cuffcompare as input for cuffdiff ?
- while creating database using cummerbund should we use genome and gtf file?
- If the answer of 2nd question is yes, then should I use gtf from Ensemble or cuffmerge or cuffcomapre?
- how to get Ensemble id instead of XLOC id.
- Is there some error in my protocol?
Thanks in advance.
Comment