Hi All,
I have a question in regards to Differential Expression analysis on RNA-seq data.
I have a number of different samples and after alignment with Tophat2, I used either cuffdiff, or summerizeoverlap to generate the count matrix and then performed the differential analysis using DEseq2, EdgeR and NBPSeq. The comparison for each I have names List A-J.
So using these 4 methods, I get a number of different DE genes. See table below:
deseq2 nbpseq edger Cuffdiff
ListA 388 491 497 585
ListB 2 5 3 16
ListC 386 487 494 576
ListD 4381 4412 4482 5405
ListE 3885 4388 4159 4753
ListF 2 5 3 16
ListG 1376 1254 1375 1455
ListH 148 188 190 213
ListI 3885 4400 4159 4754
ListJ 3626 2300 3566 5607
Overall the trend is that Cuffdiff identifies more genes than the count based methods.
I also looked at the overlap between the genes identified from the methods. Attached in the file.
Now my question is: Which gene set for each list do I use as the final list??
Even though Cufflinks reports more genes, a lot of these could be false positives. Also I am inclined to take the overlap between the 4 methods, however I have not come across a publication where this is done.
Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed? I am aware of over-representation analysis using GO terms or KEGG pathways but these also contain a large number of false positives so IMHO they don't add any more certainty/confidence to my final list being the correct one.
Thanks in advance for taking the time to answer my question and provide me with some guidance.
I have a question in regards to Differential Expression analysis on RNA-seq data.
I have a number of different samples and after alignment with Tophat2, I used either cuffdiff, or summerizeoverlap to generate the count matrix and then performed the differential analysis using DEseq2, EdgeR and NBPSeq. The comparison for each I have names List A-J.
So using these 4 methods, I get a number of different DE genes. See table below:
deseq2 nbpseq edger Cuffdiff
ListA 388 491 497 585
ListB 2 5 3 16
ListC 386 487 494 576
ListD 4381 4412 4482 5405
ListE 3885 4388 4159 4753
ListF 2 5 3 16
ListG 1376 1254 1375 1455
ListH 148 188 190 213
ListI 3885 4400 4159 4754
ListJ 3626 2300 3566 5607
Overall the trend is that Cuffdiff identifies more genes than the count based methods.
I also looked at the overlap between the genes identified from the methods. Attached in the file.
Now my question is: Which gene set for each list do I use as the final list??
Even though Cufflinks reports more genes, a lot of these could be false positives. Also I am inclined to take the overlap between the 4 methods, however I have not come across a publication where this is done.
Are there any bioinformatics ways in which I could perform additional validation of the gene lists that I have observed? I am aware of over-representation analysis using GO terms or KEGG pathways but these also contain a large number of false positives so IMHO they don't add any more certainty/confidence to my final list being the correct one.
Thanks in advance for taking the time to answer my question and provide me with some guidance.
Comment