After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine; and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Differential gene expression analysis tools
Differential gene expression is the variation in gene activity levels between different conditions or cell types. A thorough understanding of this process is important as it helps identify genes that are upregulated or downregulated in response to specific stimuli or in different disease states, providing researchers with insights into the underlying molecular mechanisms, cellular processes, and potential therapeutic targets associated with those conditions.
When asked about his procedure and preferred tools for differential expression analysis, Ziemann explains, “I use a PCA plot to visualize the sample variation and I omit samples from the downstream analysis if they appear like outliers and are supported by the QC. I load the Kallisto counts (detailed in the first article of the series) into R and collapse these to the gene level as I'm not that interested in alternative splicing. I then use DESeq2 for differential expression, as it is the most accurate according to my unpublished simulation work. DESeq2 also allows for complex experimental designs, which allow us to correct for potential confounders.”
Ziemann also notes that in order to interpret his data, he uses the Bioconductor package, mitch, for enrichment analysis. “Mitch is quite unique in that it accommodates multiple DESeq2 comparisons into an analysis, which gives a more integrated overview of the trends in a complex dataset with many contrasts.”
Tang supports the recommendation for using DESeq2 and states that it is standard for differential gene expression analysis. His claim is also backed up by tens of thousands of journal articles that cite DESeq2, clearly making it the gold standard for differential analysis. Although not included in the recommendations, common alternatives to this popular tool include edgeR, limma, NOISeq, and sleuth.
Data visualization tools
While each step of the analysis process is important, the final step—data visualization—is critical for an accurate understanding of the data. This process allows researchers to interpret complex patterns and relationships, highlight significance, and effectively communicate their findings.
Tang recommends ComplexHeatmap, ggplot2, and Bioconductor visualization packages for effective visualization tools. ComplexHeatmap is a package that is also available on Bioconductor and is ideal for building heatmaps to visualize data associations and patterns. ggplot2 is an R package offering versatile plot creation capabilities, while Bioconductor provides a wide range of visualization options tailored to specific application and analysis requirements.
“For [visualization of] differential expression analysis, I keep it fairly simple,” says Ziemann. “PCA plots to understand overall trends, base R for volcano or smear plots, heatmap.2 for heatmaps, and I like beeswarm charts to show gene expression differences between groups. For pathway enrichment, mitch provides a set of nice visualizations.” All of these types of visualization methods can be also created using R or from existing packages in Bioconductor.
Mahmoud utilizes a combination of R and Python libraries for his visualization needs. He employs ggplot2 from R, which enables the creation of versatile plots. In Python, he utilizes Matplotlib for comprehensive figure generation, Seaborn for informative statistical graphics based on Matplotlib, and Plotly, an interactive, browser-based graphing library. He also uses the Integrative Genomics Viewer (IGV) browser for much of his work.
Additional tools recommended by Mahmoud include samplot for structural variant visualization. Lastly, he suggests using Circos, an innovative tool primarily used for circular layout representations executed in Perl. Circos has enhanced the visualization of scientific results, particularly in the field of genomics.
Conclusion
There are many more influential tools and important sequencing analysis applications not mentioned in this article series. So, we’ll ask the community. What are some of your preferred tools for these processes? Make sure you are logged in so you can comment below!
Attached is a PDF containing additional details about some of the tools recommended above.