Useful tools for my PhD

A collection of Tools I frequently use for my PhD studies in Molecular Virology. Updated 2025-10-14.

Author

Elgin Akin

Published

April 18, 2024

Literature - Keeping up and keeping track…

🧠 Personal knowledge databases

For literature and coursework:

RemNote: A markdown editor with many built in features for spaced repetition studying 💭. Simple and effective. Nice app too.
Obsidian: Obsidian deserves its own post (coming soon…). It’s an extremely powerful markdown editor suited for integrating individual notes into a network for your * insert need here *. Obsidian is maintained by core developers and community supported through the implementation of community plugins.
Omnivore: Think of this as a citation manager but for…everything. It is my favorite “web clipper” application and is flexible across many types of content. Also has integration with ‘database managers’ like Obsidian and Notion (discussed below).

✏️ Citation/Literature managers

Each of these have integration with common web browsers

Zotero: free and open source
- My favorite add on’s: zotero action tags, Better BibTex, ZotFile, Zotero PDF Preview
Mendely: free, not open source
PaperPile: paid after 30 days

📚 Literature Review Resources

Research Rabbit: An AI assisted application which builds networks around public manuscripts (accessible by DOI) by various attributes: author, citations, etc.
- Integrates with Zotero.
Semantic Scholar: A powerful search which assists with identifying ‘related’ articles.

🥼 Electronic Laboratory Notebooks

Notebook’s I recommend:

Notebook	Paid/Free	Pros ✅	Cons ❌	Community Support	Templates
Obsidian	Free, open source	Extremely Flexible superior note backlinking capabilities , `.md` (plain text) file type	Steep learning curve	Yes	Elgin’s Obsidian Lab Notebook Template ELN Blog Post
Benchling	Free + Paid Plans	Developed for Laboratory Scientists	For now, all data is stored completely Online	Yes
Notion	Free to Students	Flexible, huge community, database ready (great for literature reviews)	Requires a stable internet connection, non-local	Yes	Notion Templates, Case Study
Microsoft Office OneNote	Free with Microsoft Subscription	Simple UI, integrates all Office products, cloud backup, descent search.	Clunky UI. Limited to 1 template per folder. No automation options.	Yes
AnyType	Free	Encypted, similar to Notion UX/UI and is “offline” (with included web sync service). Integrated back-linking similar to Obsidian	Newer software, “Blackboxy” file types	Some

💭 Experimental Data Organization Phiolosophy

In the age of -omics, advanced microscopy, and other large data platforms, PhDs are producing more data than ever in the history of science. As a trainee, it can be difficult to navigate both the generation and stewardship of these data. I have prepared a slide deck which outlines my approach to handling a diverse portfolio of data (mixed genomics, microscopy and traditional wet lab experiments: ELISA, Western Blots, RT-qPCR, etc.)

Go to Slide Deck

🖥️ Scripting and Reproducibility

VS Code: A lightweight and accessible text editor.
Rstudio IDE
Positron IDE: For Joint R + Python development. Based on VS Code.
Quarto: also developed by posit. Similar to Rmarkdown and supports multiple languages.
Jupyter: A locally deployed web-application for scripting. Multilanguage support but works especially well for python and julia users. ‘Chunk’-based approach.
Github Co Pilot: Nice AI assistant and can be integrated in VS Code. Free Education license available.
Warp (MacOS Only as of Feb 2024): A fancy shmancy command line utility terminal with integrated features such as autocomplete, pipelines, its own AI etc.

🧬 General sequence data type utilities

Seqkit: Literal gold. Used on a daily basis. An extremely powerful FASTA/Q toolkit for file manipulation, searching, motif scanning, extraction, statistics, etc. Written in go for an extremely lightweight experience.
BigSeqKit: A toolkit designed for processing and analyzing large-scale biological sequence data. It offers efficient algorithms and data structures tailored for handling big sequence datasets, enabling researchers to perform tasks such as sequence alignment, similarity search, and sequence manipulation at scale.
CLipKit: A package developed for the analysis of Comparative Phylogenomic data, offering various tools and utilities for processing and interpreting genomic sequences across different species. It provides functionalities for sequence alignment, phylogenetic tree inference, and comparative genomics studies.
BioKit:
vsearch: Essentially an open-source implementation of USEARCH offering high-performance sequence analysis tools for tasks such as sequence clustering, chimera detection, and sequence similarity search.
CD-Hit: A classic. It is a widely used software package for clustering and comparing protein or nucleotide sequences. It employs a fast and memory-efficient algorithm to identify clusters of similar sequences, enabling users to reduce redundancy in sequence datasets and identify representative sequences.
Awk: A powerful text processing tool used for pattern scanning and processing in text files. It provides a simple yet expressive scripting language for manipulating textual data, making it a popular choice for tasks such as data extraction, transformation, and reporting.
Sed: Short for Stream Editor. It is a command-line utility for parsing and transforming text streams. It operates by applying user-defined editing commands to input text, making it useful for tasks such as text substitution, filtering, and text manipulation in Unix-like environments.
MultiQC: Packs fastqc reports into a nice interactive HTML document.
MMSeq2: Protein sequence searching and clustering. It employs hidden Markov Models (HMM) and efficient sequence searching algorithms to id homologs and group them.
miller: A flexible JSON (and other format) utility tool for cleaning and transforming structures.
Sublime Text Editor
BLAST and BLAST+: No introduction needed
- A nice introduction to using BLAST+ CLI.
NCBI C++ Toolkit: Public domain of portable libraries for sequencing-class data (and others).

🧬 Sequence Motif and Similarity Searches

dbPTM: Integrtated resource for post transnational modifications. Can enter by keywords or by .fasta sequence
EMBL-EBI has an incredible wealth of freely available tools for molecular data analysis. Some I use often include:
- HMMER
- Kalign: sequence alignment
MAFFT: MSA tool. CLI deployable as well.

🧬 Protein Modeling and structure visualization

PyMol and ChimeraX: Powerful tools for visualizing, annotating and analyzing protein structures.
ProteinImager: A web-based protein tool visualization tool. Direct PDB connectivity.
Molecular Nodes: This is my personal module for visualizing crystal structures (although it is so much more than that)
AlphaFold2
- Tutorials are abundant but here is a good protocol for easing into ColabFold https://protocolexchange.researchsquare.com/article/pex-2490/v1
Evo: A long-context biological foundation model for predictive structure generation.

🧬 Leaning a Pipeline (and useful tools within them)

scRNAseq:

single-cell-tutorial: Scripts to complement the book: Current best-practices in single-cell RNA-seq: a tutorial
An open source lecture for Single Cell Datascience from JHU’s Dr. Stephanie Hicks.

RNAseq (bulk and sc):

Caltech BI/BE/CSS 183 - Course: Introduction to Computational Biology and Bioinformatics Course at Caltech, 2023
Introduction to GSEA
iDEP: Guided RNAseq analysis built and deployed in Rshiny.
TOmicsVIS: An end to end transcripomics analysis resource. Implemented as an Rshiny.
PCAtools: extended PCA toolkit for illuminating eigenvalue meaning.
ggkegg: A nice extension to visualizing kegg pathways.
enhancedvolcano: a tidy volcano plot package so that you do not have to make yours from scratch :)
ggvolcano: Another one
ggvolc: Another one.

Proteomics:

TidyProteomics

Genomics

Genomics really deserves its own post but a nice integration of all the tools I use can be summarized through GATK

IGV-Reports
ggcoverage: R package for flexible BAM file visualization.
Sandbox.bio: Learning Genomics in your browser
Dada2
SeekDeep
gget

Integrated and Pan-Dimentional Omics

SIMON: A machine-learning based approach to integrating diverse biological data with transcriptomics and microbiome datasets.
- Read the manuscript
- Read the FluPRINT study

High-dimentional flow cytometry

Omiq: (resource) Cloud-based flow cytometry software, a nice flow-jo alternative gear particularly to high-dimentional flow panels with seamless integration with common dimention-reduction approaches and EmbededSOM.

🧬 Phylogenetics

Augur: A phylogenetics toolkit with a wealth of integrate tools for filtering and aligning sequences, constructing phylogenetics trees as well as options for deploying interactive dashboards for phylogenetic inference
nextclade: Part of next strain project. Relatively niche to viral genome/gene evolution. However, nextclade v3 makes the list due to its powerful and fast annotation guided alignment functionality descrived here. Part of the greater Nextstrain project.
eteToolkit: A python library for phylogenetic and evolutionary hypotehsis testing and visualization.
BEAST2: BEAST2 is a software package for Bayesian evolutionary analysis by sampling trees. It is widely used for estimating phylogenies, divergence times, and other evolutionary parameters from molecular sequence data.
clockor2: clockor2 is a software tool for estimating evolutionary rates and divergence times from molecular sequence data. It is particularly useful for studying the molecular clock hypothesis and inferring evolutionary timelines.
PhyKit: a software package designed for phylogenetic analysis and visualization. It provides a suite of tools for processing and analyzing phylogenetic data, including tree manipulation, visualization, and statistical analysis.
IQ-TREE: IQ-TREE is a software tool for inferring phylogenetic trees from molecular sequence data. It uses maximum likelihood methods to estimate tree topologies and model parameters. Especially useful for determining appropriate models for phylogenetic tree construction.
treetime: A tool for inferring time-resolved phylogenies from molecular sequence data. It incorporates temporal information into phylogenetic analyses to estimate evolutionary timelines and rates more accurately.
IcyTree: a web-based tool for visualizing and annotating phylogenetic trees. It offers interactive visualization features and supports the integration of additional data layers for enhanced analysis.
treeio: Implemented as an R package, treeio is a software package for reading, writing, and manipulating phylogenetic tree data in various formats. It provides a set of tools for importing and exporting tree data, as well as for performing basic tree manipulations and analyses.
ape: An R package for analyzing and manipulating phylogenetic trees. It provides a wide range of functions for reading, writing, and manipulating tree data, as well as for conducting comparative phylogenetic analyses.
HyPhy: A software package for conducting phylogenetic analysis and hypothesis testing. It offers a suite of tools for estimating evolutionary parameters, testing molecular evolution models, and detecting positive selection from sequence data.
treesort: Package for inferring reassortment rates between 2 phylogenetic trees.

☁️ FREE Online and cloud-based computational resources

UseGalaxy: an online platform for conducting bioinformatics analyses. It provides a user-friendly interface for accessing a wide range of bioinformatics tools and workflows, including those for sequence analysis, genome assembly, and phylogenetics. Users can upload their data, select analysis tools, and visualize results within the platform.
BV-BRC: A bioinformatics resource center focused on providing tools and data for research in biodefense and public health. It offers a variety of resources, including databases of genomic sequences, analysis tools for studying pathogens, and educational materials for researchers in the field.
MAFFT: a software tool for multiple sequence alignment of biological sequences. It is widely used for aligning DNA, RNA, and protein sequences, and it offers several algorithms optimized for different types of sequences and alignment tasks. MAFFT can handle large datasets efficiently and is available both as a standalone program and through a web server interface.

💻 R and Python Libraries for statistical interrogation and visualization

For statistics and visualization, I am more biased to solutions in R. The following are some of my favorites visualization libraries in R and Python. This list is not even close to exhaustive but these cover my basics.

[Eric Fletcher] has an excellent R learning resource list which helped me ease my way into hypotehsis testing and visualization in R with his awesome-r-learning-resources page.
Yan Holtz’s data2viz guidelines for appropriate visualization of data. An excellent resource which explains the pros and caveats for numeric, categoric, Num+Cat, mapped, network and time series data.
Yan Holtz’s galleries for R, Python, D3, and React.js are am excellent resource for gathering inspiration for virualizing complex (or even simple) data.

Data and color

It is especially critical to consider color when generating figures for your audience. Here are a couple of great tools I have referenced which can assess common and custom pallets across major color populations including Deuteranomaly, Protanomaly, Protanopia, and Deuteranopia:

Viz Palletteby Elijah Meeks and Susie Lu.
The Color Palette Finder by Yan Holtz: Comes in Python and R-ready panels:
- Python Color Palette Finder
- R Color Palette Finder

R

🖼️ ggplot2 extension gallery
🖼️ The R graph Gallery - A reproducible collection of charts using R. This gallery is an incredible resouce for not only expanding the flexability of your plots, but also for choosing the most appropriate visualization for your data.
ggstatsplot
ggpubr
smplot2
ggforce
ggdist
tidytables
ggupset
ggvenndiagram - if you must…
ggmsa: MSA viewer appropriate for visualization of short aa or nt alignments. Preloaded with different tracks showing consensus, stacked composition, motif highlighting, etc.
seqvisR

Fun packages for ✨aesthetics✨ and publication-quality figures

themepark - a pop culture ggplot theme library
ggprism - iykyk
patchwork
cowplot
gridExtra

Python

I primarily use python to build tools that do not exist in R or as a standalone release. Outside of pandas, numpy, matplotlib, and standard system modules, I use very few extras:

Honorable metions

The awesome-datascience
An Rshiny for calculating statistical power and sample size: https://pwrss.shinyapps.io/index/

Common Databases I use that have been lifesavers for large-volume data projects:

DuckDB
- R API👀
SQLite: Lightweight RDBMS! Great for smaller sequencing projects.
MongoDB: Well-supported NoSQL Database.

High-level Languages which mesh incredibly well with sequencing-based bioinformatics - RUST: 🫠 - Go 🏃🏻‍♂️💨 - C++: Of course.

📝 Blogs, Repositories and Additional Resources

TidyTuesday
Learning basic statistical Testing in R
Dr. Stephaniehicks Blog
Repository - LearnAnything.xyz - An interactive collection from the learn-anything.xyz github repository