Useful tools for my PhD
A collection of Tools I frequently use for my PhD studies in Molecular Virology. Updated 2024-04-18.
Literature - Keeping up and keeping track…
🧠 Personal knowledge databases
For literature and coursework:
- RemNote: A markdown editor with many built in features for spaced repetition studying 💭. Simple and effective. Nice app too.
- Obsidian: Obsidian deserves its own post (coming soon…). It’s an extremely powerful markdown editor suited for integrating individual notes into a network for your * insert need here *. Obsidian is maintained by core developers and community supported through the implementation of community plugins.
- Omnivore: Think of this as a citation manager but for…everything. It is my favorite “web clipper” application and is flexible across many types of content. Also has integration with ‘database managers’ like Obsidian and Notion (discussed below).
✏️ Citation/Literature managers
Each of these have integration with common web browswers
- Zotero: free and open source
- My favorite add ons: zotero action tags, Better BibTex, ZotFile, Zotero PDF Preview
- Mendely: free, not open source
- PaperPile: paid after 30 days
📚 Literature Review Resources
- Research Rabbit: An AI assisted application which builds networks around public manuscripts (accessible by DOI) by various attributes: author, citations, etc.
- Integrates with Zotero.
- Semantic Scholar: A powerful search which assists with identifying ‘related’ articles.
🥼 Electronic Laboratory Notebooks
Notebook | Paid/Free | Pros ✅ | Cons ❌ | Community Support | Templates |
---|---|---|---|---|---|
Obsidian | Free, open source | Extremely Flexible superior note backlinking capabilities , .md file type |
Steep learning curve | Yes | ELN Blog Post |
Benchling | Free + Paid Plans | Developed for Laboratory Scientists | Completely Online | Yes | |
Notion | Free to Students | Flexible, huge community, database ready (great for literature reviews) | Requires a stable internet connection, non-local | Yes | Notion Templates, Case Study |
Microsoft Office OneNote | Free with Microsoft Subscription | Simple UI, integrates all Office products, cloud backup, descent search | Clunky UI | Yes | |
AnyType | Free | Encypted, similar to Notion UX/UI and is “offline” (with included web sync service). Integrated backlinking similar to Obsidian | Newer software, “Blackboxy” file types | Some |
🖥️ Scripting and Reproducibility
- Rstudio IDE
- Quarto: also developed by posit. Similar to Rmarkdown and supports multiple languages.
- Jupyter: A locally deployed web-application for scripting. Multilanguage support but works especially well for
python
andjulia
users. ‘Chunk’-based approach. - VS Code: A lightweight and accesible text editor.
- Github Co Pilot: Nice AI assistant and can be integrated in VS Code.
- Warp (MacOS Only as of Feb 2024): A fancy shmancy command line utility terminal with integrated features such as autocomplete, pipelines, its own AI etc.
🧬 General sequence data type utilities
- Seqkit: Literal gold. Used on a daily basis. An extremely powerful FASTA/Q toolkit for file manipulation, searching, motif scanning, extraction, statistics, etc. Written in go for an extremely lightweight experience.
- BigSeqKit: A toolkit designed for processing and analyzing large-scale biological sequence data. It offers efficient algorithms and data structures tailored for handling big sequence datasets, enabling researchers to perform tasks such as sequence alignment, similarity search, and sequence manipulation at scale.
- CLipKit: A package developed for the analysis of Comparative Phylogenomic data, offering various tools and utilities for processing and interpreting genomic sequences across different species. It provides functionalities for sequence alignment, phylogenetic tree inference, and comparative genomics studies.
- BioKit:
- vsearch: Essentially an open-source implementation of USEARCH offering high-performance sequence analysis tools for tasks such as sequence clustering, chimera detection, and sequence similarity search.
- CD-Hit: A classic. It is a widely used software package for clustering and comparing protein or nucleotide sequences. It employs a fast and memory-efficient algorithm to identify clusters of similar sequences, enabling users to reduce redundancy in sequence datasets and identify representative sequences.
- Awk: A powerful text processing tool used for pattern scanning and processing in text files. It provides a simple yet expressive scripting language for manipulating textual data, making it a popular choice for tasks such as data extraction, transformation, and reporting.
- Sed: Short for Stream Editor. It is a command-line utility for parsing and transforming text streams. It operates by applying user-defined editing commands to input text, making it useful for tasks such as text substitution, filtering, and text manipulation in Unix-like environments.
- MultiQC: Packs fastqc reports into a nice interactive HTML document.
- MMSeq2: Protein sequence searching and clustering. It employs hidden Markov Models (HMM) and efficient sequence searching algorithms to id homologs and group them.
- miller: A flexible JSON (and other format) utility tool for cleaning and transforming structures.
- Sublime Text Editor
- BLAST and BLAST+: No introduction needed
- A nice introduction to using BLAST+ CLI.
- NCBI C++ Toolkit: Public domain of portable libraries for sequencing-class data (and others).
🧬 Sequence Motif and Similarity Searches
🧬 Protein Modeling and structure visualization
- PyMol and ChimeraX: Powerful tools for visualizing, annotating and analyzing protein structures.
- ProteinImager: A web-based protein tool visualization tool. Direct PDB connectivity.
- Molecular Nodes: This is my personal module for visualizing crystal structures (aulthough it is so much more than that)
- AlphaFold2
- Tutorials are abundant but here is a good protocol for easing into ColabFold https://protocolexchange.researchsquare.com/article/pex-2490/v1
- Evo: A long-context biological foundation model for predictive structure generation.
🧬 Leaning a Pipeline (and useful tools within them)
scRNAseq:
- single-cell-tutorial: Scripts to complement the book: Current best-practices in single-cell RNA-seq: a tutorial
- An open source lecture for Single Cell Datascience from JHU’s Dr. Stephanie Hicks.
RNAseq (bulk and sc):
- Caltech BI/BE/CSS 183 - Course: Introduction to Computational Biology and Bioinformatics Course at Caltech, 2023
- Introduction to GSEA
- iDEP: Guided RNAseq analysis built and deployed in Rshiny.
- TOmicsVIS: An end to end transcripomics analysis resource. Implemented as an Rshiny.
- PCAtools: extended PCA toolkit for illuminating eigenvalue meaning.
- ggkegg: A nice extension to visualizing kegg pathways.
- enhancedvolcano: a tidy volcano plot package so that you dont have to make yours from scratch.
- ggvolcano: Another one
- ggvolc: Another one.
Proteomics:
Genomics
Genomics really deserves its own post but a nice integration of all the tools I use can be summarized through GATK
- IGV-Reports
- ggcoverage: R package for flexible
BAM
file visualization. - Sandbox.bio: Learning Genomics in your browser
- Dada2
- SeekDeep
- gget
Integrated and Pan-Dimentional Omics
- SIMON: A machine-learning based approach to integrating diverse biological data with transcriptomics and microbiome datasets.
- Read the manuscript
- Read the FluPRINT study
High-dimentional flow cytometry
- Omiq: (resource) Cloud-based flow cytometry software, a nice flow-jo alternative gear particularly to high-dimentional flow panels with seamless integration with common dimention-reduction approaches and EmbededSOM.
🧬 Phylogenetics
- Augur: A phylogenetics toolkit with a wealth of integrate tools for filtering and aligning sequences, constructing phylogenetics trees as well as options for deploying interactive dashboards for phylogenetic inference
- nextclade: Part of next strain project. Relatively niche to viral genome/gene evolution. However, nextclade v3 makes the list due to its powerful and fast annotation guided alignment functionality descrived here. Part of the greater Nextstrain project.
- eteToolkit: A python library for phylogenetic and evolutionary hypotehsis testing and visualization.
- BEAST2: BEAST2 is a software package for Bayesian evolutionary analysis by sampling trees. It is widely used for estimating phylogenies, divergence times, and other evolutionary parameters from molecular sequence data.
- clockor2: clockor2 is a software tool for estimating evolutionary rates and divergence times from molecular sequence data. It is particularly useful for studying the molecular clock hypothesis and inferring evolutionary timelines.
- PhyKit: a software package designed for phylogenetic analysis and visualization. It provides a suite of tools for processing and analyzing phylogenetic data, including tree manipulation, visualization, and statistical analysis.
- IQ-TREE: IQ-TREE is a software tool for inferring phylogenetic trees from molecular sequence data. It uses maximum likelihood methods to estimate tree topologies and model parameters. Especially useful for determining appropriate models for phylogenetic tree construction.
- treetime: A tool for inferring time-resolved phylogenies from molecular sequence data. It incorporates temporal information into phylogenetic analyses to estimate evolutionary timelines and rates more accurately.
- IcyTree: a web-based tool for visualizing and annotating phylogenetic trees. It offers interactive visualization features and supports the integration of additional data layers for enhanced analysis.
- treeio: Implemented as an R package, treeio is a software package for reading, writing, and manipulating phylogenetic tree data in various formats. It provides a set of tools for importing and exporting tree data, as well as for performing basic tree manipulations and analyses.
- ape: An R package for analyzing and manipulating phylogenetic trees. It provides a wide range of functions for reading, writing, and manipulating tree data, as well as for conducting comparative phylogenetic analyses.
- HyPhy: A software package for conducting phylogenetic analysis and hypothesis testing. It offers a suite of tools for estimating evolutionary parameters, testing molecular evolution models, and detecting positive selection from sequence data.
- treesort
☁️ FREE Online and cloud-based computational resources
- UseGalaxy: an online platform for conducting bioinformatics analyses. It provides a user-friendly interface for accessing a wide range of bioinformatics tools and workflows, including those for sequence analysis, genome assembly, and phylogenetics. Users can upload their data, select analysis tools, and visualize results within the platform.
- BV-BRC: A bioinformatics resource center focused on providing tools and data for research in biodefense and public health. It offers a variety of resources, including databases of genomic sequences, analysis tools for studying pathogens, and educational materials for researchers in the field.
- MAFFT: a software tool for multiple sequence alignment of biological sequences. It is widely used for aligning DNA, RNA, and protein sequences, and it offers several algorithms optimized for different types of sequences and alignment tasks. MAFFT can handle large datasets efficiently and is available both as a standalone program and through a web server interface.
💻 R and Python Libraries for statistical interrogation and visualization
For statistics and visualization, I am more biased to solutions in R. The following are some of my favorites visualization libraries in R and Python. This list is not even close to exhaustive but these cover my basics.
[Eric Fletcher] has an excellent R learning resource list which helped me ease my way into hypotehsis testing and visualization in R with his awesome-r-learning-resources page.
Yan Holtz’s data2viz guidlines for appropriate visualization of data. An excellent resouce which explains the pros and caviates for numeric, categoric, Num+Cat, maped, network and time series data.
Yan Holtz’s galleries for R, Python, D3, and React.js are am excellent resource for gathering inspiration for virualizing complex (or even simple) data.
R
- 🖼️ ggplot2 extension gallery
- 🖼️ The
R
graph Gallery - A reproducible collection of charts usingR
. This gallery is an incredible resouce for not only expanding the flexability of your plots, but also for choosing the most appropriate visualization for your data. - ggstatsplot
- ggpubr
- smplot2
- ggforce
- ggdist
- tidytables
- ggupset
- ggvenndiagram - if you must…
- ggmsa: MSA viewer approrpiate for visualization of short aa or nt alignments. Preloaded with different tracks showing consensus, stacked composition, motif highlighting, etc.
- seqvisR
Fun packages for ✨aesthetics✨ and publication-quality figures
Python
I primarily use python to build tools that do not exist in R
or as a standalone release. Outside of pandas, numpy, matplotlib, and standard system modules, I use very few extras:
Honorable metions
- The awesome-datascience
- An Rshiny for calculating statistical power and sample size: https://pwrss.shinyapps.io/index/
Common Databases I use that have been lifesavers for large-volume data projects:
- DuckDB
- SQLite: Lightweight RDBMS! Great for smaller sequencing projects.
- MongoDB: Well-supported NoSQL Database.
High-level Languages which mesh incredibly well with sequencing-based bioinformatics - RUST: 🫠 - Go 🏃🏻♂️💨 - C++: Of course.
📝 Blogs, Repositories and Additional Resources
- TidyTuesday
- Learning basic statistical Testing in R
- Dr. Stephaniehicks Blog
- Repository - LearnAnything.xyz - An interactive collection from the learn-anything.xyz github repository