Useful tools for my PhD

A collection of Tools I frequently use for my PhD studies in Molecular Virology. Updated 2024-04-18.

Author

Elgin Akin

Published

February 27, 2024

Literature - Keeping up and keeping track…

🧠 Personal knowledge databases

For literature and coursework:

  • RemNote: A markdown editor with many built in features for spaced repetition studying 💭. Simple and effective. Nice app too.
  • Obsidian: Obsidian deserves its own post (coming soon…). It’s an extremely powerful markdown editor suited for integrating individual notes into a network for your * insert need here *. Obsidian is maintained by core developers and community supported through the implementation of community plugins.
  • Omnivore: Think of this as a citation manager but for…everything. It is my favorite “web clipper” application and is flexible across many types of content. Also has integration with ‘database managers’ like Obsidian and Notion (discussed below).

✏️ Citation/Literature managers

Each of these have integration with common web browswers

📚 Literature Review Resources

  • Research Rabbit: An AI assisted application which builds networks around public manuscripts (accessible by DOI) by various attributes: author, citations, etc.
  • Semantic Scholar: A powerful search which assists with identifying ‘related’ articles.

🥼 Electronic Laboratory Notebooks

Notebook Paid/Free Pros ✅ Cons ❌ Community Support Templates
Obsidian Free, open source Extremely Flexible superior note backlinking capabilities , .md file type Steep learning curve Yes ELN Blog Post
Benchling Free + Paid Plans Developed for Laboratory Scientists Completely Online Yes
Notion Free to Students Flexible, huge community, database ready (great for literature reviews) Requires a stable internet connection, non-local Yes Notion Templates, Case Study
Microsoft Office OneNote Free with Microsoft Subscription Simple UI, integrates all Office products, cloud backup, descent search Clunky UI  Yes
AnyType Free Encypted, similar to Notion UX/UI and is “offline” (with included web sync service). Integrated backlinking similar to Obsidian Newer software, “Blackboxy” file types Some

🖥️ Scripting and Reproducibility

  • Rstudio IDE
  • Quarto: also developed by posit. Similar to Rmarkdown and supports multiple languages.
  • Jupyter: A locally deployed web-application for scripting. Multilanguage support but works especially well for python and julia users. ‘Chunk’-based approach.
  • VS Code: A lightweight and accesible text editor.
  • Github Co Pilot: Nice AI assistant and can be integrated in VS Code.
  • Warp (MacOS Only as of Feb 2024): A fancy shmancy command line utility terminal with integrated features such as autocomplete, pipelines, its own AI etc.

🧬 General sequence data type utilities

  • Seqkit: Literal gold. Used on a daily basis. An extremely powerful FASTA/Q toolkit for file manipulation, searching, motif scanning, extraction, statistics, etc. Written in go for an extremely lightweight experience.
  • BigSeqKit: A toolkit designed for processing and analyzing large-scale biological sequence data. It offers efficient algorithms and data structures tailored for handling big sequence datasets, enabling researchers to perform tasks such as sequence alignment, similarity search, and sequence manipulation at scale.
  • CLipKit: A package developed for the analysis of Comparative Phylogenomic data, offering various tools and utilities for processing and interpreting genomic sequences across different species. It provides functionalities for sequence alignment, phylogenetic tree inference, and comparative genomics studies.
  • BioKit:
  • vsearch: Essentially an open-source implementation of USEARCH offering high-performance sequence analysis tools for tasks such as sequence clustering, chimera detection, and sequence similarity search.
  • CD-Hit: A classic. It is a widely used software package for clustering and comparing protein or nucleotide sequences. It employs a fast and memory-efficient algorithm to identify clusters of similar sequences, enabling users to reduce redundancy in sequence datasets and identify representative sequences.
  • Awk: A powerful text processing tool used for pattern scanning and processing in text files. It provides a simple yet expressive scripting language for manipulating textual data, making it a popular choice for tasks such as data extraction, transformation, and reporting.
  • Sed: Short for Stream Editor. It is a command-line utility for parsing and transforming text streams. It operates by applying user-defined editing commands to input text, making it useful for tasks such as text substitution, filtering, and text manipulation in Unix-like environments.
  • MultiQC: Packs fastqc reports into a nice interactive HTML document.
  • MMSeq2: Protein sequence searching and clustering. It employs hidden Markov Models (HMM) and efficient sequence searching algorithms to id homologs and group them.
  • miller: A flexible JSON (and other format) utility tool for cleaning and transforming structures.
  • Sublime Text Editor
  • BLAST and BLAST+: No introduction needed
  • NCBI C++ Toolkit: Public domain of portable libraries for sequencing-class data (and others).

🧬 Sequence Motif and Similarity Searches

  • dbPTM: Integrtated resource for post translational modifications. Can enter by keywords or by .fasta sequence
  • EMBL-EBI has an incredible wealth of freely availible tools for molecular data analysis. Some I use often include:
  • MAFFT: MSA tool. CLI deployable as well.

🧬 Protein Modeling and structure visualization

  • PyMol and ChimeraX: Powerful tools for visualizing, annotating and analyzing protein structures.
  • ProteinImager: A web-based protein tool visualization tool. Direct PDB connectivity.
  • Molecular Nodes: This is my personal module for visualizing crystal structures (aulthough it is so much more than that)
  • AlphaFold2
    • Tutorials are abundant but here is a good protocol for easing into ColabFold https://protocolexchange.researchsquare.com/article/pex-2490/v1
  • Evo: A long-context biological foundation model for predictive structure generation.

🧬 Leaning a Pipeline (and useful tools within them)

scRNAseq:

RNAseq (bulk and sc):

  • Caltech BI/BE/CSS 183 - Course: Introduction to Computational Biology and Bioinformatics Course at Caltech, 2023
  • Introduction to GSEA
  • iDEP: Guided RNAseq analysis built and deployed in Rshiny.
  • TOmicsVIS: An end to end transcripomics analysis resource. Implemented as an Rshiny.
  • PCAtools: extended PCA toolkit for illuminating eigenvalue meaning.
  • ggkegg: A nice extension to visualizing kegg pathways.
  • enhancedvolcano: a tidy volcano plot package so that you dont have to make yours from scratch.
  • ggvolcano: Another one
  • ggvolc: Another one.

Proteomics:

Genomics

Genomics really deserves its own post but a nice integration of all the tools I use can be summarized through GATK

Integrated and Pan-Dimentional Omics

High-dimentional flow cytometry

  • Omiq: (resource) Cloud-based flow cytometry software, a nice flow-jo alternative gear particularly to high-dimentional flow panels with seamless integration with common dimention-reduction approaches and EmbededSOM.

🧬 Phylogenetics

  • Augur: A phylogenetics toolkit with a wealth of integrate tools for filtering and aligning sequences, constructing phylogenetics trees as well as options for deploying interactive dashboards for phylogenetic inference
  • nextclade: Part of next strain project. Relatively niche to viral genome/gene evolution. However, nextclade v3 makes the list due to its powerful and fast annotation guided alignment functionality descrived here. Part of the greater Nextstrain project.
  • eteToolkit: A python library for phylogenetic and evolutionary hypotehsis testing and visualization.
  • BEAST2: BEAST2 is a software package for Bayesian evolutionary analysis by sampling trees. It is widely used for estimating phylogenies, divergence times, and other evolutionary parameters from molecular sequence data.
  • clockor2: clockor2 is a software tool for estimating evolutionary rates and divergence times from molecular sequence data. It is particularly useful for studying the molecular clock hypothesis and inferring evolutionary timelines.
  • PhyKit: a software package designed for phylogenetic analysis and visualization. It provides a suite of tools for processing and analyzing phylogenetic data, including tree manipulation, visualization, and statistical analysis.
  • IQ-TREE: IQ-TREE is a software tool for inferring phylogenetic trees from molecular sequence data. It uses maximum likelihood methods to estimate tree topologies and model parameters. Especially useful for determining appropriate models for phylogenetic tree construction.
  • treetime: A tool for inferring time-resolved phylogenies from molecular sequence data. It incorporates temporal information into phylogenetic analyses to estimate evolutionary timelines and rates more accurately.
  • IcyTree: a web-based tool for visualizing and annotating phylogenetic trees. It offers interactive visualization features and supports the integration of additional data layers for enhanced analysis.
  • treeio: Implemented as an R package, treeio is a software package for reading, writing, and manipulating phylogenetic tree data in various formats. It provides a set of tools for importing and exporting tree data, as well as for performing basic tree manipulations and analyses.
  • ape: An R package for analyzing and manipulating phylogenetic trees. It provides a wide range of functions for reading, writing, and manipulating tree data, as well as for conducting comparative phylogenetic analyses.
  • HyPhy: A software package for conducting phylogenetic analysis and hypothesis testing. It offers a suite of tools for estimating evolutionary parameters, testing molecular evolution models, and detecting positive selection from sequence data.
  • treesort

☁️ FREE Online and cloud-based computational resources

  • UseGalaxy: an online platform for conducting bioinformatics analyses. It provides a user-friendly interface for accessing a wide range of bioinformatics tools and workflows, including those for sequence analysis, genome assembly, and phylogenetics. Users can upload their data, select analysis tools, and visualize results within the platform.
  • BV-BRC: A bioinformatics resource center focused on providing tools and data for research in biodefense and public health. It offers a variety of resources, including databases of genomic sequences, analysis tools for studying pathogens, and educational materials for researchers in the field.
  • MAFFT: a software tool for multiple sequence alignment of biological sequences. It is widely used for aligning DNA, RNA, and protein sequences, and it offers several algorithms optimized for different types of sequences and alignment tasks. MAFFT can handle large datasets efficiently and is available both as a standalone program and through a web server interface.

💻 R and Python Libraries for statistical interrogation and visualization

For statistics and visualization, I am more biased to solutions in R. The following are some of my favorites visualization libraries in R and Python. This list is not even close to exhaustive but these cover my basics.

  • [Eric Fletcher] has an excellent R learning resource list which helped me ease my way into hypotehsis testing and visualization in R with his awesome-r-learning-resources page.

  • Yan Holtz’s data2viz guidlines for appropriate visualization of data. An excellent resouce which explains the pros and caviates for numeric, categoric, Num+Cat, maped, network and time series data.

  • Yan Holtz’s galleries for R, Python, D3, and React.js are am excellent resource for gathering inspiration for virualizing complex (or even simple) data.

R

Fun packages for ✨aesthetics✨ and publication-quality figures

Python

I primarily use python to build tools that do not exist in R or as a standalone release. Outside of pandas, numpy, matplotlib, and standard system modules, I use very few extras:

Honorable metions

  • The awesome-datascience
  • An Rshiny for calculating statistical power and sample size: https://pwrss.shinyapps.io/index/

Common Databases I use that have been lifesavers for large-volume data projects:

  • DuckDB
  • SQLite: Lightweight RDBMS! Great for smaller sequencing projects.
  • MongoDB: Well-supported NoSQL Database.

High-level Languages which mesh incredibly well with sequencing-based bioinformatics - RUST: 🫠 - Go 🏃🏻‍♂️💨 - C++: Of course.

📝 Blogs, Repositories and Additional Resources