Visual Physiology Opsin Database

A newly compiled database of opsin genes and machine-learning models to predict peak-sensitivity (λmax) phenotypes.

Summary of VPOD v1.3

  • Contains 1,714 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 120+ separate publications.
  • Curates all heterologously expressed, and a partial collection of physiologically inferred opsin genes.
  • Uses VPOD data and deepBreaks to show regression-based machine learning (ML) models reliably predict λmax, account for epistatic (non-additive) effects, and identify functional amino acid sites.
  • Lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype.

Database File Structure

VPOD provides both fully curated machine-learning ready datasets and the raw database files for custom querying using SQLite.

Formatted Data Subsets

Located in vpod_data/VPOD_1.3/formatted_data_subsets/. Subsets suitable for direct model training without requiring MySQL or sequence alignment.

  • xxx.txt - Unaligned data subsets.
  • xxx_aligned.txt - Aligned data subsets.
  • VPOD_xxx_het_1.3.fasta - Fully aligned and formatted subsets.
  • xxx_meta.tsv - Metadata files (species, accession, λmax).

Raw Database Files

Located in vpod_data/VPOD_1.3/raw_database_files/. Load into SQLite to create custom datasets.

  • litsearch.csv - Literature search information.
  • references.csv - All publication references.
  • opsins.csv - Sequence data & taxonomic metadata.
  • heterologous.csv - Opsin phenotype data & experimental metadata.
Clone VPOD
git clone https://github.com/VisualPhysiologyDB/visual-physiology-opsin-db.git
cd visual-physiology-opsin-db
# Start exploring with vpod_main_wf.ipynb

Workflows & Analysis Notebooks

Primary ML Workflow

vpod_main_wf.ipynb is the primary notebook for users. It contains a full pipeline for creating a local instance of VPOD using SQLite, formatting datasets, and training ML models using deepBreaks.

Sequence Manipulation

Includes tools for generating chimeras, in-silico deep-mutational-scanning (DMS), and reciprocal mutagenesis to build theoretical opsin variants for model testing.

Phylogenetic Imputation

R-based tools (Phylogenetic_Imputation.Rmd) to load tree files, make λmax predictions via phylogenetic imputation, and compare them directly against ML outputs.

Mine-n-Match (MNM)

Advanced workflow combining heterologous expression data with in-vivo correlations to augment datasets for more robust taxonomic subset modeling.