VPOD: Visual Physiology Opsin Database

Summary of VPOD v1.3

Contains 1,714 unique opsin genotypes and corresponding λ_max phenotypes collected across all animals from 120+ separate publications.
Curates all heterologously expressed, and a partial collection of physiologically inferred opsin genes.
Uses VPOD data and deepBreaks to show regression-based machine learning (ML) models reliably predict λ_max, account for epistatic (non-additive) effects, and identify functional amino acid sites.
Lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype.

Database File Structure

VPOD provides both fully curated machine-learning ready datasets and the raw database files for custom querying using SQLite.

Formatted Data Subsets

Located in vpod_data/VPOD_1.3/formatted_data_subsets/. Subsets suitable for direct model training without requiring MySQL or sequence alignment.

xxx.txt - Unaligned data subsets.
xxx_aligned.txt - Aligned data subsets.
VPOD_xxx_het_1.3.fasta - Fully aligned and formatted subsets.
xxx_meta.tsv - Metadata files (species, accession, λ_max).

Raw Database Files

Located in vpod_data/VPOD_1.3/raw_database_files/. Load into SQLite to create custom datasets.

litsearch.csv - Literature search information.
references.csv - All publication references.
opsins.csv - Sequence data & taxonomic metadata.
heterologous.csv - Opsin phenotype data & experimental metadata.

Clone VPOD

git clone https://github.com/VisualPhysiologyDB/visual-physiology-opsin-db.git
cd visual-physiology-opsin-db
# Start exploring with vpod_main_wf.ipynb

Workflows & Analysis Notebooks

Primary ML Workflow

vpod_main_wf.ipynb is the primary notebook for users. It contains a full pipeline for creating a local instance of VPOD using SQLite, formatting datasets, and training ML models using deepBreaks.

Sequence Manipulation

Includes tools for generating chimeras, in-silico deep-mutational-scanning (DMS), and reciprocal mutagenesis to build theoretical opsin variants for model testing.

Phylogenetic Imputation

R-based tools (Phylogenetic_Imputation.Rmd) to load tree files, make λ_max predictions via phylogenetic imputation, and compare them directly against ML outputs.

Mine-n-Match (MNM)

Advanced workflow combining heterologous expression data with in-vivo correlations to augment datasets for more robust taxonomic subset modeling.

Looking to just run predictions? Check out the OPTICS Tool instead.