The Pharmome Map: a comprehensive public dataset for drug-target interaction modeling

Community Article Published November 18, 2025

Upvote

Elaine McVey Houskeeper

eamcvey

hugging-science

Georgia Channing

cgeorgiaw

hugging-science

The missing pharmome map

Mapping is underway!

Target Classes

Data Structure

Ready to get started?

The missing pharmome map

Therapeutic drugs are ubiquitous, with billions of prescriptions written worldwide every year. In the last 30 days, half of Americans have taken at least one prescription medication, and a quarter have taken at least three (CDC). Despite this, we know relatively little about the scope of effects these drugs have on the human body. Pharmaceutical companies focus their efforts on making new drugs that are safe and effective, typically with a single “target” in mind. The drug development process does not include comprehensively understanding all the effects a drug may have with other proteins it may encounter in the body.

Imagine a matrix of every approved drug against every potential protein a drug can act on (the “human druggable genome”). Until now, our knowledge of drug activities has been an incredibly sparse version of this matrix. We often – but not always! – know the primary target of a drug and what effect it has on that targeted protein. Sometimes we know about a handful of other “off-target” effects. But most of the drug-target matrix is a mystery.

Imagine instead that we have a measurement for every single one of these drug-target combinations that either confirms the drug is inactive at a particular target, or quantitatively characterizes its activity. This is the pharmome map.

What could we do if we had this map? Understanding all the effects of a drug means we can address questions such as:

What patterns of activity are associated with particular adverse events (AE modeling)?
Is this drug’s mechanism of action simply via the known “on-target” activity, or are its effects driven by interactions across multiple targets (polypharmacology)?
What do patterns of activity suggest about the effects of drug combinations (polypharmacy)?
What drugs act on targets that might make them suitable for treating new indications (drug repurposing)?
Can we predict activity patterns across targets for novel compounds? (structure-activity relationship modeling)

While the pharmome map has been sparsely known to this point, the breadth and depth of data available to join to the pharmome map is vast: clinical trials (ClinicalTrials.gov), adverse event surveillance (FAERS), individual health records (UK Biobank), many -omics databases. A complete pharmome map unlocks new ways to understand these outcomes and relate them back to drug activity.

Mapping is underway!

EvE Bio is a non-profit (a Focused Research Organization under Convergent Research) that is generating the pharmome map and putting it in the public domain. EvE develops assays in a single format for the members of each target class, then carries out a quantitative high throughput screening and profiling process that provides the final measurements. By approaching dataset creation as the primary goal, EvE is able to provide the type of comprehensive and consistently generated dataset that is ideal for machine learning. This public dataset is already the largest of its kind, and is actively expanding with new data added every other month.

EvE is currently focused on the portion of the pharmome map representing a 1,397 member compound library, primarily composed of FDA-approved small molecule drugs, measured against key classes of drug targets. These target classes were selected because they are therapeutically relevant, druggable by small molecules, and addressable at scale by in vitro assays. The three target classes included are nuclear receptors (NRs), 7-transmembrane receptors (7TMs, aka GPCRs), and protein kinases (PKs). Collectively these cover the intended targets for more than half of FDA-approved small molecule drugs. Small molecule drugs are those typically available in traditional pill form, such as statins, tamoxifen, and metformin. They are well suited to high throughput screening and profiling approaches.

Target Classes

Each target class plays a critical role in physiology and pharmacology, and each is addressed with a different assay format that is considered best suited to high-throughput screening for physiologically relevant activity. While the same types of response measures are collected across classes, understanding the uniqueness of each target class will inform data usage. (Details beyond what is included here can be found on EvE Bio’s methods site.)

Nuclear receptors directly regulate gene expression, controlling which proteins get created and influencing the long term behavior of a cell. This is a small (<50 members) but highly impactful receptor class, representing the targets for more than 10% of approved small molecule drugs. NRs are activated by ligand binding, with ligand binding domains that have collectively evolved to bind a diverse set of small molecules. This is advantageous for drug targeting, because it enables selective design not only for specific NRs but also for the type and degree of activation desired. Drugs can be full or partial agonists (increasing the receptor’s activity over the basal level), antagonists (blocking agonism of the receptor), or inverse agonists (reducing the basal activity level). NR activity is measured with biochemical co-factor recruitment assays that reflect the conformational changes induced by ligand binding. These assays are separately configured for agonist and antagonist modes.

7TMs – also known as G-protein coupled receptors (GPCRs) – sense a wide variety of extracellular signals and translate them into intracellular responses, effectively telling the cell what’s happening around it. More than a third of FDA-approved drugs target 7TMs and these address a range of therapeutic areas. 7TMs are a large target class that has evolved to sense a diversity of molecules, making them exceptionally druggable. They have multiple binding sites with diverse ligand possibilities, with the potential for selective activation (via biased agonists that preferentially activate one pathway over another), and the potential for ligands to control agonism and antagonism. Since 7TMs are on the cell surface, drugs need not cross the cell membrane to access them. 7TM activity is measured with cell-based assays that are configured for agonist and antagonist modes.

Protein kinases (PKs) are enzymes that catalyze phosphorylation, effectively controlling many molecular “switches” within cells. Since these switches precisely regulate a wide variety of critical processes, kinases enable computational complexity inside cells via feedback loops, cascades, and signal integration. PKs are a newer and rapidly growing set of targets for FDA-approved drugs and are particularly relevant to cancer, where mutations can lead to dysregulated activation. These disease relevant mutations are included in the pharmome map wherever possible. PK activity is measured with biochemical competition-based ligand binding assays in a single mode (inhibition).

In 2026, the number of 7TM and PK targets in the EvE pharmome mapping dataset will increase ~3x. This will include the addition of G-protein as well as β-arrestin data for 7TMs, which will allow for modeling of biased signaling via these two pathways. (Modern pharmacology experts consider the quantification of biased signaling a key opportunity for improved drug design.)

In addition to NRs, 7TMs, and PKs, the data includes measurements of cell viability for each compound (labeled as target class “Viability”), based on an assay that measured ATP production. These results reflect cytotoxic effects, which are meaningful endpoints unto themselves with regard to compound activity. Additionally, it is critical to interpret 7TM antagonism data in the context of viability results, as cell death can masquerade as antagonism in cell-based assays.

Data Structure

The key response variables are compound activity and potency. Binary activity and maximum observed activity is captured for every compound-assay combination (outcome_is_active, outcome_max_activity). Activity is expressed as a % of maximum activity, in reference to known standard compounds for each assay. For active compounds that have sufficient potency to be measurable in the concentration range tested (is_quantified), four-parameter logistic curve fits result in quantified potency, measured as pXC50 (outcome_potency_pxc50). pXC50 is the negative log of the IC50/EC50 – the concentration at which half of the maximum activity is reached. Higher pXC50s are higher potency, and 5 is the lowest quantifiable pXC50 in the concentration range used.

To collect these measurements, EvE uses a two-phase quantitative screening process. All combinations of compounds and assays are included in the screening phase, which includes two replicates of three concentrations. A rules-based progression algorithm determines which compounds advance to the profiling phase, where the concentration range is 10 μM to 10 pM. The full concentration response is effectively censored by the concentration range tested. For low potency compounds, this leads to results that are reported as active, but not quantified.

In addition to cytotoxicity, compounds can interfere with assays in various ways, leading to potentially spurious results. Compounds that appear with suspicious frequency for any given target class and mode are flagged as “high frequency”. They could be removed from the data before model development, but in some cases true activity will be lost in the process. Alternatively, this frequency flag could be treated as a response in itself, in order to develop models that link compound and concentration response characteristics with particular forms of interference. Columns that flag combinations where either cell viability or hit frequency merit consideration are included in the dataset (viability_flag, frequency_flag).

The dataset contains one row per combination of target, compound, mode, and mechanism (currently there is only one mechanism per target class, but this will change when data for both signaling pathways is added for 7TMs in 2026). NRs and 7TMs have two modes each, while PKs and cell viability have one. Multiple identifiers are included for both compounds and targets. For compounds: SMILES (a text-based chemical representation), InChIkey, CAS #, UNII, and DrugBank ID. For targets: gene, Uniprot ID, and mutant/wildtype indicators.

Ready to get started?

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("eve-bio/drug-target-activity")

Or, view the dataset on Hugging Face here.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment