An introduction to recommender systems in bioinformatics
📊 Prefer slides? Check out the interactive presentation: Bioinformatics Recommender Systems Slides
To bridge the gap between traditional recommendation systems and bioinformatics, we first map the foundational entities. In computer science, recommender systems were built for e-commerce and media. In bioinformatics, we apply similar mathematical frameworks to biological entities.
| E-Commerce / Movies | Bioinformatics Translation | Brief Biological Definition |
|---|---|---|
| User | Drug, disease, or patient | The primary entity we want to make a prediction for. |
| Item (Movie) | Protein target, gene, or microbe | The secondary entity that interacts with the primary one. |
| Rating (1-5 Stars) | Binding affinity, DTI, association score | A score showing whether, or how strongly, a biological interaction exists. |
| Movie Genre / Actor | SMILES, Morgan fingerprint, sequence | Features that describe the item/entity. |
| User History | Omics profile, multi-omics profile | Biological measurements for a patient, cell, tissue, or disease. |
flowchart LR
A["User"] --> B["Drug / Disease / Patient"]
B --> C["Item"]
C --> D["Protein / Gene / Microbe"]
D --> E["Rating / Interaction"]
E --> F["Binding / Association Score"]
F --> G["Features: Genre / Actor"]
G --> H["Features: SMILES / Fingerprints / Omics"]
flowchart LR
subgraph "Movie Example"
direction LR
CFA["User A likes Inception + The Matrix"] --> CFB["User B has similar taste"]
CFB --> CFC["User B dislikes Shrek"]
CFC --> CFD["Predict User A may dislike Shrek"]
end
subgraph "Bio Example"
direction LR
CFE["Drug A binds Targets 1, 2, 3"] --> CFF["Drug B has similar target pattern"]
CFF --> CFG["Drug B binds Target 4"]
CFG --> CFH["Predict Drug A may bind Target 4"]
end
| User | Inception | Toy Story | The Matrix | Shrek |
|---|---|---|---|---|
| User A | 5 | 1 | 5 | ? |
| User B | 5 | 2 | 4 | 1 |
| User C | 1 | 5 | 1 | 5 |
Here, $r_{ui}$ means the observed rating from user $u$ for item $i$. For example, $r_{\text{User A}, \text{Inception}} = 5$. The missing ? is what the recommender tries to predict. A matrix factorization model predicts it as:
where $p_u$ is the learned vector for user $u$, $q_i$ is the learned vector for item $i$, and the dot product measures how compatible they are.
Formula symbols:
| Symbol | Meaning |
|---|---|
| $R$ | The original user-item matrix. |
| $r_{ui}$ | Known rating/interactions for user $u$ and item $i$. |
| $\hat{r}_{ui}$ | Predicted rating/interactions for user $u$ and item $i$. |
| $P$ | Matrix of learned user vectors. |
| $Q$ | Matrix of learned item vectors. |
| $p_u$ | The learned vector for one user. |
| $q_i$ | The learned vector for one item. |
| $\lambda$ | Regularization strength; keeps vectors from becoming too large. |
flowchart LR
subgraph "Movie CF"
direction LR
A1["Sparse User-Movie Matrix"] --> B1["ALS Factorization"]
B1 --> C1["User Latent Space"]
B1 --> D1["Movie Latent Space"]
C1 --> E1["Dot Product"]
D1 --> E1
E1 --> F1["Predicted Movie Ratings"]
end
? means the interaction is untested.| Drug | Protein T1 | Protein T2 | Protein T3 | Protein T4 |
|---|---|---|---|---|
| Drug A | 1 | 1 | 1 | ? |
| Drug B | 1 | 1 | 1 | 1 |
| Drug C | 0 | 1 | 0 | 0 |
The model learns drug vectors and protein vectors, then predicts the missing value: “Does Drug A bind Protein T4?”
flowchart LR
subgraph "Bioinformatics CF"
direction LR
A2["Sparse DTI Matrix"] --> B2["PMF / LMF / NRLMF"]
B2 --> C2["Drug Latent Space"]
B2 --> D2["Protein Latent Space"]
C2 --> E2["Dot Product / Probability"]
D2 --> E2
E2 --> F2["Predicted Binding / Interaction"]
end
flowchart LR
subgraph "Movie Example"
direction LR
CBFA["Inception"] --> CBFB["Features: Sci-Fi, Nolan, dreams"]
CBFB --> CBFC["Compare movie profiles"]
CBFC --> CBFD["Recommend Interstellar"]
end
subgraph "Bio Example"
direction LR
CBFE["FDA-approved Drug A"] --> CBFF["Features: SMILES + Morgan fingerprint"]
CBFF --> CBFG["High structural similarity"]
CBFG --> CBFH["Suggest similar therapeutic targets"]
end
| Movie | Genre | Director | Keywords |
|---|---|---|---|
| Inception | Sci-Fi | C. Nolan | dreams, mind-bending |
| Interstellar | Sci-Fi | C. Nolan | space, time-travel |
| Toy Story | Animation | John Lasseter | toys, kids |
Formula symbols:
| Symbol | Meaning | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| $A$ | Feature vector for the first item, such as the TF-IDF vector for Inception. | ||||||||
| $B$ | Feature vector for the second item, such as the TF-IDF vector for Interstellar. | ||||||||
| $A \cdot B$ | Dot product; large when both vectors have high values in the same feature positions. | ||||||||
| $ | A | $, $ | B | $ | Vector lengths. Dividing by them makes the score depend on direction/profile similarity, not just vector size. | ||||
| $\text{Cosine}(A,B)$ | Similarity score from -1 to 1 in general, and usually 0 to 1 for non-negative feature vectors like TF-IDF. |
flowchart LR
subgraph "Movie CBF"
direction LR
M1["Inception: Sci-Fi, Nolan"] --> V1["TF-IDF Vector 1"]
M2["Interstellar: Sci-Fi, Nolan"] --> V2["TF-IDF Vector 2"]
V1 -->|Cosine Similarity| S1["Similarity Score: 0.95"]
V2 --> S1
S1 --> R1["Recommend Interstellar"]
end
Formula symbols:
| Symbol | Meaning | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| $A$ | Binary fingerprint vector for the first molecule, such as Drug A. | ||||||||
| $B$ | Binary fingerprint vector for the second molecule, such as Drug B. | ||||||||
| $A \cdot B$ | Number of fingerprint bits that are 1 in both molecules; shared chemical patterns. | ||||||||
| $ | A | ^2$ | Number of 1 bits in Drug A’s fingerprint. | ||||||
| $ | B | ^2$ | Number of 1 bits in Drug B’s fingerprint. | ||||||
| $ | A | ^2 + | B | ^2 - A \cdot B$ | Number of bit positions where at least one molecule has a 1. | ||||
| $T(A,B)$ | Tanimoto similarity, usually between 0 and 1; higher means more chemically similar. |
flowchart LR
subgraph "Bioinformatics CBF"
direction LR
D1["Drug A: Known Target X"] -->|Morgan Fingerprint| V3["Bit Vector 1: 10110"]
D2["Drug B: Unknown Target"] -->|Morgan Fingerprint| V4["Bit Vector 2: 10100"]
V3 -->|Tanimoto Coefficient| S2["Similarity Score: 0.75"]
V4 --> S2
S2 --> R2["Predict Drug B binds Target X"]
end
flowchart LR
subgraph "Movie Example"
direction LR
GMA["User A"] -->|Watched| GMB["Inception"]
GMB -->|Starring| GMC["Leonardo DiCaprio"]
GMC -->|Starred In| GMD["Titanic"]
GMD --> GME["Recommend Titanic"]
end
subgraph "Bio Example"
direction LR
GBA["Drug A"] -->|Treats| GBB["Disease X"]
GBB -->|Caused by| GBC["Gene Y"]
GBC -->|Encodes| GBD["Protein Y"]
GBD --> GBE["Check Drug A against Protein Y"]
end
flowchart LR
subgraph "Movie Graph"
direction LR
U1(("User A")) -- Watched --> M1(("Inception"))
M1 -- Starring --> A1(("Leonardo DiCaprio"))
A1 -- Starred In --> M2(("Titanic"))
U1 -. Short path .-> M2
end
Bio formula explanation:
| Symbol | Meaning |
|---|---|
| $p_0$ | Starting probability vector; for example, all probability starts at the disease node. |
| $p_t$ | Probability distribution over all nodes after $t$ steps. |
| $p_{t+1}$ | Updated probability distribution after one more step. |
| $W$ | Transition matrix; each entry says how likely the walker is to move from one node to another. |
| $c$ | Restart probability; higher $c$ keeps the walk closer to the starting node. |
| $(1-c)Wp_t$ | The part that walks through the graph. |
| $cp_0$ | The part that jumps back to the start. |
| Steady state | The final stable probability distribution when repeated updates barely change $p_t$. High-score nodes are close or strongly connected to the start. |
flowchart LR
subgraph "Bioinformatics Graph"
direction LR
Dise(("Disease Node")) -- Associated --> G1(("Known Gene"))
G1 -- PPI Edge --> P1(("Protein A"))
P1 -- PPI Edge --> P2(("Candidate Target"))
Dise -. RWR High Probability .-> P2
end
flowchart LR
subgraph "Movie Example"
direction LR
AEM1["Messy watch history"] --> AEM2["Autoencoder compresses taste"]
AEM2 --> AEM3["Latent vector"]
AEM3 --> AEM4["Reconstruct missing ratings"]
end
subgraph "Bio Example"
direction LR
AEB1["Noisy RNA-seq profile"] --> AEB2["Denoising autoencoder"]
AEB2 --> AEB3["Cell-state vector"]
AEB3 --> AEB4["Predict missing gene/drug associations"]
end
flowchart LR
subgraph "Movie Autoencoder"
direction LR
I1["Sparse User Ratings"] --> E1("Encoder NN")
E1 --> B1(("Latent Vector<br/>Taste Code"))
B1 --> D1("Decoder NN")
D1 --> O1["Dense Predicted Ratings"]
end
Meaning: $x$ is the noisy or incomplete profile, $\hat{x}$ is the reconstructed profile, $W$ and $b$ are learned neural-network parameters, and $\sigma$ is a nonlinear activation function. Training makes $\hat{x}$ close to the reliable parts of $x$, so the reconstructed output can estimate missing entries.
flowchart LR
subgraph "Bioinformatics Autoencoder"
direction LR
I2["Noisy Multi-Omics Vector"] --> E2("Encoder NN")
E2 --> B2(("Latent Vector<br/>Cell State"))
B2 --> D2("Decoder NN")
D2 --> O2["Cleaned, Completed DTI Profile"]
end
flowchart LR
subgraph "Movie Example"
direction LR
GNNM1["Movie node: Inception"] --> GNNM2["Receives actor + director messages"]
GNNM2 --> GNNM3["Updated movie embedding"]
GNNM3 --> GNNM4["Recommend similar graph-neighbor movies"]
end
subgraph "Bio Example"
direction LR
GNNB1["Molecular graph"] --> GNNB2["Atoms pass bond messages"]
GNNB2 --> GNNB3["Drug embedding"]
GNNB3 --> GNNB4["Predict toxicity or side effect"]
end
for node in graph:
neighbor_messages = SUM(features of neighbors)
node.features = NeuralNet(node.features + neighbor_messages)
flowchart LR
subgraph "Movie GNN Message Passing"
direction LR
A["Actor Node Vector"] -->|Pass Info| M["Movie Node"]
D["Director Node Vector"] -->|Pass Info| M
M -->|Aggregate & Update| M_new["Updated Movie Vector"]
end
Before the formula, the graph is represented as an adjacency matrix. If atom 1 is bonded to atom 2, then the matrix has an edge entry for that connection. A self-looped adjacency matrix means we also add a node-to-itself edge for every node. In matrix form, this is usually $\tilde{A} = A + I$, where $I$ is the identity matrix. This lets each node keep its own current features while also receiving messages from neighbors.
\[H^{(l+1)} = \sigma\left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right)\]Formula symbols:
| Symbol | Meaning |
|---|---|
| $A$ | Adjacency matrix; says which nodes are connected by edges. In a molecule, it says which atoms are bonded. |
| $I$ | Identity matrix; has 1s on the diagonal and 0s elsewhere. Adding it creates self-connections. |
| $\tilde{A}$ | Self-looped adjacency matrix, usually $A + I$. It contains normal edges plus each node’s edge to itself. |
| $D$ | Degree matrix; a diagonal matrix where each diagonal value is the number of neighbors for a node. |
| $\tilde{D}$ | Degree matrix computed from $\tilde{A}$. |
| $H^{(l)}$ | Feature matrix at layer $l$; each row is a node’s current embedding/features. |
| $W^{(l)}$ | Trainable weight matrix for layer $l$. |
| $\sigma$ | Nonlinear activation, such as ReLU. |
| $H^{(l+1)}$ | Updated node features after message passing. |
flowchart LR
subgraph "Bioinformatics GNN Message Passing"
direction LR
At1["Carbon Atom Vector"] -->|Bond Edge| At2["Oxygen Atom"]
At3["Nitrogen Atom Vector"] -->|Bond Edge| At2
At2 -->|Aggregate & Apply ReLU| At2_new["Updated Oxygen Vector"]
At2_new -->|Global Pooling| Pred["Toxicity Prediction"]
end
flowchart LR
subgraph "Movie Example"
direction LR
RLM1["Show Inception"] --> RLM2["User watches 40 minutes"]
RLM2 --> RLM3["Positive reward"]
RLM3 --> RLM4["Policy recommends more sci-fi thrillers"]
end
subgraph "Bio Example"
direction LR
RLB1["Start molecule"] --> RLB2["Add atom or ring"]
RLB2 --> RLB3["Score binding + toxicity + QED"]
RLB3 --> RLB4["Keep actions that improve drug-likeness"]
end
flowchart LR
subgraph "Movie RL"
direction LR
Agent1["Recommender Agent"] -->|Action: Show Movie| Env1["Environment: User"]
Env1 -->|Reward: Watch Time| Agent1
Env1 -->|Next State: User Bored| Agent1
end
Meaning: $Q(s,a)$ is the expected long-term reward for taking action $a$ in state $s$. In molecule generation, the state can be the current molecule, the action can be adding an atom or bond, and the reward can combine binding affinity, toxicity, synthetic accessibility, and QED.
Formula symbols:
| Symbol | Meaning |
|---|---|
| $s$ | Current state. In drug design, this can be the current partial molecule. |
| $a$ | Current action. For example, add an atom, add a bond, or stop generation. |
| $Q(s,a)$ | Expected long-term value of taking action $a$ in state $s$. |
| $\mathbb{E}$ | Expected value; average outcome over possible future paths. |
| $r_{t+1}$ | Immediate reward after taking the action, such as better binding or lower toxicity. |
| $\gamma$ | Discount factor between 0 and 1; controls how much future rewards matter. |
| $s_{t+1}$ | Next state after the action, such as the updated molecule. |
| $a’$ | A possible next action from the next state. |
| $\max_{a’} Q(s_{t+1}, a’)$ | Best predicted future value after reaching the next state. |
| $s_t = s, a_t = a$ | Condition saying we are evaluating the case where the current state is $s$ and the current action is $a$. |
flowchart LR
subgraph "Bioinformatics RL"
direction LR
Agent2["Generator Agent"] -->|Action: Add Benzene Ring| Env2["Environment: Chemical Simulator"]
Env2 -->|Reward: High Binding Affinity + Good QED| Agent2
Env2 -->|State: Updated Molecular Graph| Agent2
end
A DTI asks whether a drug interacts with a biological target, usually a protein. It can be binary (binds / does not bind) or continuous (how strongly it binds).
Example DTI row:
| Drug | EGFR | BRAF | ACE2 |
|---|---|---|---|
| Drug A | 1 | 0 | ? |
Here, ? means the interaction is unknown and may be predicted.
A protein that a drug tries to bind, block, activate, or modify. For example, a cancer drug may target EGFR, a protein involved in cell-growth signaling.
A DNA instruction that can be used to make RNA and often a protein. A disease gene is a gene whose mutation or abnormal activity is linked to a disease.
Binding affinity describes how strongly a drug binds to a target.
| Term | Simple Meaning |
|---|---|
| Kd | Dissociation constant. Lower Kd means tighter binding. |
| IC50 | Concentration needed to inhibit 50% of a biological activity. Lower IC50 usually means stronger effect. |
Example: If Drug A has IC50 = 10 nM and Drug B has IC50 = 1000 nM for the same target, Drug A is usually considered more potent for that assay.
SMILES is a text string that represents a molecule.
Examples:
| Molecule | Example SMILES |
|---|---|
| Ethanol | CCO |
| Benzene | c1ccccc1 |
| Aspirin | CC(=O)Oc1ccccc1C(=O)O |
The recommender cannot directly “see” a molecule, so SMILES is one way to convert chemistry into machine-readable input.
A Morgan fingerprint is a binary vector describing circular neighborhoods around atoms. It is often computed with RDKit.
Toy example:
| Molecule | Fingerprint |
|---|---|
| Drug A | 10110010 |
| Drug B | 10100010 |
Matching 1s mean the molecules share local chemical patterns. Real fingerprints are usually much longer, such as 1024 or 2048 bits.
The Tanimoto coefficient measures similarity between binary chemical fingerprints.
For binary vectors:
\[T(A, B) = \frac{\text{shared 1 bits}}{\text{1 bits in A or B}}\]Toy example:
101100 101000 TF-IDF means Term Frequency-Inverse Document Frequency. It gives high weight to words that are common in one document but uncommon across all documents.
Movie example:
| Movie | Keywords | TF-IDF-like Feature |
|---|---|---|
| Inception | dreams, mind, Nolan | high weight on dreams |
| Interstellar | space, time, Nolan | high weight on space |
In biology, a similar idea can be used for text mining biomedical abstracts or drug descriptions.
An omics profile is a vector of biological measurements.
Example RNA-seq profile:
| Sample | TP53 | EGFR | MYC | BRCA1 |
|---|---|---|---|---|
| Patient 1 | 8.2 | 2.1 | 10.5 | 4.4 |
Each number may represent gene expression level. A recommender can use this vector as a patient or disease feature.
RNA-seq measures which genes are active by counting RNA molecules. If a gene has many RNA reads, it is likely highly expressed in that sample.
Multi-omics combines multiple biological layers.
| Layer | What It Measures | Example Feature |
|---|---|---|
| Genomics | DNA variants | BRAF V600E mutation |
| Transcriptomics | RNA expression | EGFR expression = 2.1 |
| Proteomics | Protein abundance | Protein X high |
| Metabolomics | Small molecules | Lactate high |
| Clinical | Patient information | Age, stage, survival |
The problem is that these layers have different units, missing values, noise, and batch effects. Autoencoders and other representation-learning methods try to compress them into one shared latent profile.
A protein-protein interaction network is a graph where nodes are proteins and edges mean physical or functional interaction.
Example:
flowchart LR
P1["Protein A"] --> P2["Protein B"]
P2 --> P3["Protein C"]
P1 --> P3
RWR repeatedly mixes two behaviors:
Restart is useful because it keeps the ranking local. In disease-gene prediction, if the start node is a disease, high-scoring genes are those that remain close to the disease through many possible paths.
Logistic Matrix Factorization (LMF) predicts a probability instead of a star rating. It is useful for binary data such as “drug binds target” vs. “drug does not bind target.”
QED estimates how drug-like a molecule is. It combines properties such as molecular weight, polarity, aromatic rings, and other chemistry rules into one score.
Toxicity is the chance that a molecule causes harmful biological effects. In drug design, high binding affinity alone is not enough; a molecule also needs acceptable toxicity, solubility, stability, and synthesizability.
Drug repurposing means finding a new disease use for an existing drug. It is attractive because the safety profile of the drug may already be partially known.
De novo drug design means designing a new molecule from scratch instead of selecting from existing drugs.