RevBench - Revilico

Why Use This Engine?

In the documentation below, we will use Revilico’s RevBench engine to evaluate how well computational docking and co-folding predictions align with experimental bioactivity measurements. RevBench automates the full benchmarking pipeline: retrieving experimental assay data from public databases, merging it with prediction outputs, and computing a comprehensive set of statistical metrics that reveal where a given computational method succeeds or fails for a target of interest.

Background

Virtual screening methods such as docking and co-folding produce ranked lists of compounds with associated predicted scores. Evaluating whether these rankings meaningfully reflect experimental binding data is a non-trivial challenge. Experimental IC50, Ki, and Kd values are measured across heterogeneous assay conditions, stored in multiple public databases, and distributed at different scales. RevBench addresses this by standardizing the benchmarking workflow around a single protein target defined by its PDB ID, collecting bioactivity data from ChEMBL, PubChem, BindingDB, and IUPHAR, and computing statistically rigorous correlation, discrimination, and enrichment metrics against any prediction input the user supplies. The platform supports two prediction modalities. Boltz2 co-folding predictions provide a predicted pIC50, a confidence score, an affinity probability, and an interface pTM (ipTM) score per compound. AutoDock Vina docking predictions provide binding free energy estimates (kcal/mol) from static, flexible, and ensemble docking runs. Both modalities are evaluated on the same experimental benchmark set, enabling direct and consistent comparison.

Target Setup and Structure Preparation

The benchmarking workflow begins with a PDB ID. The engine retrieves target metadata from RCSB PDB, UniProt, ChEMBL, and NCBI, downloading the co-crystal structure and extracting the primary protein chain and the co-crystal inhibitor as separate PDB files. The docking box is computed automatically from the ligand centroid, extending by 10 Angstroms in each dimension to a cubic grid suitable for AutoDock Vina configuration.

Experimental Dataset Assembly

Bioactivity data is retrieved from four databases: ChEMBL (paginated activity records by target ChEMBL ID), PubChem (assay results by gene ID), BindingDB (ligand affinities by UniProt ID, preferring isomeric SMILES), and IUPHAR/GtoPdb (ligand interactions, with pKd/pKi/pEC50 converted to nanomolar units via

\text{nM} = 10^{9 - p}

). All records are deduplicated to one row per unique SMILES, prioritizing IC50 over Ki and Kd when multiple assay types exist for the same compound. Activity labels are assigned by the following rules in priority order: if the database provides an explicit active or inactive outcome it is used directly; the compound is labeled inactive if comments indicate no activity, if IC50 exceeds 100 micromolar, if HTS percent inhibition is below 10%, or if biophysical comments indicate no binding; otherwise the compound is labeled active. The final benchmark set uses a 9:1 inactive-to-active ratio for enrichment analysis to match industry-standard virtual screening conditions.

Benchmarking Metrics

Parity Plot and Correlation Analysis Experimental pK values are computed from quantitative measurements as:

\text{pK} = -\log_{10}\left(\text{value}_{\text{nM}} \times 10^{-9}\right)

Predicted scores are aligned to the same scale. For Boltz2, the reported pIC50 is used directly. For Vina static docking, binding free energy in kcal/mol is converted to an approximate pK unit using the linear relationship

\text{pK} = (\Delta G + 5.89) / 1.364

. Five correlation metrics are reported with 95% bootstrap confidence intervals (1,000 resamples): Pearson r measures linear correlation between predicted and observed potency:

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Spearman rho measures rank-based correlation and is robust to outliers. Kendall tau measures rank concordance across all compound pairs. RMSE and MAE quantify prediction error magnitude:

\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}, \quad \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

Lin’s Concordance Correlation Coefficient (CCC) combines precision and accuracy, penalizing predictions that correlate well but are systematically shifted:

\text{CCC} = \frac{2\rho \sigma_x \sigma_y}{\sigma_x^2 + \sigma_y^2 + (\mu_x - \mu_y)^2}

Enrichment Factor The enrichment factor (EF) at a given fraction

\chi

measures how efficiently a virtual screening ranking recovers known actives compared to random selection:

\text{EF}(\chi) = \frac{\text{fraction of actives in top } \chi\text{\%}}{\chi}

An EF of 1.0 corresponds to random retrieval. EF values at 1%, 5%, and 10% of the library are reported with bootstrap uncertainty bands. ROC Curve and AUC The ROC curve plots true positive rate (TPR) against false positive rate (FPR) as the score threshold is varied. For Boltz2, two ROC curves are generated separately using affinity probability and predicted pIC50 as the ranking signal. AUC of 0.5 indicates random discrimination; AUC of 1.0 indicates perfect separation of actives from inactives. Confidence Calibration For Boltz2 predictions, confidence score and ipTM are assessed against per-bin prediction accuracy. Compounds are binned by confidence signal value (0 to 1 in 0.2-width bins), and within each bin the fraction of predictions with absolute error below 0.5 pIC50 units (the success rate) and the mean absolute error are reported. Well-calibrated models show monotonically increasing success rates with increasing confidence. Correlation Heatmap A pairwise correlation heatmap is computed across all numeric columns from the merged prediction-experiment dataset. Pearson r, Spearman rho, Kendall tau, R-squared, and CCC are all available as the color metric. Cells with fewer than five paired observations are left blank.

ADMET and Structural Alert Profiling

In addition to benchmarking predictive accuracy, RevBench profiles the benchmark compound set for drug-likeness and structural liabilities using RDKit. Computed descriptors include molecular weight, LogP, hydrogen bond donors and acceptors, topological polar surface area, rotatable bonds, ring counts, QED (quantitative estimate of drug-likeness), and estimated water solubility (ESOL model). Drug-likeness rules assessed include Lipinski Rule of Five, Veber criteria (rotatable bonds and TPSA), and Ghose filter. Structural alerts are flagged against the PAINS, Brenk, and NIH MLSMR catalogs to identify compounds prone to assay interference or containing known problematic substructures.

Running the Engine

Inputs

Input	Required	Description
PDB ID	Yes	4-character PDB identifier of the target
Boltz2 output	Optional	CSV with `ligand_smiles`, `predicted_pic50`, `confidence_score`, `affinity_probability`, `iptm`
Vina static output	Optional	CSV with `ligand_smiles`, `best_affinity`, `mean_affinity` (kcal/mol)
Vina flexible output	Optional	CSV with per-pose ΔG and CNN affinity/pose scores
Vina ensemble output	Optional	CSV with per-conformation ΔG values across receptor snapshots

Outputs

Experimental benchmark set: Deduplicated active/inactive compound set with source database, SMILES, assay type, and measured value
Parity plots: Predicted vs. observed pK scatter plots with OLS regression line and 1-sigma confidence band
Correlation metrics table: Pearson r, Spearman rho, Kendall tau, R-squared, CCC, RMSE, MAE, and bias with 95% CI
Enrichment curves: EF at 1%, 5%, 10% with bootstrap uncertainty bands
ROC curves: With AUC and 95% CI
Calibration plots: Confidence bin vs. success rate and MAE (Boltz2 only)
Correlation heatmap: Pairwise correlations across all numeric prediction and experimental columns
ADMET table: Per-compound physicochemical descriptors, drug-likeness rule results, and structural alerts
Docking-specific plots: CNN vs. physics affinity scatter, per-compound pose distribution boxplots (ensemble), active/inactive score distribution histograms

​Why Use This Engine?

​Background

​Target Setup and Structure Preparation

​Experimental Dataset Assembly

​Benchmarking Metrics

​ADMET and Structural Alert Profiling

​Running the Engine

​Inputs

​Outputs