High Throughput Screening

Overview

A computational high-throughput screening (HTS) campaign on the Revilico platform follows five sequential phases. Each phase feeds the next, so work through them in order on your first campaign.

Phase	Name	What Happens
1	Target Identification	Identify your biological target and define the therapeutic hypothesis
2	Structure Acquisition & Preparation	Obtain and clean a high-quality protein structure
3	The Four Docking Engines	Understand the tools available and when to use each
4	Calibration	Benchmark each engine against known experimental data
5	Production Run	Screen your compound library and refine to a confirmed lead set

Phase 1 — Target Identification

Before running a single computation, you need a clearly defined biological target and a therapeutic hypothesis. This phase is about narrowing the scientific and commercial landscape down to one specific protein — and knowing exactly how you intend to modulate it.

How Targets Are Selected

Target selection typically draws from three sources:

Literature — peer-reviewed publications, structural genomics datasets, or recently solved crystal structures that reveal a druggable binding site.
Multi-omics data — genomics, transcriptomics, proteomics, or metabolomics evidence that a given protein is causally linked to the disease phenotype of interest.
Market intelligence — financial records, competitive pipelines, and projected indication markets that indicate an unmet therapeutic need and a viable commercial opportunity.

Define Your Therapeutic Hypothesis

Once you have a target, decide explicitly how you want to modulate it before moving forward.

Mechanism — Are you aiming for competitive inhibition, allosteric modulation, activation, or protein-protein interaction (PPI) disruption?
Binding site — Identify the pocket you want to engage. For a kinase this is typically the ATP-binding site (hinge residue and gatekeeper). For a PPI, identify the hot-spot residues that anchor the interface.
Inhibitor type — Type I (DFG-in), Type II (DFG-out), Type III (allosteric), or covalent. This determines which conformational state of the protein you dock into.

A poorly defined therapeutic hypothesis leads to poor pose interpretation and misguided hit selection. Invest time here before running any computational screen.

Phase 2 — Structure Acquisition & Preparation

Your docking engines are only as good as the structure you feed them. A contaminated or incomplete structure will produce misleading results regardless of which engine you run.

Step 1 — Obtain the Structure

Source your structure from one of three routes:

Protein Data Bank (PDB)

Search by protein name, UniProt accession, or indication. Prefer high-resolution crystal structures (below 2.5 Å resolution). A co-crystal structure with a known inhibitor defines the active binding conformation and gives you key contact residues.

UniProt

Confirm the canonical sequence and link out to deposited structures. Pay attention to isoforms and active/inactive state annotations.

Sequence-Based Generation

If no experimental structure exists, generate one from the amino acid sequence using AlphaFold, OpenFold, or Boltz-2 co-folding. This gives a clean structure with no co-crystal contamination to remove.

Step 2 — Clean the Structure

PDB structures almost always contain components that must be removed before docking. Open the file in PyMOL or in RevBench and remove all of the following:

Co-crystal ligands — any bound inhibitor, agonist, substrate, or cofactor already present in the structure.
Water molecules — crystallographic waters should be removed for standard docking campaigns.
Ions and buffer components — sodium, chloride, DMSO, glycerol, and all other non-protein atoms.
Excess chains — if the structure is a dimer, trimer, or higher-order complex, keep only the chain you want to dock into (usually Chain A). Remove all others.

Example: You download a kinase structure containing a staurosporine co-crystal ligand, 180 water molecules, two sulfate ions, and four chains (A, B, C, D). Keep only Chain A and delete everything else. The resulting file should contain only the protein backbone and side-chain atoms of your target domain.

Step 3 — Retain Co-Crystal Contact Information

Before deleting a co-crystal ligand, record the key residues it contacts. This contact map is invaluable for downstream pocket definition and flexible-residue selection — and lets you validate your later docking poses against known biology.

Step 4 — Define the Binding Pocket

Use RevPocket, or extract the pocket from the literature or from your co-crystal contact analysis, to define a docking box around the site of interest. Verify that the box encompasses the residues identified as mechanistically critical in Phase 1. The box position and dimensions are passed directly to the docking engine.

Phase 3 — The Four Docking Engines

Revilico provides four complementary engines. They are used sequentially in a production run — faster and broader at the start, slower and more accurate at the end.

Engine	Throughput	Key Outputs	Best Used For
Co-Folding (Boltz-2)	Per target	Predicted complex structure	Novel targets with no crystal structure; protein-ligand co-folding
Static Docking	2M+ compounds	Binding energy (kcal/mol)	High-throughput initial screen; fast GPU-accelerated elimination
Flexible Docking	Up to 50K	Binding energy, CNN affinity, CNN accuracy, intramolecular energy	Hit refinement; binding-site flexibility and steric clash detection
Ensemble Docking	~3K compounds	Binding affinity across MD snapshots	High-confidence lead confirmation; captures protein dynamics

Co-Folding (Boltz-2)

Co-folding takes your SMILES string and your protein amino acid sequence as two inputs and generates a predicted complex structure as output. Use this engine when no experimental structure exists, or when you want to evaluate a chemotype in a flexible, co-folded conformation rather than a rigid crystal structure. Most commonly used for novel targets or targets where the binding site is poorly defined.

Static Docking

Static docking is a GPU-accelerated, semi-empirical scoring method. The protein structure is held completely rigid and the ligand is sampled across thousands of orientations within the docking box. Because the protein does not flex, this engine is very fast — it can handle libraries of 2 million compounds or more. Primary output: binding energy in kcal/mol. More negative values indicate stronger predicted binding. Use static docking for the initial wide-net screen. Its role is to eliminate clearly non-binding compounds quickly, not to generate publication-quality poses.

Flexible Docking

Flexible docking extends static docking by allowing up to 3–5 selected protein residues to move during scoring. This is more computationally expensive, so throughput drops to approximately 50,000 compounds or below. Outputs from flexible docking:

Binding energy (kcal/mol) — same semi-empirical scoring as static docking.
CNN affinity — a pKi value estimated by a convolutional neural network trained on protein-ligand data. Provides an AI-based heuristic for binding strength that complements the semi-empirical score.
CNN accuracy — a confidence score on the predicted pose. Use this to flag unreliable poses quickly.
Intramolecular energy — measures internal strain within the ligand pose. Highly positive values indicate steric clashes within the compound itself, which typically invalidate the pose.

Select flexible residues that you know engage the ligand based on your co-crystal contact analysis or the literature. Poor residue selection undermines the accuracy advantage of this engine.

Ensemble Docking

Ensemble docking is the most rigorous and computationally intensive method. Before running it, you must first complete a protein-in-water molecular dynamics (MD) simulation using RevMD Aqua. Set the simulation time to 100 ns (minimum 50 ns acceptable). The MD run captures how the protein naturally moves in solution.

Extract trajectory snapshots

Pull snapshots at regular time intervals from the completed MD run.

Pre-process and align

Align the protein snapshots to a common reference frame.

Specify the docking pocket

Define the binding site on each snapshot.

Dock across all snapshots

Each compound is scored across the full ensemble of protein conformations simultaneously.

Ensemble docking provides binding affinity estimates that account for protein flexibility and dynamics. It is the most reliable predictor of binding in cases where the protein is conformationally dynamic. Use it for your final ~3,000 compound set as the last confirmation step before nominating compounds for wet-lab synthesis.

Phase 4 — Calibration

Calibration is the single most important quality-control step in any computational screening campaign. Without it, you have no basis for trusting the scores your engines produce. Calibration tells you how well each engine predicts experimental activity for compounds with known data — and therefore how much confidence to place in its predictions for unknowns.

Build a Calibration Set

Collect a set of compounds with known experimental activity against your target. Aim for several hundred to several thousand compounds with associated IC₅₀ or Kᵢ values. Biochemical assay data is preferred over cell-based.

ChEMBL

Highly curated with good metadata. Generally trustworthy as a primary source.

PubChem BioAssay

Broad coverage. Verify assay annotations carefully before including.

Literature

Acceptable, but audit assay conditions and metadata. Inconsistent protocols between sources introduce noise.

Use RevBench to scrape and compile these compounds. Export a CSV with at minimum two columns: SMILES and experimental activity value (with units and assay type annotated). The compounds should be chemically diverse and representative of the chemical space you intend to screen.

Run All Four Engines on the Calibration Set

Take the calibration library and run it through each of the four engines exactly as you would run a production screen. Use the same pocket box and settings you defined in Phase 2. Collect all outputs into a master CSV: one row per compound, one column per engine output.

Generate Calibration Plots and Interpret Metrics

In RevBench, upload your combined CSV alongside the experimental activity data and generate calibration plots — predicted score vs. experimental value — for each engine. Evaluate the following four metrics:

Metric	Range	Interpretation
Pearson / R²	0–1 (higher is better)	Measures linear correlation between predicted and experimental values. R² > 0.5 is a reasonable baseline for virtual screening.
Spearman Coefficient	0–1 (higher is better)	Measures rank-order correlation. More important than R² in screening, where ranking compounds correctly matters more than absolute accuracy.
RMSE	Lower is better	Root Mean Squared Error. Sensitive to outliers. Use alongside MAE to understand whether a few bad predictions are skewing the average.
MAE	Lower is better	Mean Absolute Error. A robust average error metric less sensitive to extreme outliers than RMSE.

No engine will be perfect across all assay types. Look for which engine performs best for your specific target class. Compare speed vs. accuracy trade-offs. The calibration data gives you a defensible, data-driven rationale for your engine choices in the production run.

Phase 5 — Production Run

With calibration complete and engine performance understood, you are ready to screen a real compound library. The production run is a staged funnel — each stage reduces the compound count while increasing scoring accuracy.

Stage	Engine	Compound Count	Goal
1 — Library Screen	Static Docking	~2,000,000	Eliminate non-binders rapidly
2 — Hit Refinement	Flexible Docking	30,000–50,000	Score and rank filtered hits with improved accuracy
3 — Lead Confirmation	Ensemble Docking	~3,000	Validate binding across protein conformations

Stage 1 — Static Docking Screen (2M+ Compounds)

Select your compound library. Common sources include the Enamine REAL Library, Enamine 2M liquid stock compounds, or any other commercially available or in-house library. Upload the SMILES to RevBench as a CSV. For libraries of 2 million or more compounds, split the library into 10–15 equal batches. Submit all batches simultaneously — they run in parallel and are automatically concatenated once complete. After the run completes, apply a binding energy cutoff:

Binding energy more negative than a defined kcal/mol threshold (e.g., more negative than the best-performing calibration compounds, or a fixed threshold such as −8 kcal/mol).

Then apply secondary filters:

Chemical space clustering — remove redundant chemotypes to ensure diversity in your hit set.
Binding affinity ranking — sort by score and apply a top-N cutoff.

Target output: 30,000–50,000 compounds carried forward to Stage 2.

Stage 2 — Flexible Docking (30K–50K Compounds)

Import your filtered hit set into the flexible docking engine. Select 3–5 flexible residues at the binding site based on your calibration analysis and co-crystal contact data. Run the batch — again split into 10–15 parallel jobs if needed. Evaluate each pose using all four flexible docking outputs:

Binding energy for a primary rank.
CNN affinity to cross-check with the AI-based prediction.
CNN accuracy to flag low-confidence poses for manual review.
Intramolecular energy to remove poses with internal steric clashes.

Target output: ~3,000 high-confidence hits carried forward to Stage 3.

Stage 3 — Ensemble Docking (~3,000 Compounds)

Run the protein-in-water MD simulation in RevMD Aqua at 100 ns if you have not done so already. Extract trajectory snapshots at regular intervals. Load the snapshots into the ensemble docking pipeline, define the pocket on each snapshot, and submit your 3,000-compound set. Split into batches of 10–15 for parallelism. Ensemble docking produces per-compound binding affinity scores averaged across all protein conformations captured in the simulation. Compounds that score well consistently across many snapshots are the most likely to bind in a real, dynamic biological environment. Apply a final round of filtering and ranking. The resulting top compound set is your lead series — ready for molecular dynamics validation and prioritization for wet-lab synthesis.

After Stage 3, you will have a ranked lead set with high-confidence binding poses, affinity predictions from multiple orthogonal methods, and validated alignment to a biologically relevant binding site. This set is now ready for physical synthesis and biochemical assay confirmation.

Quick Reference Checklist

Use this checklist to track your campaign progress. Phase 1 — Target Identification

Identify target from literature, multi-omics, or market analysis
Define the mechanism: inhibition type, binding site, key residues

Phase 2 — Structure Preparation

Download structure from PDB / UniProt, or generate with AlphaFold / Boltz-2
Remove all co-crystal ligands, waters, ions, and excess chains
Record co-crystal contact residues before deletion
Define and validate the binding pocket using RevPocket or literature

Phase 3 — Engine Selection

Review calibration data to understand engine performance for your target class
Confirm 100 ns MD simulation is queued for ensemble docking stage

Phase 4 — Calibration

Compile calibration set (ChEMBL / PubChem / literature, biochemical assay preferred)
Run all four engines on calibration set with production settings
Generate calibration plots; record R², Spearman, RMSE, MAE per engine
Select best engine for Stage 1 screen based on speed vs. accuracy

Phase 5 — Production Run

Stage 1: Static docking on 2M+ compounds, split into 10–15 batches
Apply binding energy cutoff and chemical diversity filter; reduce to 30K–50K
Stage 2: Flexible docking on 30K–50K; evaluate all four output metrics
Stage 3: Ensemble docking on ~3K; use MD snapshots from RevMD Aqua
Rank final lead set; nominate for MD validation and wet-lab synthesis

​Overview

​Phase 1 — Target Identification

​How Targets Are Selected

​Define Your Therapeutic Hypothesis

​Phase 2 — Structure Acquisition & Preparation

​Step 1 — Obtain the Structure

Protein Data Bank (PDB)

UniProt

Sequence-Based Generation

​Step 2 — Clean the Structure

​Step 3 — Retain Co-Crystal Contact Information

​Step 4 — Define the Binding Pocket

​Phase 3 — The Four Docking Engines

​Co-Folding (Boltz-2)

​Static Docking

​Flexible Docking

​Ensemble Docking

​Phase 4 — Calibration

​Build a Calibration Set

ChEMBL

PubChem BioAssay

Literature

​Run All Four Engines on the Calibration Set

​Generate Calibration Plots and Interpret Metrics

​Phase 5 — Production Run

​Stage 1 — Static Docking Screen (2M+ Compounds)

​Stage 2 — Flexible Docking (30K–50K Compounds)

​Stage 3 — Ensemble Docking (~3,000 Compounds)

​Quick Reference Checklist

Overview

Phase 1 — Target Identification

How Targets Are Selected

Define Your Therapeutic Hypothesis

Phase 2 — Structure Acquisition & Preparation

Step 1 — Obtain the Structure

Step 2 — Clean the Structure

Step 3 — Retain Co-Crystal Contact Information

Step 4 — Define the Binding Pocket

Phase 3 — The Four Docking Engines

Co-Folding (Boltz-2)

Static Docking

Flexible Docking

Ensemble Docking

Phase 4 — Calibration

Build a Calibration Set

Run All Four Engines on the Calibration Set

Generate Calibration Plots and Interpret Metrics

Phase 5 — Production Run

Stage 1 — Static Docking Screen (2M+ Compounds)

Stage 2 — Flexible Docking (30K–50K Compounds)

Stage 3 — Ensemble Docking (~3,000 Compounds)

Quick Reference Checklist