Overview
A computational high-throughput screening (HTS) campaign on the Revilico platform follows five sequential phases. Each phase feeds the next, so work through them in order on your first campaign.| Phase | Name | What Happens |
|---|---|---|
| 1 | Target Identification | Identify your biological target and define the therapeutic hypothesis |
| 2 | Structure Acquisition & Preparation | Obtain and clean a high-quality protein structure |
| 3 | The Four Docking Engines | Understand the tools available and when to use each |
| 4 | Calibration | Benchmark each engine against known experimental data |
| 5 | Production Run | Screen your compound library and refine to a confirmed lead set |
Phase 1 — Target Identification
Before running a single computation, you need a clearly defined biological target and a therapeutic hypothesis. This phase is about narrowing the scientific and commercial landscape down to one specific protein — and knowing exactly how you intend to modulate it.How Targets Are Selected
Target selection typically draws from three sources:- Literature — peer-reviewed publications, structural genomics datasets, or recently solved crystal structures that reveal a druggable binding site.
- Multi-omics data — genomics, transcriptomics, proteomics, or metabolomics evidence that a given protein is causally linked to the disease phenotype of interest.
- Market intelligence — financial records, competitive pipelines, and projected indication markets that indicate an unmet therapeutic need and a viable commercial opportunity.
Define Your Therapeutic Hypothesis
Once you have a target, decide explicitly how you want to modulate it before moving forward.- Mechanism — Are you aiming for competitive inhibition, allosteric modulation, activation, or protein-protein interaction (PPI) disruption?
- Binding site — Identify the pocket you want to engage. For a kinase this is typically the ATP-binding site (hinge residue and gatekeeper). For a PPI, identify the hot-spot residues that anchor the interface.
- Inhibitor type — Type I (DFG-in), Type II (DFG-out), Type III (allosteric), or covalent. This determines which conformational state of the protein you dock into.
Phase 2 — Structure Acquisition & Preparation
Your docking engines are only as good as the structure you feed them. A contaminated or incomplete structure will produce misleading results regardless of which engine you run.Step 1 — Obtain the Structure
Source your structure from one of three routes:Protein Data Bank (PDB)
Search by protein name, UniProt accession, or indication. Prefer high-resolution crystal structures (below 2.5 Å resolution). A co-crystal structure with a known inhibitor defines the active binding conformation and gives you key contact residues.
UniProt
Confirm the canonical sequence and link out to deposited structures. Pay attention to isoforms and active/inactive state annotations.
Sequence-Based Generation
If no experimental structure exists, generate one from the amino acid sequence using AlphaFold, OpenFold, or Boltz-2 co-folding. This gives a clean structure with no co-crystal contamination to remove.
Step 2 — Clean the Structure
PDB structures almost always contain components that must be removed before docking. Open the file in PyMOL or in RevBench and remove all of the following:- Co-crystal ligands — any bound inhibitor, agonist, substrate, or cofactor already present in the structure.
- Water molecules — crystallographic waters should be removed for standard docking campaigns.
- Ions and buffer components — sodium, chloride, DMSO, glycerol, and all other non-protein atoms.
- Excess chains — if the structure is a dimer, trimer, or higher-order complex, keep only the chain you want to dock into (usually Chain A). Remove all others.
Example: You download a kinase structure containing a staurosporine co-crystal ligand, 180 water molecules, two sulfate ions, and four chains (A, B, C, D). Keep only Chain A and delete everything else. The resulting file should contain only the protein backbone and side-chain atoms of your target domain.
Step 3 — Retain Co-Crystal Contact Information
Before deleting a co-crystal ligand, record the key residues it contacts. This contact map is invaluable for downstream pocket definition and flexible-residue selection — and lets you validate your later docking poses against known biology.Step 4 — Define the Binding Pocket
Use RevPocket, or extract the pocket from the literature or from your co-crystal contact analysis, to define a docking box around the site of interest. Verify that the box encompasses the residues identified as mechanistically critical in Phase 1. The box position and dimensions are passed directly to the docking engine.Phase 3 — The Four Docking Engines
Revilico provides four complementary engines. They are used sequentially in a production run — faster and broader at the start, slower and more accurate at the end.| Engine | Throughput | Key Outputs | Best Used For |
|---|---|---|---|
| Co-Folding (Boltz-2) | Per target | Predicted complex structure | Novel targets with no crystal structure; protein-ligand co-folding |
| Static Docking | 2M+ compounds | Binding energy (kcal/mol) | High-throughput initial screen; fast GPU-accelerated elimination |
| Flexible Docking | Up to 50K | Binding energy, CNN affinity, CNN accuracy, intramolecular energy | Hit refinement; binding-site flexibility and steric clash detection |
| Ensemble Docking | ~3K compounds | Binding affinity across MD snapshots | High-confidence lead confirmation; captures protein dynamics |
Co-Folding (Boltz-2)
Co-folding takes your SMILES string and your protein amino acid sequence as two inputs and generates a predicted complex structure as output. Use this engine when no experimental structure exists, or when you want to evaluate a chemotype in a flexible, co-folded conformation rather than a rigid crystal structure. Most commonly used for novel targets or targets where the binding site is poorly defined.Static Docking
Static docking is a GPU-accelerated, semi-empirical scoring method. The protein structure is held completely rigid and the ligand is sampled across thousands of orientations within the docking box. Because the protein does not flex, this engine is very fast — it can handle libraries of 2 million compounds or more. Primary output: binding energy in kcal/mol. More negative values indicate stronger predicted binding. Use static docking for the initial wide-net screen. Its role is to eliminate clearly non-binding compounds quickly, not to generate publication-quality poses.Flexible Docking
Flexible docking extends static docking by allowing up to 3–5 selected protein residues to move during scoring. This is more computationally expensive, so throughput drops to approximately 50,000 compounds or below. Outputs from flexible docking:- Binding energy (kcal/mol) — same semi-empirical scoring as static docking.
- CNN affinity — a pKi value estimated by a convolutional neural network trained on protein-ligand data. Provides an AI-based heuristic for binding strength that complements the semi-empirical score.
- CNN accuracy — a confidence score on the predicted pose. Use this to flag unreliable poses quickly.
- Intramolecular energy — measures internal strain within the ligand pose. Highly positive values indicate steric clashes within the compound itself, which typically invalidate the pose.
Ensemble Docking
Ensemble docking is the most rigorous and computationally intensive method. Before running it, you must first complete a protein-in-water molecular dynamics (MD) simulation using RevMD Aqua. Set the simulation time to 100 ns (minimum 50 ns acceptable). The MD run captures how the protein naturally moves in solution.
Ensemble docking provides binding affinity estimates that account for protein flexibility and dynamics. It is the most reliable predictor of binding in cases where the protein is conformationally dynamic. Use it for your final ~3,000 compound set as the last confirmation step before nominating compounds for wet-lab synthesis.
Phase 4 — Calibration
Calibration is the single most important quality-control step in any computational screening campaign. Without it, you have no basis for trusting the scores your engines produce. Calibration tells you how well each engine predicts experimental activity for compounds with known data — and therefore how much confidence to place in its predictions for unknowns.Build a Calibration Set
Collect a set of compounds with known experimental activity against your target. Aim for several hundred to several thousand compounds with associated IC₅₀ or Kᵢ values. Biochemical assay data is preferred over cell-based.ChEMBL
Highly curated with good metadata. Generally trustworthy as a primary source.
PubChem BioAssay
Broad coverage. Verify assay annotations carefully before including.
Literature
Acceptable, but audit assay conditions and metadata. Inconsistent protocols between sources introduce noise.
Run All Four Engines on the Calibration Set
Take the calibration library and run it through each of the four engines exactly as you would run a production screen. Use the same pocket box and settings you defined in Phase 2. Collect all outputs into a master CSV: one row per compound, one column per engine output.Generate Calibration Plots and Interpret Metrics
In RevBench, upload your combined CSV alongside the experimental activity data and generate calibration plots — predicted score vs. experimental value — for each engine. Evaluate the following four metrics:| Metric | Range | Interpretation |
|---|---|---|
| Pearson / R² | 0–1 (higher is better) | Measures linear correlation between predicted and experimental values. R² > 0.5 is a reasonable baseline for virtual screening. |
| Spearman Coefficient | 0–1 (higher is better) | Measures rank-order correlation. More important than R² in screening, where ranking compounds correctly matters more than absolute accuracy. |
| RMSE | Lower is better | Root Mean Squared Error. Sensitive to outliers. Use alongside MAE to understand whether a few bad predictions are skewing the average. |
| MAE | Lower is better | Mean Absolute Error. A robust average error metric less sensitive to extreme outliers than RMSE. |
Phase 5 — Production Run
With calibration complete and engine performance understood, you are ready to screen a real compound library. The production run is a staged funnel — each stage reduces the compound count while increasing scoring accuracy.| Stage | Engine | Compound Count | Goal |
|---|---|---|---|
| 1 — Library Screen | Static Docking | ~2,000,000 | Eliminate non-binders rapidly |
| 2 — Hit Refinement | Flexible Docking | 30,000–50,000 | Score and rank filtered hits with improved accuracy |
| 3 — Lead Confirmation | Ensemble Docking | ~3,000 | Validate binding across protein conformations |
Stage 1 — Static Docking Screen (2M+ Compounds)
Select your compound library. Common sources include the Enamine REAL Library, Enamine 2M liquid stock compounds, or any other commercially available or in-house library. Upload the SMILES to RevBench as a CSV. For libraries of 2 million or more compounds, split the library into 10–15 equal batches. Submit all batches simultaneously — they run in parallel and are automatically concatenated once complete. After the run completes, apply a binding energy cutoff:- Binding energy more negative than a defined kcal/mol threshold (e.g., more negative than the best-performing calibration compounds, or a fixed threshold such as −8 kcal/mol).
- Chemical space clustering — remove redundant chemotypes to ensure diversity in your hit set.
- Binding affinity ranking — sort by score and apply a top-N cutoff.
Stage 2 — Flexible Docking (30K–50K Compounds)
Import your filtered hit set into the flexible docking engine. Select 3–5 flexible residues at the binding site based on your calibration analysis and co-crystal contact data. Run the batch — again split into 10–15 parallel jobs if needed. Evaluate each pose using all four flexible docking outputs:- Binding energy for a primary rank.
- CNN affinity to cross-check with the AI-based prediction.
- CNN accuracy to flag low-confidence poses for manual review.
- Intramolecular energy to remove poses with internal steric clashes.
Stage 3 — Ensemble Docking (~3,000 Compounds)
Run the protein-in-water MD simulation in RevMD Aqua at 100 ns if you have not done so already. Extract trajectory snapshots at regular intervals. Load the snapshots into the ensemble docking pipeline, define the pocket on each snapshot, and submit your 3,000-compound set. Split into batches of 10–15 for parallelism. Ensemble docking produces per-compound binding affinity scores averaged across all protein conformations captured in the simulation. Compounds that score well consistently across many snapshots are the most likely to bind in a real, dynamic biological environment. Apply a final round of filtering and ranking. The resulting top compound set is your lead series — ready for molecular dynamics validation and prioritization for wet-lab synthesis.After Stage 3, you will have a ranked lead set with high-confidence binding poses, affinity predictions from multiple orthogonal methods, and validated alignment to a biologically relevant binding site. This set is now ready for physical synthesis and biochemical assay confirmation.
Quick Reference Checklist
Use this checklist to track your campaign progress. Phase 1 — Target Identification- Identify target from literature, multi-omics, or market analysis
- Define the mechanism: inhibition type, binding site, key residues
- Download structure from PDB / UniProt, or generate with AlphaFold / Boltz-2
- Remove all co-crystal ligands, waters, ions, and excess chains
- Record co-crystal contact residues before deletion
- Define and validate the binding pocket using RevPocket or literature
- Review calibration data to understand engine performance for your target class
- Confirm 100 ns MD simulation is queued for ensemble docking stage
- Compile calibration set (ChEMBL / PubChem / literature, biochemical assay preferred)
- Run all four engines on calibration set with production settings
- Generate calibration plots; record R², Spearman, RMSE, MAE per engine
- Select best engine for Stage 1 screen based on speed vs. accuracy
- Stage 1: Static docking on 2M+ compounds, split into 10–15 batches
- Apply binding energy cutoff and chemical diversity filter; reduce to 30K–50K
- Stage 2: Flexible docking on 30K–50K; evaluate all four output metrics
- Stage 3: Ensemble docking on ~3K; use MD snapshots from RevMD Aqua
- Rank final lead set; nominate for MD validation and wet-lab synthesis

