Expand Chemical Libraries

The Problem You are Trying to Solve:
“I have an identified target and a set of initial hit compounds that bind well, and I want a new library for further testing with an expanded set of chemical hypotheses.”

Generate An Expanded Small Molecule Library From Hit
Compounds

This can be difficult because expanding from hits to a broader library requires generating compounds that are new enough to teach you something, but close enough to keep activity, and still reasonable to synthesize and test. In traditional hit-to-lead, this happens through repeated medicinal chemistry cycles (analog enumeration, scaffold hopping, fragment linking), supported by screening and structure-based design. With AI generative chemistry, we can propose thousands to millions of hypotheses quickly, but they must be generated in a controlled and interpretable way to be useful. With AI-driven generative chemistry, we can now rapidly generate, prune, and prioritize large hypothesis libraries, provided outputs are scored, filtered, diversified, and interpreted with the right confidence and constraints. Further testing and scoring can be conducted with a variety of Revilico’s Engines to be able to ensure that designed structures will perform well experimentally before undergoing lengthy synthesis cycles. Solution
This workflow enables users to generate an expanded compound library using Revilico’s Generative Chemistry Suite, powered by reinforcement learning and a variety of scoring functions. Revilico supports three complementary ways to expand beyond your hit set:

De Novo Library Generation, which allows you to explore new chemical space (broad expansion, scaffold hopping, fragment linking, scaffold decoration)
Molecular Optimization generates improved “next-iteration” analogs of your hits under clear goals (property + activity objectives) for multi-parameter optimization of leads.
Custom Model Training creates a fine-tuned generative model on your chemistry so the libraries match your project’s chemical style and constraints, allowing for optimization within pre-determined chemical spaces.

Most teams use a mix of these approaches so they get conservative analogs, moderate exploration, and a small number of high-novelty ideas. What Data Do I Need to Provide?
Required

Hit compounds as SMILES (CSV or text; CSV must contain one column labeled ‘smileString’
Your determination of your “expanded library” (close analogs vs scaffold hops vs both)
How many new hypotheses you want (hundreds, thousands, millions)

Optional (recommended for stronger control)

Known scaffolds, fragments, or warheads (if you want constrained generation)
Preferences or constraints (size, drug-likeness, avoid substructures, etc.)
Project-specific compound dataset (if you want custom model training to remain optimized towards your own chemical spaces)

Workflow

Choose Your Expansion Strategy

Before you generate, decide what kind of expansion you want from your hits:

Close-in expansion: “Give me analogs close to my hit series”
Balanced expansion: “Keep the core ideas, but explore substitutions and related scaffolds”
Exploratory expansion: “Find new chemotypes / scaffold hops”
Constrained expansion: “Keep my scaffold or warheads fixed and only vary linkers / R-groups”

This choice determines which engine(s) you run first.

Generate a Broad Expansion Library

Use this step when you want to quickly explore chemical space and create a first-pass expanded library for downstream screening on different Engines. On Revilico, you select a De Novo Library Generation mode based on your intent:

De Novo Generation (no starting molecules needed): broad ideas from general drug-like space biased towards certain property optimizations
Scaffold Decoration (you provide a scaffold with * attachment points): explore R-group combinations while preserving the core
Fragment Linker Design (you provide two warheads separated by |, with * attachment points): explore linker chemistry between fragments

Revilico produces a large set of candidate molecules and automatically performs quality checks and diversity handling (e.g., removing duplicates and pruning overly similar structures). How to interpret the outputs
Each generated molecule is accompanied by an NLL score, which helps you understand how “normal” vs “novel” the molecule is relative to the model’s learned chemistry:

Lower NLL → safer / more typical chemistry (often good for early hit expansion)
Higher NLL → more novel chemistry (often useful for scaffold hopping or getting unstuck from difficult performance regimes)

Generate “Better Versions” of Your Hits

Use this step when you want the next library to be hit-like, but improved, based on the priorities you care about. Molecular Optimization takes your hit SMILES and produces new “siblings” optimized toward goals you define, such as:

staying within drug-like property ranges
increasing desirability under predicted ADMET or physicochemical constraints
encouraging novelty without breaking the series
Optimizing for activity against specific known targets using binding affinity scoring functions
penalizing known liabilities (reactive groups, toxic motifs)

Revilico runs this as an iterative optimization campaign: generating molecules, scoring them, and refining the generator toward better solutions in the next rounds of optimization. What you get

a ranked library of optimized hypotheses
clear scoring summaries so you can understand why a molecule was preferred
optional diversity controls so the library doesn’t collapse into near-duplicates

Train a Custom Generator on Your Chemistry (Optional)

Use this step when you want your generated libraries to reflect your organization’s chemical space, and not generic public chemistry. If you provide a curated SMILES dataset, Revilico’s models can train or fine-tune a model with Custom Model Training so future libraries:

match your chemistry patterns
better respect your synthesis constraints
produce compounds that “feel native” to the project based on already known structure activity relationships your chemists determined

Once trained, that model becomes a selectable option inside De Novo Generation or Molecular Optimization, so you can generate libraries that are consistently aligned with your internal chemical space, helping you to train your own models for downstream lead optimization campaigns and iterative chemical series expansion.

Consolidate Your Libraries Into a Single “Next Test Set”

Most users will run two or three generation passes, then combine them:

Library A (Conservative): close analogs of hit series
Library B (Balanced): moderate exploration around scaffolds / substitutions
Library C (Exploratory): a smaller set of scaffold hops or fragment-link ideas

From there, users typically apply a consistent triage strategy (e.g., dedupe, cluster, filter, then send to docking/MD/ADMET) and select a final testable subset. Results

Versioned expanded compound library (new chemical hypotheses)
A mixture of conservative and exploratory ideas (depending on your chosen strategy)
Interpretable outputs that help you reason about novelty vs feasibility
Libraries ready for downstream screening and prioritization
When multiple strategies produce overlapping conclusions (e.g., similar motifs across de novo + optimization), you can move forward with greater confidence that you’re expanding in a meaningful direction.

Now what? I have a new library, but what’s next?
This workflow commonly feeds into hit-to-lead prioritization steps such as:

docking and structure-based triage
molecular dynamics on top candidates
ADMET filtering and multi-parameter ranking
synthesis planning and experiment selection
Before sending these results to the wet lab, any other Revilico engine can be used to curate compound sets with more optimal properties before spending money on synthesis or testing.

Why Revilico?
Revilico allows you to expand from hits to a next-generation test library using three complementary approaches (broad exploration, guided optimization, and project-specific model training) while keeping outputs organized, versioned, and interpretable. This enables rapid iteration without losing scientific control over what is being generated and why.