Skip to main content

Why Use this product?

Revilico’s De Novo Library Generation Engine is an AI powered tool that leverages reinforcement learning and pretrained generative models and scoring functions to design drug-like molecules, perform scaffold hopping, connect molecular fragments with linters, and generate peptides with natural and non natural amino acids. This Engine will enable rapid exploration of chemical space for hit discovery, lead optimization, and library enumeration at scales from hundreds to millions of compounds. This library generated will eventually serve as a concentrated screening set for other engines on the platform to narrow down synthesizable candidates.
De Novo Library Generation Workflow

Background

Oftentimes in drug discovery, we already have a target protein and would like to screen a compound library to see which molecules have the best binding affinity, and are most promising to move to downstream analysis such as Molecular Dynamics and Free Energy Perturbation Calculations. This oftentimes requires the user to already have a molecular library built out in which they would like to screen, oftentimes a library of compounds that have some sort of desired property, may be on hand, or are easily synthesizable or accessible because the lab next door has them on hand. But what if we do not have a library and only a target molecule? We then can use Revilico’s De Novo Library Generation engine which is a core capability in AI-assisted drug discovery that attempts to solve the inverse design problem: Given a desired set of properties, create new compounds likely to satisfy them. Rather than enumerating combinations of fragments or scaffolds, De Novo Generation builds molecules from scratch using learned language or graph models that encode valid chemical structure syntax. This is an alternative to pre-determined library selection and allows for less molecules to be synthesized and tested to get towards candidate hits because we are ‘creating a new needle’ rather than finding a needle in a haystack. Now how does it work? The basis of this engine is a chemical language model or neural network that treats SMILES strings like sentences and treats atoms and bonds like tokens or words. This model has been trained on a large molecular database with valid, synthesized small molecules. From this model it has learned valence rules, common ring systems, typical functional groups, and what molecules look like medicinal chemistry vs which look like chemistry. Essentially this model is sampling from learned chemical patterns. We will first start with a start token, where a token is a small piece of a SMILES string. The job of the neural network is to predict the probability distribution of the next possible token. The next token is not always the highest-probability token, since that would lead to the same molecule being generated over and over again. Instead we sample from the probability distribution (or the latent space vector representation of the molecules). This sample is then appended to the current token or current SMILES string. We will repeat this process of generating the distribution, sampling from the distribution, and appending until we have generated the end token, signaling the end of one molecule. We will do this for several molecules generating a library of valid chemical candidates. Now we have a set of candidate molecules, all of which are chemically and synthetically valid, however some may be undesirable molecules or molecules are impractical. We are now asking the question which of the molecules should we keep, and which do we remove? To do this we go through a series of different quality control check, first being a validity check to see if the SMILES string is synthetically valid, whether it can be parsed into a molecular graph, whether it contains allows atom types, and whether the molecule is within a maximum size limit. The next step is a chemical sanity filter that will be constrained to a molecular weight range, number of rings,and a number of heavy atoms. We will then lastly go through a diversity handling step, where we will remove duplicates, as well as conduct a similarity based pruning. This is to ensure we have wide coverage of the entire chemical space. To run the De Novo Library Generation Pipeline: (1) Select your molecule generation type from five options: De Novo Generation (no input molecules), Scaffold Decoration (scaffold-based generation), Fragment Linker Design (link molecules with linkers), Molecular Optimization (lead-based generation), or Peptide Generation (peptide design with natural and non-natural amino acids). (2) Configure the Pipeline Name and select your model, defaults are included and don’t need to be changed, with Molecular Optimization offering seven model options (Low similarity, Medium similarity, High similarity, Real medicinal chemistry transformation, Scaffold-based transformations, Broad scaffold diversification, or custom model upload). (3) Upload or submit input SMILES if required by your selected generation type, adjust molecular diversity settings between Exploratory and Focused (with Temperature slider ranging from “Conservative” to “Wild” for Exploratory mode), and enter the number of output molecules desired per input. (4) Press “Create Pipeline” to initiate generation, then retrieve results in the Analysis section once completed, outputs include generated molecular structures with diversity scores and property predictions.