RevGRN - Revilico

Why Use This Engine?

In the documentation below, we will use Revilico’s RevGRN engine to infer gene regulatory networks from single-cell, bulk, or spatial transcriptomic data and simulate the stable cellular states that emerge from those regulatory interactions. RevGRN identifies which transcription factors are driving gene expression changes in the dataset, constructs a network of TF-target regulatory edges, and then uses Boolean dynamical simulation to identify the attractor states that the network converges to, enabling prediction of how genetic perturbations such as knockouts or overexpression events alter the cellular phenotype.

Background

Gene regulatory networks describe the control logic by which transcription factors bind to gene promoters and enhancers to activate or repress downstream target genes. These networks determine which genes are expressed in each cell type and how expression patterns change in response to developmental signals, disease mutations, or drug perturbations. Inferring GRNs from transcriptomic data is challenging because correlation in expression does not imply causation, and regulatory relationships must be distinguished from indirect co-expression driven by shared upstream regulators. RevGRN addresses this using machine learning-based importance scoring (GRNBoost2) to prioritize direct TF-target regulatory edges, filtering to the most informative subset of the network, and then converting the inferred network into a Boolean dynamical model to simulate cellular states. Boolean models represent each gene as either ON or OFF and apply the regulatory logic iteratively until the system converges to stable attractor states. These attractors correspond to distinct biological phenotypes (cell types, disease states, drug-response signatures), and perturbation simulations reveal how knocking out or overexpressing a gene redirects the network toward different attractors.

Input Data Loading

RevGRN accepts expression matrices in CSV, TSV, H5AD, and LOOM formats. The engine auto-detects whether rows represent cells and columns represent genes, or vice versa, by analyzing the index patterns: Ensembl IDs (ENSG prefix), gene symbols, or cell barcode patterns are detected and the matrix is transposed if necessary. For matrices indexed by Ensembl IDs with a gene_name column, symbols are resolved via the MyGeneInfo API.

Quality Control and Normalization

Cells with fewer than min_genes_per_cell expressed genes (default 200) are removed. Genes detected in fewer than min_cells_per_gene cells (default 3) are removed. The filtered matrix is then normalized by library size and log-transformed:

\tilde{x}_{ij} = \log\left(\frac{x_{ij}}{\sum_j x_{ij}} \cdot 10000 + 1\right)

Where

x_{ij}

is the raw count for gene

j

in cell

i

. Highly variable genes are then selected by ranking genes on their dispersion ratio:

d_j = \frac{\text{Var}(x_j)}{\text{mean}(x_j) + \varepsilon}

The top 2,000 genes by dispersion (or adaptively capped at

\min(2000, 20 \times n_{\text{cells}})

) are retained for network inference.

GRN Inference

GRNBoost2 (Primary Method) GRNBoost2 uses an ensemble of gradient-boosted regression trees to estimate the regulatory importance of each transcription factor for each target gene. For each target gene, an XGBoost model is trained to predict that gene’s expression from the expression of all TF genes. The feature importances from the trained model quantify how much each TF reduces prediction error, providing a directed importance score for each TF-target pair. GRNBoost2 is fast, parallelizable, and robust to non-linear regulatory relationships. Mutual Information (Alternative) Mutual information between TF and target gene expression is computed as:

I(X; Y) = H(Y) - H(Y | X)

Where

H(Y)

is the marginal entropy of target gene expression and

H(Y|X)

is the conditional entropy given TF expression. Mutual information captures non-linear associations that linear correlation misses. Spearman Correlation (Fallback) When neither GRNBoost2 nor mutual information libraries are available, Spearman rank correlation is used as a fallback. The absolute correlation value serves as the regulatory importance score. After inference, the top max_network_edges edges (default 50) ranked by importance score are retained as the final network.

Boolean Simulation

The inferred network is converted to a Boolean dynamical model. Each gene is represented as a binary variable (ON = 1, OFF = 0). At each simulation step, the state of each gene is updated synchronously according to the regulatory logic:

s_j^{(t+1)} = \begin{cases} 1 & \text{if } \sum_i w_{ij} \cdot s_i^{(t)} > 0 \\ 0 & \text{otherwise} \end{cases}

Where

w_{ij}

is the regulatory edge weight from TF

i

to gene

j

(positive for activation, negative for repression, using the sign of the inferred importance). The simulation is initialized from random binary starting states and iterated until convergence (a state that maps to itself) or a limit of 50 steps. Each terminal state is recorded as an attractor. The simulation is repeated from 200 different random initial states, and the basin of attraction for each attractor is estimated as the fraction of initial states that converge to it.

Perturbation Analysis

In knockouts, the target gene is pinned to state 0 (OFF) throughout the simulation regardless of its regulatory inputs. In overexpression simulations, the gene is pinned to state 1 (ON). The resulting attractor distribution is compared to the baseline to predict how the perturbation redirects network dynamics. Genes whose knockout or overexpression produces the largest shift in attractor basin fractions are the regulatory hubs most relevant to phenotypic control.

Running the Engine

Inputs

Parameter	Default	Description
Expression matrix	Required	CSV, TSV, H5AD, or LOOM (cells x genes)
Data type	scrna	`scrna`, `bulk`, or `spatial`
Organism	human	`human`, `mouse`, or `rat` for TF annotation
Min genes per cell	200	QC threshold for sparse cell removal
Min cells per gene	3	QC threshold for sparse gene removal
Highly variable genes	2000	Number of HVGs for network inference
Max network edges	50	Top TF-target edges to retain
Inference method	grnboost2	`grnboost2`, `mutual_info`, or `correlation`
Simulation type	attractor_landscape	`attractor_landscape`, `gene_knockout`, or `gene_overexpression`
Knockout/overexpression gene	None	Gene symbol to perturb

Outputs

Network: Ranked list of TF-target regulatory edges with importance scores
TF hub ranking: Top transcription factors by total regulatory influence (sum of outgoing edge importance)
Attractors: Stable cellular states with active gene sets and basin of attraction fractions
Perturbation comparison: Baseline vs. perturbed attractor distribution (if perturbation mode selected)
Network visualization: Interactive force-directed graph with TFs as blue diamonds and target genes as gray nodes, edge width proportional to importance
QC summary: Cell counts, gene counts, HVG count, network edge count, TF count, and attractor count
Downloads: Network CSV (TF, target, importance), attractors CSV (active genes and basin fraction), full results JSON

​Why Use This Engine?

​Background

​Input Data Loading

​Quality Control and Normalization

​GRN Inference

​Boolean Simulation

​Perturbation Analysis

​Running the Engine

​Inputs

​Outputs

Why Use This Engine?

Background

Input Data Loading

Quality Control and Normalization

GRN Inference

Boolean Simulation

Perturbation Analysis

Running the Engine

Inputs

Outputs