RevRisk - Revilico

Why Use This Engine?

In the documentation below, we will use Revilico’s RevRisk engine to compute polygenic risk scores (PRS) from a whole-genome VCF file across 36 built-in disease panels and any score from the PGS Catalog. This engine enables researchers and clinicians to stratify individual genetic risk across a broad range of complex diseases, compare scores against population-level reference distributions, and report standardized percentile rankings and relative risk estimates.

Background

Polygenic risk scores aggregate the small effects of many common genetic variants identified through genome-wide association studies (GWAS) into a single summary statistic that estimates an individual’s inherited predisposition to a given trait or disease. Each variant in a GWAS weight set contributes a dosage-weighted effect size to the total score. Because complex diseases are influenced by thousands of loci each with small individual effects, PRS captures the cumulative genetic burden that no single variant analysis can reveal. PRS has emerged as a clinically relevant tool for identifying individuals at elevated lifetime risk of conditions such as coronary artery disease, type 2 diabetes, and several cancers, often providing predictive value that is independent of and complementary to conventional clinical risk factors. RevRisk implements a complete PRS pipeline: VCF parsing and quality control, dosage extraction, raw score computation, population-level standardization using Hardy-Weinberg equilibrium-derived statistics, conversion to percentile rank against a UK Biobank reference population, and risk category classification. The engine covers 36 curated built-in disease panels and integrates with the PGS Catalog to support scoring against any of over 4,000 published scores.

VCF Parsing and Quality Control

The engine accepts per-sample genotype VCF files in .vcf, .vcf.gz, or .bcf format. After loading, each variant record passes through the following sequential quality control filters: PASS filter: Variants with a FILTER field value other than PASS, ”.”, or empty are excluded. This removes variants flagged as low quality by the variant calling pipeline. Genotype validity: Variants where the genotype (GT) field contains only missing alleles are excluded. Read depth: If the DP FORMAT field is present, variants with depth below 10 are excluded to remove poorly supported calls. Genotype quality: If the GQ FORMAT field is present, variants with genotype quality below 20 are excluded to remove uncertain genotype assignments. Dosage for each passing variant is computed as the count of non-reference alleles: 0 (homozygous reference), 1 (heterozygous), or 2 (homozygous alternate). Both unphased (/) and phased (|) genotype encodings are handled. Chromosome identifiers are normalized by stripping the “chr” prefix if present and mapping “M” to “MT”.

PRS Scoring

Raw Score

For each disease, the raw polygenic risk score is the weighted sum of per-variant dosage values:

\text{PRS}_{\text{raw}} = \sum_{i=1}^{n} w_i \cdot d_i

Where

w_i

is the GWAS effect size (log odds ratio or beta coefficient) for variant

i

and

d_i

is the observed dosage (0, 1, or 2) from the VCF. When a variant in the disease weight set is not present in the input VCF, it is imputed at the expected dosage under Hardy-Weinberg equilibrium:

d_i^{\text{imputed}} = 2 \cdot \text{EAF}_i

Where

\text{EAF}_i

is the effect allele frequency in the reference population. This imputation strategy is equivalent to assigning the population mean genotype for that variant and is standard practice in PRS computation with partially matched SNP sets. Variant matching is attempted first by rsID, then by chromosomal position (GRCh38 chromosome:position) as a fallback. Per-disease statistics track how many variants were matched by rsID, matched by position, or imputed.

Population Standardization

The raw PRS is standardized to a Z-score using theoretical population statistics derived from Hardy-Weinberg equilibrium. The population mean and standard deviation are computed analytically from the effect size and allele frequency information in the GWAS weight set:

\mu = \sum_{i=1}^{n} 2 \cdot \text{EAF}_i \cdot w_i

\sigma = \sqrt{\sum_{i=1}^{n} 2 \cdot \text{EAF}_i \cdot (1 - \text{EAF}_i) \cdot w_i^2}

The Z-score is then:

Z = \frac{\text{PRS}_{\text{raw}} - \mu}{\sigma}

This approach derives the reference distribution from the expected genetic variance under HWE, making the standardization independent of any empirical reference cohort for the raw computation step.

Percentile and Risk Category

The Z-score is converted to a population percentile using the standard normal cumulative distribution function, calibrated against the UK Biobank European reference population (n = 488,000):

\text{Percentile} = \Phi(Z) \times 100

Percentiles are clamped to the range [0.1, 99.9] to avoid boundary values. Individuals are then classified into one of four risk categories based on their percentile rank:

Category	Percentile Range
Low	Below 20th
Average	20th to 60th
High	60th to 80th
Very High	80th and above

Relative risk estimates are retrieved from disease-specific quartile lookup tables derived from published GWAS findings. Each disease provides four relative risk values corresponding to the bottom 25%, 25th to 50th, 50th to 75th, and top 25th percentile bins of the score distribution.

Disease Panels

RevRisk includes 36 built-in disease panels with curated GWAS lead SNPs across seven clinical categories. Each panel is linked to a PGS Catalog identifier for reference and includes disease-specific population prevalence and quartile-based relative risk values.

Category	Diseases
Cardiovascular	Coronary Artery Disease, Atrial Fibrillation, Heart Failure, Ischemic Stroke, Hypertension, Venous Thromboembolism
Neurological	Alzheimer’s Disease, Age-Related Macular Degeneration, Parkinson’s Disease, Multiple Sclerosis, Epilepsy, Migraine
Metabolic and Endocrine	Type 2 Diabetes, Type 1 Diabetes, Obesity, Hypothyroidism, PCOS
Oncology	Prostate Cancer, Breast Cancer, Colorectal Cancer, Lung Cancer, Pancreatic Cancer, Melanoma
Inflammatory and Autoimmune	Inflammatory Bowel Disease, Rheumatoid Arthritis, Systemic Lupus Erythematosus, Psoriasis, Celiac Disease
Respiratory and Renal	Asthma, COPD, Chronic Kidney Disease
Psychiatric	Schizophrenia, Bipolar Disorder, Major Depressive Disorder, ADHD

Each SNP record in the built-in panels stores the rsID, GRCh38 chromosome and position, effect allele, other allele, effect size (log OR or beta), and effect allele frequency in the European ancestry reference population.

PGS Catalog Integration

In addition to the 36 built-in panels, RevRisk integrates with the PGS Catalog REST and FTP APIs to support scoring against any of over 4,000 published polygenic scores. Users can search the catalog by trait keyword and select one or more external scores to include alongside the built-in panels. When a custom PGS score is requested, the engine downloads the harmonized GRCh38 scoring file from the PGS Catalog FTP (falling back to the primary scoring file if the harmonized version is unavailable). Scoring files are cached locally for seven days so that repeated analyses of the same score do not require repeated downloads. The downloaded file is parsed to extract the same SNP-level fields used by the built-in panels, and the full PRS pipeline is applied identically. For custom scores where disease-specific relative risk quartile tables are unavailable, a conservative generic relative risk table is applied (Q1: 0.60, Q2: 0.85, Q3: 1.20, Q4: 2.00).

Running the Engine

Inputs

Parameter	Required	Description
`vcf_file`	Yes	Per-sample genotype VCF (`.vcf`, `.vcf.gz`, `.bcf`)
`pgs_ids`	No	JSON array of custom PGS Catalog scores to include (e.g. `[{"pgs_id":"PGS000036","label":"T2D"}]`)

When pgs_ids is omitted, all 36 built-in panels are scored using local weight tables with no network calls. When a PGS ID is provided that is not in the built-in set, the scoring file is downloaded from the PGS Catalog FTP and cached for 7 days.

Outputs

Upon completion, the engine returns results for each disease scored:

Raw score: Linear combination of effect sizes and genotype dosages
Z-score: Standardized score relative to the HWE-derived population distribution
Percentile: Population rank relative to UK Biobank reference, clamped to [0.1, 99.9]
Risk category: Low, Average, High, or Very High based on percentile thresholds
Relative risk: Quartile-based relative risk estimate from disease-specific lookup tables
SNP match statistics: Counts of variants matched by rsID, matched by position, and imputed per disease

VCF QC statistics are also returned: total variants, variants passing the FILTER field, variants excluded by insufficient depth, variants excluded by insufficient genotype quality, and variants excluded by invalid genotype calls. Alignment statistics summarize matching across all diseases: total PRS SNPs, matched by rsID, matched by position, imputed, and overall match rate. A summary object reports the number of diseases with elevated scores (above the 75th percentile), the number of diseases scoring in the Very High category (above the 80th percentile), the disease with the highest percentile, and the maximum percentile observed.

​Why Use This Engine?

​Background

​VCF Parsing and Quality Control

​PRS Scoring

​Raw Score

​Population Standardization

​Percentile and Risk Category

​Disease Panels

​PGS Catalog Integration

​Running the Engine

​Inputs

​Outputs