In the documentation below, we will use Revilico’s RevRisk engine to compute polygenic risk scores (PRS) from a whole-genome VCF file across 36 built-in disease panels and any score from the PGS Catalog. This engine enables researchers and clinicians to stratify individual genetic risk across a broad range of complex diseases, compare scores against population-level reference distributions, and report standardized percentile rankings and relative risk estimates.
Polygenic risk scores aggregate the small effects of many common genetic variants identified through genome-wide association studies (GWAS) into a single summary statistic that estimates an individual’s inherited predisposition to a given trait or disease. Each variant in a GWAS weight set contributes a dosage-weighted effect size to the total score. Because complex diseases are influenced by thousands of loci each with small individual effects, PRS captures the cumulative genetic burden that no single variant analysis can reveal.PRS has emerged as a clinically relevant tool for identifying individuals at elevated lifetime risk of conditions such as coronary artery disease, type 2 diabetes, and several cancers, often providing predictive value that is independent of and complementary to conventional clinical risk factors. RevRisk implements a complete PRS pipeline: VCF parsing and quality control, dosage extraction, raw score computation, population-level standardization using Hardy-Weinberg equilibrium-derived statistics, conversion to percentile rank against a UK Biobank reference population, and risk category classification. The engine covers 36 curated built-in disease panels and integrates with the PGS Catalog to support scoring against any of over 4,000 published scores.
The engine accepts per-sample genotype VCF files in .vcf, .vcf.gz, or .bcf format. After loading, each variant record passes through the following sequential quality control filters:PASS filter: Variants with a FILTER field value other than PASS, ”.”, or empty are excluded. This removes variants flagged as low quality by the variant calling pipeline.Genotype validity: Variants where the genotype (GT) field contains only missing alleles are excluded.Read depth: If the DP FORMAT field is present, variants with depth below 10 are excluded to remove poorly supported calls.Genotype quality: If the GQ FORMAT field is present, variants with genotype quality below 20 are excluded to remove uncertain genotype assignments.Dosage for each passing variant is computed as the count of non-reference alleles: 0 (homozygous reference), 1 (heterozygous), or 2 (homozygous alternate). Both unphased (/) and phased (|) genotype encodings are handled. Chromosome identifiers are normalized by stripping the “chr” prefix if present and mapping “M” to “MT”.
For each disease, the raw polygenic risk score is the weighted sum of per-variant dosage values:PRSraw=i=1∑nwi⋅diWhere wi is the GWAS effect size (log odds ratio or beta coefficient) for variant i and di is the observed dosage (0, 1, or 2) from the VCF. When a variant in the disease weight set is not present in the input VCF, it is imputed at the expected dosage under Hardy-Weinberg equilibrium:diimputed=2⋅EAFiWhere EAFi is the effect allele frequency in the reference population. This imputation strategy is equivalent to assigning the population mean genotype for that variant and is standard practice in PRS computation with partially matched SNP sets.Variant matching is attempted first by rsID, then by chromosomal position (GRCh38 chromosome:position) as a fallback. Per-disease statistics track how many variants were matched by rsID, matched by position, or imputed.
The raw PRS is standardized to a Z-score using theoretical population statistics derived from Hardy-Weinberg equilibrium. The population mean and standard deviation are computed analytically from the effect size and allele frequency information in the GWAS weight set:μ=i=1∑n2⋅EAFi⋅wiσ=i=1∑n2⋅EAFi⋅(1−EAFi)⋅wi2The Z-score is then:Z=σPRSraw−μThis approach derives the reference distribution from the expected genetic variance under HWE, making the standardization independent of any empirical reference cohort for the raw computation step.
The Z-score is converted to a population percentile using the standard normal cumulative distribution function, calibrated against the UK Biobank European reference population (n = 488,000):Percentile=Φ(Z)×100Percentiles are clamped to the range [0.1, 99.9] to avoid boundary values. Individuals are then classified into one of four risk categories based on their percentile rank:
Category
Percentile Range
Low
Below 20th
Average
20th to 60th
High
60th to 80th
Very High
80th and above
Relative risk estimates are retrieved from disease-specific quartile lookup tables derived from published GWAS findings. Each disease provides four relative risk values corresponding to the bottom 25%, 25th to 50th, 50th to 75th, and top 25th percentile bins of the score distribution.
RevRisk includes 36 built-in disease panels with curated GWAS lead SNPs across seven clinical categories. Each panel is linked to a PGS Catalog identifier for reference and includes disease-specific population prevalence and quartile-based relative risk values.
Schizophrenia, Bipolar Disorder, Major Depressive Disorder, ADHD
Each SNP record in the built-in panels stores the rsID, GRCh38 chromosome and position, effect allele, other allele, effect size (log OR or beta), and effect allele frequency in the European ancestry reference population.
In addition to the 36 built-in panels, RevRisk integrates with the PGS Catalog REST and FTP APIs to support scoring against any of over 4,000 published polygenic scores. Users can search the catalog by trait keyword and select one or more external scores to include alongside the built-in panels.When a custom PGS score is requested, the engine downloads the harmonized GRCh38 scoring file from the PGS Catalog FTP (falling back to the primary scoring file if the harmonized version is unavailable). Scoring files are cached locally for seven days so that repeated analyses of the same score do not require repeated downloads. The downloaded file is parsed to extract the same SNP-level fields used by the built-in panels, and the full PRS pipeline is applied identically. For custom scores where disease-specific relative risk quartile tables are unavailable, a conservative generic relative risk table is applied (Q1: 0.60, Q2: 0.85, Q3: 1.20, Q4: 2.00).
JSON array of custom PGS Catalog scores to include (e.g. [{"pgs_id":"PGS000036","label":"T2D"}])
When pgs_ids is omitted, all 36 built-in panels are scored using local weight tables with no network calls. When a PGS ID is provided that is not in the built-in set, the scoring file is downloaded from the PGS Catalog FTP and cached for 7 days.
Upon completion, the engine returns results for each disease scored:
Raw score: Linear combination of effect sizes and genotype dosages
Z-score: Standardized score relative to the HWE-derived population distribution
Percentile: Population rank relative to UK Biobank reference, clamped to [0.1, 99.9]
Risk category: Low, Average, High, or Very High based on percentile thresholds
Relative risk: Quartile-based relative risk estimate from disease-specific lookup tables
SNP match statistics: Counts of variants matched by rsID, matched by position, and imputed per disease
VCF QC statistics are also returned: total variants, variants passing the FILTER field, variants excluded by insufficient depth, variants excluded by insufficient genotype quality, and variants excluded by invalid genotype calls.Alignment statistics summarize matching across all diseases: total PRS SNPs, matched by rsID, matched by position, imputed, and overall match rate.A summary object reports the number of diseases with elevated scores (above the 75th percentile), the number of diseases scoring in the Very High category (above the 80th percentile), the disease with the highest percentile, and the maximum percentile observed.