Abstract
Background: Polygenic risk scores (PRS) aggregate effects of common genetic variants to estimate disease susceptibility. While PRS have demonstrated clinical utility, implementing them with consumer genotyping data requires addressing variant coverage limitations, population stratification, and interpretability challenges.
Methods: We developed a web-based PRS calculator processing 23andMe, AncestryDNA, and VCF files against 50+ validated scoring files from the PGS Catalog. Our pipeline implements position-based variant matching with strand flip detection, achieving 65-75% variant coverage on consumer arrays. Population normalization uses UK Biobank-derived parameters for five ancestry groups.
Results: In European-ancestry validation, our implementation achieves AUC values of 0.65-0.85 across diseases, with top-decile odds ratios of 2.0-4.5× compared to population medians. Cross-ancestry performance shows expected 20-55% attenuation in non-European populations. Variant matching rates average 68% across 50+ disease scores.
Conclusions: Consumer DNA data enables meaningful PRS calculation when combined with validated scoring files and appropriate normalization. Limitations include ancestry bias, incomplete variant coverage, and the probabilistic nature of risk estimates. The tool is available open-source at github.com/suislanchez/polygenic-risk-score-calc.
1. Introduction
Genome-wide association studies (GWAS) have identified thousands of common genetic variants associated with complex diseases[1]. Unlike monogenic disorders, these diseases arise from the combined effects of many variants, each contributing small increments to overall risk. Polygenic risk scores (PRS) aggregate these effects into a single metric of genetic liability[2].
The clinical potential of PRS is substantial. Khera et al. demonstrated that individuals in the top percentiles of coronary artery disease PRS have risk equivalent to carriers of rare monogenic mutations[3]. Similar findings exist for breast cancer, where PRS identifies women exceeding clinical screening thresholds[4].
Over 40 million individuals have undergone direct-to-consumer (DTC) DNA testing through services like 23andMe and AncestryDNA[5]. This creates an opportunity to apply PRS at population scale, but requires addressing several challenges:
- Variant coverage: Consumer arrays genotype ~650,000 variants, while many PRS include millions
- Ancestry heterogeneity: Most GWAS derive from European populations, limiting transferability
- Score selection: Multiple PRS exist for each disease with varying validation status
- Result interpretation: Raw scores require contextualization for clinical meaning
This paper describes our approach to implementing a consumer-grade PRS calculator that addresses these challenges through validated scoring files, position-based matching, and ancestry-aware normalization.
2. Methods
2.1 Pipeline Overview
Our pipeline consists of five stages: (1) file parsing and format detection, (2) genome build detection and coordinate standardization, (3) variant matching against PGS Catalog scoring files, (4) weighted score calculation, and (5) population-specific normalization.
The pipeline supports 23andMe (v3-v5), AncestryDNA, and VCF file formats. Format detection is automatic based on file headers and column structures. Genome build (GRCh37 vs GRCh38) is detected using diagnostic SNPs with known position differences between builds.
2.2 Variant Matching
Variants are matched by genomic position (chromosome:position) rather than rsID to ensure consistency across reference builds. Position-based matching avoids issues with rsID merging and split events that complicate identifier-based approaches.
For each matched position, we verify allele compatibility using a two-step process:
pythondef compute_dosage(user_alleles, effect_allele, other_allele):
"""Calculate effect allele dosage with strand flip handling."""
a1, a2 = user_alleles
# Direct match
if {a1, a2} <= {effect_allele, other_allele}:
return (a1 == effect_allele) + (a2 == effect_allele)
# Strand flip (A↔T, C↔G)
complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
eff_comp = complement.get(effect_allele)
oth_comp = complement.get(other_allele)
if eff_comp and oth_comp and {a1, a2} <= {eff_comp, oth_comp}:
return (a1 == eff_comp) + (a2 == eff_comp)
return None # Incompatible allelesStrand flips are common at A/T and C/G polymorphisms where the two DNA strands have complementary but identical alleles. Our algorithm detects and handles these cases to prevent dosage miscalculation.
2.3 PRS Calculation
The raw PRS is computed as the weighted sum of effect allele dosages:
Where βj is the effect weight for variant j (typically log odds ratio) and Gj is the dosage (0, 1, or 2) of the effect allele. Effect weights are obtained from PGS Catalog scoring files, which provide harmonized weights from published GWAS or Bayesian shrinkage methods (LDpred, PRS-CS)[6,7].
2.4 Population Normalization
Raw PRS values are standardized using ancestry-specific population parameters to produce interpretable percentiles:
Where Φ is the standard normal CDF. Population parameters are derived from ancestry-stratified UK Biobank data for five major groups:
Percentiles are mapped to clinical risk categories following published guidelines:
Interactive Visualization
Understanding Your Population Position
3. Results
3.1 Variant Matching Rates
We evaluated variant matching rates across 50+ PGS Catalog scoring files using representative consumer genotyping arrays. Matching rates depend on score size and array coverage overlap:
Matching Rate Interpretation
3.2 Discriminative Performance
PRS performance metrics are derived from PGS Catalog evaluation studies, primarily in UK Biobank. We report discrimination (AUC) and risk stratification (top-decile OR):
These metrics reflect performance of validated PGS Catalog scores as implemented in our pipeline. Actual performance in external cohorts may vary based on population structure, disease prevalence, and case ascertainment.
3.3 Ancestry-Stratified Performance
PRS performance varies substantially across ancestry groups due to differences in allele frequencies, linkage disequilibrium patterns, and effect sizes. We quantify this portability loss relative to European-ancestry performance:
Important: Ancestry Bias
Most GWAS have been conducted in European populations (>85% of participants), creating systematic bias in PRS performance[9]. African-ancestry individuals experience the greatest performance loss (45-58% of European performance). This represents both a scientific limitation and an equity concern that active research efforts are addressing through diverse cohort studies like All of Us and H3Africa.
4. Discussion
We have implemented a consumer-grade PRS calculator that processes direct-to-consumer genotyping data using validated scoring files from the PGS Catalog. Our approach achieves robust variant matching (65-75%) and provides meaningful risk stratification when combined with ancestry-aware normalization.
The clinical utility of PRS is increasingly recognized. The NHS has piloted PRS-based breast cancer screening[10], and cardiovascular PRS are being incorporated into clinical risk calculators[11]. Our tool enables individuals with existing consumer genotype data to access these insights, though results should be interpreted with appropriate caveats.
4.1 Limitations and Mitigations
Ancestry Bias
Issue: PRS developed in European populations show 20-55% reduced performance in non-European ancestries, potentially exacerbating health disparities.
Mitigation: We provide ancestry-specific normalization and clearly communicate expected performance reduction. As diverse GWAS become available, we will incorporate multi-ancestry PRS methods (e.g., PRS-CSx)[12].
Incomplete Variant Coverage
Issue: Consumer arrays capture only ~650,000 variants, missing many variants in comprehensive PRS (some with millions of variants).
Mitigation: Studies demonstrate substantial predictive power at 50-75% coverage. We report matching rates for transparency. Future versions may incorporate genotype imputation to expand coverage.
Environmental Factors Not Captured
Issue: PRS reflect genetic predisposition only; environmental factors (lifestyle, diet, exposures) contribute substantially to disease risk.
Mitigation: Our optional questionnaire module integrates lifestyle risk modifiers, providing combined genetic+environmental risk estimates. We emphasize that genetic risk is modifiable through lifestyle intervention.
Probabilistic Interpretation
Issue: PRS provide probability estimates, not diagnoses. High-risk scores do not guarantee disease; low-risk scores do not guarantee protection.
Mitigation: Results include clear explanations of probabilistic interpretation, comparison to population distributions, and recommendations for clinical consultation when appropriate.
4.2 Future Directions
- Multi-ancestry PRS: Incorporate methods like PRS-CSx that combine GWAS from multiple populations for improved cross-ancestry performance
- Genotype imputation: Implement server-side imputation to expand variant coverage from ~650K to ~30M variants
- Longitudinal risk modeling: Age-of-onset PRS that account for time-varying genetic effects
- Rare variant integration: Combine common variant PRS with rare variant scores from whole-genome sequencing when available
- Clinical decision support: Integration with electronic health records and clinical risk calculators for healthcare provider use
5. Data Availability
Source Code
Full pipeline code available under MIT license
github.com/suislanchez/polygenic-risk-score-calc →PGS Catalog Scores
All scoring files downloaded from PGS Catalog with harmonized coordinates
www.pgscatalog.org →Reproducibility
The complete pipeline can be reproduced using:
bashgit clone https://github.com/suislanchez/polygenic-risk-score-calc
cd polygenic-risk-score-calc
pip install -r requirements.txt
python app.py # Launches local Gradio interfaceDocker deployment and API documentation available in repository README.