# Migration Scenario Engine — Full LLM Reference Document

This is the long, self-contained reference. Every fact an AI assistant
needs to summarize, evaluate or cite the project is in this single file.
No further URL fetches required.

Site: https://migrationengine.org/
Author: Christian Rogalski, M.Sc.
Contact: rogalski.academic@pm.me
License: CC BY-NC 4.0
Cite as: Rogalski, C. (2026). Migration Scenario Engine: Global
Bilateral Migration Projections under Climate Scenarios, 2020–2100.
CERIFR Research. https://migrationengine.org

---

## 1. Project summary

The Migration Scenario Engine (MSE) is an open scientific platform that
projects global bilateral migration flows under climate scenarios for
the period 2020–2100. It combines econometric panel models with a
machine-learning ensemble to identify nonlinear climate thresholds and
structural drivers of cross-border mobility, and then projects flows
under alternative climate and governance pathways through 2100.

Headline numbers:
- 195 countries × ~52,670 directed bilateral corridors × 5-year periods
- 316,020 dyad-period observations in the historical training panel
  (1990–2015), 99.57% completeness on core variables
- 109 engineered predictors in 12 groups
- 4 IPCC Shared Socioeconomic Pathways (SSP1, SSP2, SSP3, SSP5)
  × 5 narrative scenarios per SSP
- Stacking ensemble OOF R² = 0.826 (within 0.7 pp of the temporal
  autocorrelation ceiling r² = 0.827)

The platform is operated entirely at the author's own expense, without
external funding, sponsorship, or institutional support; it represents
no lobby, no commercial interest, and no institutional agenda. All
analyses reflect independent research.

## 2. Author and institutional context

**Christian Rogalski, M.Sc.** — External Doctoral Researcher and
economist. Affiliation: Alexandru Ioan Cuza University of Iași
(Romania), Faculty of Economics and Business Administration, Department
of Statistics and Cybernetics. Topics: climate migration, machine
learning, deep learning and AI, automation of data pipelines,
econometrics, scenario modelling.

The doctoral research that this dashboard accompanies asks: *How will
climate change reshape international migration over the coming decades?*
The platform is the author's contribution toward answering that
question — transparently, independently, and with the strongest
quantitative methods available.

The author plans to expand the platform with crisis-mode monitoring,
interactive conflict detection from geospatial and social-media data,
and integration of satellite imagery and nightlight proxies. A second
interactive dashboard on a separate research topic is in preparation.

## 3. Methodology — full pipeline

### 3.1 Target variable

For each origin–destination dyad (od) and 5-year period t:

  r_{od,t}  = 1000 · M_{od,t} / (Pop_{o,t} + 1e-6)
  y_{od,t}  = log(1 + max(0, r_{od,t}))

Where M_{od,t} is bilateral migration (Abel & Cohen 2019, calibrated
against UN DESA IMS 2024 stocks), r_{od,t} is the flow rate per 1,000
origin population, and y_{od,t} is the log-transformed rate that all
component models predict.

### 3.2 Stacking ensemble

Three base learners are stacked via a Ridge meta-learner on the
out-of-fold (OOF) prediction matrix [N × 3]:

  Model         | Type                              | OOF R² | Weight
  --------------|-----------------------------------|--------|-------
  GAM           | mgcv::gam, REML, thin-plate sp.   | 0.795  | 0.317
  Random Forest | ranger, 500 trees                 | 0.804  | 0.333
  XGBoost       | 500 rounds, η=0.01, depth=6       | 0.813  | 0.350
  Stacking      | Ridge meta-learner (glmnet, α=0)  | 0.826  | —

  ŷ_ensemble(x) = Σ_m w_m · ŷ_m(x) / Σ_m w_m
  m ∈ {GAM, RF, XGB}

Weight rationale: XGBoost earns the highest weight (35.0%) because of
its highest standalone R² (0.813). RF (33.3%) contributes via bagged
tree diversity (R² = 0.804). GAM (31.7%) adds smooth spline trends
(R² = 0.795) that tree models miss. The three weights sit in a tight
band (σ ≈ 1.5 pp) — evidence of genuine algorithmic complementarity.
The ensemble achieves R² = 0.826, 99.9% of the temporal autocorrelation
ceiling (r² = 0.827).

### 3.3 GAM specification

  log(1 + r_{od,t}) =
       Σ_{j=1..7} s_j(x_{j,od,t})           [smooth terms]
     + Σ_{k=1..12} β_k · z_{k,od,t}         [linear terms]
     + ε_{od,t}

Smooth terms (k = basis dimension):

  s_1(tmp_anom, k=5),   s_2(pre_anom, k=5),
  s_3(GDP_o,   k=4),    s_4(GDP_d,    k=4),
  s_5(agr_o,   k=4),    s_6(agr_d,    k=4),
  s_7(log dist_od, k=4)

Linear terms:

  β_1·wgi_ge_o + β_2·wgi_rl_o + β_3·wgi_ge_d + β_4·wgi_rl_d
  + β_5·war_onset + β_6·major_conflict
  + β_7·drought + β_8·flood
  + β_9·diaspora_{od,t-1}
  + β_10·contig + β_11·comlang + β_12·colony

7 nonlinear + 12 linear terms; estimated via mgcv::gam with penalised
thin-plate regression splines, REML.

### 3.4 Feature engineering

109 predictors in 12 groups. v1 (interaction terms) and v2 (temporal
& network features) layers.

v1 interactions (selected):
  climate_income     = tmp_anomaly × log(gdp + 1)
  climate_gov        = tmp_anomaly × wgi_ge
  conflict_crisis    = war_onset × gdp_crisis
  gravity            = log(Pop_o) × log(Pop_d) / exp(log_dist)

v2 (~30 additional features, all use only prior-period (t–1) data; no
data leakage):

  Category                | Features                                                                        | Leakage guard
  ------------------------|---------------------------------------------------------------------------------|---------------
  Lag features            | flow_rate_lag1, log_flow_rate_lag1, momentum_lag, total_migration_lag1          | t–1
  Corridor shares         | corridor_share_of_emigration_lag, corridor_share_of_immigration_lag             | t–1
  Network centrality      | pagerank, degree, strength, betweenness, hub/authority (orig + dest)            | t–1 network
  Diaspora dynamics       | diaspora_growth                                                                 | t–1
  Temporal interactions   | climate_x_past_flow, conflict_x_past_flow, hub_authority_interaction            | t–1

R² lift from v2: without v2 ~0.20–0.55 per individual model; with v2
~0.79–0.80 per individual model. Current-period aggregates are removed
after computation.

### 3.5 Cross-validation

Expanding-window (rolling-origin), strictly time-based. 6 historical
periods (1990, 1995, 2000, 2005, 2010, 2015) yield 5 expanding folds.
Models are fully retrained per fold (no data leakage):

  Fold | Training       | Test         | R²    | RMSE  | N (test) | N Models
  -----|----------------|--------------|-------|-------|----------|---------
  1    | 1990           | 1995–2015    | 0.373 | 0.229 | 253,770  | 2 (GAM+RF)
  2    | 1990–1995      | 2000–2015    | 0.810 | 0.126 | 203,016  | 3
  3    | 1990–2000      | 2005–2015    | 0.834 | 0.118 | 152,262  | 3
  4    | 1990–2005      | 2010–2015    | 0.831 | 0.119 | 101,508  | 3
  5    | 1990–2010      | 2015         | 0.855 | 0.111 | 50,754   | 3
  Pool | All prior      | All held-out | 0.826 | 0.121 | 507,540  | —

Monotonic learning curve from Fold 2 onward (0.810 → 0.855) validates
genuine learning over overfitting; Fold 1 is intentionally weak due to
its single-period training set. Reported metrics: R², RMSE, MAE, MedAE,
MAPE, Pearson r.

### 3.6 Shock models

  Shock                              | Threshold
  -----------------------------------|--------------------------------------------------
  war_onset                          | Battle deaths ≥ 1,000 + Onset
  major_conflict                     | Battle deaths ≥ P90
  catastrophic_drought/flood/storm   | Event count ≥ P95
  gdp_crisis                         | 5-year GDP growth < –5%
  governance_breakdown               | min(Δwgi_ge, Δwgi_rl) < –0.5

Each shock has its own probabilistic ensemble:

  p̂(x) = σ( (1/3)·[ logit p_GLM(x) + logit p_GAM(x) + logit p_RF(x) ] )
  σ(u)  = 1 / (1 + e^{-u})

### 3.7 Migration Pressure Index (MPI)

  MPI_{o,t} = 0.25·C + 0.25·F + 0.20·D + 0.15·G + 0.15·E

  Component        | Calculation                                       | Weight
  -----------------|---------------------------------------------------|-------
  C: Climate       | √(tmp_anom² + pre_anom²)                          | 25%
  F: Conflict      | (war + conflict) / 2                              | 25%
  D: Disasters     | (drought + flood + storm) / 3                     | 20%
  G: Governance    | p_gov_breakdown                                   | 15%
  E: Economy       | p_gdp_crisis                                      | 15%

Normalisation: percentile rank within each SSP group → [0, 1].

### 3.8 Trapped Population Index (TPI)

  mobility_{o,t} = (1000 / Pop_{o,t}) · Σ_{d ≠ o} M_{od,t}
  TPI_{o,t}      = MPI_{o,t} · (1 – mobilitỹ_{o,t})

Tilde = min–max normalisation within SSP group → [0, 1]. Flagging:
TPI ≥ P95 within SSP. High TPI = high stress AND low mobility.

### 3.9 Scenarios (V7 hybrid: feature multiplication + displacement overlay)

Total Migration = ML Ensemble Prediction + Climate Displacement Overlay

(1) Feature-level multiplication: input features are scaled directly
(e.g. tmp_anomaly × 1.5), and the full ensemble (GAM + RF + XGBoost)
is re-run. The model's learned nonlinear responses determine the
scenario effect.

(2) Structural displacement overlay: physics-based forced displacement
from 3 climate channels is added on top.

Scenario multipliers:

  Scenario             | Climate | Conflict | Income | Drought | Flood | Storm | Gov. | Displacement
  ---------------------|---------|----------|--------|---------|-------|-------|------|-------------
  Baseline (ML only)   |   1.0   |   1.0    |  1.0   |   1.0   |  1.0  |  1.0  | 1.0  | —
  Baseline+            |   1.0   |   1.0    |  1.0   |   1.0   |  1.0  |  1.0  | 1.0  | × 1.0
  Adaptation Success   |   0.8   |   0.7    |  1.1   |   0.9   |  0.9  |  1.0  | 1.0  | × 0.8
  Fragmentation        |   1.0   |   1.5    |  0.8   |   1.0   |  1.0  |  1.0  | 1.3  | × 1.2
  Climate Extreme      |   1.5   |   1.0    |  0.9   |   1.5   |  1.4  |  1.5  | 1.0  | × 1.5

Displacement Overlay — 3 channels:

  Channel                 | Source                  | Threshold                                    | Max share
  ------------------------|-------------------------|----------------------------------------------|----------
  Sea Level Rise          | IPCC AR6, NASA SEDAC    | <5 m: (SLR/2.5)²; 5–10 m: ((SLR–2)/8)²      | 100% / 100%
  Extreme Heat            | CRU TS 4.09, CMIP6      | Sigmoid onset at 33 °C (Mora et al.)         | 25%
  Drought/Desertification | CRU/CMIP6, WB WDI       | From −150 mm precipitation anomaly           | 15%

Stock-to-flow conversion: 10% displacement rate per 5-year period.
Corridor distribution proportional to baseline corridor shares
(network-informed). Global exposure: 687M pop below 5 m, 1.056B below
10 m (SEDAC LECZ).

### 3.10 Diaspora recursion model

  S_{ij,t+1} = (1 – δ_m – ρ) · S_{ij,t} + F_{ij,t}
  with survival = 0.87 (δ_m = 0.10 mortality 5-yr, ρ = 0.03 return)

Dynamic reconstruction of bilateral diaspora stocks across 6 historical
periods. Strongest single predictor (r = 0.33). Inactive corridors
decline 13%/period; active ones keep growing. During projection,
predicted flows feed back endogenously into the next period's diaspora
stock (self-reinforcing corridor dynamics).

### 3.11 Back-transformation & IPF calibration

  flow_{od,t} = clip( exp(ŷ_{od,t}) − 1, 0, 1000 )
  M̂_{od,t}    = flow_{od,t} · Pop_{o,t} / 1000

IPF calibration (V4) — three layers correct systematic model bias,
following Willekens (1999) and Abel & Cohen (2019):

1. Country-specific growth trajectories — model predictions are scaled
   to country-specific growth rates (not a global factor).
2. Marginal sum adjustment — iterative row/column sum alignment to
   target totals.
3. Population caps — origin and destination caps prevent physically
   impossible volumes.

  M^{(k+1)}_{od} = M^{(k)}_{od}   · T_o / Σ_{d'} M^{(k)}_{od'}   (row scaling: origin totals)
  M^{(k+2)}_{od} = M^{(k+1)}_{od} · T_d / Σ_{o'} M^{(k+1)}_{o'd} (column scaling: destination totals)

Targets: UN WPP origin totals, destination totals.

### 3.12 Conformal Prediction Intervals (CPI)

Distribution-free conformal intervals following Vovk et al. (2005),
Romano et al. (2019), Barber et al. (2023). Guarantee finite-sample
coverage without distributional assumptions.

- Mondrian binning: residuals stratified by flow magnitude (4 bins:
  zero, low, medium, high)
- Centered residuals: bias correction within each Mondrian bin
  (median-centered)
- Multiplicative bootstrap: corridor aggregation with pivot correction
  (N = 500 replicates)
- Intervals: 50% (25th–75th percentile) and 90% (5th–95th percentile)

  CI_α(x) = [ ŷ(x) + q_{α/2}(R_{B(x)}),  ŷ(x) + q_{1−α/2}(R_{B(x)}) ]

R_{B(x)} = centered residuals within the Mondrian bin
B(x) ∈ {zero, low, medium, high}; q_p(·) denotes the p-th quantile.

## 4. Data sources (17 primary)

  #  | Pillar            | Source                              | Variables                                                        | Period         | Coverage             | License
  ---|-------------------|-------------------------------------|------------------------------------------------------------------|----------------|----------------------|--------------
  1  | Migration Flows   | Abel & Cohen (2019)                 | Bilateral flow estimates                                         | 1990–2015      | 230 countries        | CC BY 4.0
  2  | Migration Flows   | UN DESA IMS 2024                    | Bilateral stock matrices, diaspora calibration                   | 1990–2020      | 232 countries        | UN ToU
  3  | Climate           | CRU TS 4.09 (Harris et al. 2020)    | tmp, pre, tmx, tmn, dtr, vap, wet, SPEI                          | 1901–2022      | 0.5° grid, land      | OGL v3
  4  | Climate           | CMIP6 ScenarioMIP (5 GCMs)          | tas, pr → anomalies (4 RCPs)                                     | 2015–2100      | Global               | CMIP6 ToU
  5  | Climate           | WMO (2017)                          | Baseline normals 1961–1990                                       | Reference      | Global               | WMO
  6  | Economic          | World Bank WDI (2024)               | gdp_pc_ppp, urban_pct, agr_va_pct, sec_enroll, poverty_rate      | 1960–2024      | 217 countries        | CC BY 4.0
  7  | Governance        | WGI (Kaufmann et al. 2011)          | wgi_ge, wgi_rl                                                   | 1996–2023      | 215 countries        | CC BY 4.0
  8  | Conflict          | UCDP v25.1 (Davies et al. 2023)     | Battle-related deaths                                            | 1989–2023      | Global               | CC BY 4.0
  9  | Vulnerability     | ND-GAIN                             | Climate adaptation score (0–100)                                 | 1995–2023      | 192 countries        | CC BY-SA 4.0
  10 | Disasters         | EM-DAT (Guha-Sapir et al.)          | Natural hazard events                                            | 1900–present   | Global               | Academic
  11 | Education         | Barro-Lee v3 (2013)                 | yrs_school_mean, sec_complete, tert_complete                     | 1950–2010      | 146 countries        | Free
  12 | Demography        | UN WPP 2024                         | Population by age-sex                                            | 1950–2100      | 237 countries        | CC BY 3.0 IGO
  13 | Gravity           | CEPII GeoDist (Mayer & Zignago)     | log_distw, contig, comlang_off, colony                           | Static         | ~40,000 dyads        | Free
  14 | Policy            | DEMIG VISA (de Haas et al.)         | visa_required (0/1)                                              | 2000–2015      | 214 countries bilat. | Academic
  15 | Socioeconomic     | IIASA SSP Database v3.1             | Future GDP, population, education                                | 2020–2100      | 197 countries × 4 SSPs | CC
  16 | Displacement      | NASA SEDAC LECZ v3                  | Coastal population <5 m, <10 m                                   | 1990, 2000, 2015 | 234 countries      | CC BY 4.0
  17 | Displacement      | IPCC AR6 WG1 Table 9.9              | Sea level rise projections                                       | 2020–2150      | Global, 4 SSPs       | IPCC

Panel: 316,020 directed dyad-period observations (230 countries ×
52,670 dyads × 6 periods, 1990–2015). 99.57% completeness.

GCM ensemble (CMIP6): ACCESS-CM2, GFDL-ESM4, IPSL-CM6A-LR, MIROC6,
MPI-ESM1-2-LR. Selection rationale: documented performance,
independence across 5 research centers, availability across all 4 RCPs.

Anomaly baseline: WMO 1961–1990; projections calibrated via
model-specific 1995–2014 reference.

SSP→RCP mapping: SSP1=ssp126, SSP2=ssp245, SSP3=ssp370, SSP5=ssp585.

Missing data: core predictors <0.1% missing. Time-invariant bilateral
variables (distance, colonial ties) need no imputation. Slowly
evolving variables (GDP, governance) use linear interpolation.

## 5. Glossary (21 terms)

- MPI — Migration Pressure Index. Composite index (0–1) from 5 stress
  components. Higher = more migration pressure.
- TPI — Trapped Population Index. Combines migration pressure (MPI)
  with limited mobility. Flagged at P95 within SSP. High TPI = high
  stress AND low mobility.
- SSP — Shared Socioeconomic Pathways. IPCC scenarios: SSP1
  (Sustainability), SSP2 (Middle of the Road), SSP3 (Rivalry), SSP5
  (Fossil-fueled).
- ND-GAIN — Notre Dame Global Adaptation Initiative. Climate
  adaptation score (0–100).
- Net Migration — Immigration minus emigration. 5-year cumulative.
  Positive = net inflow.
- Scenario — Narrative future scenario: Baseline, Climate Extreme,
  Fragmentation, Adaptation Success.
- Flow Rate — Migration rate per 1,000 origin population per 5-year
  period.
- Corridors — Bilateral origin–destination pairs (e.g. MEX→USA).
- WGI — Worldwide Governance Indicators. World Bank index:
  Government Effectiveness (GE), Rule of Law (RL). Scale: −2.5 to +2.5.
- Temperature Anomaly — Deviation of mean temperature from reference
  period 1986–2005, in °C.
- Precipitation Anomaly — Deviation of precipitation from reference
  period 1986–2005, in mm/year.
- Diaspora Stock — Cumulative stock of migrants from origin in
  destination country. Network effect.
- Gravity Model — Migration flows as a function of mass (population)
  and distance — analogous to physical gravity.
- Stacking Ensemble — Combination of three models (GAM 31.7%, RF 33.3%,
  XGBoost 35.0%) via Ridge meta-learner on out-of-fold predictions.
  OOF R² = 0.826.
- CMIP6 — Coupled Model Intercomparison Project Phase 6. Climate model
  ensemble for projections through 2100.
- IPF — Iterative Proportional Fitting. Calibration method that aligns
  model predictions to known marginal totals (UN WPP).
- CPI — Conformal Prediction Intervals. Distribution-free uncertainty
  intervals (50% and 90% coverage) based on out-of-fold residuals.
- OOF — Out-of-Fold. Predictions on data not used during training
  (cross-validation). Avoids overfitting bias.
- SHAP — SHapley Additive exPlanations. Explainability method that
  quantifies each feature's contribution to the prediction.

## 6. Privacy & policy

- No tracking, no third-party cookies, no Google Analytics, no Meta
  Pixel, no advertising.
- Single localStorage flag stored client-side: `mse_welcome_dismissed`
  (remembers dismissal of the welcome dialog). GDPR / § 25 TTDSG
  compliant; no consent banner required because no tracking takes place.
- Server logs: standard HTTP access logs, retained 14 days, no IP
  hashing required for analytics because no analytics are run.
- All projections released under CC BY-NC 4.0; raw third-party inputs
  remain under their respective upstream licenses (see data catalogue).