# Migration Scenario Engine — Full LLM Reference Document This is the long, self-contained reference. Every fact an AI assistant needs to summarize, evaluate or cite the project is in this single file. No further URL fetches required. Site: https://migrationengine.org/ Author: Christian Rogalski, M.Sc. Contact: rogalski.academic@pm.me License: CC BY-NC 4.0 Cite as: Rogalski, C. (2026). Migration Scenario Engine: Global Bilateral Migration Projections under Climate Scenarios, 2020–2100. CERIFR Research. https://migrationengine.org --- ## 1. Project summary The Migration Scenario Engine (MSE) is an open scientific platform that projects global bilateral migration flows under climate scenarios for the period 2020–2100. It combines econometric panel models with a machine-learning ensemble to identify nonlinear climate thresholds and structural drivers of cross-border mobility, and then projects flows under alternative climate and governance pathways through 2100. Headline numbers: - 195 countries × ~52,670 directed bilateral corridors × 5-year periods - 316,020 dyad-period observations in the historical training panel (1990–2015), 99.57% completeness on core variables - 109 engineered predictors in 12 groups - 4 IPCC Shared Socioeconomic Pathways (SSP1, SSP2, SSP3, SSP5) × 5 narrative scenarios per SSP - Stacking ensemble OOF R² = 0.826 (within 0.7 pp of the temporal autocorrelation ceiling r² = 0.827) The platform is operated entirely at the author's own expense, without external funding, sponsorship, or institutional support; it represents no lobby, no commercial interest, and no institutional agenda. All analyses reflect independent research. ## 2. Author and institutional context **Christian Rogalski, M.Sc.** — External Doctoral Researcher and economist. Affiliation: Alexandru Ioan Cuza University of Iași (Romania), Faculty of Economics and Business Administration, Department of Statistics and Cybernetics. Topics: climate migration, machine learning, deep learning and AI, automation of data pipelines, econometrics, scenario modelling. The doctoral research that this dashboard accompanies asks: *How will climate change reshape international migration over the coming decades?* The platform is the author's contribution toward answering that question — transparently, independently, and with the strongest quantitative methods available. The author plans to expand the platform with crisis-mode monitoring, interactive conflict detection from geospatial and social-media data, and integration of satellite imagery and nightlight proxies. A second interactive dashboard on a separate research topic is in preparation. ## 3. Methodology — full pipeline ### 3.1 Target variable For each origin–destination dyad (od) and 5-year period t: r_{od,t} = 1000 · M_{od,t} / (Pop_{o,t} + 1e-6) y_{od,t} = log(1 + max(0, r_{od,t})) Where M_{od,t} is bilateral migration (Abel & Cohen 2019, calibrated against UN DESA IMS 2024 stocks), r_{od,t} is the flow rate per 1,000 origin population, and y_{od,t} is the log-transformed rate that all component models predict. ### 3.2 Stacking ensemble Three base learners are stacked via a Ridge meta-learner on the out-of-fold (OOF) prediction matrix [N × 3]: Model | Type | OOF R² | Weight --------------|-----------------------------------|--------|------- GAM | mgcv::gam, REML, thin-plate sp. | 0.795 | 0.317 Random Forest | ranger, 500 trees | 0.804 | 0.333 XGBoost | 500 rounds, η=0.01, depth=6 | 0.813 | 0.350 Stacking | Ridge meta-learner (glmnet, α=0) | 0.826 | — ŷ_ensemble(x) = Σ_m w_m · ŷ_m(x) / Σ_m w_m m ∈ {GAM, RF, XGB} Weight rationale: XGBoost earns the highest weight (35.0%) because of its highest standalone R² (0.813). RF (33.3%) contributes via bagged tree diversity (R² = 0.804). GAM (31.7%) adds smooth spline trends (R² = 0.795) that tree models miss. The three weights sit in a tight band (σ ≈ 1.5 pp) — evidence of genuine algorithmic complementarity. The ensemble achieves R² = 0.826, 99.9% of the temporal autocorrelation ceiling (r² = 0.827). ### 3.3 GAM specification log(1 + r_{od,t}) = Σ_{j=1..7} s_j(x_{j,od,t}) [smooth terms] + Σ_{k=1..12} β_k · z_{k,od,t} [linear terms] + ε_{od,t} Smooth terms (k = basis dimension): s_1(tmp_anom, k=5), s_2(pre_anom, k=5), s_3(GDP_o, k=4), s_4(GDP_d, k=4), s_5(agr_o, k=4), s_6(agr_d, k=4), s_7(log dist_od, k=4) Linear terms: β_1·wgi_ge_o + β_2·wgi_rl_o + β_3·wgi_ge_d + β_4·wgi_rl_d + β_5·war_onset + β_6·major_conflict + β_7·drought + β_8·flood + β_9·diaspora_{od,t-1} + β_10·contig + β_11·comlang + β_12·colony 7 nonlinear + 12 linear terms; estimated via mgcv::gam with penalised thin-plate regression splines, REML. ### 3.4 Feature engineering 109 predictors in 12 groups. v1 (interaction terms) and v2 (temporal & network features) layers. v1 interactions (selected): climate_income = tmp_anomaly × log(gdp + 1) climate_gov = tmp_anomaly × wgi_ge conflict_crisis = war_onset × gdp_crisis gravity = log(Pop_o) × log(Pop_d) / exp(log_dist) v2 (~30 additional features, all use only prior-period (t–1) data; no data leakage): Category | Features | Leakage guard ------------------------|---------------------------------------------------------------------------------|--------------- Lag features | flow_rate_lag1, log_flow_rate_lag1, momentum_lag, total_migration_lag1 | t–1 Corridor shares | corridor_share_of_emigration_lag, corridor_share_of_immigration_lag | t–1 Network centrality | pagerank, degree, strength, betweenness, hub/authority (orig + dest) | t–1 network Diaspora dynamics | diaspora_growth | t–1 Temporal interactions | climate_x_past_flow, conflict_x_past_flow, hub_authority_interaction | t–1 R² lift from v2: without v2 ~0.20–0.55 per individual model; with v2 ~0.79–0.80 per individual model. Current-period aggregates are removed after computation. ### 3.5 Cross-validation Expanding-window (rolling-origin), strictly time-based. 6 historical periods (1990, 1995, 2000, 2005, 2010, 2015) yield 5 expanding folds. Models are fully retrained per fold (no data leakage): Fold | Training | Test | R² | RMSE | N (test) | N Models -----|----------------|--------------|-------|-------|----------|--------- 1 | 1990 | 1995–2015 | 0.373 | 0.229 | 253,770 | 2 (GAM+RF) 2 | 1990–1995 | 2000–2015 | 0.810 | 0.126 | 203,016 | 3 3 | 1990–2000 | 2005–2015 | 0.834 | 0.118 | 152,262 | 3 4 | 1990–2005 | 2010–2015 | 0.831 | 0.119 | 101,508 | 3 5 | 1990–2010 | 2015 | 0.855 | 0.111 | 50,754 | 3 Pool | All prior | All held-out | 0.826 | 0.121 | 507,540 | — Monotonic learning curve from Fold 2 onward (0.810 → 0.855) validates genuine learning over overfitting; Fold 1 is intentionally weak due to its single-period training set. Reported metrics: R², RMSE, MAE, MedAE, MAPE, Pearson r. ### 3.6 Shock models Shock | Threshold -----------------------------------|-------------------------------------------------- war_onset | Battle deaths ≥ 1,000 + Onset major_conflict | Battle deaths ≥ P90 catastrophic_drought/flood/storm | Event count ≥ P95 gdp_crisis | 5-year GDP growth < –5% governance_breakdown | min(Δwgi_ge, Δwgi_rl) < –0.5 Each shock has its own probabilistic ensemble: p̂(x) = σ( (1/3)·[ logit p_GLM(x) + logit p_GAM(x) + logit p_RF(x) ] ) σ(u) = 1 / (1 + e^{-u}) ### 3.7 Migration Pressure Index (MPI) MPI_{o,t} = 0.25·C + 0.25·F + 0.20·D + 0.15·G + 0.15·E Component | Calculation | Weight -----------------|---------------------------------------------------|------- C: Climate | √(tmp_anom² + pre_anom²) | 25% F: Conflict | (war + conflict) / 2 | 25% D: Disasters | (drought + flood + storm) / 3 | 20% G: Governance | p_gov_breakdown | 15% E: Economy | p_gdp_crisis | 15% Normalisation: percentile rank within each SSP group → [0, 1]. ### 3.8 Trapped Population Index (TPI) mobility_{o,t} = (1000 / Pop_{o,t}) · Σ_{d ≠ o} M_{od,t} TPI_{o,t} = MPI_{o,t} · (1 – mobilitỹ_{o,t}) Tilde = min–max normalisation within SSP group → [0, 1]. Flagging: TPI ≥ P95 within SSP. High TPI = high stress AND low mobility. ### 3.9 Scenarios (V7 hybrid: feature multiplication + displacement overlay) Total Migration = ML Ensemble Prediction + Climate Displacement Overlay (1) Feature-level multiplication: input features are scaled directly (e.g. tmp_anomaly × 1.5), and the full ensemble (GAM + RF + XGBoost) is re-run. The model's learned nonlinear responses determine the scenario effect. (2) Structural displacement overlay: physics-based forced displacement from 3 climate channels is added on top. Scenario multipliers: Scenario | Climate | Conflict | Income | Drought | Flood | Storm | Gov. | Displacement ---------------------|---------|----------|--------|---------|-------|-------|------|------------- Baseline (ML only) | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | — Baseline+ | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | × 1.0 Adaptation Success | 0.8 | 0.7 | 1.1 | 0.9 | 0.9 | 1.0 | 1.0 | × 0.8 Fragmentation | 1.0 | 1.5 | 0.8 | 1.0 | 1.0 | 1.0 | 1.3 | × 1.2 Climate Extreme | 1.5 | 1.0 | 0.9 | 1.5 | 1.4 | 1.5 | 1.0 | × 1.5 Displacement Overlay — 3 channels: Channel | Source | Threshold | Max share ------------------------|-------------------------|----------------------------------------------|---------- Sea Level Rise | IPCC AR6, NASA SEDAC | <5 m: (SLR/2.5)²; 5–10 m: ((SLR–2)/8)² | 100% / 100% Extreme Heat | CRU TS 4.09, CMIP6 | Sigmoid onset at 33 °C (Mora et al.) | 25% Drought/Desertification | CRU/CMIP6, WB WDI | From −150 mm precipitation anomaly | 15% Stock-to-flow conversion: 10% displacement rate per 5-year period. Corridor distribution proportional to baseline corridor shares (network-informed). Global exposure: 687M pop below 5 m, 1.056B below 10 m (SEDAC LECZ). ### 3.10 Diaspora recursion model S_{ij,t+1} = (1 – δ_m – ρ) · S_{ij,t} + F_{ij,t} with survival = 0.87 (δ_m = 0.10 mortality 5-yr, ρ = 0.03 return) Dynamic reconstruction of bilateral diaspora stocks across 6 historical periods. Strongest single predictor (r = 0.33). Inactive corridors decline 13%/period; active ones keep growing. During projection, predicted flows feed back endogenously into the next period's diaspora stock (self-reinforcing corridor dynamics). ### 3.11 Back-transformation & IPF calibration flow_{od,t} = clip( exp(ŷ_{od,t}) − 1, 0, 1000 ) M̂_{od,t} = flow_{od,t} · Pop_{o,t} / 1000 IPF calibration (V4) — three layers correct systematic model bias, following Willekens (1999) and Abel & Cohen (2019): 1. Country-specific growth trajectories — model predictions are scaled to country-specific growth rates (not a global factor). 2. Marginal sum adjustment — iterative row/column sum alignment to target totals. 3. Population caps — origin and destination caps prevent physically impossible volumes. M^{(k+1)}_{od} = M^{(k)}_{od} · T_o / Σ_{d'} M^{(k)}_{od'} (row scaling: origin totals) M^{(k+2)}_{od} = M^{(k+1)}_{od} · T_d / Σ_{o'} M^{(k+1)}_{o'd} (column scaling: destination totals) Targets: UN WPP origin totals, destination totals. ### 3.12 Conformal Prediction Intervals (CPI) Distribution-free conformal intervals following Vovk et al. (2005), Romano et al. (2019), Barber et al. (2023). Guarantee finite-sample coverage without distributional assumptions. - Mondrian binning: residuals stratified by flow magnitude (4 bins: zero, low, medium, high) - Centered residuals: bias correction within each Mondrian bin (median-centered) - Multiplicative bootstrap: corridor aggregation with pivot correction (N = 500 replicates) - Intervals: 50% (25th–75th percentile) and 90% (5th–95th percentile) CI_α(x) = [ ŷ(x) + q_{α/2}(R_{B(x)}), ŷ(x) + q_{1−α/2}(R_{B(x)}) ] R_{B(x)} = centered residuals within the Mondrian bin B(x) ∈ {zero, low, medium, high}; q_p(·) denotes the p-th quantile. ## 4. Data sources (17 primary) # | Pillar | Source | Variables | Period | Coverage | License ---|-------------------|-------------------------------------|------------------------------------------------------------------|----------------|----------------------|-------------- 1 | Migration Flows | Abel & Cohen (2019) | Bilateral flow estimates | 1990–2015 | 230 countries | CC BY 4.0 2 | Migration Flows | UN DESA IMS 2024 | Bilateral stock matrices, diaspora calibration | 1990–2020 | 232 countries | UN ToU 3 | Climate | CRU TS 4.09 (Harris et al. 2020) | tmp, pre, tmx, tmn, dtr, vap, wet, SPEI | 1901–2022 | 0.5° grid, land | OGL v3 4 | Climate | CMIP6 ScenarioMIP (5 GCMs) | tas, pr → anomalies (4 RCPs) | 2015–2100 | Global | CMIP6 ToU 5 | Climate | WMO (2017) | Baseline normals 1961–1990 | Reference | Global | WMO 6 | Economic | World Bank WDI (2024) | gdp_pc_ppp, urban_pct, agr_va_pct, sec_enroll, poverty_rate | 1960–2024 | 217 countries | CC BY 4.0 7 | Governance | WGI (Kaufmann et al. 2011) | wgi_ge, wgi_rl | 1996–2023 | 215 countries | CC BY 4.0 8 | Conflict | UCDP v25.1 (Davies et al. 2023) | Battle-related deaths | 1989–2023 | Global | CC BY 4.0 9 | Vulnerability | ND-GAIN | Climate adaptation score (0–100) | 1995–2023 | 192 countries | CC BY-SA 4.0 10 | Disasters | EM-DAT (Guha-Sapir et al.) | Natural hazard events | 1900–present | Global | Academic 11 | Education | Barro-Lee v3 (2013) | yrs_school_mean, sec_complete, tert_complete | 1950–2010 | 146 countries | Free 12 | Demography | UN WPP 2024 | Population by age-sex | 1950–2100 | 237 countries | CC BY 3.0 IGO 13 | Gravity | CEPII GeoDist (Mayer & Zignago) | log_distw, contig, comlang_off, colony | Static | ~40,000 dyads | Free 14 | Policy | DEMIG VISA (de Haas et al.) | visa_required (0/1) | 2000–2015 | 214 countries bilat. | Academic 15 | Socioeconomic | IIASA SSP Database v3.1 | Future GDP, population, education | 2020–2100 | 197 countries × 4 SSPs | CC 16 | Displacement | NASA SEDAC LECZ v3 | Coastal population <5 m, <10 m | 1990, 2000, 2015 | 234 countries | CC BY 4.0 17 | Displacement | IPCC AR6 WG1 Table 9.9 | Sea level rise projections | 2020–2150 | Global, 4 SSPs | IPCC Panel: 316,020 directed dyad-period observations (230 countries × 52,670 dyads × 6 periods, 1990–2015). 99.57% completeness. GCM ensemble (CMIP6): ACCESS-CM2, GFDL-ESM4, IPSL-CM6A-LR, MIROC6, MPI-ESM1-2-LR. Selection rationale: documented performance, independence across 5 research centers, availability across all 4 RCPs. Anomaly baseline: WMO 1961–1990; projections calibrated via model-specific 1995–2014 reference. SSP→RCP mapping: SSP1=ssp126, SSP2=ssp245, SSP3=ssp370, SSP5=ssp585. Missing data: core predictors <0.1% missing. Time-invariant bilateral variables (distance, colonial ties) need no imputation. Slowly evolving variables (GDP, governance) use linear interpolation. ## 5. Glossary (21 terms) - MPI — Migration Pressure Index. Composite index (0–1) from 5 stress components. Higher = more migration pressure. - TPI — Trapped Population Index. Combines migration pressure (MPI) with limited mobility. Flagged at P95 within SSP. High TPI = high stress AND low mobility. - SSP — Shared Socioeconomic Pathways. IPCC scenarios: SSP1 (Sustainability), SSP2 (Middle of the Road), SSP3 (Rivalry), SSP5 (Fossil-fueled). - ND-GAIN — Notre Dame Global Adaptation Initiative. Climate adaptation score (0–100). - Net Migration — Immigration minus emigration. 5-year cumulative. Positive = net inflow. - Scenario — Narrative future scenario: Baseline, Climate Extreme, Fragmentation, Adaptation Success. - Flow Rate — Migration rate per 1,000 origin population per 5-year period. - Corridors — Bilateral origin–destination pairs (e.g. MEX→USA). - WGI — Worldwide Governance Indicators. World Bank index: Government Effectiveness (GE), Rule of Law (RL). Scale: −2.5 to +2.5. - Temperature Anomaly — Deviation of mean temperature from reference period 1986–2005, in °C. - Precipitation Anomaly — Deviation of precipitation from reference period 1986–2005, in mm/year. - Diaspora Stock — Cumulative stock of migrants from origin in destination country. Network effect. - Gravity Model — Migration flows as a function of mass (population) and distance — analogous to physical gravity. - Stacking Ensemble — Combination of three models (GAM 31.7%, RF 33.3%, XGBoost 35.0%) via Ridge meta-learner on out-of-fold predictions. OOF R² = 0.826. - CMIP6 — Coupled Model Intercomparison Project Phase 6. Climate model ensemble for projections through 2100. - IPF — Iterative Proportional Fitting. Calibration method that aligns model predictions to known marginal totals (UN WPP). - CPI — Conformal Prediction Intervals. Distribution-free uncertainty intervals (50% and 90% coverage) based on out-of-fold residuals. - OOF — Out-of-Fold. Predictions on data not used during training (cross-validation). Avoids overfitting bias. - SHAP — SHapley Additive exPlanations. Explainability method that quantifies each feature's contribution to the prediction. ## 6. Privacy & policy - No tracking, no third-party cookies, no Google Analytics, no Meta Pixel, no advertising. - Single localStorage flag stored client-side: `mse_welcome_dismissed` (remembers dismissal of the welcome dialog). GDPR / § 25 TTDSG compliant; no consent banner required because no tracking takes place. - Server logs: standard HTTP access logs, retained 14 days, no IP hashing required for analytics because no analytics are run. - All projections released under CC BY-NC 4.0; raw third-party inputs remain under their respective upstream licenses (see data catalogue).