Methodology
From raw data through feature engineering and ensemble modelling to calibrated projections — the full pipeline behind the Migration Scenario Engine. Total Projected Migration = ML Ensemble + Displacement Overlay.
- 316,020 dyad-period observations (230 countries × 52,670 dyads × 6 periods, 1990–2015)
- OOF R² = 0.826 (stacking ensemble; within 0.7 pp of the temporal autocorrelation ceiling 0.827)
- 109 engineered predictors in 12 groups
- 6 pipeline stages: raw data → feature engineering → ensemble → autoregressive projection → IPF calibration → uncertainty quantification
Pipeline overview
- Raw data — 17 sources: UN, World Bank, CMIP6, UCDP, EM-DAT, ND-GAIN, IIASA SSP, NASA SEDAC, IPCC AR6 et al.
- Feature engineering — 109 clean predictors in 12 groups; climate (54%), economic, governance, network, interactions et al.
- Ensemble model — GAM + Random Forest + XGBoost, combined via Ridge meta-learner (R² = 0.826).
- Autoregressive projection — step-wise roll-forward 2020 → 2025 → … → 2095 in 5-year steps.
- IPF calibration — Iterative Proportional Fitting against UN WPP population totals.
- Uncertainty quantification — distribution-free Conformal Prediction Intervals at 50% and 90% coverage.
1. Target variable
For each origin–destination dyad (o, d) and 5-year period t:
\[ r_{od,t} \;=\; 1000 \cdot \frac{M_{od,t}}{\mathrm{Pop}_{o,t} + 10^{-6}} \qquad y_{od,t} \;=\; \log\!\bigl(1 + \max(0,\, r_{od,t})\bigr) \]- \(M_{od,t}\): bilateral migration (Abel & Cohen 2019)
- \(r_{od,t}\): flow rate per 1,000 origin population
- \(y_{od,t}\): log-transformed rate
2. Stacking ensemble
| Model | Type | R² (OOF) | Weight |
|---|---|---|---|
| GAM | mgcv::gam, REML | 0.795 | 0.317 |
| Random Forest | ranger, 500 trees | 0.804 | 0.333 |
| XGBoost | 500 rounds, η = 0.01, depth = 6 | 0.813 | 0.350 |
| Stacking | Ridge meta-learner (glmnet, α = 0) | 0.826 | — |
Weight rationale: XGBoost receives the highest weight (35.0%), consistent with its highest standalone R² of 0.813. RF (33.3%) contributes via bagged tree diversity (R² = 0.804). GAM (31.7%) adds smooth spline trends (R² = 0.795) that tree models miss. The three weights sit in a tight band (σ ≈ 1.5 pp) — evidence of genuine algorithmic complementarity. The ensemble achieves R² = 0.826, 99.9% of the temporal autocorrelation ceiling (r² = 0.827).
3. GAM specification
\[ \log(1+r_{od,t}) \;=\; \underbrace{\sum_{j=1}^{7} s_j\!\bigl(x_{j,\,od,t}\bigr)}_{\text{smooth terms}} \;+\; \underbrace{\sum_{k=1}^{12} \beta_k \, z_{k,\,od,t}}_{\text{linear terms}} \;+\; \varepsilon_{od,t} \]Smooths (k = basis dimension):
\[ \begin{aligned} s_1(\mathrm{tmp}_{\mathrm{anom}},\,k{=}5),\; s_2(\mathrm{pre}_{\mathrm{anom}},\,k{=}5),\; s_3(\mathrm{GDP}_o,\,k{=}4),\; s_4(\mathrm{GDP}_d,\,k{=}4), \\ s_5(\mathrm{agr}_o,\,k{=}4),\; s_6(\mathrm{agr}_d,\,k{=}4),\; s_7(\log\mathrm{dist}_{od},\,k{=}4) \end{aligned} \]Linear terms:
\[ \begin{aligned} &\beta_1\,\mathrm{wgi}_{\mathrm{ge},o} + \beta_2\,\mathrm{wgi}_{\mathrm{rl},o} + \beta_3\,\mathrm{wgi}_{\mathrm{ge},d} + \beta_4\,\mathrm{wgi}_{\mathrm{rl},d} \\ &+\, \beta_5\,\mathrm{war\_onset} + \beta_6\,\mathrm{major\_conflict} + \beta_7\,\mathrm{drought} + \beta_8\,\mathrm{flood} \\ &+\, \beta_9\,\mathrm{diaspora}_{od,t-1} + \beta_{10}\,\mathrm{contig}_{od} + \beta_{11}\,\mathrm{comlang}_{od} + \beta_{12}\,\mathrm{colony}_{od} \end{aligned} \]7 nonlinear + 12 linear terms. Estimated via mgcv::gam with penalised thin-plate regression splines (REML).
4. Feature engineering
v1 — Interaction terms
| Feature | Formula |
|---|---|
climate_income | tmp_anomaly × log(gdp + 1) |
climate_gov | tmp_anomaly × wgi_ge |
conflict_crisis | war_onset × gdp_crisis |
gravity | log(Pop_o) × log(Pop_d) / exp(log_dist) |
v2 — Temporal & network features (~30 additional)
| Category | Features | Leakage guard |
|---|---|---|
| Lag features | flow_rate_lag1, log_flow_rate_lag1, momentum_lag, total_migration_lag1 | t−1 |
| Corridor shares | corridor_share_of_emigration_lag, corridor_share_of_immigration_lag | t−1 |
| Network centrality | pagerank, degree, strength, betweenness, hub/authority (orig+dest) | t−1 network |
| Diaspora dynamics | diaspora_growth | t−1 |
| Temporal interactions | climate_x_past_flow, conflict_x_past_flow, hub_authority_interaction | t−1 |
5. Cross-validation
Expanding-window (rolling-origin), strictly time-based. 6 historical periods (1990–2015, 5-year steps) yield 5 expanding folds: Fold 1 trains on 1990 and tests on all subsequent periods (1995–2015); Fold 5 trains on 1990–2010 and tests only on 2015. Models are fully retrained per fold (no data leakage). Fold 1 uses a GAM + RF fallback ensemble because XGBoost is undertrained on a single training period; Folds 2–5 use the full GAM + RF + XGBoost stack.
| Fold | Training | Test | R² | RMSE | N (Test) | N Models |
|---|---|---|---|---|---|---|
| 1 | 1990 | 1995–2015 | 0.373 | 0.229 | 253,770 | 2 (GAM + RF) |
| 2 | 1990–1995 | 2000–2015 | 0.810 | 0.126 | 203,016 | 3 |
| 3 | 1990–2000 | 2005–2015 | 0.834 | 0.118 | 152,262 | 3 |
| 4 | 1990–2005 | 2010–2015 | 0.831 | 0.119 | 101,508 | 3 |
| 5 | 1990–2010 | 2015 | 0.855 | 0.111 | 50,754 | 3 |
| Pooled | All prior | All held-out | 0.826 | 0.121 | 507,540 | — |
6. Shock models
| Shock | Threshold |
|---|---|
war_onset | Battle deaths ≥ 1,000 + Onset |
major_conflict | Battle deaths ≥ P90 |
catastrophic_drought/flood/storm | Event count ≥ P95 |
gdp_crisis | 5-year GDP growth < −5% |
governance_breakdown | min(Δwgi_ge, Δwgi_rl) < −0.5 |
7. Migration Pressure Index (MPI)
\[ \mathrm{MPI}_{o,t} \;=\; 0.25\,C_{o,t} + 0.25\,F_{o,t} + 0.20\,D_{o,t} + 0.15\,G_{o,t} + 0.15\,E_{o,t} \]| Component | Calculation | Weight |
|---|---|---|
| C: Climate | \(\sqrt{\mathrm{tmp}_{\mathrm{anom}}^{2} + \mathrm{pre}_{\mathrm{anom}}^{2}}\) | 25% |
| F: Conflict | \(\tfrac{1}{2}(\mathrm{war} + \mathrm{conflict})\) | 25% |
| D: Disasters | \(\tfrac{1}{3}(\mathrm{drought} + \mathrm{flood} + \mathrm{storm})\) | 20% |
| G: Governance | \(p_{\mathrm{gov\_breakdown}}\) | 15% |
| E: Economy | \(p_{\mathrm{gdp\_crisis}}\) | 15% |
8. Trapped Population Index (TPI)
\[ \mathrm{mobility}_{o,t} \;=\; \frac{1000}{\mathrm{Pop}_{o,t}} \sum_{d \,\neq\, o} M_{od,t} \qquad \mathrm{TPI}_{o,t} \;=\; \mathrm{MPI}_{o,t}\,\bigl(1 - \widetilde{\mathrm{mobility}}_{o,t}\bigr) \]9. Scenarios (V7 hybrid)
V7 combines two mechanisms: (1) feature-level multiplication — input features are scaled directly (e.g. tmp_anomaly × 1.5) and the full ensemble (GAM + RF + XGBoost) is re-run; the model's learned nonlinear responses determine the scenario effect. (2) Structural displacement overlay — physics-based forced displacement from 3 climate channels is added on top.
| Scenario | Climate | Conflict | Income | Drought | Flood | Storm | Gov. | Displacement |
|---|---|---|---|---|---|---|---|---|
| Baseline (ML only) | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | — |
| Baseline+ | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ×1.0 |
| Adaptation Success | 0.8 | 0.7 | 1.1 | 0.9 | 0.9 | 1.0 | 1.0 | ×0.8 |
| Fragmentation | 1.0 | 1.5 | 0.8 | 1.0 | 1.0 | 1.0 | 1.3 | ×1.2 |
| Climate Extreme | 1.5 | 1.0 | 0.9 | 1.5 | 1.4 | 1.5 | 1.0 | ×1.5 |
Displacement overlay — 3 channels
| Channel | Source | Threshold | Max share |
|---|---|---|---|
| Sea Level Rise | IPCC AR6, NASA SEDAC LECZ | <5m: (SLR/2.5)²; 5–10m: ((SLR−2)/8)² | 100% / 100% |
| Extreme Heat | CRU TS 4.09, CMIP6 | Sigmoid onset at 33 °C (Mora et al.) | 25% |
| Drought / Desertification | CRU/CMIP6, WB WDI | From −150 mm precipitation anomaly | 15% |
10. Diaspora recursion model
\[ S_{ij,\,t+1} \;=\; \underbrace{(1 - \delta_m - \rho)}_{\text{survival} \,=\, 0.87} \cdot S_{ij,\,t} \;+\; F_{ij,\,t} \]Dynamic reconstruction of bilateral diaspora stocks across 6 historical periods. Strongest single predictor (r = 0.33). Inactive corridors decline 13% per period; active ones keep growing. During projection, predicted flows feed back endogenously into the next period's diaspora stock (self-reinforcing corridor dynamics).
11. Back-transformation, IPF calibration & volume estimation
\[ \mathrm{flow}_{od,t} \;=\; \operatorname{clip}\!\bigl(\exp(\hat{y}_{od,t}) - 1,\; 0,\; 1000\bigr) \qquad \hat{M}_{od,t} \;=\; \mathrm{flow}_{od,t} \cdot \frac{\mathrm{Pop}_{o,t}}{1000} \]IPF calibration (V4)
Iterative Proportional Fitting (IPF) following Willekens (1999) and Abel & Cohen (2019). Three layers correct systematic model bias:
- Country-specific growth trajectories: scaling model predictions to country-specific growth rates (not a global factor).
- Marginal sum adjustment: iterative row/column sum alignment to target totals.
- Population caps: origin and destination caps prevent physically impossible volumes.
12. Conformal Prediction Intervals (CPI)
Distribution-free conformal intervals following Vovk et al. (2005), Romano et al. (2019), and Barber et al. (2023). Guarantee finite-sample coverage without distributional assumptions.
- Mondrian binning: residuals stratified by flow magnitude (4 bins: zero, low, medium, high).
- Centered residuals: bias correction within each Mondrian bin (median-centered).
- Multiplicative bootstrap: corridor aggregation with pivot correction (N = 500 replicates).
- Intervals: 50% (25th–75th percentile) and 90% (5th–95th percentile).