Methodology

From raw data through feature engineering and ensemble modelling to calibrated projections — the full pipeline behind the Migration Scenario Engine. Total Projected Migration = ML Ensemble + Displacement Overlay.

Headline numbers

316,020 dyad-period observations (230 countries × 52,670 dyads × 6 periods, 1990–2015)
OOF R² = 0.826 (stacking ensemble; within 0.7 pp of the temporal autocorrelation ceiling 0.827)
109 engineered predictors in 12 groups
6 pipeline stages: raw data → feature engineering → ensemble → autoregressive projection → IPF calibration → uncertainty quantification

Pipeline overview

Raw data — 17 sources: UN, World Bank, CMIP6, UCDP, EM-DAT, ND-GAIN, IIASA SSP, NASA SEDAC, IPCC AR6 et al.
Feature engineering — 109 clean predictors in 12 groups; climate (54%), economic, governance, network, interactions et al.
Ensemble model — GAM + Random Forest + XGBoost, combined via Ridge meta-learner (R² = 0.826).
Autoregressive projection — step-wise roll-forward 2020 → 2025 → … → 2095 in 5-year steps.
IPF calibration — Iterative Proportional Fitting against UN WPP population totals.
Uncertainty quantification — distribution-free Conformal Prediction Intervals at 50% and 90% coverage.

1. Target variable

For each origin–destination dyad (o, d) and 5-year period t:

\[ r_{od,t} \;=\; 1000 \cdot \frac{M_{od,t}}{\mathrm{Pop}_{o,t} + 10^{-6}} \qquad y_{od,t} \;=\; \log\!\bigl(1 + \max(0,\, r_{od,t})\bigr) \]

\(M_{od,t}\): bilateral migration (Abel & Cohen 2019)
\(r_{od,t}\): flow rate per 1,000 origin population
\(y_{od,t}\): log-transformed rate

2. Stacking ensemble

Model	Type	R² (OOF)	Weight
GAM	mgcv::gam, REML	0.795	0.317
Random Forest	ranger, 500 trees	0.804	0.333
XGBoost	500 rounds, η = 0.01, depth = 6	0.813	0.350
Stacking	Ridge meta-learner (glmnet, α = 0)	0.826	—

R² (OOF) = pooled out-of-fold R² from 5-fold expanding-window CV with Feature Engineering v2 (temporal lags, network centrality, diaspora dynamics). Meta-learner: Ridge regression on the OOF prediction matrix [N × 3].

\[ \hat{y}_{\text{ensemble}}(x) \;=\; \frac{\sum_{m \in \{\mathrm{GAM},\,\mathrm{RF},\,\mathrm{XGB}\}} w_m \, \hat{y}_m(x)}{\sum_{m} w_m} \]

Weight rationale: XGBoost receives the highest weight (35.0%), consistent with its highest standalone R² of 0.813. RF (33.3%) contributes via bagged tree diversity (R² = 0.804). GAM (31.7%) adds smooth spline trends (R² = 0.795) that tree models miss. The three weights sit in a tight band (σ ≈ 1.5 pp) — evidence of genuine algorithmic complementarity. The ensemble achieves R² = 0.826, 99.9% of the temporal autocorrelation ceiling (r² = 0.827).

3. GAM specification

\[ \log(1+r_{od,t}) \;=\; \underbrace{\sum_{j=1}^{7} s_j\!\bigl(x_{j,\,od,t}\bigr)}_{\text{smooth terms}} \;+\; \underbrace{\sum_{k=1}^{12} \beta_k \, z_{k,\,od,t}}_{\text{linear terms}} \;+\; \varepsilon_{od,t} \]

Smooths (k = basis dimension):

\[ \begin{aligned} s_1(\mathrm{tmp}_{\mathrm{anom}},\,k{=}5),\; s_2(\mathrm{pre}_{\mathrm{anom}},\,k{=}5),\; s_3(\mathrm{GDP}_o,\,k{=}4),\; s_4(\mathrm{GDP}_d,\,k{=}4), \\ s_5(\mathrm{agr}_o,\,k{=}4),\; s_6(\mathrm{agr}_d,\,k{=}4),\; s_7(\log\mathrm{dist}_{od},\,k{=}4) \end{aligned} \]

Linear terms:

\[ \begin{aligned} &\beta_1\,\mathrm{wgi}_{\mathrm{ge},o} + \beta_2\,\mathrm{wgi}_{\mathrm{rl},o} + \beta_3\,\mathrm{wgi}_{\mathrm{ge},d} + \beta_4\,\mathrm{wgi}_{\mathrm{rl},d} \\ &+\, \beta_5\,\mathrm{war\_onset} + \beta_6\,\mathrm{major\_conflict} + \beta_7\,\mathrm{drought} + \beta_8\,\mathrm{flood} \\ &+\, \beta_9\,\mathrm{diaspora}_{od,t-1} + \beta_{10}\,\mathrm{contig}_{od} + \beta_{11}\,\mathrm{comlang}_{od} + \beta_{12}\,\mathrm{colony}_{od} \end{aligned} \]

7 nonlinear + 12 linear terms. Estimated via mgcv::gam with penalised thin-plate regression splines (REML).

4. Feature engineering

v1 — Interaction terms

Feature	Formula
`climate_income`	`tmp_anomaly × log(gdp + 1)`
`climate_gov`	`tmp_anomaly × wgi_ge`
`conflict_crisis`	`war_onset × gdp_crisis`
`gravity`	`log(Pop_o) × log(Pop_d) / exp(log_dist)`

v2 — Temporal & network features (~30 additional)

Category	Features	Leakage guard
Lag features	`flow_rate_lag1, log_flow_rate_lag1, momentum_lag, total_migration_lag1`	t−1
Corridor shares	`corridor_share_of_emigration_lag, corridor_share_of_immigration_lag`	t−1
Network centrality	`pagerank, degree, strength, betweenness, hub/authority (orig+dest)`	t−1 network
Diaspora dynamics	`diaspora_growth`	t−1
Temporal interactions	`climate_x_past_flow, conflict_x_past_flow, hub_authority_interaction`	t−1

All v2 features use only prior-period data (t−1). Current-period aggregates are removed after computation (no data leakage). R² lift: without v2 ~0.20–0.55, with v2 ~0.79–0.80 per individual model.

5. Cross-validation

Expanding-window (rolling-origin), strictly time-based. 6 historical periods (1990–2015, 5-year steps) yield 5 expanding folds: Fold 1 trains on 1990 and tests on all subsequent periods (1995–2015); Fold 5 trains on 1990–2010 and tests only on 2015. Models are fully retrained per fold (no data leakage). Fold 1 uses a GAM + RF fallback ensemble because XGBoost is undertrained on a single training period; Folds 2–5 use the full GAM + RF + XGBoost stack.

Fold	Training	Test	R²	RMSE	N (Test)	N Models
1	1990	1995–2015	0.373	0.229	253,770	2 (GAM + RF)
2	1990–1995	2000–2015	0.810	0.126	203,016	3
3	1990–2000	2005–2015	0.834	0.118	152,262	3
4	1990–2005	2010–2015	0.831	0.119	101,508	3
5	1990–2010	2015	0.855	0.111	50,754	3
Pooled	All prior	All held-out	0.826	0.121	507,540	—

Monotonic learning curve from Fold 2 onward (0.810 → 0.855) validates genuine learning over overfitting; Fold 1 is intentionally weak due to its minimal 1-period training set. Reported metrics: R², RMSE, MAE, MedAE, MAPE, Pearson correlation.

6. Shock models

Shock	Threshold
`war_onset`	Battle deaths ≥ 1,000 + Onset
`major_conflict`	Battle deaths ≥ P90
`catastrophic_drought/flood/storm`	Event count ≥ P95
`gdp_crisis`	5-year GDP growth < −5%
`governance_breakdown`	min(Δwgi_ge, Δwgi_rl) < −0.5

\[ \hat{p}(x) \;=\; \sigma\!\left(\tfrac{1}{3}\bigl[\mathrm{logit}\,p_{\mathrm{GLM}}(x) + \mathrm{logit}\,p_{\mathrm{GAM}}(x) + \mathrm{logit}\,p_{\mathrm{RF}}(x)\bigr]\right), \quad \sigma(u) = \frac{1}{1+e^{-u}} \]

7. Migration Pressure Index (MPI)

\[ \mathrm{MPI}_{o,t} \;=\; 0.25\,C_{o,t} + 0.25\,F_{o,t} + 0.20\,D_{o,t} + 0.15\,G_{o,t} + 0.15\,E_{o,t} \]

Component	Calculation	Weight
C: Climate	\(\sqrt{\mathrm{tmp}_{\mathrm{anom}}^{2} + \mathrm{pre}_{\mathrm{anom}}^{2}}\)	25%
F: Conflict	\(\tfrac{1}{2}(\mathrm{war} + \mathrm{conflict})\)	25%
D: Disasters	\(\tfrac{1}{3}(\mathrm{drought} + \mathrm{flood} + \mathrm{storm})\)	20%
G: Governance	\(p_{\mathrm{gov\_breakdown}}\)	15%
E: Economy	\(p_{\mathrm{gdp\_crisis}}\)	15%

Normalisation: percentile rank within each SSP group → [0, 1].

8. Trapped Population Index (TPI)

\[ \mathrm{mobility}_{o,t} \;=\; \frac{1000}{\mathrm{Pop}_{o,t}} \sum_{d \,\neq\, o} M_{od,t} \qquad \mathrm{TPI}_{o,t} \;=\; \mathrm{MPI}_{o,t}\,\bigl(1 - \widetilde{\mathrm{mobility}}_{o,t}\bigr) \]

Tilde = min–max normalisation within SSP group → [0, 1]. Flagging: TPI ≥ P95 within SSP. High TPI = high stress AND low mobility.

9. Scenarios (V7 hybrid)

V7 combines two mechanisms: (1) feature-level multiplication — input features are scaled directly (e.g. tmp_anomaly × 1.5) and the full ensemble (GAM + RF + XGBoost) is re-run; the model's learned nonlinear responses determine the scenario effect. (2) Structural displacement overlay — physics-based forced displacement from 3 climate channels is added on top.

Scenario	Climate	Conflict	Income	Drought	Flood	Storm	Gov.	Displacement
Baseline (ML only)	1.0	1.0	1.0	1.0	1.0	1.0	1.0	—
Baseline+	1.0	1.0	1.0	1.0	1.0	1.0	1.0	×1.0
Adaptation Success	0.8	0.7	1.1	0.9	0.9	1.0	1.0	×0.8
Fragmentation	1.0	1.5	0.8	1.0	1.0	1.0	1.3	×1.2
Climate Extreme	1.5	1.0	0.9	1.5	1.4	1.5	1.0	×1.5

Total Migration = ML Ensemble Prediction + Climate Displacement Overlay.

Displacement overlay — 3 channels

Channel	Source	Threshold	Max share
Sea Level Rise	IPCC AR6, NASA SEDAC LECZ	<5m: (SLR/2.5)²; 5–10m: ((SLR−2)/8)²	100% / 100%
Extreme Heat	CRU TS 4.09, CMIP6	Sigmoid onset at 33 °C (Mora et al.)	25%
Drought / Desertification	CRU/CMIP6, WB WDI	From −150 mm precipitation anomaly	15%

Stock-to-flow conversion: 10% displacement rate per 5-year period. Corridor distribution proportional to baseline corridor shares (network-informed). Global exposure: 687M below 5 m, 1.056 B below 10 m (NASA SEDAC LECZ).

10. Diaspora recursion model

\[ S_{ij,\,t+1} \;=\; \underbrace{(1 - \delta_m - \rho)}_{\text{survival} \,=\, 0.87} \cdot S_{ij,\,t} \;+\; F_{ij,\,t} \]

\(\delta_m = 0.10\) (mortality, 5 yr), \(\rho = 0.03\) (return), \(S_{ij,t}\) = diaspora stock, \(F_{ij,t}\) = new arrivals.

Dynamic reconstruction of bilateral diaspora stocks across 6 historical periods. Strongest single predictor (r = 0.33). Inactive corridors decline 13% per period; active ones keep growing. During projection, predicted flows feed back endogenously into the next period's diaspora stock (self-reinforcing corridor dynamics).

11. Back-transformation, IPF calibration & volume estimation

\[ \mathrm{flow}_{od,t} \;=\; \operatorname{clip}\!\bigl(\exp(\hat{y}_{od,t}) - 1,\; 0,\; 1000\bigr) \qquad \hat{M}_{od,t} \;=\; \mathrm{flow}_{od,t} \cdot \frac{\mathrm{Pop}_{o,t}}{1000} \]

IPF calibration (V4)

Iterative Proportional Fitting (IPF) following Willekens (1999) and Abel & Cohen (2019). Three layers correct systematic model bias:

Country-specific growth trajectories: scaling model predictions to country-specific growth rates (not a global factor).
Marginal sum adjustment: iterative row/column sum alignment to target totals.
Population caps: origin and destination caps prevent physically impossible volumes.

\[ \begin{aligned} M^{(k+1)}_{od} &\;=\; M^{(k)}_{od}\cdot\dfrac{T_o}{\sum_{d'} M^{(k)}_{od'}} & \text{(row scaling: origin totals)} \\[4pt] M^{(k+2)}_{od} &\;=\; M^{(k+1)}_{od}\cdot\dfrac{T_d}{\sum_{o'} M^{(k+1)}_{o'd}} & \text{(column scaling: destination totals)} \end{aligned} \]

12. Conformal Prediction Intervals (CPI)

Distribution-free conformal intervals following Vovk et al. (2005), Romano et al. (2019), and Barber et al. (2023). Guarantee finite-sample coverage without distributional assumptions.

Mondrian binning: residuals stratified by flow magnitude (4 bins: zero, low, medium, high).
Centered residuals: bias correction within each Mondrian bin (median-centered).
Multiplicative bootstrap: corridor aggregation with pivot correction (N = 500 replicates).
Intervals: 50% (25th–75th percentile) and 90% (5th–95th percentile).

\[ \mathrm{CI}_{\alpha}(x) \;=\; \Bigl[\hat{y}(x) + q_{\alpha/2}\!\bigl(R_{\mathcal{B}(x)}\bigr),\;\; \hat{y}(x) + q_{1-\alpha/2}\!\bigl(R_{\mathcal{B}(x)}\bigr)\Bigr] \]

\(R_{\mathcal{B}(x)}\) = centered residuals within the Mondrian bin \(\mathcal{B}(x) \in \{\mathrm{zero},\,\mathrm{low},\,\mathrm{medium},\,\mathrm{high}\}\); \(q_p(\cdot)\) denotes the \(p\)-th quantile.