CERIFR Research — Migration Scenario Engine

Methodology

From raw data through feature engineering and ensemble modelling to calibrated projections — the full pipeline behind the Migration Scenario Engine. Total Projected Migration = ML Ensemble + Displacement Overlay.

Headline numbers
  • 316,020 dyad-period observations (230 countries × 52,670 dyads × 6 periods, 1990–2015)
  • OOF R² = 0.826 (stacking ensemble; within 0.7 pp of the temporal autocorrelation ceiling 0.827)
  • 109 engineered predictors in 12 groups
  • 6 pipeline stages: raw data → feature engineering → ensemble → autoregressive projection → IPF calibration → uncertainty quantification

Pipeline overview

  1. Raw data — 17 sources: UN, World Bank, CMIP6, UCDP, EM-DAT, ND-GAIN, IIASA SSP, NASA SEDAC, IPCC AR6 et al.
  2. Feature engineering — 109 clean predictors in 12 groups; climate (54%), economic, governance, network, interactions et al.
  3. Ensemble model — GAM + Random Forest + XGBoost, combined via Ridge meta-learner (R² = 0.826).
  4. Autoregressive projection — step-wise roll-forward 2020 → 2025 → … → 2095 in 5-year steps.
  5. IPF calibration — Iterative Proportional Fitting against UN WPP population totals.
  6. Uncertainty quantification — distribution-free Conformal Prediction Intervals at 50% and 90% coverage.

1. Target variable

For each origin–destination dyad (o, d) and 5-year period t:

\[ r_{od,t} \;=\; 1000 \cdot \frac{M_{od,t}}{\mathrm{Pop}_{o,t} + 10^{-6}} \qquad y_{od,t} \;=\; \log\!\bigl(1 + \max(0,\, r_{od,t})\bigr) \]
  • \(M_{od,t}\): bilateral migration (Abel & Cohen 2019)
  • \(r_{od,t}\): flow rate per 1,000 origin population
  • \(y_{od,t}\): log-transformed rate

2. Stacking ensemble

ModelTypeR² (OOF)Weight
GAMmgcv::gam, REML0.7950.317
Random Forestranger, 500 trees0.8040.333
XGBoost500 rounds, η = 0.01, depth = 60.8130.350
StackingRidge meta-learner (glmnet, α = 0)0.826

R² (OOF) = pooled out-of-fold R² from 5-fold expanding-window CV with Feature Engineering v2 (temporal lags, network centrality, diaspora dynamics). Meta-learner: Ridge regression on the OOF prediction matrix [N × 3].

\[ \hat{y}_{\text{ensemble}}(x) \;=\; \frac{\sum_{m \in \{\mathrm{GAM},\,\mathrm{RF},\,\mathrm{XGB}\}} w_m \, \hat{y}_m(x)}{\sum_{m} w_m} \]

Weight rationale: XGBoost receives the highest weight (35.0%), consistent with its highest standalone R² of 0.813. RF (33.3%) contributes via bagged tree diversity (R² = 0.804). GAM (31.7%) adds smooth spline trends (R² = 0.795) that tree models miss. The three weights sit in a tight band (σ ≈ 1.5 pp) — evidence of genuine algorithmic complementarity. The ensemble achieves R² = 0.826, 99.9% of the temporal autocorrelation ceiling (r² = 0.827).

3. GAM specification

\[ \log(1+r_{od,t}) \;=\; \underbrace{\sum_{j=1}^{7} s_j\!\bigl(x_{j,\,od,t}\bigr)}_{\text{smooth terms}} \;+\; \underbrace{\sum_{k=1}^{12} \beta_k \, z_{k,\,od,t}}_{\text{linear terms}} \;+\; \varepsilon_{od,t} \]

Smooths (k = basis dimension):

\[ \begin{aligned} s_1(\mathrm{tmp}_{\mathrm{anom}},\,k{=}5),\; s_2(\mathrm{pre}_{\mathrm{anom}},\,k{=}5),\; s_3(\mathrm{GDP}_o,\,k{=}4),\; s_4(\mathrm{GDP}_d,\,k{=}4), \\ s_5(\mathrm{agr}_o,\,k{=}4),\; s_6(\mathrm{agr}_d,\,k{=}4),\; s_7(\log\mathrm{dist}_{od},\,k{=}4) \end{aligned} \]

Linear terms:

\[ \begin{aligned} &\beta_1\,\mathrm{wgi}_{\mathrm{ge},o} + \beta_2\,\mathrm{wgi}_{\mathrm{rl},o} + \beta_3\,\mathrm{wgi}_{\mathrm{ge},d} + \beta_4\,\mathrm{wgi}_{\mathrm{rl},d} \\ &+\, \beta_5\,\mathrm{war\_onset} + \beta_6\,\mathrm{major\_conflict} + \beta_7\,\mathrm{drought} + \beta_8\,\mathrm{flood} \\ &+\, \beta_9\,\mathrm{diaspora}_{od,t-1} + \beta_{10}\,\mathrm{contig}_{od} + \beta_{11}\,\mathrm{comlang}_{od} + \beta_{12}\,\mathrm{colony}_{od} \end{aligned} \]

7 nonlinear + 12 linear terms. Estimated via mgcv::gam with penalised thin-plate regression splines (REML).

4. Feature engineering

v1 — Interaction terms

FeatureFormula
climate_incometmp_anomaly × log(gdp + 1)
climate_govtmp_anomaly × wgi_ge
conflict_crisiswar_onset × gdp_crisis
gravitylog(Pop_o) × log(Pop_d) / exp(log_dist)

v2 — Temporal & network features (~30 additional)

CategoryFeaturesLeakage guard
Lag featuresflow_rate_lag1, log_flow_rate_lag1, momentum_lag, total_migration_lag1t−1
Corridor sharescorridor_share_of_emigration_lag, corridor_share_of_immigration_lagt−1
Network centralitypagerank, degree, strength, betweenness, hub/authority (orig+dest)t−1 network
Diaspora dynamicsdiaspora_growtht−1
Temporal interactionsclimate_x_past_flow, conflict_x_past_flow, hub_authority_interactiont−1

All v2 features use only prior-period data (t−1). Current-period aggregates are removed after computation (no data leakage). R² lift: without v2 ~0.20–0.55, with v2 ~0.79–0.80 per individual model.

5. Cross-validation

Expanding-window (rolling-origin), strictly time-based. 6 historical periods (1990–2015, 5-year steps) yield 5 expanding folds: Fold 1 trains on 1990 and tests on all subsequent periods (1995–2015); Fold 5 trains on 1990–2010 and tests only on 2015. Models are fully retrained per fold (no data leakage). Fold 1 uses a GAM + RF fallback ensemble because XGBoost is undertrained on a single training period; Folds 2–5 use the full GAM + RF + XGBoost stack.

FoldTrainingTestRMSEN (Test)N Models
119901995–20150.3730.229253,7702 (GAM + RF)
21990–19952000–20150.8100.126203,0163
31990–20002005–20150.8340.118152,2623
41990–20052010–20150.8310.119101,5083
51990–201020150.8550.11150,7543
PooledAll priorAll held-out0.8260.121507,540

Monotonic learning curve from Fold 2 onward (0.810 → 0.855) validates genuine learning over overfitting; Fold 1 is intentionally weak due to its minimal 1-period training set. Reported metrics: R², RMSE, MAE, MedAE, MAPE, Pearson correlation.

6. Shock models

ShockThreshold
war_onsetBattle deaths ≥ 1,000 + Onset
major_conflictBattle deaths ≥ P90
catastrophic_drought/flood/stormEvent count ≥ P95
gdp_crisis5-year GDP growth < −5%
governance_breakdownmin(Δwgi_ge, Δwgi_rl) < −0.5
\[ \hat{p}(x) \;=\; \sigma\!\left(\tfrac{1}{3}\bigl[\mathrm{logit}\,p_{\mathrm{GLM}}(x) + \mathrm{logit}\,p_{\mathrm{GAM}}(x) + \mathrm{logit}\,p_{\mathrm{RF}}(x)\bigr]\right), \quad \sigma(u) = \frac{1}{1+e^{-u}} \]

7. Migration Pressure Index (MPI)

\[ \mathrm{MPI}_{o,t} \;=\; 0.25\,C_{o,t} + 0.25\,F_{o,t} + 0.20\,D_{o,t} + 0.15\,G_{o,t} + 0.15\,E_{o,t} \]
ComponentCalculationWeight
C: Climate\(\sqrt{\mathrm{tmp}_{\mathrm{anom}}^{2} + \mathrm{pre}_{\mathrm{anom}}^{2}}\)25%
F: Conflict\(\tfrac{1}{2}(\mathrm{war} + \mathrm{conflict})\)25%
D: Disasters\(\tfrac{1}{3}(\mathrm{drought} + \mathrm{flood} + \mathrm{storm})\)20%
G: Governance\(p_{\mathrm{gov\_breakdown}}\)15%
E: Economy\(p_{\mathrm{gdp\_crisis}}\)15%

Normalisation: percentile rank within each SSP group → [0, 1].

8. Trapped Population Index (TPI)

\[ \mathrm{mobility}_{o,t} \;=\; \frac{1000}{\mathrm{Pop}_{o,t}} \sum_{d \,\neq\, o} M_{od,t} \qquad \mathrm{TPI}_{o,t} \;=\; \mathrm{MPI}_{o,t}\,\bigl(1 - \widetilde{\mathrm{mobility}}_{o,t}\bigr) \]

Tilde = min–max normalisation within SSP group → [0, 1]. Flagging: TPI ≥ P95 within SSP. High TPI = high stress AND low mobility.

9. Scenarios (V7 hybrid)

V7 combines two mechanisms: (1) feature-level multiplication — input features are scaled directly (e.g. tmp_anomaly × 1.5) and the full ensemble (GAM + RF + XGBoost) is re-run; the model's learned nonlinear responses determine the scenario effect. (2) Structural displacement overlay — physics-based forced displacement from 3 climate channels is added on top.

ScenarioClimateConflictIncomeDroughtFloodStormGov.Displacement
Baseline (ML only)1.01.01.01.01.01.01.0
Baseline+1.01.01.01.01.01.01.0×1.0
Adaptation Success0.80.71.10.90.91.01.0×0.8
Fragmentation1.01.50.81.01.01.01.3×1.2
Climate Extreme1.51.00.91.51.41.51.0×1.5

Total Migration = ML Ensemble Prediction + Climate Displacement Overlay.

Displacement overlay — 3 channels

ChannelSourceThresholdMax share
Sea Level RiseIPCC AR6, NASA SEDAC LECZ<5m: (SLR/2.5)²; 5–10m: ((SLR−2)/8)²100% / 100%
Extreme HeatCRU TS 4.09, CMIP6Sigmoid onset at 33 °C (Mora et al.)25%
Drought / DesertificationCRU/CMIP6, WB WDIFrom −150 mm precipitation anomaly15%

Stock-to-flow conversion: 10% displacement rate per 5-year period. Corridor distribution proportional to baseline corridor shares (network-informed). Global exposure: 687M below 5 m, 1.056 B below 10 m (NASA SEDAC LECZ).

10. Diaspora recursion model

\[ S_{ij,\,t+1} \;=\; \underbrace{(1 - \delta_m - \rho)}_{\text{survival} \,=\, 0.87} \cdot S_{ij,\,t} \;+\; F_{ij,\,t} \]

\(\delta_m = 0.10\) (mortality, 5 yr), \(\rho = 0.03\) (return), \(S_{ij,t}\) = diaspora stock, \(F_{ij,t}\) = new arrivals.

Dynamic reconstruction of bilateral diaspora stocks across 6 historical periods. Strongest single predictor (r = 0.33). Inactive corridors decline 13% per period; active ones keep growing. During projection, predicted flows feed back endogenously into the next period's diaspora stock (self-reinforcing corridor dynamics).

11. Back-transformation, IPF calibration & volume estimation

\[ \mathrm{flow}_{od,t} \;=\; \operatorname{clip}\!\bigl(\exp(\hat{y}_{od,t}) - 1,\; 0,\; 1000\bigr) \qquad \hat{M}_{od,t} \;=\; \mathrm{flow}_{od,t} \cdot \frac{\mathrm{Pop}_{o,t}}{1000} \]

IPF calibration (V4)

Iterative Proportional Fitting (IPF) following Willekens (1999) and Abel & Cohen (2019). Three layers correct systematic model bias:

  1. Country-specific growth trajectories: scaling model predictions to country-specific growth rates (not a global factor).
  2. Marginal sum adjustment: iterative row/column sum alignment to target totals.
  3. Population caps: origin and destination caps prevent physically impossible volumes.
\[ \begin{aligned} M^{(k+1)}_{od} &\;=\; M^{(k)}_{od}\cdot\dfrac{T_o}{\sum_{d'} M^{(k)}_{od'}} & \text{(row scaling: origin totals)} \\[4pt] M^{(k+2)}_{od} &\;=\; M^{(k+1)}_{od}\cdot\dfrac{T_d}{\sum_{o'} M^{(k+1)}_{o'd}} & \text{(column scaling: destination totals)} \end{aligned} \]

12. Conformal Prediction Intervals (CPI)

Distribution-free conformal intervals following Vovk et al. (2005), Romano et al. (2019), and Barber et al. (2023). Guarantee finite-sample coverage without distributional assumptions.

  • Mondrian binning: residuals stratified by flow magnitude (4 bins: zero, low, medium, high).
  • Centered residuals: bias correction within each Mondrian bin (median-centered).
  • Multiplicative bootstrap: corridor aggregation with pivot correction (N = 500 replicates).
  • Intervals: 50% (25th–75th percentile) and 90% (5th–95th percentile).
\[ \mathrm{CI}_{\alpha}(x) \;=\; \Bigl[\hat{y}(x) + q_{\alpha/2}\!\bigl(R_{\mathcal{B}(x)}\bigr),\;\; \hat{y}(x) + q_{1-\alpha/2}\!\bigl(R_{\mathcal{B}(x)}\bigr)\Bigr] \]

\(R_{\mathcal{B}(x)}\) = centered residuals within the Mondrian bin \(\mathcal{B}(x) \in \{\mathrm{zero},\,\mathrm{low},\,\mathrm{medium},\,\mathrm{high}\}\); \(q_p(\cdot)\) denotes the \(p\)-th quantile.