FourCastNet
Pioneering transformer-based global forecaster. Trained on ERA5 (1979–2015), validated 2016–17, tested 2018+.
AI weather models are routinely ranked against ECMWF's reanalysis — but reanalysis inherits the same sparse inputs the models do. Evaluating against the in-situ observation record tells a different story: systematic biases, hidden by reanalysis-centric evaluation, surface immediately.
All models selected for 0.25° × 0.25° spatial resolution (≈25 km at the tropics) and 6-hourly forecast output. Aurora is included at its 0.25° checkpoint for parity. GenCast is treated separately for cyclone tracks (probabilistic, 12-hourly, 32-member).
Pioneering transformer-based global forecaster. Trained on ERA5 (1979–2015), validated 2016–17, tested 2018+.
Update to FourCastNet using spherical operators. 13 pressure levels; same training regime on ERA5.
Trained on ERA5 1979–2017, validated 2019, tested 2018. The 6-hourly checkpoint is used here for consistency.
Graph-based attention. Trained on ERA5 (1979–2018), fine-tuned on ECMWF HRES (2016–2021). 37 pressure levels.
Foundation model pre-trained on 16 datasets (ERA5, HRES, IFS/GEFS ensembles, CMIP6, MERRA-2, CAMS). Used at 0.25° checkpoint.
ECMWF's data-driven system. Trained on ERA5 1979–2018, fine-tuned on operational analyses. Outputs regridded from N320 to 0.25°.
The only probabilistic AWP in the assessment. Trained on ERA5 1979–2018. 12-hourly cadence, 32-member ensemble used for cyclone trajectory analysis.
ECMWF HRES — 9 km deterministic, 12-hourly, 10-day lead. IFS Ensemble — 50-member, 18 km, 15-day lead. IFS Ensemble Mean — the mean of the 50-member ensemble.
Mean absolute error for six AIWPs, scored against three different references: Indian Meteorological Department station observations (rows labelled a, d, g, j, m, p), ERA5 analysis (b, e, h, k, n, q), and ECMWF HRES operational forecasts (c, f, i, l, o, r), across lead times 1–10 days for 2022. Color encodes percentage difference in MAE relative to ECMWF's IFS HRES. Blue is better; red is worse.
Every AI weather model in the study shows substantially larger errors against station observations than against ERA5 — and the gap widens with lead time. Reanalysis is not ground truth; it is a model output conditioned on the same sparse inputs. For South Asia, where station density is already low, the evaluation layer matters as much as the forecast layer.
The benchmark is only as honest as its reference data. MAUSAM pairs conventional reanalysis against a layered stack of in-situ and satellite observations — the measurements models rarely see.
6-hourly, 0.25° × 0.25° reanalysis, 2021–2024. The default baseline in the AWP literature. Copernicus CDS.
Hourly point-based surface observations from IMD's weather station network, accessed through the MeteoStat Python API. Used to validate T2, U10, V10, and precipitation during extreme events.
Daily-averaged gridded rainfall from up to 6,995 rain gauges across India. Used for 2022 and 2024 monsoon-season verification.
Geostationary cloud-top products (processed clear-sky cloud fraction at 0.5°), used to validate total cloud cover diagnostics from AIFS. Source: ISRO MOSDAC.
Best-track cyclone trajectory data for Tauktae (2021) and Yaas (2021), used to benchmark deterministic tracks, the 50-member IFS ensemble, and the 32-member GenCast ensemble.