The quest for reliably stable ML emulators of sub-grid convection for climate models

Written by Mike Pritchard on August 11, 2021. Posted in Uncategorized.

Warning: This is a disorganized snapshot of follow-on thoughts about a topic that is set up somewhat more carefully in a formal blog post that I wrote for E3SM back in August 2020, which might be helpful background reading. I readily admit it is shamelessly self-referential of work I am most familiar with.

It’s irresistible to dream that we might be able to sidestep Moore’s law for climate modeling by outsourcing actual expensive calculations from prognostic models of multiphase turbulence to neural network emulators trained to approximate their coarse-grained inputs and outputs. This could usher in an era of explicit (large-eddy simulation quality) turbulence feedbacks to climate change prediction and stress test the limits of today’s approximations of boundary layer processes that should otherwise keep us predictors of future climate variability up at night.

\(E=mc^2\)

But that dream only works if these ML emulators can generally behave well when they’re implanted in a host planetary model of atmospheric dynamics. The problem, based on my limited experience in crudely approximating the inputs and outputs to superparameterized (SP) cloud resolving models (CRMs), is that this may be the exception and not the norm. With hindsight I think it’s fair to say we got pretty lucky in Rasp et al (PNAS, 2018) with a stable integration of a SP-emulator — even then it was hard-won and I can attest our attempts to replicate prognostic stability since fail more often than not.

So what will it take to do this reliably, reproducibly and thus operationally?

The truth is nobody knows. But here are some of the growing pool of ideas circulating these days along with my current views.

Maybe it takes really meticulous expert feature-engineering.

Spectacular work by Yuval and O’Gorman (Nat. Comm., 2020) and Yuval et al. (GRL, 2021) shows what you can do if highly-trained atmospheric convection specialists hyper-focus on ripping out every sub-kernel of a CRM to achieve the goal of only emulating separable sub-processes (e.g. condensation mass rates) to build a very crafty training dataset that by virtue of its setup allows conservation laws and consistency between moistening and heating budgets to be perfectly satisfied. Their prognostic results for a hypso-hypsometric aquaplanet are super impressive on many levels.

Even if we don’t fully know yet if the approach can be extended to real-geography operational limits, my money is on this being one of the most currently promising paths to attempt.

But one computational bummer of this approach is that it presumes no coarse-graining of time is possible, limiting potential speedup from NN emulators. This isn’t a super big deal for the overall dream, since plenty of Moore’s law can be bypassed by coarse-graining in space alone.

The bigger bummer is that it takes such specialized expertise and artery-rewiring in the first place to achieve the goal of a sufficient training dataset and prognostic testing platform. That’s hardly the promise of the deep learning revolution Chollet’s textbook entices us with, where avoiding feature engineering with deep NNs is half the point. But then again the needs of cat vs. dog recognition are more superficial than a climate model’s interaction with its convection scheme so perhaps that promise was just an illusion.

I admit to still being a bit enamored with the idea of avoiding feature-engineering, which is part of what has me still searching for the limits of what’s possible with our far cruder superparameterization-emulation training and testing harness. To whit:

Maybe it just takes industrial-scale hyperparameter tuning.

At least some recent work in Ott et al. 2020 convinced me that letting a GPU loose on a semi-automated hyperparameter search can put a little dent in some of our own more pesky prognostic instability problems. But it is only a small dent — of hundreds of NN fit trials only one or two had the dual-promise of both competitive stability and error rates in prognostic mode. If that 1 percent return on auto-tuning investment, even for a simple homogenous aquaplanet with small 2D embedded CRMs, is representative, then I’ll worry that the many more DOF of fitting the actual dream — real 3D LES embedded in real-geography, interannually forced operational climate models — might be overwhelming to hit with hyperparameter horsepower alone.

Or maybe not. What does overwhelming mean these days? I’ve never compared the Watts of a SP or GCRM simulation to the Watts of a hyperparameter tuning survey. Superficially, it seems like we academic dabblers in ML are so far using only a tiny fraction of the horsepower available to us on community GPU clusters to train really big NNs on really big datasets. But isn’t that how facebook makes money? Maybe it’s like MPI was just invented but nobody knows how to write the multi-GPU directives in their code yet. Maybe they already do and I’m out of touch. Anyways most everyone I talk to seems to share my own lazy inclination to limit imagination to what fits on just one GPU (or even their laptop) for now. But Ben-Nun et al. (GMD, 2020) is inspiring on another level.

The allure of coupled online learning learning.

One thing that also seems under-explored but very promising is baking in the coupled error into the ML learning target in the first place. Rasp (GMD, 2020) has commented on that retroactively and pioneering work by Brenowitz and Bretherton (GRL, 2018) did this from the outset in ways that should be mind expanding (see their section 3.2). Their work was possible due to craftily focusing on a single-column learning target where the long-term behavior of a NN fit under advective forcing could be sampled due to simplicity of the large-scale forcing and limited interactivity of the convection-permitting element; detectable improvements to prognostic stability from emulators trained on a long-time-horizon behavior (rather than a local-in-time skill metric) from that work are provocative.

Depressingly (to me), generalizing that would seem to require wrapping the GCM in python; is that the end game of the valiant attempts by the team at Vulcan Inc. (e.g. Watt-Meyer et al., GRL, 2021)? Or by re-writing one from the bottom up (see Clima). Or otherwise finding ways to bake information about coupled problems into optimization target– I personally hope a gentler way can be found that is more immediately accessible.

Meanwhile I definitely don’t think we can as humans anticipate all the potential prognostic stability problems. See the valiant attempt by Brenowitz et al (JAS, 2020). Credibility metrics like “does the NN convection intensify monotonically with increased moistening or reduced lapse rates?” or “does the NN convection coupled to a minimal dynamical harness make obviously unsatisfying coupled waves?” are great ways to rule out crappy fits. But I doubt they will sample enough of the possible DOF for things to go haywire once coupled to reliably identify great ones.

Just think of how nonlinear a GCM is. A tiny error in moistening rate induced by an NN’s imperfect fit, once used online, can for instance produce an ice crystal in the stratosphere that creates longwave radiative heating anomalies with outsize influences on the underlying troposphere. This can in turn rapidly drive the input vector the NN was accustomed to in its training habitat way out of sample when it migrates to a prognostic setting. How many more nonlinear pathways like this exist? Surely more than the human mind can anticipate. In my opinion we need to play to the strengths of ML in monotonously sampling over all the potential DOF, again by making coupled error, and dense sampling of it, somehow the new machine learning target.

There are so many uncharted frontiers.

For instance: Physically renormalizing inputs to enforce them to be unbounded in all climates (like using relative humidity instead of specific humidity as the moisture variable) definitely has tangible benefits. I have experienced that viscerally in some recent tests. See Paul O’Gorman’s first talk about this and Tom Beucler’s latest work on trying to find “climate invariant” NN emulators of convection (CITE). How much have we really thought about the normalization strategies coming into and out of our NNs? Including stochastic fits is in its infancy but a dribble of stochasticity could have some desired properties on a coupled integration’s stability. Maybe Elizabeth Barnes’ latest “IDK” network architectures could allow for an NN emulator to infrequently opt out when things are too uncertain to make a claim, to sidestep those situations most prone to coupled error, accepting the raw expense of explicit convection as permissible in those instances if they are rare enough. Can we use the conditionality clauses (classifiers upstream of dedicated NNs for sub-networks) as championed by Gettelman et al. (JAMES, 2021) in producing a prognostically stable microphysical emulator that plays well with a modern version of CESM? The list goes on; see a recent book chapter by Beucler et al. 2021.

I don’t claim to see one clear path to success but I really hope we all collectively settle on some more reliable ones than presently available sooner than later. The dream of sidestepping Moore’s law in the era of ML to usher in an age of next-gen climate simulation with explicit turbulence feedbacks ahead of schedule is worth it!

How about a little CLIVAR DSWG blog discussion?

p.s. I readily admit there are likely multiple mistakes made and important things missed in this informal core-dump, which is overly self-referential, and that I am generally prone to hyperbole and imprecise language. Sorry if any of that rubs you the wrong way; I’ve tried not to let it stop me writing some stuff anyways. Looking forward to a constructive conversation! -Pritch.

Data Science Working Group Blog

The quest for reliably stable ML emulators of sub-grid convection for climate models

Leave a Reply Cancel reply