Skip to content
Asela Rajapakse edited this page Apr 23, 2018 · 10 revisions

Processing of NetCDF Data with the Climate Data Operators

Introduction

Numerical climate scientists are pursuing finer physical processes and parameterizations in their Earth System Models and require increased grid resolutions for the area covered by their climate experiments. This means they are in need of an infrastructure that is capable of coping with the anticipated massive data volumes and can perform a larger number of computations per time unit. State-of-the-art supercomputing facilities are harnessed to meet the needs for ever increasing computational power. Increasing global grid resolutions allow climate scientists to study the effects of smaller-scale physical processes on the earth system and its climate. An increasing number of grid vertices also means a rising amount of model output data that needs to be post-processed to extract information from it. A widely popular format for array-oriented scientific data is the Network Common Data Form (NetCDF) data format family. The following scientific use case is an example of an analysis that was performed on data extracted from raw model output data in NetCDF using the Climate Data Operators (CDOs) post-processing software.

Scientific Context

In the paper chosen as a use case (see reference below), the authors describe the need to formulate a parametrization of cumulus clouds that considers the cloud population distribution. This distribution is determined by the mass flux distribution p(m) and the large-scale physical processes that control p(m). In order to identify the physical mechanisms that might lead to a certain distribution function and scale, they design a set of experiments using large-eddy simulations (LESs) of shallow cumulus convection along two reference cases along with several variations. They calculate the distribution of cloud-base mass fluxes for the two references cases and the results show different shapes and different ranges of mass flux values between the two cases. To explain the difference in p(m), the authors analyze the two reference cases along with their variations and find that the mass flux distribution p(m) is determined by the ratio of the sensible to latent heat fluxes at the surface, the Bowen ratio. They also prove that neither changes in the forcing over the convective diurnal cycle nor the self-organization of the clouds influence p(m). The authors find that the Bowen ratio sets the thermodynamic efficiency of the moist heat cycle which highly influences the portion of the heat input that can be transformed into mechanical work to preserve convective circulations. Since the moist heat cycle controls the average mass flux per cloud, the Bowen ratio controls the mass flux per cloud and the shapes of p(m). As a final step, a continuous mixed Weibull probability distribution is adopted to capture the different shapes of the p(m), so that a functional form for p(m) can be used in the cumulus parameterizations.

Rationale

The post-processing of raw model data that goes along with this piece of scientific work was performed with the CDOs and they are widely adopted for this purpose in the climate community. Since the original set of raw data has a size of several terabytes, transferring such an amount of data to a processing location comes with a high cost and should be avoided. In this specific case, model output data were post-processed on the same cluster they were generated on, but any further analysis of the data once it has been put into long-term storage or by someone without access to the local cluster may require moving the data to a processing location where the CDOs can be called. The example service for the GEF containerizes the CDOs and invokes them at the location of the GEF deployment on a node that accepts Docker containers. The example service shows this with a call of the CDO ‘gather’ operator that merges all the different coordinate subdomains split across several thousand NetCDF files into a single large field in one file. This is a common final operation in climate post-processing scripts. In the finished use case, the distance between the service invocation location and the storage location will be minimized to leverage the flexibility of the GEF in choosing a computation location. Here we use a subset of the original data for the example computation to safe processing time and avoid unnecessary waiting periods for those who wish to test the service.

Input Data

Since this post-processing job is to serve as an example for employing the CDOs as a GEF Service, we have minimized data transfer volume and consequently post-processing time by cropping the original climate model output files to a smaller size. This was achieved by selecting only one of the 61 climate variables that make up the model output over only 15 time steps of the entire modeled time span. The remaining climate variable is liquid water path (lwp) which is a measure of the weight of the liquid water droplets in the atmosphere above a surface area of unit size. It is expressed in kilogram per square meter. Although this variable is important for the objective of this particular experiment, it was chosen randomly to showcase the functionality of the GEF in conjunction with the CDOs. Reduction of the data leaves an overall data size of ca. 280 MB.

The data set is comprised of an archive of NetCDF files and has been uploaded to the B2SHARE training instance. The corresponding B2SHARE record is available at https://trng-b2share.eudat.eu/records/53a8517a55e4449ca5c0dbc6acc0b37e. And the direct link to the archived data set is https://trng-b2share.eudat.eu/api/files/9bd0a681-d93f-46f9-8b37-c67e6edee571/rico_gcss_out_xy_lwp_15ts.tar.

Dockerfile and Execution Script

This example service was implemented in the form of a Dockerfile found at https://github.com/EUDAT-GEF/GEF/blob/master/services/cdo_demo/Dockerfile. Along with it comes the following execution script https://github.com/EUDAT-GEF/GEF/blob/master/services/cdo_demo/cdo_gather_lwp.sh.

References

Sakradzija, M. and C. Hohenegger, 2017: What determines the distribution of shallow convective mass flux through cloud base? J. Atmos. Sci., https://doi.org/10.1175/JAS-D-16-0326.1. (http://journals.ametsoc.org/doi/10.1175/JAS-D-16-0326.1)