In the climate computing netCDF is often used as a container for two types of data: the grid and the measurements. Typically, a grid is represented as a set of points, whereas each of them represents a position on the earth surface. Many models use pairs of longitude and latitude values for this purpose, but there are also more complex ones, that include height, vertices, topography, and so on. To each such a grid point we can assign a set of different values, like temperature, air pressure, or insolation. We also need to take into the account, that these values can change over time.
In netCDF a simple grid can be constructed by two dimensions, e.g. longitude and latitude. Changes over time can be represented by another dimension, e.g. time. These dimensions can be assigned to a variable which contains our measurements. As a result we get a variable, where each value is assigned to longitude, latitude and time step. In a netCDF we can put several such variables and connect them to the same dimensions. But as soon we start using several files we get duplication of the grid, because each file must contain one. Depending on the size and complexity of the grid, it can consume a significant amount of disk space. Therefore we need to find a way to reuse the grid.
An obvious solution to this problem is to store the grid in one file and create links to the grid in the other files. But the current netCDF version is missing such a feature. We decided to extend netCDF and wrote a patch.
Behind the API of the current implementation of netCDF-4 interface hides the HDF5 library, that has a wide range of useful features, which we can use to implement the link functionality in netCDF. We decided to use HDF5 Virtual Datasets (VDS), a feature that was introduced in the HDF5-1.10.0. VDS is powerful feature of HDF5 and it provides more functionality that is needed for our purpose. Our patch takes a simple usage of it. It looks at the dimensionality of the source dataset and create a virtual dataset with the same dimensionality in the target file. Infinite dimensions are not supported yet.
Emulations of netCDF dimensions in HDF5 is realized by HDF5 datasets and relies on a heavy usage of HDF5 attributes. More precisely, for each dimension netCDF creates a dataset stored and attaches a set of different attributes. The dataset can store dimension labels and the attributes contains meta information about the dimensions, e.g. index, name, attached variables. When using virtual datasets for creating links to datasets the attributes are not created automatically. This work must be done manually. The attributes “CLASS”, “NAME”, and “REFERENCE_LIST”, are can be easily created by the HDF5 scale interface. The attribute “_Netcdf4Dimid” is a pure netCDF component and is created by the HDF5 attribute interface.
Although, HDF5 allows to create virtual datasets even if the target datasets don't exist, we couldn't make our patch to work in this way. To work properly our patch requires information from the source file, like dimensionality, datatype. This implies, that the target file and the valid datasets must exist at runtime.
When for some reason, after the links are created, the target file becomes inaccessiable (e.g. deleted, renamed, unreadable, …) and the target dataset is not available the links will be filled with default values. In our case it is the value 0.
Our patch introduces a new function:
int nc_def_dim_external(int ncid, const int dimncid, const char *name, int *idp)
Name | Type | Description |
---|---|---|
ncid | in | File id. Links will be created here. |
dimncid | in | File id, where dimensions are located. |
name | in | dimension name |
idp | out | dimension id |
git apply hdf5-1.10.0-patch1.patch
git apply netcdf-c-4.4.1-rc2.patch
int nlat, dimid; int grid_ncid, data_ncid; const char* gridfile = "grid.nc"; const char* datafile = "data.nc"; nc_open(gridfile, NC_NOWRITE, &grid_ncid); nc_create(datafile, NC_NETCDF4, &data_ncid); nc_def_dim_external(data_ncid, grid_ncid, "lat", &dimid);
In the first step a netCDF4 file “grid.nc” is created. It contains labeled dimensions “lat”, “lon” and “time”, and a variable “var1”. The output of ncdump shows the header of the file.
Download: mkncfile.c
In the second step another netCDF file “data.nc” is created. It has the same structure as the file in the previous step, but the dimensions are connected to the “grid.nc” file. Unlimited dimensions are not supported at the moment. They are converted to limited ones, as you can see in the output of ncdump.
Download: mklink.c
In the next step we plan to integrate the external dimensions in our workflows.