Disk-based caching of regridding precomputed weights caching

New in version 1.0.0.

Note

At the moment this configuration is only related to the precomputed backends in regrid().

Purpose

earthkit-geo uses a dedicated directory to store interpolation matrices and the related index file downloaded from the remote inventory. By default this directory serves a cache and is managed (its size is checked/limited). It means if we run regrid() again with the same input and output grid it will load the matrix from the cache instead of downloading it again. Additionally, caching offers monitoring and disk space management. When the cache is full, cached data is deleted according to the configuration (i.e. oldest data is deleted first). The cache is implemented by using a sqlite database running in a separate thread.

Please note that the earthkit-geo cache configuration is managed through the Configuration.

Warning

The earthkit-geo cache is intended to be used by a single user. Sharing cache with multiple users is not recommended. Downloading a local copy of data on a shared disk to have multiple users working is a different use case and should be supported through using mirrors.

Cache policies

The primary config option to control the cache is cache-policy, which can take the following values:

The cache location can be read and modified with Python (see the details below).

Tip

See the Matrix disk cache notebook for examples.

Note

It is recommended to restart your Jupyter kernels after changing the cache policy or location.

User cache policy

When the cache-policy is “user” the cache will be active and created in a managed directory defined by the user-cache-directory config option. This is the default value.

Note

The default location of the user cache directory is "~/.cache/earthkit-geo" and its maximum size is 5 GB.

The user cache directory is not cleaned up on exit. So next time you start earthkit-geo it will be there again unless it is deleted manually or it is set in way that on each startup a different path is assigned to it. Also, when you run multiple sessions of earthkit-geo under the same user they will share the same cache.

We can query the directory path via the Configuration and also by calling the directory() cache method.

>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "user")
>>> config.get("user-cache-directory")
'/Users/username/.cache/earthkit-geo'
>>> cache.directory()
'/Users/username/.cache/earthkit-geo'

The following code shows how to change the user-cache-directory config option:

>>> from earthkit.geo import config
>>> config.get("user-cache-directory")  # Find the current cache directory
'/Users/username/.cache/earthkit-geo'
>>> # Change the value of the setting
>>> config.set("user-cache-directory", "/big-disk/earthkit-geo-cache")

# Python kernel restarted

>>> from earthkit.geo import config
>>> config.get("user-cache-directory")  # Cache directory has been modified
'/big-disk/earthkit-geo-cache'

More generally, the earthkit-geo config options can be read, modified, reset to their default values from Python, see the Configs documentation.

Temporary cache policy

When the cache-policy is “temporary” the cache will be active and located in a managed temporary directory created by tempfile.TemporaryDirectory. This directory will be unique for each earthkit-geo session. When the directory object goes out of scope (at the latest on exit) the cache is cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "temporary")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary cache by using the temporary-cache-directory-root config option. By default it is set to Non e (no parent directory specified).

>>> from earthkit.geo import cache, config
>>> s = {
...     "cache-policy": "temporary",
...     "temporary-cache-directory-root": "~/my_demo_cache",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_cache/tmp0iiuvsz5'

Off cache policy

When the cache-policy is “off” no disk-based caching is available. In this case all files are downloaded into an unmanaged temporary directory created by tempfile.TemporaryDirectory. Since caching is disabled, all repeated calls to regrid() will download the interpolation matrix again! This temporary directory will be unique for each earthkit-geo session. When the directory object goes out of scope (at the latest on exit) the directory will be cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "off")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary directory by using the temporary-directory-root config. By default it is set to None (no parent directory specified).

>>> from earthkit.geo import cache, config
>>> s = {
...     "cache-policy": "off",
...     "temporary-directory-root": "~/my_demo_tmp",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_tmp/tmp0iiuvsz5'

Cache methods

The cache is controlled by a global object, which we can access as earthkit.geo.cache.

>>> from earthkit.geo import cache
>>> cache
<earthkit.geo.utils.caching.Cache object at 0x117be7040>

When cache-policy is user or temporary there are a set of methods available on this object to manage and interact with the cache.

Methods/properties of the cache object

Methods

Description

policy

Get the current cache policy object.

directory()

Return the path to the current cache directory

size()

Return the total number of bytes stored in the cache

check_size()

Check the cache size and trim it down when needed.

entries()

Dump the entries stored in the cache

summary_dump_database()

Return the number of items and total size of the cache

purge()

Delete entries from the cache

Warning

check_size() automatically runs when a new entry is added to the cache or any of the Cache config parameters changes.

Examples:

>>> from earthkit.geo import cache
>>> cache.policy.name
'user'
>>> cache.directory()
'/Users/username/.cache/earthkit-geo/''
>>> cache.size()
846785699
>>> cache.summary_dump_database()
(40, 846785699)
>>> d = cache.entries()
>>> len(d)
40
>>> d[0].get("creation_date")
'2023-10-30 14:48:31.320322'

Cache limits

Warning

These config options do not work when cache-policy is off.

Maximum-cache-size

The maximum-cache-size setting ensures that earthkit-geo does not use to much disk space. Its value sets the maximum disk space used by earthkit-geo cache. When earthkit-geo cache disk usage goes above this limit, earthkit-geo triggers its cache cleaning mechanism before downloading additional data. The value of cache-maximum-size is absolute (such as “10G”, “10M”, “1K”). To disable it use None.

Maximum-cache-disk-usage

The maximum-cache-disk-usage setting ensures that earthkit-geo does not fill your disk. It specifies the maximum disk usage (as a percentage) on the filesystem containing the cache directory. When the total disk usage (so this is not the cache usage alone) goes above this limit, earthkit-geo triggers its cache cleaning mechanism to free up space before downloading additional data. The value of maximum-cache-disk-usage is relative (such as “90%” or “100%”). To disable it use None.

Warning

If your disk is filled by another application, earthkit-geo will happily delete its cached data to make room for the other application as soon as it has a chance.

Cache config parameters

Name

Default

Description

cache‑policy

‘user’

Caching policy. Valid values: off, temporary and user. See Disk-based caching of regridding precomputed weights caching for more information.

maximum‑cache‑disk‑usage

None

Specify maximum disk usage as a percentage of the full disk capacity on the filesystem the cache is located (e.g.: 90%). When the total disk usage exceeds this limit (it’s not limited to the cache usage alone), earthkit-geo evicts older cached entries until the usage is below the specified limit. Can be set to None. Ignored when cache-policy is off. See Disk-based caching of regridding precomputed weights caching for more information.

maximum‑cache‑size

‘5GB’

Maximum disk space used by the earthkit-geo cache (e.g.: 100G or 2T). When exceeded, earthkit-geo evicts older cached entries until the usage is below the specified limit. Can be set to None. Ignored when cache-policy is off. See Disk-based caching of regridding precomputed weights caching for more information.

temporary‑cache‑directory‑root

None

Parent of the cache directory when cache-policy is temporary. See Disk-based caching of regridding precomputed weights caching for more information.

user‑cache‑directory

‘~/.cache/earthkit‑geo’

Cache directory used when cache-policy is user. See Disk-based caching of regridding precomputed weights caching for more information.

Other earthkit-geo config options can be found here.