Disk-based caching of regridding precomputed weights caching¶
New in version 1.0.0.
Note
At the moment this configuration is only related to the precomputed
backends in regrid().
Purpose¶
earthkit-geo uses a dedicated directory to store interpolation matrices and the related index file downloaded from the remote inventory. By default this directory serves a cache and is managed (its size is checked/limited). It means if we run regrid() again with the same input and output grid it will load the matrix from the cache instead of downloading it again. Additionally, caching offers monitoring and disk space management. When the cache is full, cached data is deleted according to the configuration (i.e. oldest data is deleted first). The cache is implemented by using a sqlite database running in a separate thread.
Please note that the earthkit-geo cache configuration is managed through the Configuration.
Warning
The earthkit-geo cache is intended to be used by a single user. Sharing cache with multiple users is not recommended. Downloading a local copy of data on a shared disk to have multiple users working is a different use case and should be supported through using mirrors.
Cache policies¶
The primary config option to control the cache is cache-policy, which can take the following values:
The cache location can be read and modified with Python (see the details below).
Tip
See the Matrix disk cache notebook for examples.
Note
It is recommended to restart your Jupyter kernels after changing the cache policy or location.
User cache policy¶
When the cache-policy is “user” the cache will be active and created in a managed directory defined by the user-cache-directory config option. This is the default value.
Note
The default location of the user cache directory is "~/.cache/earthkit-geo" and its maximum size is 5 GB.
The user cache directory is not cleaned up on exit. So next time you start earthkit-geo it will be there again unless it is deleted manually or it is set in way that on each startup a different path is assigned to it. Also, when you run multiple sessions of earthkit-geo under the same user they will share the same cache.
We can query the directory path via the Configuration and also by calling the directory() cache method.
>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "user")
>>> config.get("user-cache-directory")
'/Users/username/.cache/earthkit-geo'
>>> cache.directory()
'/Users/username/.cache/earthkit-geo'
The following code shows how to change the user-cache-directory config option:
>>> from earthkit.geo import config
>>> config.get("user-cache-directory") # Find the current cache directory
'/Users/username/.cache/earthkit-geo'
>>> # Change the value of the setting
>>> config.set("user-cache-directory", "/big-disk/earthkit-geo-cache")
# Python kernel restarted
>>> from earthkit.geo import config
>>> config.get("user-cache-directory") # Cache directory has been modified
'/big-disk/earthkit-geo-cache'
More generally, the earthkit-geo config options can be read, modified, reset to their default values from Python, see the Configs documentation.
Temporary cache policy¶
When the cache-policy is “temporary” the cache will be active and located in a managed temporary directory created by tempfile.TemporaryDirectory. This directory will be unique for each earthkit-geo session. When the directory object goes out of scope (at the latest on exit) the cache is cleaned up.
Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.
>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "temporary")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'
We can specify the parent directory for the the temporary cache by using the temporary-cache-directory-root config option. By default it is set to Non e (no parent directory specified).
>>> from earthkit.geo import cache, config
>>> s = {
... "cache-policy": "temporary",
... "temporary-cache-directory-root": "~/my_demo_cache",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_cache/tmp0iiuvsz5'
Off cache policy¶
When the cache-policy is “off” no disk-based caching is available. In this case all files are downloaded into an unmanaged temporary directory created by tempfile.TemporaryDirectory. Since caching is disabled, all repeated calls to regrid() will download the interpolation matrix again! This temporary directory will be unique for each earthkit-geo session. When the directory object goes out of scope (at the latest on exit) the directory will be cleaned up.
Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.
>>> from earthkit.geo import cache, config
>>> config.set("cache-policy", "off")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'
We can specify the parent directory for the the temporary directory by using the temporary-directory-root config. By default it is set to None (no parent directory specified).
>>> from earthkit.geo import cache, config
>>> s = {
... "cache-policy": "off",
... "temporary-directory-root": "~/my_demo_tmp",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_tmp/tmp0iiuvsz5'
Cache methods¶
The cache is controlled by a global object, which we can access as earthkit.geo.cache.
>>> from earthkit.geo import cache
>>> cache
<earthkit.geo.utils.caching.Cache object at 0x117be7040>
When cache-policy is user or temporary
there are a set of methods available on this object to manage and interact with the cache.
Methods |
Description |
|---|---|
Get the current cache policy object. |
|
|
Return the path to the current cache directory |
Return the total number of bytes stored in the cache |
|
Check the cache size and trim it down when needed. |
|
Dump the entries stored in the cache |
|
Return the number of items and total size of the cache |
|
Delete entries from the cache |
Warning
check_size() automatically runs when a new
entry is added to the cache or any of the Cache config parameters changes.
Examples:
>>> from earthkit.geo import cache
>>> cache.policy.name
'user'
>>> cache.directory()
'/Users/username/.cache/earthkit-geo/''
>>> cache.size()
846785699
>>> cache.summary_dump_database()
(40, 846785699)
>>> d = cache.entries()
>>> len(d)
40
>>> d[0].get("creation_date")
'2023-10-30 14:48:31.320322'
Cache limits¶
Warning
These config options do not work when cache-policy is off.
- Maximum-cache-size
The
maximum-cache-sizesetting ensures that earthkit-geo does not use to much disk space. Its value sets the maximum disk space used by earthkit-geo cache. When earthkit-geo cache disk usage goes above this limit, earthkit-geo triggers its cache cleaning mechanism before downloading additional data. The value of cache-maximum-size is absolute (such as “10G”, “10M”, “1K”). To disable it use None.- Maximum-cache-disk-usage
The
maximum-cache-disk-usagesetting ensures that earthkit-geo does not fill your disk. It specifies the maximum disk usage (as a percentage) on the filesystem containing the cache directory. When the total disk usage (so this is not the cache usage alone) goes above this limit, earthkit-geo triggers its cache cleaning mechanism to free up space before downloading additional data. The value of maximum-cache-disk-usage is relative (such as “90%” or “100%”). To disable it use None.
Warning
If your disk is filled by another application, earthkit-geo will happily delete its cached data to make room for the other application as soon as it has a chance.
Cache config parameters¶
Name |
Default |
Description |
|---|---|---|
cache‑policy |
‘user’ |
Caching policy. Valid values: off, temporary and user. See Disk-based caching of regridding precomputed weights caching for more information. |
maximum‑cache‑disk‑usage |
None |
Specify maximum disk usage as a percentage of the full disk capacity on the filesystem the
cache is located (e.g.: 90%). When the total disk usage exceeds this limit (it’s not limited to the
cache usage alone), earthkit-geo evicts older cached entries until the usage is below the
specified limit. Can be set to None. Ignored when |
maximum‑cache‑size |
‘5GB’ |
Maximum disk space used by the earthkit-geo cache (e.g.: 100G or 2T).
When exceeded, earthkit-geo evicts older cached entries until the usage
is below the specified limit. Can be set to None.
Ignored when |
temporary‑cache‑directory‑root |
None |
Parent of the cache directory when |
user‑cache‑directory |
‘~/.cache/earthkit‑geo’ |
Cache directory used when |
Other earthkit-geo config options can be found here.