returnn.util.file_cache¶
File cache.
Copies files from a remote filesystem (e.g. NFS) to a local filesystem (e.g. /var/tmp) to speed up access.
See https://github.com/rwth-i6/returnn/issues/1519 for initial discussion.
Main class is FileCache.
- class returnn.util.file_cache.FileCache(*, cache_directory: str = '$TMPDIR/$USER/returnn/file_cache', cleanup_files_always_older_than_days: float = 31.0, cleanup_files_wanted_older_than_days: float = 1.0, cleanup_disk_usage_wanted_free_ratio: float = 0.2, cleanup_disk_usage_wanted_multiplier: float = 2.0, num_tries: int = 3)[source]¶
File cache.
Copies files from a remote filesystem (e.g. NFS) to a local filesystem (e.g.
/var/tmp) to speed up access.Some assumptions we depend on:
When a cached file is available, and its size matches the original file and its mtime is not older than the original file, we can use it.
We will update the cached file mtime frequently (every second) via a background thread of used cached files, to mark that they are used. (We would maybe want to use atime, but we don’t expect that atime can be relied on.) Note that updating mtime might influence the behavior of some external tools.
os.utime()will update mtime, and mtime is somewhat accurate (up to 10 secs maybe), mtime compares to time.time().shutil.disk_usage()can be relied on.
See https://github.com/rwth-i6/returnn/issues/1519 for initial discussion.
- Parameters:
cache_directory – directory where to cache files. Uses
expand_env_vars()to expand environment variables.cleanup_files_always_older_than_days – always cleanup files older than this.
cleanup_files_wanted_older_than_days – if cleanup_disk_usage_wanted_free_ratio not reached, cleanup files older than this.
cleanup_disk_usage_wanted_free_ratio – try to free at least this ratio of disk space.
cleanup_disk_usage_wanted_multiplier – when making space for a new file, try to free at least this times as much space.
num_tries – how many times to try caching a file before giving up
- get_file(src_filename: str) str[source]¶
Get cached file. This will copy the file to the cache directory if it is not already there. This will also make sure that the file is not removed from the cache directory via the _touch_files_thread until you call
release_file().- Parameters:
src_filename – source file to copy (if it is not already in the cache).
- Returns:
cached file path (in the cache directory)
- release_files(filenames: str | Iterable[str])[source]¶
Release cached files. This just says that we are not using the files anymore for now. They will be kept in the cache directory for now, and might be removed when the cache directory is cleaned up.
- Parameters:
filenames – files to release (paths in the cache directory)
- handle_cached_files_in_config(config: Any) Tuple[Any, List[str]][source]¶
- Parameters:
config – some config, e.g. dict, or any nested structure
- Returns:
modified config, where all
CachedFileinstances are replaced by the cached file path, and the list of cached files which are used.
- class returnn.util.file_cache.CachedFile(filename: str)[source]¶
Represents some file to be cached in a user config. See
FileCache.handle_cached_files_in_config().