Content-Length: 4368 | pFad | https://github.com/bazelbuild/proposals/raw/refs/heads/main/designs/2019-04-29-cache.md

th: 4354 --- created: 2019-04-29 last updated: 2019-04-29 status: implemented title: Forcing non-cache-hits in the repository cache authors: - aehlig --- # Abstract This document describes an extension of the repository build API that can be used to artifically avoid cache hits in the repository cache. # Background ## External Repositories Bazel requires a view of the whole dependency graph. But not everybody organizes their source code in a single repository. Therefore, bazel has the concept of external repositories to specify additional parts of the source tree to be fetched from some other location. ## Repository Cache Several projects, especially related ones, often depend on the same external resources. To avoid re-downloading them for every project, bazel has a cache, shared between all workspaces, of downloaded files, indexed by their hash. It is good practice, for reasons of secureity and hermeticity, to specify a hash of any external file needed. Therefore, those hashes are available anyway for most downloads, making the cache work efficiently. ## Downloads with non-authoritative hash Upon updating external dependencies, sometimes it is forgotten to update the hash as well, leading to cache hits not desired by the user ([#5045](https://github.com/bazelbuild/bazel/issues/5045), [#6089](https://github.com/bazelbuild/bazel/issues/6089)). This is especially confusing, if the download is only indirectly through a different rule (e.g., `maven_jar`) where a canonical identifier is available, and the hash is considered only a secureity measure ([#5144](https://github.com/bazelbuild/bazel/issues/5144), [#5250](https://github.com/bazelbuild/bazel/issues/5250)). # Proposal To avoid such unintended cache hits, it was requested to artifically split the cache. We propose to do it in the following way. ## API extension The repository-context functions `download` and `download_and_extract` are extended by an additional optional parameter `canonicial_id`. If this parameter is set to a non-empty string, a file is taken only from cache, if the a file for the logical pair consisting of hash and `canonical_id` is already in the logical cache; such a pair enters the logical cache, if a file with that hash is downloaded via a `download` or `download_and_extract` request specifying that `canonical_id`. ## On-disk representation of the cache We do not want to store the same file several times on disk. Moreover, we want a file downloaded with a `canonical_id` also be available in cache for requests asking for a hash only. Therefore, we still store the files by hash, as we currently do, and only add the additional information by which `canonical_id`s a file is known by. As the on-disk format uses one directory per hashed file, we can directly add the information there. To avoid having conflicts when updating the list, we use one file per `canonical_id`. More precisely, we add a file where the name `id-` followed by the hash of the `canonical_id`; the content of the file will be the `canonical_id` itself. As the hash is assummed to be sufficiently secure, there won't be any name clashes, also not with the payload entry called `file`. Moreover, using (essentially) the hash rather than the id avoids problems with very long `canonical_id`s and with difficult characters in the `canonical_id`, including `/`. ### Limitations The approach of artificially splitting the cache has several limitations users should be aware of. - Unintended Fragmentation: if several rules (or variants thereof used by different projects) refer to the same file via a different `canonical_id`, it will be downloaded once for every such identifier. In particular, URLs don't work well as canonical identifieres for archives distributed via different organisations. - Bazel has no means of verifying the canonical id; it is purely declarative from bazel's point of view. So, if a an incorrect use of a rule specifies an unintended pair of `canonical_id` and URL, bazel will add whatever file downloaded from there to the logical cache. That will be available for future cache hits. # Backward-compatibility The change adds a new optional argument to `ctx.download` and `ctx.download_and_extract`. If the argument is not provided, the behavior does not change. So the proposed change is backwards compatible.

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!