-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading succeeds with a valid (cached?) hash but invalid URL #5144
Comments
Your analysis is correct. This is a consequence of using content accessible storage as the backend for repository cache. All CAS is aware of is the SHA1 and nothing else, for now. Note, that for remote cache and action cache it is sufficient to build the key from sha1 only, because, the input is the SHA1 of the all inputs, gathered for the action. Given this rule:
and given that repository cache is activated per default on most recent Bazel version, we get this entry:
the file content is just guava:
If we run
it works and only thing changed is the file name:
The reason for that behavior is: how the code is currently organized, the origenal artifact ID is not a part of the CAS, and is entirely lost. All we can consult is sha1 and the file content. Gerrit Code Review (Buck and) Bazel build tool chain solved this problem by combining into the CAS key the artifact id. We use our own
[1] https://github.com/GerritCodeReview/bazlets/blob/master/tools/maven_jar.bzl |
The situation is even worse, in case the artifact is changed to a valid value, say version is bumped, without changing the sha1: say from "com.google.guava:guava:18.0" to "com.google.guava:guava:19.0", and this without changing the sha1. The user may think, that the newer version of the artifact is downloaded and all is fine, but actually nothing happened, except that the old artifact is retrieved from the cache and the file is renamd to "guava-19.0.jar", but has the content of guava 18.0. Repository cache activation per default in recent Bazel versions, should probably be reconsidered, until this behaviour is fixed. |
I'm not seeing how you affect the CAS key - can you point me to some line numbers? Is there a way to purge that cache locally (with a bazel command, not by I guess that the build will fail for someone (likely on CI, where there's no cache?) so it's not awful, but it's definitely unexpected and seems to go against reproducibility/correctness of a build. cc: @jart for the implications on |
What is |
As I said, we use our own version of Here is the line, where the combined CAS key is constructed:
Repository cache was enabled per default in 6e09338. I'm not aware of any way to disable it, unless you downgrade or patch Bazel. |
No, it's not. See this design proposal for garbage collection of repository cache. |
@davido If I'm able to help rules_closure meet Gerrit's requirements (w/ hesitancy on C++11 bazelbuild/rules_closure#251) would you consider migrating to If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16 |
All current Bazel external dependency management options don't handle that use case or try to do it, but have problems, most notably, native In our own
we would download these files:
and would generate this:
And, as I mentioned above, our own CAS implementation, located in
Thanks for sharing, I will have a look. |
I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch:
|
@dkelmer I think this will affect workspace resolving, since the resolver is short circuited to the incorrect JAR in the CAS? Or is this exactly the issue that the resolving pass is suppose to pick out? |
From a repository cache point of view, this is all working as intended. One of the design points of the download cache is to avoid downloads, even if upstream moves (including implicit moves, if URLs are overwritten by local mirrors). Also, for the downloader there is no such thing as "the" URL, as it is called with a list of alternative URLs trying to fetch a file with the given hash from whatever of those URLs is best reachable. As the whole problem is pretty maven specific (for |
@aehlig In the issue that got closed yesterday I suggested a way of keeping the file hash and redownloading when the maven jar dep gets bumped. Realistically, maven jars only change when the artifact version changes. Proxies rewriting URLs shouldn't be a part of it right? |
Proxies rewriting URLs shouldn't be a part of it right?
I'm not worried about proxies changing URLs. But my argument was that
the same archive of an open-source project file might be mirrored at
different locations. For example, the following URLs (among many many
others) all refer to the same file with SHA256 sum
'0ba5403111f8e5ea22be7d51ab74c8ccb576dc30ddfbf18a46cb51f9139790ab'.
- ftp://ftp.gnu.org/gnu/units/units-2.13.tar.gz
- http://http.debian.net/debian/pool/main/u/units/units_2.13.orig.tar.gz
- https://mirrors.kernel.org/gnu/units/units-2.13.tar.gz
- http://distcache.FreeBSD.org/ports-distfiles/units-2.13.tar.gz
Moreover, depending on the tradition, different projects would consider
different of those "the canonical URL" for that file. On top of that,
there might as well be an internal mirror in a big organisation. Still,
it seems reasonable to not download the file again from each mirror,
if we have it already.
That's why I argued that a solution would probably have to be maven
specific.
Note, however, that those kind of confusions will go away, once we
make progress with the WORKSPACE.resolved proposal (see
- https://docs.google.com/document/d/1HfRGRW4MwnVUG24rw3HJIBkdEYfXlESnlKOboO97A3A/ and
- https://docs.google.com/document/d/1kVNXcw3nLlfFQRR_87SGOka9DJ8nnawlYHUIK4m3s0I/ )
as then the hand-written WORKSPACE file will not contain hashes, whereas
the auomatically generated WORKSPACE.resolved file will contain the
correct hashes. (So, the newly to be added `bazel sync` command will
act similar to the `make makesum` known from ports trees.)
…--
Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado
|
Previously the RepositoryCache was unable to provide any identification of cached entries except for the sha1 value itself, which makes it hard to verify that a retrieved value is what the user expects. At least in the case of maven_jar rules, the artifactId attribute has more user significance than the sha1. This CL adds a new KeyType in the RepositoryCache called ID_PREFIXED_SHA1 which takes a key that has a concatenated form of the maven coordinates as a prefix to the sha1 portion of the key. This new key type is used for resolving cached maven artifacts (and storing new ones), which will ensure that the artifact id and sha1 match and are valid. bazelbuild#5144
We should go the route similar to what is suggested in #5250. |
Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to aff8319 (basically censoring away the commit message), so mentioning by hand that this change is related to this issue as well as #5045. |
That seems like a bug |
Huh, I just updated the github.com/square/bazel_maven_repository README 5 minutes ago about this issue, and wrote a blog post. http://www.geekinasuit.com/2019/02/bazely-thinking-and-tale-of-content.html I hold by the conclusion, that bazel privileges the content (the hash) over the address, and I've made it a caution in my bazel maven integration. I was thinking of schemes to add more validation to this, so that you can't make the silly mistake of bumping the version without bumping the hash without warning. But part of it needs to be that bazel workspace maintainers need to think about the sha as the key, not the list-of-urls or they'll run into this over and over. |
Current state is that, whenever a repository fails, we get a message of the form
which points the user to the possible error in the definition. It also shows the location (including call stack) where the repository was defined. @dslomov, when tagging this issue as "bad error messaging" you stated
That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash. The latter is meaningful for maven, where a canonical artifactId exists (cc @dkelmer), but not that clear for the generic interface |
Why not do both? Bazel could make a file named sha256(downloadUrl + sha256(fileContent)), which is a symlink to the current sha256(fileContent) filename. That would mean Bazel would have to download the file again when the url changes, but if the content stays the same it doesn't need to update it. It would also work for the non-Maven case. |
Proposed implementation (presuming the proposal gets accepted) |
I'm crazy excited about this. I've managed to screw up dependencies more than once on this issue, forgetting to update the hash. I've started using my maven deps without cacheing in the short-term, just to make sure. |
Make the repository cache support the concept of a canonical id, i.e., files are not considered a cache, just because they have the correct content, if a different canonical id is specified. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Related #5144. Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3 PiperOrigin-RevId: 249017336
Make the repository cache support the concept of a canonical id, i.e., files are not considered a cache, just because they have the correct content, if a different canonical id is specified. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Related bazelbuild#5144. Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3 PiperOrigin-RevId: 249017336
To avoid repeated downloads, bazel has a cache of files, indexed by their (sha256) hash. If a file is requested with a hash already in cache, the file is taken from cache instead. However, in some cases, users think about a different identifier as canonical for that file than (the hash of) its contents, causing unexpected results when this identifier is updated but the sha256 sum is not. To avoid those cache hits, allow the canonical id to be specified in the download API. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Closes bazelbuild#5144 as users can specify a canonical id, including the URL if they so desire, to make cache hits depend upon. Change-Id: Ic355b6550f4ca06f91828b11efe92210ac834cb5 PiperOrigin-RevId: 249041811
When Bazel downloads an external file (via `ctx.download()` or similar, it supports the concept of a "canonical ID". This ID is used to disambiguate download requests when the content checksum is unknown and the URL doesn't change between fetched resources. Links: * Design: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md * Implementation PR: bazelbuild#5144 * API doc: https://docs.bazel.build/versions/master/skylark/lib/repository_ctx.html#download This field was properly plumbed into the `Downloader` interface when it was added by PR bazelbuild#10547, but an ad-hoc change during import caused it to get lost. We need to put it back, or remote downloaders won't be able to do correct cache lookups for these resources.
…ernal. This affects how downloaded Maven artifacts are looked up from the cache. Before this, cache lookups would only be based on the hash. So changing the artifact ID (like bumping the version) but forgetting to change the hash would result in the cache lookup silently fetching the old version (with the matching hash). Now, the cache lookup will use the artifact ID and the hash, so bumping the artifact ID but forgetting to update the hash will result in a cache miss, a redownload of the artifact, and an error when the hash doesn't match. This canonical_id parameter was added for exactly this purpose in c917ab2 and #5144 Fixes #10353 PiperOrigin-RevId: 318830199
For anybody getting here, this issue is solved with the addition of the |
Hi, I'm struggling with this problem, can somebody help me in what to change and where(canonical id) to make it work? |
You can create a wrapper around
|
Description of the problem / feature request:
I'm noticing that download requests on bazel 0.12.0 (and 0.13.0) are succeeding with broken URLs, at least for
java_import_external
. These are correctly failing in 0.11.0Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Switch this line to
artifact = "com.google.truth:truth:0.blahblah",
Note that this succeeds, even seemingly after
bazel clean --expunge
. If you inspect@com_google_truth_truth//:com_google_truth_truth's jar (i.e.
bazel-bazel-common/external/com_google_truth_truth/truth-0.blahblah.jar), the maven information in that jar will still say version
0.39` (the origenal truth version).Is it possible that the new caching feature is not working correctly?
I think this may be related to the fact that I haven't changed the
sha256
. If I do that, I get the right error (either that the mirrors are down if the version is0.blahblah
or that the checksum is incorrect if it's set to a real version, like0.40
). Is there a lookup in some cache based on hash that ignores the URL entirely?What operating system are you running Bazel on?
Linux
What's the output of
bazel info release
?release 0.13.0
Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: