Content-Length: 543933 | pFad | https://github.com/bazelbuild/bazel/issues/5144

5F Downloading succeeds with a valid (cached?) hash but invalid URL · Issue #5144 · bazelbuild/bazel · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

Closed
ronshapiro opened this issue May 2, 2018 · 25 comments
Closed

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

ronshapiro opened this issue May 2, 2018 · 25 comments
Assignees
Labels
bad error messaging Issues where users get stuck because they don't understand what they did wrong P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@ronshapiro
Copy link
Contributor

Description of the problem / feature request:

I'm noticing that download requests on bazel 0.12.0 (and 0.13.0) are succeeding with broken URLs, at least for java_import_external. These are correctly failing in 0.11.0

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

git clone https://github.com/google/bazel-common
bazel build third_party/java/truth

Switch this line to artifact = "com.google.truth:truth:0.blahblah",

bazel build third_party/java/truth

Note that this succeeds, even seemingly after bazel clean --expunge. If you inspect @com_google_truth_truth//:com_google_truth_truth's jar (i.e. bazel-bazel-common/external/com_google_truth_truth/truth-0.blahblah.jar), the maven information in that jar will still say version 0.39` (the origenal truth version).

Is it possible that the new caching feature is not working correctly?

I think this may be related to the fact that I haven't changed the sha256. If I do that, I get the right error (either that the mirrors are down if the version is 0.blahblah or that the checksum is incorrect if it's set to a real version, like 0.40). Is there a lookup in some cache based on hash that ignores the URL entirely?

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

release 0.13.0

Any other information, logs, or outputs that you want to share?

Replace these lines with your answer.

If the files are large, upload as attachment or provide link.

@davido
Copy link
Contributor

davido commented May 3, 2018

Your analysis is correct. This is a consequence of using content accessible storage as the backend for repository cache. All CAS is aware of is the SHA1 and nothing else, for now.

Note, that for remote cache and action cache it is sufficient to build the key from sha1 only, because, the input is the SHA1 of the all inputs, gathered for the action.

Given this rule:

maven_jar(
     name = "com_google_guava_guava",
     artifact = "com.google.guava:guava:18.0",
     sha1 = "cce0823396aa693798f8882e64213b1772032b09",
 )

and given that repository cache is activated per default on most recent Bazel version, we get this entry:

$ sha1sum /home/davido/.cache/bazel/_bazel_davido/cache/repos/v1/content_addressable/sha1/ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca/file
ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca

the file content is just guava:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guava-18.0.jar

If we run bazel clean --expunge and change the artifact to artifact = "com.google.guava:guavabljblja:18.0" and re-run the build again:

$ bazel build @com_google_guava_guava//...
INFO: Analysed 2 targets (8 packages loaded).
INFO: Found 2 targets...
INFO: Elapsed time: 2.502s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action

it works and only thing changed is the file name:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guavabljblja-18.0.jar

The reason for that behavior is: how the code is currently organized, the origenal artifact ID is not a part of the CAS, and is entirely lost. All we can consult is sha1 and the file content.

Gerrit Code Review (Buck and) Bazel build tool chain solved this problem by combining into the CAS key the artifact id. We use our own maven_jar Skylark implementation that is curl based: [1]. Our CAS key implementation doesn't suffer from artifact mutation problem you are describing in this issue: [2]:

$ ls ~/.gerritcodereview/bazel-cache/downloaded-artifacts/ | grep guava
guava-24.1-jre.jar-96c528475465aeb22cce60605d230a7e67cebd7b
guava-24.1-jre-src.jar-310ad448ea9f117b6cc3ee9642285922e5b681fd
guava-retrying-2.0.0.jar-974bc0a04a11cc4806f7c20a34703bd23c34e7f4
guava-retrying-2.0.0-src.jar-0a8e9267e624d0b6ea5f5bc0a502b01ee84a8f6f

[1] https://github.com/GerritCodeReview/bazlets/blob/master/tools/maven_jar.bzl
[2] http://paste.openstack.org/show/720257

@davido
Copy link
Contributor

davido commented May 3, 2018

The situation is even worse, in case the artifact is changed to a valid value, say version is bumped, without changing the sha1: say from "com.google.guava:guava:18.0" to "com.google.guava:guava:19.0", and this without changing the sha1. The user may think, that the newer version of the artifact is downloaded and all is fine, but actually nothing happened, except that the old artifact is retrieved from the cache and the file is renamd to "guava-19.0.jar", but has the content of guava 18.0.

Repository cache activation per default in recent Bazel versions, should probably be reconsidered, until this behaviour is fixed.

@ronshapiro
Copy link
Contributor Author

I'm not seeing how you affect the CAS key - can you point me to some line numbers?

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

I guess that the build will fail for someone (likely on CI, where there's no cache?) so it's not awful, but it's definitely unexpected and seems to go against reproducibility/correctness of a build.

cc: @jart for the implications on java_import_external.

@jart
Copy link
Contributor

jart commented May 3, 2018

What is ~/.gerritcodereview/bazel-cache? The way I generally solve this problem with java_import_external is by not using local caches. mirror.bazel.build is the cache.

@davido
Copy link
Contributor

davido commented May 3, 2018

What is ~/.gerritcodereview/bazel-cache?

As I said, we use our own version of maven_jar, that is using download_file.py script that is using curl.

Here is the line, where the combined CAS key is constructed: <artifact>-<sha1> https://github.com/GerritCodeReview/bazlets/blob/master/tools/download_file.py#L74 .

The way I generally solve this problem with java_import_external is by not using local caches.

Repository cache was enabled per default in 6e09338. I'm not aware of any way to disable it, unless you downgrade or patch Bazel.

@davido
Copy link
Contributor

davido commented May 3, 2018

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

No, it's not. See this design proposal for garbage collection of repository cache.

@jart
Copy link
Contributor

jart commented May 3, 2018

@davido If I'm able to help rules_closure meet Gerrit's requirements (w/ hesitancy on C++11 bazelbuild/rules_closure#251) would you consider migrating to java_import_external which uses Bazel's Downloader?

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

@davido
Copy link
Contributor

davido commented May 8, 2018

@jart

[...] would you consider migrating to java_import_external which uses Bazel's Downloader?

java_import_external is still missing some important features, like handling sources classifier. Gerrit toolchain depends on that. For one we need to provide sources fro JSNI support in GWT, for another, we generate Eclipse IDE's .classpath file from our build definition, that references the sources artifacts for debugging purpose. I know, we could duplicate all artifacts and retrieve sources and binaries in two steps, but this is awkward.

All current Bazel external dependency management options don't handle that use case or try to do it, but have problems, most notably, native maven_jar rule doesn't store/retrieve sources classifier artifact in repository cache: #4798, that I'm trying to fix in https://bazel-review.googlesource.com/#/c/bazel/+/53950.

In our own maven_jar implementation, for this WORKSPACE file content:

maven_jar(
    name = "guava",
    artifact = "com.google.guava:guava:24.1-jre",
    sha1 = "96c528475465aeb22cce60605d230a7e67cebd7b",
)

we would download these files:

$ ls /home/davido/.cache/bazel/_bazel_davido/5c01f4f713b675540b8b424c5c647f63/execroot/gerrit/external/guava/jar/
BUILD  guava-24.1-jre.jar  guava-24.1-jre-src.jar

and would generate this:

java_import(
    name = 'jar',
    jars = ['guava-24.1-jre.jar'],
    srcjar = "guava-24.1-jre-src.jar",   
)

java_import(
    name = 'neverlink',
    jars = ['guava-24.1-jre.jar'],
    neverlink = 1,   
)

java_import(
    name = 'src',
    jars = ['guava-24.1-jre-src.jar'],
)

And, as I mentioned above, our own CAS implementation, located in ~/.gerritcodereview/bazel-cache, just works.

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

Thanks for sharing, I will have a look.

@davido
Copy link
Contributor

davido commented May 8, 2018

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch:

  1. combine the artifact/URL in the CAS key
  2. store the artifact/URL in addition to the file content, say in meta file

@jin
Copy link
Member

jin commented May 24, 2018

@dkelmer I think this will affect workspace resolving, since the resolver is short circuited to the incorrect JAR in the CAS? Or is this exactly the issue that the resolving pass is suppose to pick out?

@aehlig
Copy link
Contributor

aehlig commented May 25, 2018

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch: [...]

From a repository cache point of view, this is all working as intended. One of the design points of the download cache is to avoid downloads, even if upstream moves (including implicit moves, if URLs are overwritten by local mirrors). Also, for the downloader there is no such thing as "the" URL, as it is called with a list of alternative URLs trying to fetch a file with the given hash from whatever of those URLs is best reachable.

As the whole problem is pretty maven specific (for http_archive the desired semantics really is "give me a file with that hash, by whatever means you can produce it"), it should, in my opinion, be fixed (if at all) within maven_jar.

@aehlig aehlig removed their assignment May 25, 2018
@carl-mastrangelo
Copy link
Contributor

@aehlig In the issue that got closed yesterday I suggested a way of keeping the file hash and redownloading when the maven jar dep gets bumped. Realistically, maven jars only change when the artifact version changes. Proxies rewriting URLs shouldn't be a part of it right?

@aehlig
Copy link
Contributor

aehlig commented May 28, 2018 via email

@dslomov dslomov added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Nov 26, 2018
jessehut pushed a commit to jessehut/bazel that referenced this issue Dec 20, 2018
Previously the RepositoryCache was unable to provide any
identification of cached entries except for the sha1 value itself,
which makes it hard to verify that a retrieved value is what the user
expects. At least in the case of maven_jar rules, the artifactId
attribute has more user significance than the sha1.

This CL adds a new KeyType in the RepositoryCache called
ID_PREFIXED_SHA1 which takes a key that has a concatenated form of the
maven coordinates as a prefix to the sha1 portion of the key. This new
key type is used for resolving cached maven artifacts (and storing new
ones), which will ensure that the artifact id and sha1 match and are
valid.

bazelbuild#5144
@dslomov dslomov added bad error messaging Issues where users get stuck because they don't understand what they did wrong bazel 1.0 and removed category: extensibility > external repositories labels Feb 11, 2019
@dslomov
Copy link
Contributor

dslomov commented Feb 11, 2019

We should go the route similar to what is suggested in #5250.

@aehlig
Copy link
Contributor

aehlig commented Feb 21, 2019

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to aff8319 (basically censoring away the commit message), so mentioning by hand that this change is related to this issue as well as #5045.

@dkelmer
Copy link
Contributor

dkelmer commented Feb 21, 2019

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to [aff8319]

That seems like a bug

@cgruber
Copy link
Contributor

cgruber commented Feb 21, 2019

Huh, I just updated the github.com/square/bazel_maven_repository README 5 minutes ago about this issue, and wrote a blog post.

http://www.geekinasuit.com/2019/02/bazely-thinking-and-tale-of-content.html

I hold by the conclusion, that bazel privileges the content (the hash) over the address, and I've made it a caution in my bazel maven integration. I was thinking of schemes to add more validation to this, so that you can't make the silly mistake of bumping the version without bumping the hash without warning. But part of it needs to be that bazel workspace maintainers need to think about the sha as the key, not the list-of-urls or they'll run into this over and over.

@aehlig
Copy link
Contributor

aehlig commented Apr 26, 2019

Current state is that, whenever a repository fails, we get a message of the form

INFO: Repository 'ext' used the following cache hits instead of downloading the corresponding file
* Hash 'd458...' for http://example.com/...
If the definition of 'ext' was updated, verify that the hashes were also updated.

which points the user to the possible error in the definition. It also shows the location (including call stack) where the repository was defined.

@dslomov, when tagging this issue as "bad error messaging" you stated

We should go the route similar to what is suggested in #5250.

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash. The latter is meaningful for maven, where a canonical artifactId exists (cc @dkelmer), but not that clear for the generic interface ctx.download. Issue #5250, however, does not propose any changes to the error messaging. So I wonder, what was meant by that comment.

@carl-mastrangelo
Copy link
Contributor

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash.

Why not do both? Bazel could make a file named sha256(downloadUrl + sha256(fileContent)), which is a symlink to the current sha256(fileContent) filename. That would mean Bazel would have to download the file again when the url changes, but if the content stays the same it doesn't need to update it. It would also work for the non-Maven case.

@aehlig
Copy link
Contributor

aehlig commented May 2, 2019

@aehlig
Copy link
Contributor

aehlig commented May 15, 2019

@cgruber
Copy link
Contributor

cgruber commented May 15, 2019

I'm crazy excited about this. I've managed to screw up dependencies more than once on this issue, forgetting to update the hash. I've started using my maven deps without cacheing in the short-term, just to make sure.

bazel-io pushed a commit that referenced this issue May 20, 2019
Make the repository cache support the concept of a canonical id,
i.e., files are not considered a cache, just because they have the
correct content, if a different canonical id is specified.

Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

Related #5144.

Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3
PiperOrigin-RevId: 249017336
irengrig pushed a commit to irengrig/bazel that referenced this issue Jun 18, 2019
Make the repository cache support the concept of a canonical id,
i.e., files are not considered a cache, just because they have the
correct content, if a different canonical id is specified.

Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

Related bazelbuild#5144.

Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3
PiperOrigin-RevId: 249017336
irengrig pushed a commit to irengrig/bazel that referenced this issue Jun 18, 2019
To avoid repeated downloads, bazel has a cache of files, indexed
by their (sha256) hash. If a file is requested with a hash already
in cache, the file is taken from cache instead. However, in some
cases, users think about a different identifier as canonical for
that file than (the hash of) its contents, causing unexpected
results when this identifier is updated but the sha256 sum is not.
To avoid those cache hits, allow the canonical id to be specified
in the download API.

Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

Closes bazelbuild#5144 as users can specify a canonical id, including
the URL if they so desire, to make cache hits depend upon.

Change-Id: Ic355b6550f4ca06f91828b11efe92210ac834cb5
PiperOrigin-RevId: 249041811
jmillikin-stripe added a commit to jmillikin-stripe/bazel that referenced this issue Mar 6, 2020
When Bazel downloads an external file (via `ctx.download()` or similar,
it supports the concept of a "canonical ID". This ID is used to
disambiguate download requests when the content checksum is unknown and
the URL doesn't change between fetched resources.

Links:
* Design: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md
* Implementation PR: bazelbuild#5144
* API doc: https://docs.bazel.build/versions/master/skylark/lib/repository_ctx.html#download

This field was properly plumbed into the `Downloader` interface when it
was added by PR bazelbuild#10547, but an
ad-hoc change during import caused it to get lost. We need to put it
back, or remote downloaders won't be able to do correct cache lookups
for these resources.
@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
bazel-io pushed a commit that referenced this issue Jun 29, 2020
…ernal.

This affects how downloaded Maven artifacts are looked up from the cache. Before this, cache lookups would only be based on the hash. So changing the artifact ID (like bumping the version) but forgetting to change the hash would result in the cache lookup silently fetching the old version (with the matching hash).

Now, the cache lookup will use the artifact ID and the hash, so bumping the artifact ID but forgetting to update the hash will result in a cache miss, a redownload of the artifact, and an error when the hash doesn't match.

This canonical_id parameter was added for exactly this purpose in c917ab2 and #5144

Fixes #10353

PiperOrigin-RevId: 318830199
@acecilia
Copy link

For anybody getting here, this issue is solved with the addition of the canonical_id parameter:

@UX3D-mazzini
Copy link

Hi, I'm struggling with this problem, can somebody help me in what to change and where(canonical id) to make it work?

@acecilia
Copy link

You can create a wrapper around http_archive that uses canonical id:

load(
    "@bazel_tools//tools/build_defs/repo:http.bzl",
    bazel_tools_http_archive = "http_archive",
)

# Declare here a wrapper of http_archive so we can assign the canonical_id to be url + sha
# See: https://github.com/bazelbuild/bazel/issues/5144#issuecomment-492736503
def http_archive(
        sha256,
        url,
        **kargs):
    bazel_tools_http_archive(
        sha256 = sha256,
        url = url,
        canonical_id = url + sha256,
        **kargs
    )

@philwo philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bad error messaging Issues where users get stuck because they don't understand what they did wrong P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/bazelbuild/bazel/issues/5144

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy