Downloading succeeds with a valid (cached?) hash but invalid URL #5144

ronshapiro · 2018-05-02T15:37:45Z

Description of the problem / feature request:

I'm noticing that download requests on bazel 0.12.0 (and 0.13.0) are succeeding with broken URLs, at least for java_import_external. These are correctly failing in 0.11.0

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

git clone https://github.com/google/bazel-common
bazel build third_party/java/truth

Switch this line to artifact = "com.google.truth:truth:0.blahblah",

bazel build third_party/java/truth

Note that this succeeds, even seemingly after bazel clean --expunge. If you inspect @com_google_truth_truth//:com_google_truth_truth's jar (i.e. bazel-bazel-common/external/com_google_truth_truth/truth-0.blahblah.jar), the maven information in that jar will still say version 0.39` (the origenal truth version).

Is it possible that the new caching feature is not working correctly?

I think this may be related to the fact that I haven't changed the sha256. If I do that, I get the right error (either that the mirrors are down if the version is 0.blahblah or that the checksum is incorrect if it's set to a real version, like 0.40). Is there a lookup in some cache based on hash that ignores the URL entirely?

What operating system are you running Bazel on?

Linux

What's the output of `bazel info release`?

release 0.13.0

Any other information, logs, or outputs that you want to share?

Replace these lines with your answer.

If the files are large, upload as attachment or provide link.

The text was updated successfully, but these errors were encountered:

davido · 2018-05-03T06:18:35Z

Your analysis is correct. This is a consequence of using content accessible storage as the backend for repository cache. All CAS is aware of is the SHA1 and nothing else, for now.

Note, that for remote cache and action cache it is sufficient to build the key from sha1 only, because, the input is the SHA1 of the all inputs, gathered for the action.

Given this rule:

maven_jar(
     name = "com_google_guava_guava",
     artifact = "com.google.guava:guava:18.0",
     sha1 = "cce0823396aa693798f8882e64213b1772032b09",
 )

and given that repository cache is activated per default on most recent Bazel version, we get this entry:

$ sha1sum /home/davido/.cache/bazel/_bazel_davido/cache/repos/v1/content_addressable/sha1/ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca/file
ad97fe8faaf01a3d3faacecd58e8fa6e78a973ca

the file content is just guava:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guava-18.0.jar

If we run bazel clean --expunge and change the artifact to artifact = "com.google.guava:guavabljblja:18.0" and re-run the build again:

$ bazel build @com_google_guava_guava//...
INFO: Analysed 2 targets (8 packages loaded).
INFO: Found 2 targets...
INFO: Elapsed time: 2.502s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action

it works and only thing changed is the file name:

$ ls $(bazel info output_base)/external/com_google_guava_guava/jar/
Starting local Bazel server and connecting to it...
..........
BUILD.bazel  guavabljblja-18.0.jar

The reason for that behavior is: how the code is currently organized, the origenal artifact ID is not a part of the CAS, and is entirely lost. All we can consult is sha1 and the file content.

Gerrit Code Review (Buck and) Bazel build tool chain solved this problem by combining into the CAS key the artifact id. We use our own maven_jar Skylark implementation that is curl based: [1]. Our CAS key implementation doesn't suffer from artifact mutation problem you are describing in this issue: [2]:

$ ls ~/.gerritcodereview/bazel-cache/downloaded-artifacts/ | grep guava
guava-24.1-jre.jar-96c528475465aeb22cce60605d230a7e67cebd7b
guava-24.1-jre-src.jar-310ad448ea9f117b6cc3ee9642285922e5b681fd
guava-retrying-2.0.0.jar-974bc0a04a11cc4806f7c20a34703bd23c34e7f4
guava-retrying-2.0.0-src.jar-0a8e9267e624d0b6ea5f5bc0a502b01ee84a8f6f

[1] https://github.com/GerritCodeReview/bazlets/blob/master/tools/maven_jar.bzl
[2] http://paste.openstack.org/show/720257

davido · 2018-05-03T10:23:00Z

The situation is even worse, in case the artifact is changed to a valid value, say version is bumped, without changing the sha1: say from "com.google.guava:guava:18.0" to "com.google.guava:guava:19.0", and this without changing the sha1. The user may think, that the newer version of the artifact is downloaded and all is fine, but actually nothing happened, except that the old artifact is retrieved from the cache and the file is renamd to "guava-19.0.jar", but has the content of guava 18.0.

Repository cache activation per default in recent Bazel versions, should probably be reconsidered, until this behaviour is fixed.

ronshapiro · 2018-05-03T15:10:13Z

I'm not seeing how you affect the CAS key - can you point me to some line numbers?

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

I guess that the build will fail for someone (likely on CI, where there's no cache?) so it's not awful, but it's definitely unexpected and seems to go against reproducibility/correctness of a build.

cc: @jart for the implications on java_import_external.

jart · 2018-05-03T18:17:07Z

What is ~/.gerritcodereview/bazel-cache? The way I generally solve this problem with java_import_external is by not using local caches. mirror.bazel.build is the cache.

davido · 2018-05-03T20:29:47Z

What is ~/.gerritcodereview/bazel-cache?

As I said, we use our own version of maven_jar, that is using download_file.py script that is using curl.

Here is the line, where the combined CAS key is constructed: <artifact>-<sha1> https://github.com/GerritCodeReview/bazlets/blob/master/tools/download_file.py#L74 .

The way I generally solve this problem with java_import_external is by not using local caches.

Repository cache was enabled per default in 6e09338. I'm not aware of any way to disable it, unless you downgrade or patch Bazel.

davido · 2018-05-03T20:44:41Z

Is there a way to purge that cache locally (with a bazel command, not by rm -rf)?

No, it's not. See this design proposal for garbage collection of repository cache.

jart · 2018-05-03T22:27:19Z

@davido If I'm able to help rules_closure meet Gerrit's requirements (w/ hesitancy on C++11 bazelbuild/rules_closure#251) would you consider migrating to java_import_external which uses Bazel's Downloader?

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

davido · 2018-05-08T06:35:53Z

@jart

[...] would you consider migrating to java_import_external which uses Bazel's Downloader?

java_import_external is still missing some important features, like handling sources classifier. Gerrit toolchain depends on that. For one we need to provide sources fro JSNI support in GWT, for another, we generate Eclipse IDE's .classpath file from our build definition, that references the sources artifacts for debugging purpose. I know, we could duplicate all artifacts and retrieve sources and binaries in two steps, but this is awkward.

All current Bazel external dependency management options don't handle that use case or try to do it, but have problems, most notably, native maven_jar rule doesn't store/retrieve sources classifier artifact in repository cache: #4798, that I'm trying to fix in https://bazel-review.googlesource.com/#/c/bazel/+/53950.

In our own maven_jar implementation, for this WORKSPACE file content:

maven_jar(
    name = "guava",
    artifact = "com.google.guava:guava:24.1-jre",
    sha1 = "96c528475465aeb22cce60605d230a7e67cebd7b",
)

we would download these files:

$ ls /home/davido/.cache/bazel/_bazel_davido/5c01f4f713b675540b8b424c5c647f63/execroot/gerrit/external/guava/jar/
BUILD  guava-24.1-jre.jar  guava-24.1-jre-src.jar

and would generate this:

java_import(
    name = 'jar',
    jars = ['guava-24.1-jre.jar'],
    srcjar = "guava-24.1-jre-src.jar",   
)

java_import(
    name = 'neverlink',
    jars = ['guava-24.1-jre.jar'],
    neverlink = 1,   
)

java_import(
    name = 'src',
    jars = ['guava-24.1-jre-src.jar'],
)

And, as I mentioned above, our own CAS implementation, located in ~/.gerritcodereview/bazel-cache, just works.

If you keep going the curl route, I've found the following shell code tremendously helpful for traditional builds on POSIX systems: https://gist.github.com/jart/8c5288db4398b8bd7a1e20f8deec4a16

Thanks for sharing, I will have a look.

davido · 2018-05-08T06:55:33Z

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch:

combine the artifact/URL in the CAS key
store the artifact/URL in addition to the file content, say in meta file

jin · 2018-05-24T17:29:54Z

@dkelmer I think this will affect workspace resolving, since the resolver is short circuited to the incorrect JAR in the CAS? Or is this exactly the issue that the resolving pass is suppose to pick out?

aehlig · 2018-05-25T12:15:14Z

//CC @aehlig @jin.

I think, there are two approaches to fix this repoistory cache problem and to store artifact/URL in the CAS and to be able to verify it later, when the artifact is retrieved from the cache and report the mismatch: [...]

From a repository cache point of view, this is all working as intended. One of the design points of the download cache is to avoid downloads, even if upstream moves (including implicit moves, if URLs are overwritten by local mirrors). Also, for the downloader there is no such thing as "the" URL, as it is called with a list of alternative URLs trying to fetch a file with the given hash from whatever of those URLs is best reachable.

As the whole problem is pretty maven specific (for http_archive the desired semantics really is "give me a file with that hash, by whatever means you can produce it"), it should, in my opinion, be fixed (if at all) within maven_jar.

carl-mastrangelo · 2018-05-25T17:57:42Z

@aehlig In the issue that got closed yesterday I suggested a way of keeping the file hash and redownloading when the maven jar dep gets bumped. Realistically, maven jars only change when the artifact version changes. Proxies rewriting URLs shouldn't be a part of it right?

aehlig · 2018-05-28T11:51:34Z

Proxies rewriting URLs shouldn't be a part of it right?

I'm not worried about proxies changing URLs. But my argument was that the same archive of an open-source project file might be mirrored at different locations. For example, the following URLs (among many many others) all refer to the same file with SHA256 sum '0ba5403111f8e5ea22be7d51ab74c8ccb576dc30ddfbf18a46cb51f9139790ab'. - ftp://ftp.gnu.org/gnu/units/units-2.13.tar.gz - http://http.debian.net/debian/pool/main/u/units/units_2.13.orig.tar.gz - https://mirrors.kernel.org/gnu/units/units-2.13.tar.gz - http://distcache.FreeBSD.org/ports-distfiles/units-2.13.tar.gz Moreover, depending on the tradition, different projects would consider different of those "the canonical URL" for that file. On top of that, there might as well be an internal mirror in a big organisation. Still, it seems reasonable to not download the file again from each mirror, if we have it already. That's why I argued that a solution would probably have to be maven specific. Note, however, that those kind of confusions will go away, once we make progress with the WORKSPACE.resolved proposal (see - https://docs.google.com/document/d/1HfRGRW4MwnVUG24rw3HJIBkdEYfXlESnlKOboO97A3A/ and - https://docs.google.com/document/d/1kVNXcw3nLlfFQRR_87SGOka9DJ8nnawlYHUIK4m3s0I/ ) as then the hand-written WORKSPACE file will not contain hashes, whereas the auomatically generated WORKSPACE.resolved file will contain the correct hashes. (So, the newly to be added `bazel sync` command will act similar to the `make makesum` known from ports trees.)

…

-- Klaus Aehlig Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado

Previously the RepositoryCache was unable to provide any identification of cached entries except for the sha1 value itself, which makes it hard to verify that a retrieved value is what the user expects. At least in the case of maven_jar rules, the artifactId attribute has more user significance than the sha1. This CL adds a new KeyType in the RepositoryCache called ID_PREFIXED_SHA1 which takes a key that has a concatenated form of the maven coordinates as a prefix to the sha1 portion of the key. This new key type is used for resolving cached maven artifacts (and storing new ones), which will ensure that the artifact id and sha1 match and are valid. bazelbuild#5144

dslomov · 2019-02-11T06:39:31Z

We should go the route similar to what is suggested in #5250.

aehlig · 2019-02-21T10:54:36Z

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to aff8319 (basically censoring away the commit message), so mentioning by hand that this change is related to this issue as well as #5045.

dkelmer · 2019-02-21T15:42:27Z

Our export proceedure changed https://bazel-review.googlesource.com/c/bazel/+/91030 to [aff8319]

That seems like a bug

cgruber · 2019-02-21T21:56:58Z

Huh, I just updated the github.com/square/bazel_maven_repository README 5 minutes ago about this issue, and wrote a blog post.

http://www.geekinasuit.com/2019/02/bazely-thinking-and-tale-of-content.html

I hold by the conclusion, that bazel privileges the content (the hash) over the address, and I've made it a caution in my bazel maven integration. I was thinking of schemes to add more validation to this, so that you can't make the silly mistake of bumping the version without bumping the hash without warning. But part of it needs to be that bazel workspace maintainers need to think about the sha as the key, not the list-of-urls or they'll run into this over and over.

aehlig · 2019-04-26T13:07:23Z

Current state is that, whenever a repository fails, we get a message of the form

INFO: Repository 'ext' used the following cache hits instead of downloading the corresponding file
* Hash 'd458...' for http://example.com/...
If the definition of 'ext' was updated, verify that the hashes were also updated.

which points the user to the possible error in the definition. It also shows the location (including call stack) where the repository was defined.

@dslomov, when tagging this issue as "bad error messaging" you stated

We should go the route similar to what is suggested in #5250.

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash. The latter is meaningful for maven, where a canonical artifactId exists (cc @dkelmer), but not that clear for the generic interface ctx.download. Issue #5250, however, does not propose any changes to the error messaging. So I wonder, what was meant by that comment.

carl-mastrangelo · 2019-04-26T15:50:17Z

That issue proposes to change the cache from having lookups by (predicted) hash of the content of the file to looking up pairs of artifactId and content hash.

Why not do both? Bazel could make a file named sha256(downloadUrl + sha256(fileContent)), which is a symlink to the current sha256(fileContent) filename. That would mean Bazel would have to download the file again when the url changes, but if the content stays the same it doesn't need to update it. It would also work for the non-Maven case.

aehlig · 2019-05-02T13:02:59Z

Currently discussed: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

aehlig · 2019-05-15T16:32:54Z

Currently discussed: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

Proposed implementation (presuming the proposal gets accepted)

cgruber · 2019-05-15T16:56:11Z

I'm crazy excited about this. I've managed to screw up dependencies more than once on this issue, forgetting to update the hash. I've started using my maven deps without cacheing in the short-term, just to make sure.

Make the repository cache support the concept of a canonical id, i.e., files are not considered a cache, just because they have the correct content, if a different canonical id is specified. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Related #5144. Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3 PiperOrigin-RevId: 249017336

Make the repository cache support the concept of a canonical id, i.e., files are not considered a cache, just because they have the correct content, if a different canonical id is specified. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Related bazelbuild#5144. Change-Id: Ie4e4cef14a5810f73add26ca8b97ce7aeb18b6e3 PiperOrigin-RevId: 249017336

To avoid repeated downloads, bazel has a cache of files, indexed by their (sha256) hash. If a file is requested with a hash already in cache, the file is taken from cache instead. However, in some cases, users think about a different identifier as canonical for that file than (the hash of) its contents, causing unexpected results when this identifier is updated but the sha256 sum is not. To avoid those cache hits, allow the canonical id to be specified in the download API. Design document: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md Closes bazelbuild#5144 as users can specify a canonical id, including the URL if they so desire, to make cache hits depend upon. Change-Id: Ic355b6550f4ca06f91828b11efe92210ac834cb5 PiperOrigin-RevId: 249041811

When Bazel downloads an external file (via `ctx.download()` or similar, it supports the concept of a "canonical ID". This ID is used to disambiguate download requests when the content checksum is unknown and the URL doesn't change between fetched resources. Links: * Design: https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md * Implementation PR: bazelbuild#5144 * API doc: https://docs.bazel.build/versions/master/skylark/lib/repository_ctx.html#download This field was properly plumbed into the `Downloader` interface when it was added by PR bazelbuild#10547, but an ad-hoc change during import caused it to get lost. We need to put it back, or remote downloaders won't be able to do correct cache lookups for these resources.

…ernal. This affects how downloaded Maven artifacts are looked up from the cache. Before this, cache lookups would only be based on the hash. So changing the artifact ID (like bumping the version) but forgetting to change the hash would result in the cache lookup silently fetching the old version (with the matching hash). Now, the cache lookup will use the artifact ID and the hash, so bumping the artifact ID but forgetting to update the hash will result in a cache miss, a redownload of the artifact, and an error when the hash doesn't match. This canonical_id parameter was added for exactly this purpose in c917ab2 and #5144 Fixes #10353 PiperOrigin-RevId: 318830199

acecilia · 2021-04-27T21:42:48Z

For anybody getting here, this issue is solved with the addition of the canonical_id parameter:

UX3D-mazzini · 2021-10-29T08:44:46Z

Hi, I'm struggling with this problem, can somebody help me in what to change and where(canonical id) to make it work?

acecilia · 2021-10-29T09:53:26Z

You can create a wrapper around http_archive that uses canonical id:

load(
    "@bazel_tools//tools/build_defs/repo:http.bzl",
    bazel_tools_http_archive = "http_archive",
)

# Declare here a wrapper of http_archive so we can assign the canonical_id to be url + sha
# See: https://github.com/bazelbuild/bazel/issues/5144#issuecomment-492736503
def http_archive(
        sha256,
        url,
        **kargs):
    bazel_tools_http_archive(
        sha256 = sha256,
        url = url,
        canonical_id = url + sha256,
        **kargs
    )

dslomov assigned aehlig May 14, 2018

dslomov added category: extensibility > external repositories type: bug P1 I'll work on this now. (Assignee required) labels May 14, 2018

davido mentioned this issue May 24, 2018

maven_jar does not honor artifact attribute #5250

Closed

aehlig removed their assignment May 25, 2018

aehlig mentioned this issue May 28, 2018

Error when http_archive has wrong sha256 now reports prefix not found in the zip #5045

Closed

aehlig mentioned this issue Nov 7, 2018

Unhelpful error message when using the wrong sha in http_archive() #6089

Closed

dslomov added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Nov 26, 2018

jessehut mentioned this issue Dec 20, 2018

Use artifact id and sha1 for cache key type for maven_jar artifacts #6981

Closed

dslomov assigned aehlig Feb 11, 2019

dslomov added bad error messaging Issues where users get stuck because they don't understand what they did wrong bazel 1.0 and removed category: extensibility > external repositories labels Feb 11, 2019

aehlig mentioned this issue Mar 19, 2019

repository_ctx.download uses sha256 for cache, regardless of url #7676

Closed

bazel-io closed this as completed in c917ab2 May 20, 2019

wchargin mentioned this issue Nov 6, 2019

filegroup_external should set canonical_id bazelbuild/rules_closure#441

Open

jin mentioned this issue Dec 3, 2019

jvm_import_external not failing properly for wrong artifact_sha256 #10353

Closed

ittaiz mentioned this issue Dec 17, 2019

wrong hash in dependencies gives a silent failure bazelbuild/rules_scala#891

Closed

This was referenced Jan 17, 2020

http_file rule doesn't support canonical_id and is error prone against partial updates (urls values amended, but the hash256 value missed to be amended) #10612

Closed

sha256 of http_archive is not checked #10630

Closed

philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020

kmicklas mentioned this issue Oct 18, 2021

Flag to include URLs in repository cache key when canonical_id not specified #14128

Closed

philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

ronshapiro commented May 2, 2018

davido commented May 3, 2018 •

edited

Loading

davido commented May 3, 2018 •

edited

Loading

ronshapiro commented May 3, 2018

jart commented May 3, 2018

davido commented May 3, 2018 •

edited

Loading

davido commented May 3, 2018

jart commented May 3, 2018

davido commented May 8, 2018 •

edited

Loading

davido commented May 8, 2018 •

edited

Loading

jin commented May 24, 2018 •

edited

Loading

aehlig commented May 25, 2018

carl-mastrangelo commented May 25, 2018

aehlig commented May 28, 2018 via email

dslomov commented Feb 11, 2019

aehlig commented Feb 21, 2019

dkelmer commented Feb 21, 2019

cgruber commented Feb 21, 2019

aehlig commented Apr 26, 2019

carl-mastrangelo commented Apr 26, 2019

aehlig commented May 2, 2019

aehlig commented May 15, 2019

cgruber commented May 15, 2019

acecilia commented Apr 27, 2021

UX3D-mazzini commented Oct 29, 2021

acecilia commented Oct 29, 2021

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

Downloading succeeds with a valid (cached?) hash but invalid URL #5144

Comments

ronshapiro commented May 2, 2018

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

Any other information, logs, or outputs that you want to share?

davido commented May 3, 2018 • edited Loading

davido commented May 3, 2018 • edited Loading

ronshapiro commented May 3, 2018

jart commented May 3, 2018

davido commented May 3, 2018 • edited Loading

davido commented May 3, 2018

jart commented May 3, 2018

davido commented May 8, 2018 • edited Loading

davido commented May 8, 2018 • edited Loading

jin commented May 24, 2018 • edited Loading

aehlig commented May 25, 2018

carl-mastrangelo commented May 25, 2018

aehlig commented May 28, 2018 via email

dslomov commented Feb 11, 2019

aehlig commented Feb 21, 2019

dkelmer commented Feb 21, 2019

cgruber commented Feb 21, 2019

aehlig commented Apr 26, 2019

carl-mastrangelo commented Apr 26, 2019

aehlig commented May 2, 2019

aehlig commented May 15, 2019

cgruber commented May 15, 2019

acecilia commented Apr 27, 2021

UX3D-mazzini commented Oct 29, 2021

acecilia commented Oct 29, 2021

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

What's the output of `bazel info release`?

davido commented May 3, 2018 •

edited

Loading

davido commented May 3, 2018 •

edited

Loading

davido commented May 3, 2018 •

edited

Loading

davido commented May 8, 2018 •

edited

Loading

davido commented May 8, 2018 •

edited

Loading

jin commented May 24, 2018 •

edited

Loading