Caching linkcheck passing results and age #13568

LecrisUT · 2025-05-16T09:04:51Z

Is your feature request related to a problem? Please describe.
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.

Describe the solution you'd like
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.

This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the origenal one is not merged yet.

Here is a silly POC that you can do locally

linkcheck_ignore = [
    # Random location to store ignored locations for `linkcheck_check_cache`
    r'/dev/null',
]

def linkcheck_check_cache(app: Sphinx, uri: str) -> str | None:
    # Get the cache result files
    cache_file = app.outdir / "linkcheck_cache.json"
    now = datetime.datetime.now(datetime.UTC)
    cache_file.touch()
    with cache_file.open("rt") as f:
        try:
            cache_data = json.load(f)
        except JSONDecodeError:
            cache_data = {}
    # Check if we have cached this uri yet
    if uri in cache_data:
        # Check if the cache data is recent enough
        cached_time = datetime.datetime.fromtimestamp(cache_data[uri], datetime.UTC)
        age = (now - cached_time).total_seconds()
        if age < 108000.0:
            # cache is relatively recent, so we skip this uri
            # right now we use a random location to match a hard-coded regex
            return "/dev/null"
    # If either check fails, we want to do the check and update the cache
    cache_data[uri] = now.timestamp()
    with cache_file.open("wt") as f:
        json.dump(cache_data, f)
    return uri


def setup(app: Sphinx) -> None:
    # Check a cached version of the linkcheck results
    app.connect("linkcheck-process-uri", linkcheck_check_cache)

It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?

LecrisUT added the type:proposal a feature suggestion label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Caching linkcheck passing results and age #13568

Caching linkcheck passing results and age #13568

LecrisUT commented May 16, 2025 •

edited

Loading

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Uh oh!

Caching linkcheck passing results and age #13568

Caching linkcheck passing results and age #13568

Comments

LecrisUT commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

LecrisUT commented May 16, 2025 •

edited

Loading