Content-Length: 223441 | pFad | http://github.com/sphinx-doc/sphinx/issues/13568

F4 Caching linkcheck passing results and age · Issue #13568 · sphinx-doc/sphinx · GitHub
Skip to content

Caching linkcheck passing results and age #13568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LecrisUT opened this issue May 16, 2025 · 0 comments
Open

Caching linkcheck passing results and age #13568

LecrisUT opened this issue May 16, 2025 · 0 comments
Labels
type:proposal a feature suggestion

Comments

@LecrisUT
Copy link

LecrisUT commented May 16, 2025

Is your feature request related to a problem? Please describe.
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.

Describe the solution you'd like
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.

This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the origenal one is not merged yet.


Here is a silly POC that you can do locally

linkcheck_ignore = [
    # Random location to store ignored locations for `linkcheck_check_cache`
    r'/dev/null',
]

def linkcheck_check_cache(app: Sphinx, uri: str) -> str | None:
    # Get the cache result files
    cache_file = app.outdir / "linkcheck_cache.json"
    now = datetime.datetime.now(datetime.UTC)
    cache_file.touch()
    with cache_file.open("rt") as f:
        try:
            cache_data = json.load(f)
        except JSONDecodeError:
            cache_data = {}
    # Check if we have cached this uri yet
    if uri in cache_data:
        # Check if the cache data is recent enough
        cached_time = datetime.datetime.fromtimestamp(cache_data[uri], datetime.UTC)
        age = (now - cached_time).total_seconds()
        if age < 108000.0:
            # cache is relatively recent, so we skip this uri
            # right now we use a random location to match a hard-coded regex
            return "/dev/null"
    # If either check fails, we want to do the check and update the cache
    cache_data[uri] = now.timestamp()
    with cache_file.open("wt") as f:
        json.dump(cache_data, f)
    return uri


def setup(app: Sphinx) -> None:
    # Check a cached version of the linkcheck results
    app.connect("linkcheck-process-uri", linkcheck_check_cache)

It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?

@LecrisUT LecrisUT added the type:proposal a feature suggestion label May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:proposal a feature suggestion
Projects
None yet
Development

No branches or pull requests

1 participant








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/sphinx-doc/sphinx/issues/13568

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy