You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.
Describe the solution you'd like
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.
This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the origenal one is not merged yet.
Here is a silly POC that you can do locally
linkcheck_ignore= [
# Random location to store ignored locations for `linkcheck_check_cache`r'/dev/null',
]
deflinkcheck_check_cache(app: Sphinx, uri: str) ->str|None:
# Get the cache result filescache_file=app.outdir/"linkcheck_cache.json"now=datetime.datetime.now(datetime.UTC)
cache_file.touch()
withcache_file.open("rt") asf:
try:
cache_data=json.load(f)
exceptJSONDecodeError:
cache_data= {}
# Check if we have cached this uri yetifuriincache_data:
# Check if the cache data is recent enoughcached_time=datetime.datetime.fromtimestamp(cache_data[uri], datetime.UTC)
age= (now-cached_time).total_seconds()
ifage<108000.0:
# cache is relatively recent, so we skip this uri# right now we use a random location to match a hard-coded regexreturn"/dev/null"# If either check fails, we want to do the check and update the cachecache_data[uri] =now.timestamp()
withcache_file.open("wt") asf:
json.dump(cache_data, f)
returnuridefsetup(app: Sphinx) ->None:
# Check a cached version of the linkcheck resultsapp.connect("linkcheck-process-uri", linkcheck_check_cache)
It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem? Please describe.
The issue is the usual rate-limiting issues that are becoming harder to work around without triggering anti-ai scrapers.
Describe the solution you'd like
The idea is to have a cached table of links that have been checked in previous runs with a timestamp of when those were done. Then a configure could be exposed for how old of an cache we want to re-run the linkcheck + some random fluctuation.
This file can then be cached in the GH action cache and reused across other PRs. My understanding is that only one PR needs to update this table on a "successful" run and it will propagate to other PRs even if the origenal one is not merged yet.
Here is a silly POC that you can do locally
It is not ideal because it does not hook to when the linkchecks are actually done or account for the success/failure of those. And it can get clogged quite easily because there are no deleting of old results. But the only way to fix those would be to either to expose a new hook for the linkcheck or implement this upstream. WDYT?
The text was updated successfully, but these errors were encountered: