-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP32-S3: Watchdog timer expired running polling loop #9460
Comments
I think I can help with logging inside the library, I authored most of the code. Although it seems to me like the issue might be in CircuitPython itself rather than the library. |
This is almost certainly a core CircuitPython. I am transferring this issue to the circuitpython repo. |
Can I help tracking it down? I don't know what it will take to reliably reproduce the failure. I'm a programmer, but not really set up to do debug builds. Main environment is Fedora. I can spin up a VM if other environment (not Windows or Mac) is better for building. Any idea how long the watchdog timer is? Knowing that I would have an idea about how long to wait before checking if it had triggered. Assuming that it needs 'idle' time to trigger it. |
The "internal watchdog timer" that was triggered happens because something was taking far too long, blocking other operations (for example, a tight infinite loop). It reflects a bug, not just a timeout that's too short. If you can reduce your program to the smallest example that still shows the problem, that would be great. We'll have to let it sit for a while to trigger the error, it seems like. |
I did some more testing. I have sample code that reliably triggers a non-responsive httpserver, and most of the time safe mode on restart. My initial pointer to .poll() is back in the mix. I tried a couple of variations. When I use server.server_forever(…), I do not see the failures. Change that do start plus while True and server.poll() triggers the problem. Idle time is also part of the mix. If I do a fresh request every 5 minutes, the error does not trigger … until I leave it idle for about 15 minutes. # SPDX-FileCopyrightText: 2024 H Phil Duby
# SPDX-License-Identifier: MIT
import sys
from time import monotonic_ns
import socketpool
import wifi
from adafruit_httpserver import Server, Request, Response
pool = socketpool.SocketPool(wifi.radio)
server = Server(pool, "/static", debug=True)
@server.route("/")
def base(request: Request):
request_ref = monotonic_ns()
print(f'{request_ref}: request for page {TAG}') # LOG
content = f"""<!DOCTYPE html>
<html><head><title>Watchdog {TAG}</title></head>
<body><p>{TAG} {request_ref}: <span id="datetime"></span></p>
<script>var date = new Date();document.getElementById("datetime").innerHTML = date;
</script></body></html>
"""
return Response(request, content, content_type="text/html")
TAG = 'simple debug 002'
print(f'{monotonic_ns()}: Starting page {TAG}') # LOG
print(f'{sys.implementation.name} {sys.implementation.version}\n{sys.implementation._machine}')
server.start(str(wifi.radio.ipv4_address))
while True:
server.poll() Smaller code fails, but this keeps enough instrumentation/logging to match events (when it is still working).
log and browser capture using poll, 20 minute request interval
log and browser capture using poll, «initial» 5 minute request interval
|
It appears that this is not going to affect the project that it was initially encountered in. The problem was seen when doing development and testing of separate sections. With everything (so far) merged together, there is no 'idle time' that is needed to trigger the failure. The project shows the sensor data both on a local display, and on web pages. It seems that the processing needed for the display portion is enough to prevent the watchdog timeout seen in server.poll(). Even with the web server totally idle (no active connections). Which means that some sort of heartbeat may be a work around. That also could mean that figuring out just what that heartbeat needs could point to where the watchdog is being triggered. |
Possibly related #9428 Code from this issue also uses |
Reading over the code provided in that, this does look like exactly the same case: CP 9.1.1, ESP32-S3, server.poll() that is idle 'long enough', watchdog timer. Smaller code sample here, which could be shrunk further if logging not useful. |
If the problem can be prevented by adding some code to |
Here is a bit smaller sketch that can produce the same symptoms. Note that the first cycle had the server stop responding, but on control C abort and control D reload, it restarted successfully. That restart also did not stop responding after a 16 minute idle period, then a 22 minute idle period. It did not respond after a following 27 minute idle period, and the following forced abort and restart showed the watchdog timer expired problem. For most of my previous testing 15 minutes idle time was sufficient to trigger the watchdog timer. import socketpool
import wifi
from adafruit_httpserver import Server, Request, Response
pool = socketpool.SocketPool(wifi.radio)
server = Server(pool, "/static", debug=True)
@server.route("/")
def base(request: Request):
content = """<!DOCTYPE html>
<html><head><title>Watchdog check</title></head>
<body><p><span id="datetime"></span></p>
<script>document.getElementById("datetime").innerHTML = (new Date()).toISOString();
</script></body></html>
"""
return Response(request, content, content_type="text/html")
server.start(str(wifi.radio.ipv4_address))
while True:
server.poll()
|
@mMerlin I'm experiencing the same issue. ESP32-S3 Feather TFT REV, CP 9.1.1.. Basically just connecting to wifi in beginning of script and reading the voltage on A4 inside of a while loop. I go into safemode from the same watchdog timer expiration error after about 20min. |
I reproduced this once on an Adafruit Feather ESP32-S3 TFT and 9.2.0-alpha.2351-7-g19e5cf3d8f using a very simple test program that eliminates import errno
import wifi
import socketpool
def wrap_accept(s):
try:
return s.accept()
except OSError as e:
if e.errno == errno.EAGAIN:
return None, ("", 0)
raise
socket = socketpool.SocketPool(wifi.radio)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.listen(10)
s.setblocking(False)
while True:
s1, peerinfo = wrap_accept(s)
if s1 is not None:
print(peerinfo)
s1.close() I ran this test without even being connected to wifi, and it's also a slightly dubious program because it did not
I'm a little conflicted about posting this info since I only saw a WDT reset once -- there may be some other hidden variable. but I think it's good info that MAYBE this simple script is actually a reproducer. |
I did eventually get a 2nd crash with just about the same code. However, this time it was a Hard fault. That run took almost 1 hour to crash.
|
I didn't find any further useful information during today's debugging session, and I'm not planning to work further on this issue right now. |
Moved to 9.x.x because we need a clearer reproducer to debug. |
Adafruit CircuitPython 9.1.1 on 2024-07-22; Adafruit-Qualia-S3-RGB666 with ESP32S3
Board ID:adafruit_qualia_s3_rgb666
I have a script to explore working with SSE. It «seems to» work. I can connect and disconnect browser sessions. However, if I then just leave it running with no connections, and come back later (maybe an hour), the app is frozen. ctrl+c in the serial window aborts and gives:
The first part of that is the final log messages while it was running. This has occurred several times with different versions of the code. Including (I think) the httpserver_sse.py example in the repo that I started from. I don't know if it it total run time, or total idle time that triggers this.
Adafruit CircuitPython 9.1.1 on 2024-07-22; Adafruit-Qualia-S3-RGB666 with ESP32S3
Board ID:adafruit_qualia_s3_rgb666
Here is code that 'almost' matches what produced the above log. Almost, because the sequence has been: run the script, do some testing, then edit the offline version. In this case, the edit was cleanup to remove some debug messages, tweak the remaining, then start working on a more full html file. Got called away for awhile, when returned, found it locked up. again. Some of those lockup resulted in CIRCUITPY becoming read only. I had to do the file system erase to get it back.
When it locks up, the only thing it is doing is polling, because self.sse_response is empty.
The index.html file referenced is:
I'm not setup to create a debug version of circuitpython, but with a little guidance, I probably could add some logging (with timestamps) to a local copy of HTTPServer.Server
The text was updated successfully, but these errors were encountered: