Mysterious freeze/stop in the field
Mysterious freeze/stop in the field
I have an ESP32 ESP-IDF project, on M5stack Atom board.
I have shipped hundreds of products in production.
Basically my application connects to a wifi AP, poll my server for a few minutes, then connect to a bluetooth peripheral, and lopps back.
Some users encounter the device stopping at some point. The shorter time one stopped is a few hours. Some devices never stopped working.
I have reproduced it once, after 3 days, with monitoring enabled.
They always stop during polling the server, during what is a basically a for loop, each iteration retrieving some info from the server. I know that they didn't stop when receiving a particular server reponse, only the normal empty response ("nothing to do").
In all cases, when restarted, it worked correctly again.
Hypothesis 1 : device freezes
I have checked that both interrupt watchdog and task watchdog are set and working. Wactdhog successfully restart the device when not fed.
Hypothesis 2 : device overheats
I have tested adding lot of overhead to make both CPUs run at 100%, and it didn't make it overheat and stop.
When I reproduced, measured temperature was OK.
Hypothesis 3 : no more RAM
When no RAM is available anymore, device reboots correctly.
When I reproduced, there was a lot of free RAM (I logged it).
Tried solution 1 : restart after each 24h
I added code to restart it automaticaly every 24 hours. Issue still occurs
Tried solution 2 : restart when RAM is low
I added code to restart it automaticaly when RAM is low. Issue still occurs
Does someone have an idea what could happen ? I'm not even sure in which state it is.
Would adding a custom watchdog with FreeRTOS help ? (restarting if the main loop was not run for a few minutes)
Any other solution ?
Thanks !
I have shipped hundreds of products in production.
Basically my application connects to a wifi AP, poll my server for a few minutes, then connect to a bluetooth peripheral, and lopps back.
Some users encounter the device stopping at some point. The shorter time one stopped is a few hours. Some devices never stopped working.
I have reproduced it once, after 3 days, with monitoring enabled.
They always stop during polling the server, during what is a basically a for loop, each iteration retrieving some info from the server. I know that they didn't stop when receiving a particular server reponse, only the normal empty response ("nothing to do").
In all cases, when restarted, it worked correctly again.
Hypothesis 1 : device freezes
I have checked that both interrupt watchdog and task watchdog are set and working. Wactdhog successfully restart the device when not fed.
Hypothesis 2 : device overheats
I have tested adding lot of overhead to make both CPUs run at 100%, and it didn't make it overheat and stop.
When I reproduced, measured temperature was OK.
Hypothesis 3 : no more RAM
When no RAM is available anymore, device reboots correctly.
When I reproduced, there was a lot of free RAM (I logged it).
Tried solution 1 : restart after each 24h
I added code to restart it automaticaly every 24 hours. Issue still occurs
Tried solution 2 : restart when RAM is low
I added code to restart it automaticaly when RAM is low. Issue still occurs
Does someone have an idea what could happen ? I'm not even sure in which state it is.
Would adding a custom watchdog with FreeRTOS help ? (restarting if the main loop was not run for a few minutes)
Any other solution ?
Thanks !
-
- Posts: 9766
- Joined: Thu Nov 26, 2015 4:08 am
Re: Mysterious freeze/stop in the field
One of the options you haven't mentioned is that something goes wrong in the middle of polling your server, and somehow your code keeps waiting for some answer while none comes: it's working exactly as coded and the hardware is fully functional (so the wdt doesn't catch you) but your device still isn't doing what you expected it to do.
A custom watchdog, obviously, can help you there. Suggest doing a panic & restart when that happens if you have some mechanism to store & upload the core, so you might be able to see what the code is doing when it times out.
A custom watchdog, obviously, can help you there. Suggest doing a panic & restart when that happens if you have some mechanism to store & upload the core, so you might be able to see what the code is doing when it times out.
Re: Mysterious freeze/stop in the field
Hello, thanks for your help,
I have tested your hypothesis, but when simulating a stuck, slow or non-responding server, the esp_http_client timeouts and programs handle it correctly.
I have also implemented a custom watchdog, that gets hit sometimes in production, but there are still boards that are "stuck" (not rebooted by normal watchdog or custom watchdog)
I have tested your hypothesis, but when simulating a stuck, slow or non-responding server, the esp_http_client timeouts and programs handle it correctly.
I have also implemented a custom watchdog, that gets hit sometimes in production, but there are still boards that are "stuck" (not rebooted by normal watchdog or custom watchdog)
-
- Posts: 9766
- Joined: Thu Nov 26, 2015 4:08 am
Re: Mysterious freeze/stop in the field
Another thing: what does the power supply look like? Talking to a server may mean the WiFi stuff is in active use, leading to higher power usage. I'm wondering if a flakey power supply could be the cause of this; technically the WDT should stop this from happening, but in the ESP32 we are aware of some circumstances where power fluctuations can lead to an unresponsive chip. (Later chips, like the S2/S3/C3, have a super watchdog and a better brownout detector to handle these cases better.)
Re: Mysterious freeze/stop in the field
The power supply is a USB-A - AC adapter, 5V, 1A.
-
- Posts: 9766
- Joined: Thu Nov 26, 2015 4:08 am
Re: Mysterious freeze/stop in the field
Gotcha. That should be sufficient. but do note that we do have experience with USB cables being too thin, leading to a high voltage drop if a fair amount of current is pulled; that would be something you could take a look at. Are there other components being fed from the 3.3V power supply the ESP32 is running off?
Re: Mysterious freeze/stop in the field
Hello,
Thanks for your help !
There is no USB cable, the esp32 is on a board with a USB-A male pin, plugged on the USB-A female adapter.
The only other component on the board is a NeoPixel LED (SK6812 3535), that is always blinking.
After adding our daily reboot + custom high-level watchdog, we have now only a couple of customers that are affected. For them, device stops working after a few hours, every time they plug it back.
What I'm getting from our exchanges is that you suspect that the chip detects a brown-out but does not reset ? That would explain why the hardware watchdog does not occur ?
Thanks for your help !
There is no USB cable, the esp32 is on a board with a USB-A male pin, plugged on the USB-A female adapter.
The only other component on the board is a NeoPixel LED (SK6812 3535), that is always blinking.
After adding our daily reboot + custom high-level watchdog, we have now only a couple of customers that are affected. For them, device stops working after a few hours, every time they plug it back.
What I'm getting from our exchanges is that you suspect that the chip detects a brown-out but does not reset ? That would explain why the hardware watchdog does not occur ?
-
- Posts: 9766
- Joined: Thu Nov 26, 2015 4:08 am
Re: Mysterious freeze/stop in the field
That is an explanation indeed; I can't say that that is actually what is happening as it's a bit hard to prove, though. Especially given the fact that there's a direct connection to a (supposedly good) power supply, I'm not 100% sure; most of the cases where the watchdog fails are when the device is powered by batteries and people don't take the care to install a power supervisor.
If you want to track it down further, perhaps you can get some of the affected devices back and look at the serial console, to see if it indicates anything when the device crashes? A custom firmware with all the debug failure mode detections (stack overflow, heap corruption checking) may also help if it's a software issue after all.
If you want to track it down further, perhaps you can get some of the affected devices back and look at the serial console, to see if it indicates anything when the device crashes? A custom firmware with all the debug failure mode detections (stack overflow, heap corruption checking) may also help if it's a software issue after all.
Re: Mysterious freeze/stop in the field
Hello,
We're having the same issue in the test environment, fine, but the ESP32s in the field freeze after 12-24-48 hours, sometimes 7 days, and continue to work after a hard reset.
* Watchdog fails because CPU freezes.
* I noticed a ram leak, fixed it, now there is no ram leak, but I'm not sure if the problem persists.
* Could there be a problem with the voltage? I have a quectel ec25 connected in parallel to the processor. Could Esp32 be crashing?
Did you solve your problem? Does it look like the same as your problem? Can you share your experiences?
We're having the same issue in the test environment, fine, but the ESP32s in the field freeze after 12-24-48 hours, sometimes 7 days, and continue to work after a hard reset.
* Watchdog fails because CPU freezes.
* I noticed a ram leak, fixed it, now there is no ram leak, but I'm not sure if the problem persists.
* Could there be a problem with the voltage? I have a quectel ec25 connected in parallel to the processor. Could Esp32 be crashing?
Did you solve your problem? Does it look like the same as your problem? Can you share your experiences?
-
- Posts: 11
- Joined: Fri Apr 08, 2022 6:02 pm
Re: Mysterious freeze/stop in the field
I have been experiencing this exact same issue. My power source is dual battery and solar panel. Some never fail and others will fail unpredictably. I have both the task wdt and the rtc wdt running. Under test conditions I can trigger both, but when the ESP32 fails in the field it requires a hard reset. Since it is battery powered, I can open the unit and probe the chip. It is receiving full power but it is in a pin state not found in the program, meaning if the ESP32 became stuck somewhere in its operation, it would not reflect the pin status I am observing when it is locked up. For us it is getting expensive, as customers are shipping units back. I have been asked to find a new MCU to replace the ESP32, but I am hoping that I can find a solution instead. Thanks!
Who is online
Users browsing this forum: Baidu [Spider], Bing [Bot] and 127 guests