Is it possible to have data corruption just after a soft restart?
Is it possible to have data corruption just after a soft restart?
I'm having an odd problem I can't explain. I have a board based on an ESP32-VROOM-32U module with 16MB flash and my FW uses Mongoose OS. My firmware seems to run fine for hours, even days collecting data from CAN bus, until I restart the ESP32 without disconnecting power. This happens after an OTA or when I update a user configuration parameter, causing a restart. After the restart, the FW does not run correctly. It does not crash, but it seems as if some of the variables become corrupted shortly after the FW starts. I have limited debugging tools, and have not had any luck. The data is all coming in from a CAN bus and I'm using the CAN controller in the ESP32.
Is there any way data or memory state somehow survives across a soft reboot? If I pull the power and restart it, then all is fine until the next soft restart. Perhaps the CAN controller becomes corrupted?
Any thoughts are appreciated, thanks.
Is there any way data or memory state somehow survives across a soft reboot? If I pull the power and restart it, then all is fine until the next soft restart. Perhaps the CAN controller becomes corrupted?
Any thoughts are appreciated, thanks.
Re: Is it possible to have data corruption just after a soft restart?
Double stacks.
Reboot after upload. You have done a firmware upgrade so reboot is legit even desirable e.g. prove your software works after boot and after upload (which is why most just reboot). There may be some underlying issue with your firmware that is further 'tempted' by the upload. Double stacks and see.
I would be tempted to just reboot & instrument free task stack size. Anything stack/pointer related is going to be very application specific such that it would be hard to help except suggesting instrument the stack high water.
EDIT:
Ooops read your post again & missed soft reboot.
Soft reboot will initialise RAM as hard reboot (AFAIK). Soft reboot will not reset devices though. Maybe a device is not happy & that takes you down untested paths.
Sounds weird though.
Reboot after upload. You have done a firmware upgrade so reboot is legit even desirable e.g. prove your software works after boot and after upload (which is why most just reboot). There may be some underlying issue with your firmware that is further 'tempted' by the upload. Double stacks and see.
I would be tempted to just reboot & instrument free task stack size. Anything stack/pointer related is going to be very application specific such that it would be hard to help except suggesting instrument the stack high water.
EDIT:
Ooops read your post again & missed soft reboot.
Soft reboot will initialise RAM as hard reboot (AFAIK). Soft reboot will not reset devices though. Maybe a device is not happy & that takes you down untested paths.
Sounds weird though.
& I also believe that IDF CAN should be fixed.
Re: Is it possible to have data corruption just after a soft restart?
Thanks very much... I have been working on this quite a bit and it seems that what's happening is that the CAN controller is not working correctly after a soft reset. I did see something about this in the eratta but didn't fully understand it. The code I'm using is using a CAN driver by Thomas Barth. His notes say, at the time he wrote it 3 years ago, there was no other driver. Has Espressif released their own driver since then?
Re: Is it possible to have data corruption just after a soft restart?
It's not a full reset, eg. RTC domain is not reset. I don't know enough about CAN to know what effect that might have. Maybe there's some register or variable in RTC memory that is in an unexpected state.
Try a RTC reset instead of esp_restart?
eg.
https://github.com/espressif/esp-idf/bl ... #L330-L335
Try a RTC reset instead of esp_restart?
eg.
https://github.com/espressif/esp-idf/bl ... #L330-L335
-
- Posts: 9766
- Joined: Thu Nov 26, 2015 4:08 am
Re: Is it possible to have data corruption just after a soft restart?
Also, fwiw, we have a driver in ESP-IDF that is fully supported and maintained by now. You may want to switch to that instead of using 3-year-old unmaintained code.
Re: Is it possible to have data corruption just after a soft restart?
I'll definitely look at that. The driver is part of a bigger CAN library I'm using, but it's OS so I can look at changing it.ESP_Sprite wrote: ↑Fri Aug 07, 2020 7:31 amAlso, fwiw, we have a driver in ESP-IDF that is fully supported and maintained by now. You may want to switch to that instead of using 3-year-old unmaintained code.
But may I ask, is the CAN controller not working after a soft restart a known issue and if so, is it addressed by your driver? If I can't somehow keep the CAN bus working after a soft reset I'll be screwed...
Re: Is it possible to have data corruption just after a soft restart?
CAN not working after a soft reset is a driver issue rather than the controller hardware.
It is the driver's responsibility to clean initialise, so latest driver is a good idea.
But yes, there are mechanism is the original ESP CAN driver which could explain. The most obvious mechanism would be overflow. A CAN overflow used to freeze CAN frame capture as the condition was detected, the frame discarded but the condition was not cleared. Once overlow then always overflow. This was IDF fixed perhaps a year ago.
There are some other driver issues which you also need to be aware of. The driver might supply corrupt data e.g. on bit frame error and overflow recovery. EDIT: Most of these are silicon errata which were published after the original CAN driver. So you will have to upgrade.
There are (EDIT: IDF) patches (see my posts), not sure if the patches have entered IDF yet & I had some reservations, hence the signature.
I would suspect that if the controller has been initialised and then you reboot without device reset then you are seeing an overflow. EDIT: Controller is running but no one is processing the RX interrupt requests.
Increase your CAN bus frame rate and you should see the same effect. I could generate @80% bus load @250Kbps but then I have Ethernet as well which seems to hog CPU from an interrupt for 2mS every now and then and so that helps generate the overflow.
EDIT: By which I mean you cannot escape this issue without driver change e.g. if 200hrs in you get a CAN source 'beat' which leads to an overflow (in part because some lazy arse ISR has blocked your CAN driver) then you will never receive a CAN frame again! Amazed me that there were no posts on the subject and that the driver was open to this issue. Guess stress testing is not popular or all the cool kids go direct to commercial support
You can test frame corruptions by making/breaking your termination resistor. I had my termination through a button switch and hitting the switch like 80's 'Track and Field' allowed corruptions through quick enough.
IMHO the CAN driver should get a little more support.
It is the driver's responsibility to clean initialise, so latest driver is a good idea.
But yes, there are mechanism is the original ESP CAN driver which could explain. The most obvious mechanism would be overflow. A CAN overflow used to freeze CAN frame capture as the condition was detected, the frame discarded but the condition was not cleared. Once overlow then always overflow. This was IDF fixed perhaps a year ago.
There are some other driver issues which you also need to be aware of. The driver might supply corrupt data e.g. on bit frame error and overflow recovery. EDIT: Most of these are silicon errata which were published after the original CAN driver. So you will have to upgrade.
There are (EDIT: IDF) patches (see my posts), not sure if the patches have entered IDF yet & I had some reservations, hence the signature.
I would suspect that if the controller has been initialised and then you reboot without device reset then you are seeing an overflow. EDIT: Controller is running but no one is processing the RX interrupt requests.
Increase your CAN bus frame rate and you should see the same effect. I could generate @80% bus load @250Kbps but then I have Ethernet as well which seems to hog CPU from an interrupt for 2mS every now and then and so that helps generate the overflow.
EDIT: By which I mean you cannot escape this issue without driver change e.g. if 200hrs in you get a CAN source 'beat' which leads to an overflow (in part because some lazy arse ISR has blocked your CAN driver) then you will never receive a CAN frame again! Amazed me that there were no posts on the subject and that the driver was open to this issue. Guess stress testing is not popular or all the cool kids go direct to commercial support
You can test frame corruptions by making/breaking your termination resistor. I had my termination through a button switch and hitting the switch like 80's 'Track and Field' allowed corruptions through quick enough.
IMHO the CAN driver should get a little more support.
& I also believe that IDF CAN should be fixed.
Re: Is it possible to have data corruption just after a soft restart?
Thank you Peter this is great information! I'll study the Espressif driver and see if I can use it in my code. I'm very relieved to hear that this issue can be handled. Thanks again.
Re: Is it possible to have data corruption just after a soft restart?
You must upgrade. No question, none.
Even if using latest IDF then look at my posts & check that you have the ESP CAN silicon driver patches included.
You're original driver is very much broke, that's not a criticism, the hardware was not documented & more power to the coder for getting CAN off the ground. Add in the recent silicon issue discoveries & you will see the need to upgrade (Note that IDF master may not have all the latest patches and as my post I have reservations that we are done).
But I would bet a beer that this is overflow; just don't ignore the occasonal (unless you have a 'trac n field' button) frame corruptions which will follow in your more marginal cable quality installs.
Even if using latest IDF then look at my posts & check that you have the ESP CAN silicon driver patches included.
You're original driver is very much broke, that's not a criticism, the hardware was not documented & more power to the coder for getting CAN off the ground. Add in the recent silicon issue discoveries & you will see the need to upgrade (Note that IDF master may not have all the latest patches and as my post I have reservations that we are done).
But I would bet a beer that this is overflow; just don't ignore the occasonal (unless you have a 'trac n field' button) frame corruptions which will follow in your more marginal cable quality installs.
& I also believe that IDF CAN should be fixed.
Re: Is it possible to have data corruption just after a soft restart?
Oh I'm taking this very seriously. My product will be used, in part, for remote monitoring. Being able to reconfigure and soft reset reliably and remotely is critical. Drivers are outside most of my experience so it's taking me some time. I'm currently looking at the TWAI source modules on GitHub. I'll be sure to look at the non-master branches as well. But my work is using Mongoose OS, so I'll need to take the driver code from ESP-IDF and move it to a library my app can use.
"silicon driver patches" ?? Do you mean there are patches for the ESP32 itself?Even if using latest IDF then look at my posts & check that you have the ESP CAN silicon driver patches included.
Thanks again!But I would bet a beer that this is overflow; just don't ignore the occasonal (unless you have a 'trac n field' button) frame corruptions which will follow in your more marginal cable quality installs.
Who is online
Users browsing this forum: Mycael_ and 72 guests