Unexplained CAN Errors

plajjd · Postby **plajjd** » Thu May 07, 2020 7:32 pm

We have an after-market product that interfaces to an existing 500K CAN network. We are using the ESP32-WROOM-32, with an external CAN transceiver on the board. We do not transmit frames on the bus - we are only listening to the existing traffic.

When we attach to the CAN bus, our ESP32 spews "CAN_ALERT_BUS_ERROR" errors continually. However, no other device on the bus shows any errors. The ONLY device showing the errors is our ESP32. (We have tried adding a CAN sniffer (pCAN) to the bus as well, and it does not report any errors)

If we then purposely reboot our ESP32 (by initiating an intentional watchdog timeout), the bus errors STOP, and the ESP32 seems to work correctly. But if we disconnect, and re-connect (or even just power cycle the entire bus) - we show continual CAN_ALERT_BUS_ERROR errors.

We have tried changing the various CAN timing configurations - changing jump-width and the size of the various segments - to no avail. The initial access to the bus always results in continual bus errors.

We are using IDF 3.3.

Thanks for your help!

dmaxben · Postby **dmaxben** » Thu May 07, 2020 10:38 pm

What does your code look like for initializing the CAN controller, and reading messages?

You're sure the CAN transceiver is good and wired correctly? Which CAN transceiver are you using?

plajjd · Postby **plajjd** » Fri May 08, 2020 2:41 pm

We are using transceiver SN65HVD232DR. The initialization is basically just taken from the sample code.

Code: [Select all] [Expand/Collapse]

void can_configure( void )
{
    // Initialize config structures
    can_general_config_t g_config = CAN_GENERAL_CONFIG_DEFAULT(GPIO_NUM_5, GPIO_NUM_4, CAN_MODE_NORMAL);
    can_timing_config_t t_config = CAN_TIMING_CONFIG_500KBITS();
 
    static uint32_t can_acceptance_mask =   (PKT_1_ID ^ PKT_2_ID); 
 
    // To get the acceptance code, we OR all the IDs together.
    static uint32_t can_acceptance_code =   PKT_1_ID  | PKT_2_ID; //   
 
    can_filter_config_t f_config = {.acceptance_code = can_acceptance_code << 3,    // Bit shift by 3 since EXT IDs are 29 bits but this is a 32 bit register
                                    .acceptance_mask = can_acceptance_mask << 3,
                                    .single_filter = true};
 
 
    // Install CAN driver
    if (can_driver_install(&g_config, &t_config, &f_config) == ESP_OK) {
        ESP_LOGV(TAG, "Driver Installed");
    } else {
        ESP_LOGE(TAG, "Failed to install driver");
        return;
    }
}

GeSHi © Codebox Plus Extension

plajjd · Postby **plajjd** » Fri May 08, 2020 9:42 pm

More information: We found that if we Start the CAN driver, then STOP the driver and wait 100ms. And then START the CAN driver a 2nd time, the bus errors go away.

For some reason we have to START, then STOP, then START the CAN driver a 2nd time.

We are using `can_start()` and `can_stop()`.

Has anyone else experienced something like this?

Thanks!

Postby **ESP_Dazz** » Sat May 09, 2020 7:25 am

plajjd wrote: For some reason we have to START, then STOP, then START the CAN driver a 2nd time.

We are using `can_start()` and `can_stop()`.

Has anyone else experienced something like this?

This is definitely not normal behavior. The CAN driver should only require one call to START the CAN driver.

plajjd wrote: We have an after-market product that interfaces to an existing 500K CAN network. We are using the ESP32-WROOM-32, with an external CAN transceiver on the board. We do not transmit frames on the bus - we are only listening to the existing traffic.

When we attach to the CAN bus, our ESP32 spews "CAN_ALERT_BUS_ERROR" errors continually. However, no other device on the bus shows any errors. The ONLY device showing the errors is our ESP32. (We have tried adding a CAN sniffer (pCAN) to the bus as well, and it does not report any errors)

If we then purposely reboot our ESP32 (by initiating an intentional watchdog timeout), the bus errors STOP, and the ESP32 seems to work correctly. But if we disconnect, and re-connect (or even just power cycle the entire bus) - we show continual CAN_ALERT_BUS_ERROR errors.

"no other device on the bus shows any errors". This is definitely not normal either. If a CAN node detects a bus error, the node will transmit an error frame causing all other CAN nodes on the bus to also detect an error as well. Can you attach a logic analyzer to the H and L lines of the CAN bus and verify whether or not error frames are being sent on the bus.

A few things you could try:

Try disconnecting the ESP32 + Transceiver form the bus, and run the Self Test Example to verify that the ESP32 is correctly connected to the SN65HVD232DR

Try using CAN_MODE_LISTEN_ONLY so that the ESP32 will not attempt to send anything form it's TX pin.

plajjd · Postby **plajjd** » Thu May 14, 2020 10:28 pm

Thank you for the response. We verified that the self test works, and also that the bus has some error frames present even without the ESP32 attached. At this point, it appears the issue is NOT related to the ESP32.

PeterR · Postby **PeterR** » Thu May 14, 2020 11:31 pm

Well...
I am also tracing ESP CAN weirdness.
I regularly see:
(1) Other devices receiving and functioning whilst the ESP is BUS OFF. I note that the bus does have a high rate of errors but the ESP is solid fail (driver does not pass on RX frames, TX may happen but I need to confirm), whilst other devices muck on.
(2) That error conditions are related to where the device is on the bus - ok, so electrical can give weird effects.
(3) Overrun errors (which the vanilla IDF does not process) even at 30% bus load (250Kbps so say 5 frames buffer and so failure to process in say 2.5 mS?). CPU load is fine at >80%. Trying to get to the bottom of the >2.5mS lag in the RX service.

I also have other hardware related issues (I2C) following a switch to ID 4.1. Its hard to diagnose as the IDF changed radically but something aint right. If I switch CAN off the I2C errors disappears....

Postby **ESP_Dazz** » Fri May 15, 2020 3:57 am

PeterR wrote: (1) Other devices receiving and functioning whilst the ESP is BUS OFF. I note that the bus does have a high rate of errors but the ESP is solid fail (driver does not pass on RX frames, TX may happen but I need to confirm), whilst other devices muck on.

Is the ESP CAN actually attempting to transmit anything? Bus OFF only occurs when the TEC >=256, thus can only be reached if the ESP CAN was actually transmitting something and errors occur whilst transmitting. What happens when you program the ESP CAN to only receive messages in Listen Only Mode. Does it still fail to pass on RX frames?

PeterR wrote: (2) That error conditions are related to where the device is on the bus - ok, so electrical can give weird effects.

Could it be a bit timing issue due to propagation delay? How long are the wires in your CAN bus? Would it be possible to estimate the propagation delay between nodes and between the TX and RX pins of the ESP CAN.

PeterR wrote: (3) Overrun errors (which the vanilla IDF does not process) even at 30% bus load (250Kbps so say 5 frames buffer and so failure to process in say 2.5 mS?). CPU load is fine at >80%. Trying to get to the bottom of the >2.5mS lag in the RX service.

At 30% bus load, that definitely doesn't sound right. Any chance there's something in the software blocking the CAN driver ISR from running (e.g., a long critical section form some other software component)?

PeterR · Postby **PeterR** » Sat May 16, 2020 11:46 pm

Hi,
See my other posts. There do seem to be significant & long standing issues with CAN & related issues with other driver in 4.1.
To summarise:
(1) IDF does not process overrun errors despite overrun causing a 0x88 byte corruption in frame. There is a crypic comment about overrun recovery & hardware limitations.
(2) Bus conditions, for example fretting the termination resistor, will allow or induce frame corruptions to get past IDF & especially at higher frame rate. This is since early beta 4.0.
(3) 4.1 Overrun errors (even at 250kbps) are related to Ethernet and I2C use. Now maybe these other ISRs are taking >2mS (that certainly is not my expectation of reasonable) & I should be able to bump priority but its not great out of the box behaviour & especially as IDF will return an overrun frame without error code & containing corruptions (usually 0x88).

To answer your questions:
(1) Yes, I am trying to transmit but in 'fault mode' (whatever that is) RX stalls yet TX continues. Ok TX just goes for it until success or OFF. When I try to induce with bus faults I see maybe 50% RX on other devices (& after swapping positions) yet <<10% ESP.
(2) Yes, bit propagation is my 'sleep at night' thought. I have not compared ESP bit sample pattern yet & am well aware of the difference that may make. Yet the ESP is consitently worse than other devices. The ESP will RX 'blank' and will allow corrupt frames through to application (not just overrun 0x88 but other patterns, now I did add a ll clear overrun which the comment suggests is not a good idea but hey!). Same results on EVB.
EDIT: Really short 1m cables ATM. Darn..
(3) No, it is not right. I reran and maybe 60% load is a better proof. Still at 250Kbps that's a 2mS gap in my software's ISR's life which is just plain crazy. Disabling Ethernet resolves. I2C issues may also be related. I am seeing 2mS+ clock stretch on I2C which kind of fits the CAN overrun. So what the hell is going on in Ethernet? Most micro's Ethernet is swap DMA buffers, raise semaphore, jog on. Maybe take a timestamp if PTP.
As posted I have a plump application so perhaps cache hits are related.
EDIT:To clarify - overruns can be resolved/improved by disabling Ethernet but frame corrupts when toggling termination still get through.
I can try listen mode but I already know that IDF will pass on corrupt frame either; 0x88 on overrun or other patterns when inducing bus faults e.g. vibrating termination resistor connection & my application has to TX.
I am looking for a discriminator such that I may reject trash frames. Jury is out on better RX imunity, there may be an issue but my sole test ATM is to wiggle termination resistor which is actually a system fault. Getting long cables soon so lets hope no further issues.
PS
Worry that GPIO mux is involved (I do not believe that the posted calculations hold in the real world). Blind check, what pins are we supposed to use? Also what CPU clock were CAN tests run on?

PeterR · Postby **PeterR** » Mon May 18, 2020 12:06 am

Hi, Any further thoughts on latency, CAN overruns, Ethernet & especially I2C corruption seemingly arising from 4.0 to 4.1 IDF? Or frame corruption arising from termination issues which preceeds IDF 4.1?
plajjd: Please note that you will receive corrupt CAN frames when using stock IDF. Overrun efects are reasonably well documented in forums outside of ESP, less clear is why the overrun arises in the first place and especially IDF factors which determine the rate of this issue (aside from obvious application faults). I believe that you may also receive corrupt frames when bus termination is 'wiggled' & that this issue is not documentated & AFAIK has no work around.
One conclusion may be that ESP CAN is generally 'bust' & especially if there is any criticality to RX data. Certainly ESP stock IDF is 'bust' with respect to overrun errors - they are not processed and the IDF will, on overrun, pass you a corrupt frame. Actually several corrupt frames apparently - I only see one.
Tread carefully.

Unexplained CAN Errors

Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Re: Unexplained CAN Errors

Who is online

About Us

Extra

Information