I2C corruption when using ESP CAN (IDFGH-3307)

PeterR · Postby **PeterR** » Mon May 11, 2020 4:46 pm

release/v4.1 SHA-1: 5dbabae9dc32da3e639a0531b5337848a9e3317a

The I2C bus is used to read GPIO inputs and is properly pulled etc. Every now and then an incorrect GPIO port value is returned. Reducing I2C clock does not help.
The error happens every hour or so with a properly terminated CAN bus. Removing CAN termination resistors significantly increases the error rate to maybe an event per minute. Switching the ESP CAN off (can_driver_uninstall) possibly removes the error entirely.
I get the impression that the I2C module freezes during the event as I tried to send a logic analyser trigger command down the I2C bus and that took several attempts. TBC.

Nothing glitchy on the I2C bus.
This issue does not happen on v4.0-dev-562-g2b301f53e. Our application difference between IDF build are minor, mainly network stack related (which was the point of the IDF upgrade).

There are a lot of HAL/lld restructuring changes between the IDF versions and so its hard to find material changed.

PeterR · Postby **PeterR** » Thu May 14, 2020 4:39 pm

Enclosed is an I2C trace. The trace is I2C legal & shows a GPIO write.
The trace is odd though as there is a long > 2 mS pause between address and register (1st and 2 byte).

Why would that be so?
I have not dug deep into the driver but would imagine that I2C has a FIFO and that FIFO is loaded within a critical section.
If no FIFO then perhaps from an ISR. So why such a long gap.

ESP_Angus · Postby **ESP_Angus** » Fri May 15, 2020 3:53 am

Hi Peter,

I've added an ID to the topic as we're tracking this internally. I have a few questions to help us debug:

- Is the ESP32 acting as the I2C master or the slave in these examples? I wasn't certain from the description so far.
- Can you please give an example of the on-the-wire I2C protocol, and an example of the kind of corrupt result which you get instead of the correct result?
- Is there any correlation between the 2ms gap and the corrupt results, or is this gap seen on all I2C operations (correct and corrupted)?
- Is it possible to post any code snippets to show how the I2C driver is being configured and used, please?
- Is anything happening in the app firmware that might delay interrupt processing, for example writes to internal SPI flash?
- If you add the ESP_INTR_FLAG_IRAM flag to the intr_alloc_flags parameter passed when registering the I2C driver, does the 2ms delay improve and/or the problem resolve?

PeterR · Postby **PeterR** » Fri May 15, 2020 11:06 am

Thanks.
1) I2C master
2) The corruption is on a 1 byte I2C GPIO read cycle (W0x40 0x01 R0x41 0x??), rxBuffer[0] = corrupt
GPIO read is every 50mS. There may also be a GPIO write every 50mS from another thread although corruption does not seem related.

Code: Select all

    i2c_cmd_handle_t cmd = i2c_cmd_link_create();
    
    i2c_master_start(cmd);
    i2c_master_write_byte(cmd, slaveAddress | WRITE_BIT, ACK_CHECK_EN);
    i2c_master_write_byte(cmd, memAddress, ACK_CHECK_EN);
 
    i2c_master_start(cmd); 
    i2c_master_write_byte(cmd, slaveAddress | READ_BIT, ACK_CHECK_EN);
    
    if (theSize > 1) {
        i2c_master_read(cmd, rxBuffer, theSize - 1, ACK_VAL);
    }
    i2c_master_read_byte(cmd, rxBuffer + theSize - 1, NACK_VAL);
    i2c_master_stop(cmd);

    esp_err_t ret;
    {
        std::unique_lock<std::mutex> lock(_i2cMutex);    
        ret = i2c_master_cmd_begin(*(i2c_port_t *)i2cHandle, cmd, 1000 / portTICK_RATE_MS);
    }
    i2c_cmd_link_delete(cmd);

3) The clock stretching is occasional as are the corruptions. It is not clear if there is a connection between the two.
The device is configured as low nibble output, high nibble inputs. The corruption generally shows as 0xD4 when 0x8F is expected. I have seen 0x00 when 0x?f is expected.

4) I2C configuration:

Code: Select all

#define I2C_SDA_GPIO             GPIO_NUM_14 
#define I2C_SCL_GPIO             GPIO_NUM_32      
#define I2C_PORT_NUM            I2C_NUM_0

#define I2C_FREQUENCY_HZ        (100*1000)
#define I2C_GPIO_PULLUP          GPIO_PULLUP_DISABLE

    i2c_port_t i2c_master_port = I2C_PORT_NUM;
    
    i2c_config_t conf;
    
    conf.mode = I2C_MODE_MASTER;
    conf.sda_io_num = I2C_SDA_GPIO;
    conf.sda_pullup_en = I2C_GPIO_PULLUP;
    conf.scl_io_num = I2C_SCL_GPIO;
    conf.scl_pullup_en = I2C_GPIO_PULLUP;
    conf.master.clk_speed = I2C_FREQUENCY_HZ;
    
    i2c_param_config(i2c_master_port, &conf);
    
    return i2c_driver_install(i2c_master_port, conf.mode, 0, 0, 0);

5) The application can write to flash but does not. FatFS has been created but is not being used at the point of fault (fwrite() is only on user command).
The application is large and cache misses are likely frequent.
I am using C++11 constructs for the most part rather than FreeRTOS directly.

6) Will let you know about the flags. I had added IRAM_ATTR manually to the driver without improvement.

Note that I believe that I am also getting >2mS blocks in the CAN driver as I get overflows & CPU load is fine. The I2C fault is far more likely when the CAN bus is improperly terminated (so CAN frame errors). Disabling the CAN driver entirely seems to avoid the I2C issue although that is a matter of statistics & I don't have much kit to run with.

PeterR · Postby **PeterR** » Fri May 15, 2020 3:53 pm

Note: By 'error' I mean the 0xD4 GPIO reading rather than clock stretching.

Binary chopping:
Not initialising ethernet seems to significantly reduce the error rate. Also I no longer get CAN overflow errors even at significant frame rates (>1600 fps).

I use the IDF DHCP client, mDNS, HTTP server, the odd UDP socket one of which is multicast. I register for Ethernet events so that I can restart the web application on disconnect/connect.
I would have thought that Ethernet would be DMA and very quick.

PeterR · Postby **PeterR** » Fri May 15, 2020 11:35 pm

Just to note that I am not checking clock stretch frequency with IRAM attributes. I don't care too much about stretch, as long as legal and as long as not a cause of GPIO faults or significant CAN overruns.
There is evidence that Ethernet causes CAN overrun EDIT and 0xd4 faults. I am not clear as to how that fits into I2C clock stretch. The HAL routine loads in one shot after all so how does that block anyone? Root cause and detail aside I think I can live with Ethernet induced overruns, maybe.
Current build has I2C with IRAM attributes but I still see the 'error' EDIT when Ethernet enabled. It is hard to probe I2C & checking stretch frequency is even harder (is manual & so error prone).
Instead I am looking at how setup changes affect the error (0xD4) rate e.g. enabling/disabling ESP driver functions.
Ethernet & CAN (& latter especially without termination) increase the error rate.

PeterR · Postby **PeterR** » Mon May 18, 2020 12:51 am

Please let me know how investigations are progressing & how to support or check progress. We were moments away from launch & whilst possibly could recover by dropping to 4.0 would loose some significant functionality. As is it is bust.
I2C corruptions is a project killer. Just won't fly now without root cause.
CAN corruptions is also significant but maybe can work around depending on operational frequency.

Postby **ESP_Dazz** » Mon May 18, 2020 3:54 pm

@PeterR

It doesn't make sense that the CAN driver would cause a 2ms block as the CAN driver ISR doesn't infinite loop anywhere. The only loop that occurs is the clearing of the CAN RX FIFO. In the worst case scenario where the CAN RX FIFO as completely overrun, the driver would simply clear the CAN RX Buffer 64 times (which is the max message count the CAN peripheral can record), so at most it would be (64 x 13 = 832 bytes to read). At an APB bus of 80MHz, that should theoretically take 10us. Given the other parts of the CAN ISR, my guess would be that the CAN ISR should not exceed a few hundred microseconds.

I'm wondering if this could be related to the I2C stuck bus issue (see 680, 922, 2494. It could be that the I2C is the one that's actually prevent the CAN driver ISR from running in time.

The other scenario could be some hardware issue where the pins used for CAN, ethernet and I2C are interfering with each other. The following information would be useful.

To eliminate the ISR block theory, you could try install the CAN driver ISR to the core opposite of the I2C ISR. Simply run the CAN driver install function in a task pinned to the other core, and the CAN driver's ISR will run on that core. This way the I2C ISR and CAN driver ISR would not affect each other.

Maybe this is also some form of hardware issue? Which GPIOs are you using for CAN, Ethernet, and I2C. Could it be possible that there would be some power issue (i.e., maybe the pins share a common ground of some sort).

What I2C Slave is the I2C Master reading from? Does the Slave datasheet specify a maximum clock stretching period? Is the the slave that's actually keeping the SCL line low?

As for the CAN overrun corruption, this is a known hardware bug. There is an edge case where an fully overrun RX FIFO will cause subsequent messages in the RX FIFO to be byte shifted (see this GH issue explanation) . The only solution is to reset the CAN peripheral. I'll push a bugfix for this shortly.

PeterR · Postby **PeterR** » Mon May 18, 2020 9:32 pm

EDIT: CAN maybe not, I have no causal on CAN (just CAN corruptions so implicated because it is being naughty). Ethernet is fairly clear causal. Not checked why yet.

Cool, thanks, will do tommorrow.
Some recent facts:
1) I2C corruptions are absent/significatly reduced when I do not start Ethernet (4.1). EDIT: & overrun removed.
2) I2C corruptions are absent/significantly reduce in IDF 4.0-dev EDIT: Even with Ethernet
3) Overrun issues do cause corruptions but in most instances (AFAIK the ll reset recovers).
4) Termination resistor issues at high frame rates cause frame id corruptions. Essentially the device reports an incorrect frame id. I have not cross checked if the data is valid and/or if we are shifted in the buffer. This issue seems seperate to overflow i.e. I can generate the condition without an overflow. Looking for tells such that I may reset and recover.

Your questions:
1) Sounds a good idea.
I had already bumped I2C priority and that possible reduces frequency. Core switch is smarter but I have other interrupts on that core.
Will answer tommorrow.

2) Pin allocation
#define I2C_SDA_GPIO GPIO_NUM_14
#define I2C_SCL_GPIO GPIO_NUM_32

#define GPIO_ESP_CAN_TX GPIO_NUM_33
#define GPIO_ESP_CAN_RX GPIO_NUM_34

// ======================
// ETHERNET (RMII)
// RMII data pins are fixed
// TXD0 = GPIO19
// TXD1 = GPIO22
// TX_EN = GPIO21
// RXD0 = GPIO25
// RXD1 = GPIO26
// CLK = GPIO0
// MDC = GPIO23
// MDIO = GPIO18
// ======================
#define DEFAULT_ETHERNET_PHY_CONFIG phy_lan8720_default_ethernet_config

#define PIN_SMI_MDC (23)
#define PIN_SMI_MDIO (18)
#define PHY_ADDRESS (0)
#define PHY_CLOCK_MODE (0)
static void eth_gpio_config_rmii(void)
{
phy_rmii_configure_data_interface_pins();

phy_rmii_smi_configure_pins(PIN_SMI_MDC, PIN_SMI_MDIO);
}

3) tca6408a, no declared upper on clock. The corruption is regular and only expected data (other than stretch) is seen on the I2C bus (according to Saelek & who would dare dis Saelek!)
I would conceed that it is hard to check all samples though.

Wildcard(s): 4.0-dev does not generate I2C corrupts but does suffer from CAN issues (corruptions but without overflow) at high frame rates when termination is added/removed.

PeterR · Postby **PeterR** » Mon May 18, 2020 9:52 pm

PS
CAN TX/RX is from the MPC2515. I2C is pulled to 3.3V. PHY design is reference EVB LAN8710A.
The corruption pattern is regular though and not shown on bus. This is ESP.

If related to I2C 'stuck issue' what am I to do? Am using 4.1 IDF (need DHCP on Ethernet) which actually is worse than 4.0.
IMHO 4.1 Ethernet is a factor. The CAN termination corruption issues are also a problem on all IDFs but hope I can filter with range checks.

I2C corruption when using ESP CAN (IDFGH-3307)

I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Re: I2C corruption when using ESP CAN (IDFGH-3307)

Who is online

About Us

Extra

Information