Page 1 of 2

esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Tue Oct 20, 2020 10:00 am
by fpena_inusual
Hi there,

We're using SLIP protocol to communicate an esp32 device with a mcu and all seems to be ok, but we're having reboots from time to time depending on the data received by the esp32, caused by an abort on an internall memory allocation, so there's a memory leak somewhere...
We've also included a heap memory tracking task to determine from where the memory leak comes from, and we found our slip reception task is consuming all the memory for some reason...
As the task is pretty simple (no dynamic memory allocation) we determined the problem was wit esp_netif_receive function.
We started from the slip_udp example (examples/protocols/slip/slip_udp/) without much changes, we've only changed the receiving task to make it more robust (managing fifo overtflows, etc...).
¿Any clue about how to solve/trace the issue?

P.S.: Here is our receiving task, in case you wonder what we've changed

Code: Select all

static void esp_slip_modem_uart_rx_task(void *arg) {

        uart_event_t event;

        esp_slip_modem_t *slip_modem = (esp_slip_modem_t *) arg;

        int len;

        while (slip_modem->running == true) {
                if(xQueueReceive(slip_modem->uart.uart_queue, (void * )&event, 1)) {
                        switch(event.type) {
                                case UART_DATA:
                                        // Read data from the UART
                                        len = uart_read_bytes(slip_modem->uart.uart_dev, slip_modem->buffer, slip_modem->buffer_len, 1);
                                        if (len > 0) {
                                                // Ensure null termination
                                                slip_modem->buffer[len] = '\0';
                                                esp_netif_receive(slip_modem->base.netif, slip_modem->buffer, len, NULL);
                                        }
                                        if (len < 0) {
                                                // Asume error and clear everything
                                                uart_flush_input(slip_modem->uart.uart_dev);
                                                xQueueReset(slip_modem->uart.uart_queue);
                                        }
                                        break;
                                case UART_FIFO_OVF:
                                case UART_BUFFER_FULL:
                                        uart_flush_input(slip_modem->uart.uart_dev);
                                        xQueueReset(slip_modem->uart.uart_queue);
                                        break;
                                case UART_FRAME_ERR:
                                default:
                                        break;
                         }
                } else {
                        vTaskDelay(pdMS_TO_TICKS(1));
                }
        }
        vTaskDelete(NULL);
}

P.S.2: Here is the error message it give us

Code: Select all

Memory allocation failed

Backtrace:0x4008e9e6:0x3ffd0d60 0x4008f049:0x3ffd0d80 0x400f60ab:0x3ffd0da0 0x4008565b:0x3ffd0dc0 0x40085738:0x3ffd0df0 0x40085b5a:0x3ffd0e10 0x40085b99:0x3ffd0e50 0x40151c24:0x3ffd0e70 0x40151cc2:0x3ffd0e90 0x40151d30:0x3ffd0eb0 0x400941f4:0x3ffd0ed0 0x4015c3fb:0x3ffd0ef0 0x4015c4c7:0x3ffd0f10 0x4015c623:0x3ffd0f30 0x4014ec65:0x3ffd0f50 0x4017d571:0x3ffd0f70 0x400f7065:0x3ffd0f90
0x4008e9e6: panic_abort at /home/desarrollo/devel/esp32/esp-idf/components/esp_system/panic.c:360

0x4008f049: esp_system_abort at /home/desarrollo/devel/esp32/esp-idf/components/esp_system/system_api.c:104

0x400f60ab: heap_caps_alloc_failed at /home/desarrollo/devel/esp32/esp-idf/components/heap/heap_caps.c:67

0x4008565b: heap_caps_malloc at /home/desarrollo/devel/esp32/esp-idf/components/heap/heap_caps.c:155

0x40085738: heap_caps_malloc_default at /home/desarrollo/devel/esp32/esp-idf/components/heap/heap_caps.c:187

0x40085b5a: trace_malloc at /home/desarrollo/devel/esp32/esp-idf/components/heap/include/heap_trace.inc:95

0x40085b99: __wrap_malloc at /home/desarrollo/devel/esp32/esp-idf/components/heap/include/heap_trace.inc:157

0x40151c24: mem_malloc at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/core/mem.c:237

0x40151cc2: do_memp_malloc_pool at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/core/memp.c:254

0x40151d30: memp_malloc at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/core/memp.c:350 (discriminator 2)

0x400941f4: pbuf_alloc at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/core/pbuf.c:247

0x4015c3fb: slipif_rxbyte at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/netif/slipif.c:265 (discriminator 4)

0x4015c4c7: slipif_rxbyte_enqueue at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/netif/slipif.c:494

0x4015c623: slipif_received_bytes at /home/desarrollo/devel/esp32/esp-idf/components/lwip/lwip/src/netif/slipif.c:555 (discriminator 3)

0x4014ec65: esp_netif_lwip_slip_input at /home/desarrollo/devel/esp32/esp-idf/components/esp_netif/lwip/esp_netif_lwip_slip.c:165 (discriminator 3)

0x4017d571: esp_netif_receive at /home/desarrollo/devel/esp32/esp-idf/components/esp_netif/lwip/esp_netif_lwip.c:803

0x400f7065: esp_slip_modem_uart_rx_task at /home/desarrollo/devel/esp32/eth2slip/main/src/driver/slip_modem.c:221

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Tue Oct 20, 2020 11:30 am
by ESP_Alvin
Moderator's note: edit the topic title for issue tracking, thanks for reporting.

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Wed Oct 21, 2020 10:53 am
by fpena_inusual
Hello again,

I have more information on this.
It seems to occur only when the received datagram is fragmented, that is when the actual datagram is bigger than the interface MTU and it's chopped by the ip layer. The memory leak is 1520 bytes per datagram.
I assumed the slip MTU is 1500 bytes, so I tried to ping to the device with up to 1472 bytes (1500 bytes (MTU) - 20 byte (IP header) - 8 byte (ICMP header)) and it works fine, no memory leak so far.
If I try to ping with more than 1472 bytes I have no response from the device and it starts to consume the memory, 1520 bytes per lost packet as I mentioned.

Any idea how I can overcome this?

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Thu Oct 22, 2020 4:34 pm
by ESP_cermak
Hi fpena_inusual,

One possible reason could be that the slip netif doesn't support packets longer than 256 bytes, as per limitation in lwIP (len parameter is uint8_t). There a fix ready for internal review: https://github.com/espressif/esp-idf/pull/5928, Could you please check if that helps?

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Fri Oct 23, 2020 9:15 am
by fpena_inusual
Ok, patch applied, but sadly it doesn't fix this issue.

Thanks anyway

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Mon Oct 26, 2020 5:17 pm
by ESP_cermak
Would be it acceptable to workaround this by increasing the MTU size here? https://github.com/espressif/esp-lwip/b ... ipif.c#L79
I might be wrong, but it looks like the extra bytes are simply dropped on netif level (here in slipif.c) before getting fragmented.

Please note that it is possible to add target compile flags to other components in your project makefile:

Code: Select all

idf_component_get_property(lwip lwip COMPONENT_LIB)
target_compile_definitions(${lwip} PRIVATE "-DSLIP_MAX_SIZE=1600")

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Tue Oct 27, 2020 9:24 am
by fpena_inusual
@ESP_cermak
It would be acceptable, but I think that fragmentation is only part of the problem.
The memory leak occurs in many other situations, it's just when you try to ping with a data packet bigger than the MTU it goes wild.
Even connecting to a http server increases the memory consumption.

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Tue Oct 27, 2020 1:07 pm
by fpena_inusual
Hello again...

I think I have something, but I don't know how interpret it.
Here it is the capture of a HTTP transaction: https://pastebin.com/raw/qz8GpB3d
As you can see, there's a retransmission packet, and all packets are below 1500 bytes long
From my tests, every time there's a retransmission (or any other error, I guess) the memory leak is incremented in 1520 bytes...

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Tue Oct 27, 2020 5:29 pm
by fpena_inusual
Ok, the problem seems to be with lwip itself...
I wrote a new driver only using the lwip basic functions bypassing the ESP-NETIF layer (I had to comment the sio functions from the idf code) and it fails in the same way...
I think the lwip code for slip is broken in some way.

Re: esp_netif_receive eating all my ram (SLIP protocol) (IDFGH-4141)

Posted: Wed Oct 28, 2020 12:18 pm
by fpena_inusual
Ok, another little update, with a very small improvement...

I changed my reception task to this:

Code: Select all

static void esp_slip_modem_uart_rx_task(void *arg) {

        uart_event_t event;

        esp_slip_modem_t *slip_modem = (esp_slip_modem_t *) arg;

        int len;

        while (slip_modem->running == true) {
                if(xQueueReceive(slip_modem->uart.uart_queue, (void * )&event, pdMS_TO_TICKS(65))) {
                        switch(event.type) {
                                case UART_DATA:
                                        // Read data from the UART
                                        memset((void *)slip_modem->buffer, 0, slip_modem->buffer_len);
                                        len = uart_read_bytes(slip_modem->uart.uart_dev, slip_modem->buffer, slip_modem->buffer_len, 0);
                                        if (len > 0) {
                                                const int max_batch = 255;
                                                int sent = 0;
                                                while(sent < len) {
                                                        int batch = (len - sent) > max_batch ? max_batch : (len - sent);
                                                        esp_netif_receive(slip_modem->base.netif, slip_modem->buffer+sent, batch, NULL);
                                                        sent += batch;
                                                }
                                        } else if (len < 0) {
                                                // Asume error and clear everything
                                                ESP_LOGE(TAG, "Receiving error (len < 0)!!!");
                                                uart_flush_input(slip_modem->uart.uart_dev);
                                                xQueueReset(slip_modem->uart.uart_queue);
                                        }
                                        break;
                                case UART_FRAME_ERR:
                                case UART_BUFFER_FULL:
                                case UART_FIFO_OVF:
                                        ESP_LOGE(TAG, "Receiving error!!!");
                                        uart_flush_input(slip_modem->uart.uart_dev);
                                        xQueueReset(slip_modem->uart.uart_queue);
                                        break;
                                default:
                                        break;
                         }
                } else {
                        vTaskDelay(pdMS_TO_TICKS(65));
                }
        }
        vTaskDelete(NULL);
}
As you can see I decided not to patch the idf but pass the received data in small chuncks to esp_netif_receive (just what the patch is supposed to do).
Now I can ping with data bigger than the MTU and it will reply without incrementing the memory usage (in fact it does, but it's freed afterwards)...
But it still fails with regular http transactions...

Any hint on how I can track the problem (I'm not familiar with lwip inners)?

P.S.: I'm sorry for being so boring, but our product depends on this to be working and if doesn't work we will lose a 500 units contract (two esp32 devices on each unit) just on the first batch (a lot more planned)... so it's quite important to us :P