Random ERRNO 9 (EBADF) on recv() or send()

sharkx
Posts: 5
Joined: Fri Feb 28, 2020 3:27 pm

Random ERRNO 9 (EBADF) on recv() or send()

Postby sharkx » Tue Jul 14, 2020 9:03 am

The context
Several nodes running as APs connecting to each other in a tree-like network; all nodes run the same code.

There's a pretty basic TCP server that listens for small packets of data; the server is running a 'classic' soccket() -> bind() -> listen() -> select() - accept() -> recv() scenario, using the recv() in a separate task; the socket file descriptor is released immediately after the recv completes.

The server processes an average of 2 packets per second. When a packet is received, it is relayed to some of the stations that are connected to the module, or to the 'parent' node depending on data.

The problem
Every now and then, the recv() or send() calls return error code 9, EBADF, and I can't seem to figure out why ...
Sometimes everything goes well for minutes, then there's a 'burst' of EBADFs; sometimes there's one EBADF every minute ort so... there doesn't seem to be a particular pattern.

Any help about debugging this issue is appeciated.
Thanks.

ESP_Sprite
Posts: 9772
Joined: Thu Nov 26, 2015 4:08 am

Re: Random ERRNO 9 (EBADF) on recv() or send()

Postby ESP_Sprite » Wed Jul 15, 2020 2:43 pm

Are you sure you're handling your file descriptors correctly? No race conditions there?

sharkx
Posts: 5
Joined: Fri Feb 28, 2020 3:27 pm

Re: Random ERRNO 9 (EBADF) on recv() or send()

Postby sharkx » Thu Jul 16, 2020 5:01 am

Well... That's part of the problem; there none that I could find. Stripping all error handling and other minor stuff, here's how it's done (in a simplified, c-like language)

Code: Select all

function serverTask() {
    server_socket = socket();
    bind(server_socket);
    listen(server_socket);
    while (1) {
        if (select(server_socket) >= 0) {
            server_accept = accept(server_socket);
            createTask(recvTask, server_accept);
        }
    }
}

function recvTask(server_accept) {
    if (recv(server_accept) > 0) {
        process(received_data);
    }
    closeSocket(server_accept)
}
It's pretty basic. This is how it's supposed to be done, right ?

There is some synchronization done in the processing section, but it does not involve the socket; besides, the socket fd is sent to the task and should become available for another operation after the close occurrs in the recvTask.

FrancoisB-HEX
Posts: 3
Joined: Thu Oct 24, 2024 12:37 pm

Re: Random ERRNO 9 (EBADF) on recv() or send()

Postby FrancoisB-HEX » Thu Oct 24, 2024 1:03 pm

Did you ever get a grip on this issue?

I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the BSD sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet

With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.

FrancoisB-HEX
Posts: 3
Joined: Thu Oct 24, 2024 12:37 pm

Re: Random ERRNO 9 (EBADF) on recv() or send()

Postby FrancoisB-HEX » Fri Oct 25, 2024 9:54 am

FrancoisB-HEX wrote:
Thu Oct 24, 2024 1:03 pm
Did you ever get a grip on this issue?

I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the berkley sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet

With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.
Here is the LWIP Debug logs when this happens:

Code: Select all

D (174239) lwip: lwip_recvfrom(56):  addr=
D (174239) lwip: 192.168.4.2
D (174240) lwip:  port=51689 len=2
D (174284) lwip: lwip_select(57, 0x3fca8310, 0x3fca8318, 0x0, tvsec=0 tvusec=0)
D (174285) lwip: lwip_selscan: fd=56 ready for reading
D (174285) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x104a8c0 (0x493214, 0x4a8c0, 0xa000000)
D (174286) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x100007f (0x14, 0x7f, 0xa493200)
D (174286) lwip: ip4_input: packet not for us.
D (174287) lwip: lwip_selscan: fd=56 ready for writing
D (174287) lwip: lwip_select: nready=2
D (174287) lwip: lwip_recvfrom(56, 0x3fcb7084, 1440, 0x1, ..)
D (174288) lwip: lwip_recv_tcp: top while sock->lastdata=0x0
D (174288) lwip: lwip_recv_tcp: netconn_recv err=-15, pbuf=0x0
D (174289) lwip: lwip_recv_tcp: p == NULL, error is "Connection closed."!
D (174289) lwip: lwip_recvfrom(56):  addr=
D (174289) lwip: 192.168.4.2
D (174289) lwip:  port=51689 len=0
D (174290) lwip: lwip_close(56)
D (174290) lwip: tcp_close: closing in 
D (174290) lwip: State: CLOSE_WAIT
D (174290) lwip: ip4_output_if: ap2
D (174291) lwip: IP header:
D (174291) lwip: +-------------------------------+
D (174291) lwip: | 4 | 5 |  0x00 |        68     | (v, hl, tos, len)
D (174291) lwip: +-------------------------------+
D (174292) lwip: |      179      |000|       0   | (id, flags, offset)
D (174292) lwip: +-------------------------------+
D (174292) lwip: |   64  |    6  |    0xf0ad     | (ttl, proto, chksum)
D (174292) lwip: +-------------------------------+
D (174292) lwip: |  192  |  168  |    4  |    1  | (src)
D (174293) lwip: +-------------------------------+
D (174293) lwip: |  192  |  168  |    4  |    2  | (dest)
D (174293) lwip: +-------------------------------+
D (174293) lwip: ip4_output_if: call netif->output()
D (174295) lwip: lwip_send(56, data=0x3fcf21e4, size=4, flags=0x0)
W (174295) ./main/hsm/wifi/tcp-server-hsm.c: tcpServerCb_ClientTransmit: Error sending to client 56 on instance 1: 9

FrancoisB-HEX
Posts: 3
Joined: Thu Oct 24, 2024 12:37 pm

Re: Random ERRNO 9 (EBADF) on recv() or send()

Postby FrancoisB-HEX » Mon Oct 28, 2024 7:44 am

FrancoisB-HEX wrote:
Fri Oct 25, 2024 9:54 am
FrancoisB-HEX wrote:
Thu Oct 24, 2024 1:03 pm
Did you ever get a grip on this issue?

I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the berkley sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet

With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.
Here is the LWIP Debug logs when this happens:

Code: Select all

D (174239) lwip: lwip_recvfrom(56):  addr=
D (174239) lwip: 192.168.4.2
D (174240) lwip:  port=51689 len=2
D (174284) lwip: lwip_select(57, 0x3fca8310, 0x3fca8318, 0x0, tvsec=0 tvusec=0)
D (174285) lwip: lwip_selscan: fd=56 ready for reading
D (174285) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x104a8c0 (0x493214, 0x4a8c0, 0xa000000)
D (174286) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x100007f (0x14, 0x7f, 0xa493200)
D (174286) lwip: ip4_input: packet not for us.
D (174287) lwip: lwip_selscan: fd=56 ready for writing
D (174287) lwip: lwip_select: nready=2
D (174287) lwip: lwip_recvfrom(56, 0x3fcb7084, 1440, 0x1, ..)
D (174288) lwip: lwip_recv_tcp: top while sock->lastdata=0x0
D (174288) lwip: lwip_recv_tcp: netconn_recv err=-15, pbuf=0x0
D (174289) lwip: lwip_recv_tcp: p == NULL, error is "Connection closed."!
D (174289) lwip: lwip_recvfrom(56):  addr=
D (174289) lwip: 192.168.4.2
D (174289) lwip:  port=51689 len=0
D (174290) lwip: lwip_close(56)
D (174290) lwip: tcp_close: closing in 
D (174290) lwip: State: CLOSE_WAIT
D (174290) lwip: ip4_output_if: ap2
D (174291) lwip: IP header:
D (174291) lwip: +-------------------------------+
D (174291) lwip: | 4 | 5 |  0x00 |        68     | (v, hl, tos, len)
D (174291) lwip: +-------------------------------+
D (174292) lwip: |      179      |000|       0   | (id, flags, offset)
D (174292) lwip: +-------------------------------+
D (174292) lwip: |   64  |    6  |    0xf0ad     | (ttl, proto, chksum)
D (174292) lwip: +-------------------------------+
D (174292) lwip: |  192  |  168  |    4  |    1  | (src)
D (174293) lwip: +-------------------------------+
D (174293) lwip: |  192  |  168  |    4  |    2  | (dest)
D (174293) lwip: +-------------------------------+
D (174293) lwip: ip4_output_if: call netif->output()
D (174295) lwip: lwip_send(56, data=0x3fcf21e4, size=4, flags=0x0)
W (174295) ./main/hsm/wifi/tcp-server-hsm.c: tcpServerCb_ClientTransmit: Error sending to client 56 on instance 1: 9
OK so my issue was that the client socket was in fact just closed before calling send() on its file descriptor. The sequence was:
- select(readFDs,writeFDs) -> readFD set for clientSocketFD, and writeFD set for clientSocketFD (and writeBuffer has data to be sent)
- read(clientSocketFD) with 0 bytes -> clientSocket closed and clientSocketFD set for dereferencing
- write(clientSocketFD) -> EBADF since socket closed

My underlying issue is still that TCP data is getting adversely delayed when using Soft AP which causes the TCP client to close the connection. With STA WiFi this does not occur. I have also disabled Nagle's algorithm (with TCP_NODELAY) which improves the situation, but does not solve it for this scenario.

Who is online

Users browsing this forum: No registered users and 79 guests