esp32 P4 performance

chegewara
Posts: 2378
Joined: Wed Jun 14, 2017 9:00 pm

esp32 P4 performance

Postby chegewara » Fri Aug 30, 2024 11:46 pm

Yesterday ive been playing a bit with P4-2121 matrix panel and just as exercise i decided to write a simple bigbanging driver for it. Nothing fancy, just simplest i could do.

I have few of those, so i connected 4 in chain and it can run with 46-48fps. A single panel can run with around 250fps.
Then i decided to compare with "old" esp32 S3 and with 4 panels it can run with 42fps.
You can imagine my surprise when i saw 334fps with single panel. Its not only 2 times faster than expected, but also it is significant faster than P4.
- S3 running with 240MHz clock
- P4 running with 360MHz clock
- both are using internal memory

After reviewing result i decided to check one more thing and here are the results:
- P4 with internal memory, with and w/o DMA - 250fps
- S3 with internal memory, with DMA - 334fps
- S3 with internal memory, w/o DMA - 250fps

Code: Select all

    panel = (hub75 *)heap_caps_calloc(ROW_SIZE * COL_SIZE * PANELS_H * 4, 4, MALLOC_CAP_DMA | MALLOC_CAP_INTERNAL);
    panel = (hub75 *)heap_caps_calloc(ROW_SIZE * COL_SIZE * PANELS_H * 4, 4, MALLOC_CAP_INTERNAL);
Couple more tests later:
- is it that native CPU clock on S3 is 160MHz and when we are using 240MHz it can be considered as overclocking?
- because with 160MHz on S3 i have expected 250fps with and w/o DMA memory buffer, which is weird it is the same as on 360MHz P4, but its different story

Another couple tests later:
- looks like limitation in my previous tests was the speed esp32 can set GPIO pins level, after adding loop to copy memory 100x i get better test results
- after all P4 is faster than S3 about 2.5 times, which is expected comparing CPU clock speed (160 vs 360) and other parameters

Conclusion:
- it is weird that overclocked S3 to 240mhz is almost as fast as P4 with clock 360mhz
- overclocking S3 has significant impact on GPIO pins when using DMA memory buffers
- hopefully we can have similar impact on P4 when we have a chance to run it with 400MHz, which espressif "promised" us
Last edited by chegewara on Wed Sep 04, 2024 10:24 am, edited 1 time in total.

username
Posts: 538
Joined: Thu May 03, 2018 1:18 pm

Re: esp32 P4 poor performance?

Postby username » Sat Aug 31, 2024 2:27 pm

This is very interesting. Thanks for sharing.

ESP_igrr
Posts: 2072
Joined: Tue Dec 01, 2015 8:37 am

Re: esp32 P4 poor performance?

Postby ESP_igrr » Sun Sep 01, 2024 6:58 am

chegewara wrote:decided to write a simple bigbanging driver for it
chegewara wrote:MALLOC_CAP_DMA
If you are accessing a memory region by your bitbanging software driver only, there is no difference whether MALLOC_CAP_DMA is used or not. This flag requests memory which _can_ be used by DMA engines. DMA engines (like GDMA or peripheral-specific DMAs) can move data between memory and peripherals. If you don't explicitly use one of the DMA-capable peripherals or GDMA APIs and only access the memory by software, then DMA is not used.

The differences between S3 and P4 you measure may be due to peripheral access latencies. Which method are you using for bitbanging? (direct writes to GPIO registers, calls to GPIO driver API, dedicated GPIO instructions?)

chegewara
Posts: 2378
Joined: Wed Jun 14, 2017 9:00 pm

Re: esp32 P4 poor performance?

Postby chegewara » Sun Sep 01, 2024 7:45 am

ESP_igrr wrote:
Sun Sep 01, 2024 6:58 am
If you are accessing a memory region by your bitbanging software driver only, there is no difference whether MALLOC_CAP_DMA is used or not. This flag requests memory which _can_ be used by DMA engines. DMA engines (like GDMA or peripheral-specific DMAs) can move data between memory and peripherals. If you don't explicitly use one of the DMA-capable peripherals or GDMA APIs and only access the memory by software, then DMA is not used.

The differences between S3 and P4 you measure may be due to peripheral access latencies. Which method are you using for bitbanging? (direct writes to GPIO registers, calls to GPIO driver API, dedicated GPIO instructions?)
Thanks for explanation.
Its my first try to play with DMA, so i am still a bit confused, especially it has impact on fps on S3.

I am using gpio API, since it was meant to be simplest code i could imagine, not the fastest one

Code: Select all

static void hub75_set_pixel1(pixel_color &pixel, uint8_t bit)
{
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)R0, (pixel.r >> bit) & 0x1));
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)G0, (pixel.g >> bit) & 0x1));
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)B0, (pixel.b >> bit) & 0x1));
}

static void hub75_set_pixel2(pixel_color &pixel2, uint8_t bit)
{
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)R1, (pixel2.r >> bit) & 0x1));
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)G1, (pixel2.g >> bit) & 0x1));
    ESP_ERROR_CHECK_WITHOUT_ABORT(gpio_set_level((gpio_num_t)B1, (pixel2.b >> bit) & 0x1));
}

static uint8_t hub75_next_row(uint8_t row)
{
    gpio_set_level((gpio_num_t)A, (row >> 0) & 0x1);
    gpio_set_level((gpio_num_t)B, (row >> 1) & 0x1);
    gpio_set_level((gpio_num_t)C, (row >> 2) & 0x1);
    gpio_set_level((gpio_num_t)D, (row >> 3) & 0x1);

    // if(row == COL_SIZE/2) return 0;
    return row++;
}

static void hub75_gate_clk()
{
    gpio_set_level((gpio_num_t)CLK, 1);
    gpio_set_level((gpio_num_t)CLK, 0);
}

static void hub75_latch_row(uint8_t lvl) // LAT pin
{
    gpio_set_level((gpio_num_t)LAT, lvl);
}

static void hub75_swap_pixels(uint8_t lvl) // OE pin
{
    gpio_set_level((gpio_num_t)OE, lvl);
}


PS results from my previous post are when only most significant bit from color byte is used and my topic is not meant to say that P4 has bad performance or sucks, just to compare it with S3 using some random code, in this case its mostly gpio API

ESP_igrr
Posts: 2072
Joined: Tue Dec 01, 2015 8:37 am

Re: esp32 P4 poor performance?

Postby ESP_igrr » Mon Sep 02, 2024 7:15 am

chegewara wrote: so i am still a bit confused, especially it has impact on fps on S3.
I am not sure why adding this flag makes any difference on S3, either. Maybe you can try to log the pointer values returned by heap_caps_calloc with and without MALLOC_CAP_DMA and see if they are in the same memory region or not?

By the way, on P4 we can also use a hardware interface to drive RGB panels, you can check this example: https://github.com/espressif/esp-idf/tr ... led_matrix

chegewara
Posts: 2378
Joined: Wed Jun 14, 2017 9:00 pm

Re: esp32 P4 poor performance?

Postby chegewara » Mon Sep 02, 2024 11:53 am

ESP_igrr wrote:
Mon Sep 02, 2024 7:15 am
chegewara wrote: so i am still a bit confused, especially it has impact on fps on S3.
I am not sure why adding this flag makes any difference on S3, either. Maybe you can try to log the pointer values returned by heap_caps_calloc with and without MALLOC_CAP_DMA and see if they are in the same memory region or not?

By the way, on P4 we can also use a hardware interface to drive RGB panels, you can check this example: https://github.com/espressif/esp-idf/tr ... led_matrix
Thanks, i will check that example for sure. Earlier ive been using this library with esp-idf too
https://github.com/mrcodetastic/ESP32-H ... xPanel-DMA

About DMA flag it was just assumption, after initial tests. At the end i figured out its the CPU clock which is game changer.
With standard 160MHz on S3 we have results similar to P4, but increasing clock on S3 to 240MHz makes over 33% speed boost on gpio control API. So, comparing S3 with 160MHz vs P4 with 360MHz giving us the same results with gpio API.
Also the boost on S3 is only with single matrix panel. When i chained 4 the results are similar on bots CPU, P4 is maybe 10% faster.

On all esp32 chips, but P4, we can set CPU in very wide range, and I its interesting how it will impact P4 eventually, when we have that option too.
chegewara wrote: Conclusion:
- it is weird that overclocked S3 to 240mhz is almost as fast as P4 with clock 360mhz
- overclocking S3 has significant impact on GPIO pins when using DMA memory buffers
- hopefully we can have similar impact on P4 when we have a chance to run it with 400MHz
Additionally adding some memory copying code shows that P4 is actually faster, like it suppose to be, so at the end it was my test poor quality
chegewara wrote: Another couple tests later:
- looks like limitation in my previous tests was the speed esp32 can set GPIO pins level, after adding loop to copy memory 100x i get better test results
- after all P4 is faster than S3 about 2.5 times, which is expected comparing CPU clock speed (160 vs 360) and other parameters
Nevertheless i am very happy having 2 P4 board in my hands and having a lot of fun playing with it. Thats why i am trying to share my findings.

BTW there is no TRM and datasheet for P4 available yet, but i found some old datasheet on internet which is suggesting that P4 is having 2 USB ports, one USB 2.0 fast/slow and one USB 2.0 high speed

Demirug
Posts: 11
Joined: Fri May 28, 2021 12:54 pm

Re: esp32 P4 poor performance?

Postby Demirug » Mon Sep 02, 2024 1:38 pm

Bitbanging was always a weak point of the ESP32 chips. Therefore as ESP_igrr said it is always the best solution if you can use one of the peripherie subsystems to do the work for you. We used the lcd controller of the S3 with create success to drive old 15KHz arcade crts (even if it called lcd). But if you really need to bit bang things there are ways to make things faster.

1. read/write multiple pins at once and not every one separated. This way you pay the overhead only once.
2. some esp32 chips (like S2, S3, P4) have dedicated gpio suppport. This allows a subset of the gpio pins to be read/write much faster than with the regular gpio functions. Unfortunately you can do this only with a limited number of pins. So if you need many signals it's not an option

chegewara
Posts: 2378
Joined: Wed Jun 14, 2017 9:00 pm

Re: esp32 P4 poor performance?

Postby chegewara » Mon Sep 02, 2024 1:44 pm

Demirug wrote:
Mon Sep 02, 2024 1:38 pm
Bitbanging was always a weak point of the ESP32 chips. Therefore as ESP_igrr said it is always the best solution if you can use one of the peripherie subsystems to do the work for you. We used the lcd controller of the S3 with create success to drive old 15KHz arcade crts (even if it called lcd). But if you really need to bit bang things there are ways to make things faster.

1. read/write multiple pins at once and not every one separated. This way you pay the overhead only once.
2. some esp32 chips (like S2, S3, P4) have dedicated gpio suppport. This allows a subset of the gpio pins to be read/write much faster than with the regular gpio functions. Unfortunately you can do this only with a limited number of pins. So if you need many signals it's not an option
Yet again. Im not saying that esp32 is bad, because i have poor results. I am not expecting any results from bigbanging code.

The point is that esp32 S3 with 160MHz has similar results as P4 with 360, and S3 with 240MHz is performing much better.
Its just observation, nothing more.

username
Posts: 538
Joined: Thu May 03, 2018 1:18 pm

Re: esp32 P4 poor performance?

Postby username » Mon Sep 02, 2024 11:20 pm

Yet again. Im not saying that esp32 is bad, because i have poor results. I am not expecting any results from bigbanging code.

The point is that esp32 S3 with 160MHz has similar results as P4 with 360, and S3 with 240MHz is performing much better.
Its just observation, nothing more.
FWIW, I get what your doing. I would be doing the same tests, and be curious as to the why as well.

chegewara
Posts: 2378
Joined: Wed Jun 14, 2017 9:00 pm

Re: esp32 P4 performance

Postby chegewara » Wed Sep 04, 2024 3:33 am

Ethernet PHY test with iperf and PSRAM 200MHz enabled:
Test downlink bandwidth with this command on my PC (just in case)

Code: Select all

iperf -u -c 192.168.0.209 -b 1000M -t 30 -i 3 // <---- 1000M

Code: Select all

iperf> iperf -u -s -i 3
I (15986) IPERF: mode=udp-server sip=localhost:5001, dip=0.0.0.0:5001, interval=3, time=30
I (15986) iperf: Socket created
I (15996) iperf: Socket bound, port 35091
iperf> 
Interval       Bandwidth
 0.0- 3.0 sec  91.17 Mbits/sec
 3.0- 6.0 sec  91.25 Mbits/sec
 6.0- 9.0 sec  91.27 Mbits/sec
 9.0-12.0 sec  91.25 Mbits/sec
12.0-15.0 sec  91.27 Mbits/sec
15.0-18.0 sec  91.25 Mbits/sec
18.0-21.0 sec  91.26 Mbits/sec
21.0-24.0 sec  91.25 Mbits/sec
24.0-27.0 sec  91.27 Mbits/sec
27.0-30.0 sec  91.25 Mbits/sec
 0.0-30.0 sec  91.25 Mbits/sec
I (49256) iperf: Udp socket server is closed.
I (49256) iperf: iperf exit
Test uplink bandwidth

Code: Select all

Interval       Bandwidth
 0.0- 3.0 sec  90.76 Mbits/sec
 3.0- 6.0 sec  91.26 Mbits/sec
 6.0- 9.0 sec  91.26 Mbits/sec
 9.0-12.0 sec  91.27 Mbits/sec
12.0-15.0 sec  83.05 Mbits/sec
15.0-18.0 sec  91.26 Mbits/sec
18.0-21.0 sec  91.26 Mbits/sec
21.0-24.0 sec  87.31 Mbits/sec
24.0-27.0 sec  91.26 Mbits/sec
27.0-30.0 sec  91.27 Mbits/sec
 0.0-30.0 sec  90.00 Mbits/sec
I (55317) iperf: UDP Socket client is closed
I (55317) iperf: iperf exit

For comparison without PSRAM enabled, accordingly:

Code: Select all

iperf> iperf -u -s -i 3
I (13827) IPERF: mode=udp-server sip=localhost:5001, dip=0.0.0.0:5001, interval=3, time=30
I (13827) iperf: Socket created
I (13837) iperf: Socket bound, port 35091
iperf> 
Interval       Bandwidth
 0.0- 3.0 sec  82.03 Mbits/sec
 3.0- 6.0 sec  83.89 Mbits/sec
 6.0- 9.0 sec  84.33 Mbits/sec
 9.0-12.0 sec  83.84 Mbits/sec
12.0-15.0 sec  83.87 Mbits/sec
15.0-18.0 sec  83.86 Mbits/sec
18.0-21.0 sec  83.71 Mbits/sec
21.0-24.0 sec  83.76 Mbits/sec
24.0-27.0 sec  84.16 Mbits/sec
27.0-30.0 sec  83.91 Mbits/sec
 0.0-30.0 sec  83.74 Mbits/sec
I (45137) iperf: Udp socket server is closed.

Code: Select all

Interval       Bandwidth
 0.0- 3.0 sec  91.15 Mbits/sec
 3.0- 6.0 sec  91.27 Mbits/sec
 6.0- 9.0 sec  91.26 Mbits/sec
 9.0-12.0 sec  91.26 Mbits/sec
12.0-15.0 sec  91.26 Mbits/sec
15.0-18.0 sec  91.26 Mbits/sec
18.0-21.0 sec  87.74 Mbits/sec
21.0-24.0 sec  91.26 Mbits/sec
24.0-27.0 sec  91.26 Mbits/sec
27.0-30.0 sec  91.27 Mbits/sec
 0.0-30.0 sec  90.90 Mbits/sec
I (171887) iperf: UDP Socket client is closed
I (171887) iperf: iperf exit
I do not have S3 board with RMII to compare with, sorry

EDIT for test completeness, because i dont know how important is packet loss in this test
- with PSRAM

Code: Select all

[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  1] 0.0000-3.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.070 ms 25350/49755 (51%)
[  1] 3.0000-6.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.076 ms 24270/48685 (50%)
[  1] 6.0000-9.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.073 ms 24143/48559 (50%)
[  1] 9.0000-12.0000 sec  34.2 MBytes  95.6 Mbits/sec   0.075 ms 24089/48482 (50%)
[  1] 12.0000-15.0000 sec  34.1 MBytes  95.3 Mbits/sec   0.088 ms 24328/48641 (50%)
[  1] 15.0000-18.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.068 ms 24234/48656 (50%)
[  1] 18.0000-21.0000 sec  34.2 MBytes  95.6 Mbits/sec   0.067 ms 24305/48701 (50%)
[  1] 21.0000-24.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.120 ms 24105/48512 (50%)
[  1] 24.0000-27.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.070 ms 24301/48726 (50%)
[  1] 27.0000-30.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.065 ms 24179/48586 (50%)

- no PSRAM

Code: Select all

[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  1] 0.0000-3.0000 sec  34.5 MBytes  96.5 Mbits/sec   0.066 ms 12547/37163 (34%)
[  1] 3.0000-6.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.087 ms 12624/37035 (34%)
[  1] 6.0000-9.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.066 ms 12531/36950 (34%)
[  1] 9.0000-12.0000 sec  34.2 MBytes  95.6 Mbits/sec   0.081 ms 12439/36838 (34%)
[  1] 12.0000-15.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.065 ms 12482/36898 (34%)
[  1] 15.0000-18.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.069 ms 12450/36875 (34%)
[  1] 18.0000-21.0000 sec  32.9 MBytes  92.0 Mbits/sec   0.120 ms 13710/37173 (37%)
[  1] 21.0000-24.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.066 ms 12481/36890 (34%)
[  1] 24.0000-27.0000 sec  34.2 MBytes  95.7 Mbits/sec   0.077 ms 12479/36886 (34%)
[  1] 27.0000-30.0000 sec  33.9 MBytes  94.8 Mbits/sec   0.169 ms 12362/36540 (34%)

Who is online

Users browsing this forum: Baidu [Spider] and 88 guests