RAM & Flash cache efficiency
RAM & Flash cache efficiency
Hi. I've looked around a bit but I couldn't find an indication of how much actual RAM/Flash does 32K represent in the cache. I assume that the cache needs to reserve space for metadata information, like page source address, some kind of access tree, so I'd like to have a better idea about how many real KB can be cached by each KB on the cache, and wether contiguous blocks are better cached than randomly accessed ones. Or perhaps that's all covered by "0x3FF1_0000 ~ 0x3FF1_3FFF 16 KB Cache MMU Table"? I'd also like to confirm wether the cached block size is 16 bytes, as I've read somewhere. Thank you!
-
- Posts: 9759
- Joined: Thu Nov 26, 2015 4:08 am
Re: RAM & Flash cache efficiency
From what I remember, the cache tag memory (the 'metadata' as you put it) is a separate bit of memory, the 32K is used in full for the actual data. Cache lines are 32 bytes.
Re: RAM & Flash cache efficiency
Thank you. Does it make any difference if the lines are contiguous? I mean, if I access 2KB of contiguous memory vs. 2KB comprised of randomly located 32 byte chunks (as long as the chunks are properly aligned, of course)?
-
- Posts: 9759
- Joined: Thu Nov 26, 2015 4:08 am
Re: RAM & Flash cache efficiency
No, a chunk is a chunk and the ESP32 cache does not do pre-fetching, so randomly accessing aligned 32-byte chunks is as fast as accessing them linearily. (Note that our cache designs have been improved in the meanwhile; this may not be true anymore for any chips we'll be releasing in the future.)
Re: RAM & Flash cache efficiency
Is it somehow possible to hide the cache access latency? For instance, I'd like to walk over a larger piece of data that is stored in ext. QSPI ROM/RAM and mapped via MMU/cache, and process that data. It could walk the data in cache line sizes (32 bytes). But every time there is a cache miss, things have to wait for the QSPI transaction which is slow. Is it possible to issue a cache prefetch somehow so that the next cache line gets prefetched while the current one is processed? On other systems it's usually accessible via GCC's __builtin_prefetch ... is there anything on ESP32 for that?
-
- Posts: 9759
- Joined: Thu Nov 26, 2015 4:08 am
Re: RAM & Flash cache efficiency
No, sorry, the ESP32 does not have a prefetch mechanism to my knowledge.
Re: RAM & Flash cache efficiency
Would it work if one would access the first byte of each 32byte segment of the chunk that needs to be prefetched?
Something along these lines (untested):
[Codebox]
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>
#include "esp_heap_caps.h"
#define SPIRAM_DATA_CACHE_SZ ((size_t)32768)
#define SPIRAM_DATA_CACHE_LINE_SZ ((size_t)32)
#define SPIRAM_DATA_CACHE_PREFETCH_LIMIT ((size_t)16384) // Let's not clog the whole cache at once
#define MIN(a, b) ((a) < (b) ? (a) : (b))
void spiram_cache_prefetch(char *addr, size_t sz) {
volatile static char byte = 0; // Prevent optimization
for(uint16_t ln = 0; ln < MIN(sz, SPIRAM_DATA_CACHE_PREFETCH_LIMIT) / SPIRAM_DATA_CACHE_LINE_SZ; ++ln) {
// Intentionally not using pointer arithmetic as gcc optimizes array access better
byte = addr[ln * SPIRAM_DATA_CACHE_LINE_SZ];
}
}
void spiram_cache_prefetch_test() {
char *buf = heap_caps_malloc(20000, MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM);
spiram_cache_prefetch(buf, 20000);
heap_caps_free(buf);
} [/Codebox]
I'm wondering because this might be a really big boost for linear access performed across bigger chunks (like a framebuffer ).
Something along these lines (untested):
[Codebox]
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>
#include "esp_heap_caps.h"
#define SPIRAM_DATA_CACHE_SZ ((size_t)32768)
#define SPIRAM_DATA_CACHE_LINE_SZ ((size_t)32)
#define SPIRAM_DATA_CACHE_PREFETCH_LIMIT ((size_t)16384) // Let's not clog the whole cache at once
#define MIN(a, b) ((a) < (b) ? (a) : (b))
void spiram_cache_prefetch(char *addr, size_t sz) {
volatile static char byte = 0; // Prevent optimization
for(uint16_t ln = 0; ln < MIN(sz, SPIRAM_DATA_CACHE_PREFETCH_LIMIT) / SPIRAM_DATA_CACHE_LINE_SZ; ++ln) {
// Intentionally not using pointer arithmetic as gcc optimizes array access better
byte = addr[ln * SPIRAM_DATA_CACHE_LINE_SZ];
}
}
void spiram_cache_prefetch_test() {
char *buf = heap_caps_malloc(20000, MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM);
spiram_cache_prefetch(buf, 20000);
heap_caps_free(buf);
} [/Codebox]
I'm wondering because this might be a really big boost for linear access performed across bigger chunks (like a framebuffer ).
-
- Posts: 9759
- Joined: Thu Nov 26, 2015 4:08 am
Re: RAM & Flash cache efficiency
Well, you could do that, but it wouldn't be prefetching, it would just be fetching: your core would still be hung up until the entire 32-byte cache line is retrieved from PSRAM.
Re: RAM & Flash cache efficiency
You're right. Yeah I've checked and esp32 silicon does not implement any of the xtensa prefetch instructions. I wonder if this could be somehow simulated in ASM by delaying the cache sync/memw instructions?ESP_Sprite wrote: ↑Thu Aug 12, 2021 10:01 amWell, you could do that, but it wouldn't be prefetching, it would just be fetching: your core would still be hung up until the entire 32-byte cache line is retrieved from PSRAM.
Btw I've found some cache invalidation functions in the xtensa code. I wonder if they are usable so at least I could invalidate processed data to prevent pushing everything else from the cache.
Too bad the async memcp on S2 doesn't work with psram as this would allow to explicitly prefetch data into heap at least.
Re: RAM & Flash cache efficiency
I think one way this "prefetching" might be possible to implement on an ESP32 would be to use an interrupt to cancel the fetch instruction.
Something along these lines:
(initialization):
1. create an interrupt handler for CCOMPARE2 interrupt, that's a CPU-internal interrupt with level 5. There is some description on how to set up high-level interrupts in ESP-IDF here: https://docs.espressif.com/projects/esp ... rupts.html.
(when you want to prefetch)
1. enter a critical section (disable low&medium level interrupts)
2. set up CCOMPARE2 to trigger an interrupt a few CPU cycles in the future
3. read from the address you want to fetch
4. the CPU sends the request to the cache
5. a few cycles later, ccompare2 interrupt triggers
6. in the level 5 interrupt handler, you need to change the return address to point to the next instruction (EPC5 += 3) and return
7. exit the critical section
The theory behind this is that the cache will still perform the fetch but the CPU will not have to wait for it to complete. It's entirely untested, though, and I apologize in advance in case this won't work for some reason I'm not seeing now.
Something along these lines:
(initialization):
1. create an interrupt handler for CCOMPARE2 interrupt, that's a CPU-internal interrupt with level 5. There is some description on how to set up high-level interrupts in ESP-IDF here: https://docs.espressif.com/projects/esp ... rupts.html.
(when you want to prefetch)
1. enter a critical section (disable low&medium level interrupts)
2. set up CCOMPARE2 to trigger an interrupt a few CPU cycles in the future
3. read from the address you want to fetch
4. the CPU sends the request to the cache
5. a few cycles later, ccompare2 interrupt triggers
6. in the level 5 interrupt handler, you need to change the return address to point to the next instruction (EPC5 += 3) and return
7. exit the critical section
The theory behind this is that the cache will still perform the fetch but the CPU will not have to wait for it to complete. It's entirely untested, though, and I apologize in advance in case this won't work for some reason I'm not seeing now.
Who is online
Users browsing this forum: No registered users and 60 guests