RAM & Flash cache efficiency

guillep2k
Posts: 13
Joined: Tue Feb 12, 2019 8:39 pm

RAM & Flash cache efficiency

Postby guillep2k » Mon Feb 18, 2019 1:41 pm

Hi. I've looked around a bit but I couldn't find an indication of how much actual RAM/Flash does 32K represent in the cache. I assume that the cache needs to reserve space for metadata information, like page source address, some kind of access tree, so I'd like to have a better idea about how many real KB can be cached by each KB on the cache, and wether contiguous blocks are better cached than randomly accessed ones. Or perhaps that's all covered by "0x3FF1_0000 ~ 0x3FF1_3FFF 16 KB Cache MMU Table"? I'd also like to confirm wether the cached block size is 16 bytes, as I've read somewhere. Thank you!

ESP_Sprite
Posts: 9766
Joined: Thu Nov 26, 2015 4:08 am

Re: RAM & Flash cache efficiency

Postby ESP_Sprite » Tue Feb 19, 2019 2:59 am

From what I remember, the cache tag memory (the 'metadata' as you put it) is a separate bit of memory, the 32K is used in full for the actual data. Cache lines are 32 bytes.

guillep2k
Posts: 13
Joined: Tue Feb 12, 2019 8:39 pm

Re: RAM & Flash cache efficiency

Postby guillep2k » Tue Feb 19, 2019 4:58 am

Thank you. Does it make any difference if the lines are contiguous? I mean, if I access 2KB of contiguous memory vs. 2KB comprised of randomly located 32 byte chunks (as long as the chunks are properly aligned, of course)?

ESP_Sprite
Posts: 9766
Joined: Thu Nov 26, 2015 4:08 am

Re: RAM & Flash cache efficiency

Postby ESP_Sprite » Wed Feb 20, 2019 3:00 am

No, a chunk is a chunk and the ESP32 cache does not do pre-fetching, so randomly accessing aligned 32-byte chunks is as fast as accessing them linearily. (Note that our cache designs have been improved in the meanwhile; this may not be true anymore for any chips we'll be releasing in the future.)

Oleg Endo
Posts: 18
Joined: Fri Sep 28, 2018 1:48 pm

Re: RAM & Flash cache efficiency

Postby Oleg Endo » Thu Jul 18, 2019 2:36 am

Is it somehow possible to hide the cache access latency? For instance, I'd like to walk over a larger piece of data that is stored in ext. QSPI ROM/RAM and mapped via MMU/cache, and process that data. It could walk the data in cache line sizes (32 bytes). But every time there is a cache miss, things have to wait for the QSPI transaction which is slow. Is it possible to issue a cache prefetch somehow so that the next cache line gets prefetched while the current one is processed? On other systems it's usually accessible via GCC's __builtin_prefetch ... is there anything on ESP32 for that?

ESP_Sprite
Posts: 9766
Joined: Thu Nov 26, 2015 4:08 am

Re: RAM & Flash cache efficiency

Postby ESP_Sprite » Thu Jul 18, 2019 6:26 am

No, sorry, the ESP32 does not have a prefetch mechanism to my knowledge.

pinkeen
Posts: 6
Joined: Mon Jul 12, 2021 2:55 pm

Re: RAM & Flash cache efficiency

Postby pinkeen » Thu Aug 12, 2021 4:55 am

Would it work if one would access the first byte of each 32byte segment of the chunk that needs to be prefetched?

Something along these lines (untested):

[Codebox]
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>
#include "esp_heap_caps.h"


#define SPIRAM_DATA_CACHE_SZ ((size_t)32768)
#define SPIRAM_DATA_CACHE_LINE_SZ ((size_t)32)
#define SPIRAM_DATA_CACHE_PREFETCH_LIMIT ((size_t)16384) // Let's not clog the whole cache at once
#define MIN(a, b) ((a) < (b) ? (a) : (b))

void spiram_cache_prefetch(char *addr, size_t sz) {
volatile static char byte = 0; // Prevent optimization
for(uint16_t ln = 0; ln < MIN(sz, SPIRAM_DATA_CACHE_PREFETCH_LIMIT) / SPIRAM_DATA_CACHE_LINE_SZ; ++ln) {
// Intentionally not using pointer arithmetic as gcc optimizes array access better
byte = addr[ln * SPIRAM_DATA_CACHE_LINE_SZ];
}
}


void spiram_cache_prefetch_test() {
char *buf = heap_caps_malloc(20000, MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM);
spiram_cache_prefetch(buf, 20000);
heap_caps_free(buf);
} [/Codebox]

I'm wondering because this might be a really big boost for linear access performed across bigger chunks (like a framebuffer ;)).

ESP_Sprite
Posts: 9766
Joined: Thu Nov 26, 2015 4:08 am

Re: RAM & Flash cache efficiency

Postby ESP_Sprite » Thu Aug 12, 2021 10:01 am

Well, you could do that, but it wouldn't be prefetching, it would just be fetching: your core would still be hung up until the entire 32-byte cache line is retrieved from PSRAM.

pinkeen
Posts: 6
Joined: Mon Jul 12, 2021 2:55 pm

Re: RAM & Flash cache efficiency

Postby pinkeen » Thu Aug 12, 2021 10:13 am

ESP_Sprite wrote:
Thu Aug 12, 2021 10:01 am
Well, you could do that, but it wouldn't be prefetching, it would just be fetching: your core would still be hung up until the entire 32-byte cache line is retrieved from PSRAM.
You're right. Yeah I've checked and esp32 silicon does not implement any of the xtensa prefetch instructions. I wonder if this could be somehow simulated in ASM by delaying the cache sync/memw instructions?

Btw I've found some cache invalidation functions in the xtensa code. I wonder if they are usable so at least I could invalidate processed data to prevent pushing everything else from the cache.

Too bad the async memcp on S2 doesn't work with psram as this would allow to explicitly prefetch data into heap at least.

ESP_igrr
Posts: 2072
Joined: Tue Dec 01, 2015 8:37 am

Re: RAM & Flash cache efficiency

Postby ESP_igrr » Thu Aug 12, 2021 10:30 am

I think one way this "prefetching" might be possible to implement on an ESP32 would be to use an interrupt to cancel the fetch instruction.

Something along these lines:

(initialization):
1. create an interrupt handler for CCOMPARE2 interrupt, that's a CPU-internal interrupt with level 5. There is some description on how to set up high-level interrupts in ESP-IDF here: https://docs.espressif.com/projects/esp ... rupts.html.

(when you want to prefetch)
1. enter a critical section (disable low&medium level interrupts)
2. set up CCOMPARE2 to trigger an interrupt a few CPU cycles in the future
3. read from the address you want to fetch
4. the CPU sends the request to the cache
5. a few cycles later, ccompare2 interrupt triggers
6. in the level 5 interrupt handler, you need to change the return address to point to the next instruction (EPC5 += 3) and return
7. exit the critical section

The theory behind this is that the cache will still perform the fetch but the CPU will not have to wait for it to complete. It's entirely untested, though, and I apologize in advance in case this won't work for some reason I'm not seeing now.

Who is online

Users browsing this forum: No registered users and 61 guests