10x faster flash programming. Source code ESP32ROM.STUB_CODE?

RMandR
Posts: 75
Joined: Mon Oct 29, 2018 3:13 pm

10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby RMandR » Wed Feb 08, 2023 3:29 pm

Is it possible to get access to the the stub code for flash programming?

Working on improving the flash programming speed by a few multiples.

The 2232HL chip on ESP_PROG is capable of 12Mbps but there may be other limitations on the flash program process that result in errors such as this one in baud rates over 2.5Mpbs:

A fatal error occurred: Serial data stream stopped: Possible serial noise or corruption.

At 2.5Mbps BAUD, it takes a very long time to flash an encrypted image to modules such as ESP32-WROOM-32E-N16. Considering the fact that the ext flash-chip itself can handle close to 100MHz and compared to other chips on the market, we should be able to get a few fold improvement for the pure flash programming part.

It would be nice to get a head-start with existing stub as opposed to from scratch.

ESP_Sprite
Posts: 9773
Joined: Thu Nov 26, 2015 4:08 am

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby ESP_Sprite » Thu Feb 09, 2023 12:24 am

Yes, it's part of esptool. Be aware that running 2.5MBit over a simple serial link might be too much for the serial link hardware; in addition to that, in my experience the flash erase and write times take up a fair amount of the program time, so I'm not sure how much more there is to gain. (Source: the -S2, S3 and C3 can be programmed over USB and should in theory have the full 12MBit of the USB pipe to themselves, but they don't program that much faster.) If you do manage to speed up the process, please share, we're always interested in making the developers experience better.

RandomInternetGuy
Posts: 52
Joined: Fri Aug 11, 2023 4:56 am

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby RandomInternetGuy » Sat Jun 29, 2024 10:24 pm

I suspect many of us are annoyed at the long flash times. The few times I poked at it (before finding this) I, too, concluded that the flash write itself was probably the bottleneck. Well, I at least concluded that the serial wasn't the bottleneck. You can pretty trivially convinced yourself that 2Mbps is nowhere near as fast as 2x 1Mbps, for example. Maybe the programming protocol is kind of dumb and not allowing multiple packets in flight (I *think* this is true, but if it's dominated by flash write speed, that tiny empty space of waiting for a new packet after the last one is acked is probably not a huge deal.) Even though the post -C3 line can program over JTAG at 1995 speeds (sigh), I share Sprite's observation that it's no faster in practice. We also see other JTAG implementations - usually sharing that blazing 1.1 speed (which I won't continue to dog because it's still way faster than the fastest observed serial speeds here) - attached to common ESP32-Nothing boards (ESP32-Legacy? Esp32-WithoutAnythingAfterIt?) programmed via JTAG are still about the same.

The set of tires I haven't kicked, because I know it'll involve changes to the linker scripts and won't work for large programs, is changing the upload to just terminate in RAM instead of flash and execute there. Maybe not even the whole program, but at least the functions that I'm working on actively (they're kind of leaf nodes in the call tree; I'm not worried about, say, changing a structure that's a global and NEEDS the world recompiled...) could be interesting targets. For development, I'd be willing to trade off executing cached from IRAM and running from SRAM (I'm guessing that PSRAM is right out of the question unless there's some clever way to handle the "paging" from code that's nailed into IRAM). Are the parts really limited to running ONLY in IRAM? If that's the case, limiting to 128K will be a challenge - if your program is THAT small, you're probably not raging at flash speeds.

My trivial experiment in doing this resulted in Memory protection fault) trying to run code from normal RAM, so either there's no juice in this squeeze or there's some safety check I need to unlock.

I'm willing to forget about things like power states. As a developer turning code many times an hour, it's plugged in anyway.

Before I go too far into crazy-town: is this, or anything else, a viable way to reduce the amount of time spent reflashing during development? Is there some existing tool or technique out there in common use that I've not found?

Postscript of crazy thoughts while typing this:
1) Especially for the post-2020 parts: Maybe do something crazy like put an RV32 emulator in flash (hello, https://github.com/cnlohr/mini-rv32ima) and run the code you care about from RAM during development and move it to flash for final testing and shipping. Then we could "execute" (interpret) data even in PSRAM.
2) Put that "leaf node" code (in my case, the twiddly bits that the rest of the code exists to run) into a special section of flash at fixed sectors/addresses so that only THAT code has to be reflashed. (Hopefully the flash loader either knows to not reflash sectors that haven't been changed or you just run it explictly to load at the last, say, 1MB. So you let ESP-IDF have, say, 0-0x2f.ffff and your .section have 0x30.0000 -> 3f.ffff.

If it doesn't reflash unchanged bytes, it would know only to reprogram the final 1MB that's changing. You could put your code in that specific .section and it's all good.
If it's not that smart, or there are timestamps or other annoying randomizations introduced during the rebuild, you just leave your core code unchanged (well, you'd have to decouple it like a shared library in order to get the executable to link...) and just 'esptool write_flash 0x300000 leafcode.bin'.

Making that "not that smart" bit work will be a bit of a bad dream because leafcode.bin will need to know addresses of anything in the core that it may need to call or read.

But think of my leaf code here as an example like a device driver. Currently we're rebuilding (hopefully not, but we're at least reinking and re-uploading) the entire OS every time we're recompiling a specific driver that we're working on and that's not great.

username
Posts: 542
Joined: Thu May 03, 2018 1:18 pm

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby username » Sun Jun 30, 2024 6:10 am

I suspect many of us are annoyed at the long flash times.
Not sure why you are having long flash times. I primarily use the ESP32-S3.
Having code that is a bit over 4mb takes only 4 seconds. Personally I think that is very fast.
Though I dont know why when I use the same 4mb code on the ESP32-S3-DevKitC-1 it takes like 8-9 seconds. But when I use the M5Stamp ESP32S3 Module https://shop.m5stack.com/products/m5sta ... 2s3-module it only takes 4 seconds.

RandomInternetGuy
Posts: 52
Joined: Fri Aug 11, 2023 4:56 am

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby RandomInternetGuy » Tue Jul 02, 2024 5:52 am

Gee, I composed a long answer, got distracted, and came back to press the preview button. The forum then took me to a login page, but ate my post. Classy!

The TLDR is that this is, in fact, not really Espressif's problem. I chased the problem all the way down to the wire and the upload of ~1MB of *.bins takes about 7.8 seconds at 1000000bps and 7.1 at 4000000bps, though it was unreliable on the WCH at that speed. Ekeing out that half second at the cost of reliability is stress I don't need.

The overall problem is that

Code: Select all

'pio run  --target upload -e ...
` is doing about 23 seconds of something, even on a 'do nothing' build. Only 7-8 seconds of my case is eaten in the actual serial transfer done by esptool. Visualy, the way it's presented throws esptool under the bus, but that's not where the REAL time goes.

username, to what do you attribute the difference? You're seeing a MB/sec with 4MB/4sec, but I'm seeing .5MB/~8sec. I don't think there's a life-changing amount of difference I can make without replacing Platformio/VSCode ((Wicked Witch voice, "and your little Arduino, too!") but now that I've dug deeply, I'm interested in understanding the engineering explanation for what's left.

Thank you for encouraging me to dig deeper.




The rest is just details for anyone curious about the research process and results found.

Another funny thing is that if you use cached builds and do a clean build, it takes ...

Code: Select all

========================= [SUCCESS] Took 31.21 seconds =========================
so apparently the actual copying (because it copies files from the cache to the build directories !?!?!) and linking takes about one second in my case. If I then do a clean AND remove the build cache, we see that optimizing all 500KB of object from scratch takes 3x as long as platformio just doing absolutely nothing.

Code: Select all

========================= 1 succeeded in 00:01:58.463 =========================
If you're moving 4MB in 4sec, your data must be WAY more compressed than mine is or your flash is faster.. Mine moves about 550MB in four sections at about half wire speed at 500kbps. My testing just won't go faster than that at any bit rate so I suspect I'm flash-bound. Is your 4MB image perhaps largely an empty filesystem or data that's more compressed than my executables? Are you using JTAG or faster flash or flashing straight to RAM? I can't seem to replicate your success on any combination of "real" UART or CDC-ACM implementation on the chips from the last 4 years.


For future generations reading this, the actual upload command is:

Code: Select all

 esptool.py \
--chip esp32s3 --port /dev/cu.wchusbserial* --baud 3000000 \
 write_flash  --flash_mode keep \
--flash_freq 80m --flash_size 8MB \
0x0000 .pio/build/yd-esp32-s3-demo/bootloader.bin \ 
0x8000 .pio/build/yd-esp32-s3-demo/partitions.bin \
0xe000 ~/.platformio/packages/framework-arduinoespressif32/tools/partitions/boot_app0.bin \
0x10000 .pio/build/yd-esp32-s3-demo/firmware.bin
Flash will be erased from 0x00000000 to 0x00003fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Flash will be erased from 0x0000e000 to 0x0000ffff...
Flash will be erased from 0x00010000 to 0x000fcfff...
Compressed 15104 bytes to 10430...
Wrote 15104 bytes (10430 compressed) at 0x00000000 in 0.2 seconds (effective 499.2 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 146...
Wrote 3072 bytes (146 compressed) at 0x00008000 in 0.0 seconds (effective 511.9 kbit/s)...
Hash of data verified.
Compressed 8192 bytes to 47...
Wrote 8192 bytes (47 compressed) at 0x0000e000 in 0.1 seconds (effective 675.6 kbit/s)...
Hash of data verified.
Compressed 969952 bytes to 532866...
Wrote 969952 bytes (532866 compressed) at 0x00010000 in 7.0 seconds (effective 1106.0 kbit/s)...
Hash of data verified.
In case you suspect that esptool is smart and doesn't erase/write memory that hasn't change, it doesn't appear to be that clever. In my testing, it always erases and it always rewrites. So even if adding a read could save two write cycles and the reality that writes are slower than reads, it just doesn't seem to do so.

Thank you for making me question my tools and place blame and anger where it belongs. :-)

P.S. I dig that this is a conversation between 'username' and "randomguy".

username
Posts: 542
Joined: Thu May 03, 2018 1:18 pm

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby username » Tue Jul 02, 2024 11:40 am

The largest program I have at home is about 1MB.
looks as though I must be living in a time warp or smoking to much crack. Seems I am getting the same speeds as you.
esptool.py v4.7.0
Serial port COM3
Connecting...
Chip is ESP32-S3 (QFN56) (revision v0.1)
Features: WiFi, BLE, Embedded Flash 8MB (GD)
Crystal is 40MHz
MAC: dc:54:75:d1:64:14
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 460800
Changed.
Configuring flash size...
Auto-detected Flash size: 8MB
Flash will be erased from 0x00000000 to 0x00005fff...
Flash will be erased from 0x00010000 to 0x00123fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Flash will be erased from 0x0000d000 to 0x0000efff...
Compressed 23568 bytes to 14629...
Wrote 23568 bytes (14629 compressed) at 0x00000000 in 0.4 seconds (effective 509.0 kbit/s)...
Hash of data verified.
Compressed 1126656 bytes to 825496...
Wrote 1126656 bytes (825496 compressed) at 0x00010000 in 8.0 seconds (effective 1123.1 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 128...
Wrote 3072 bytes (128 compressed) at 0x00008000 in 0.1 seconds (effective 399.1 kbit/s)...
Hash of data verified.
Compressed 8192 bytes to 31...
Wrote 8192 bytes (31 compressed) at 0x0000d000 in 0.1 seconds (effective 690.0 kbit/s)...
Hash of data verified.

RandomInternetGuy
Posts: 52
Joined: Fri Aug 11, 2023 4:56 am

Re: 10x faster flash programming. Source code ESP32ROM.STUB_CODE?

Postby RandomInternetGuy » Tue Jul 02, 2024 5:09 pm

Thank you, that resolves that mystery. We're both likely bound by the actual flash speed. Maybe life is better on a device with QIO flash.

If you're at 460k on a "real" UART, you probably can get a doubling of speed just going up to 1Mbps or so. If you're on a CDC/ACM device, that speed doesn't matter. (My S3 boards have two ports. I use the legacy UART side just because it doesn't' reset when the ESP resets, so my openocd or monitor or whatever can stay up.)

I assume you have the good taste to not be using Platformio's pio upload, right?

That does leave the "10x faster" option interesting if there was a way we could write to RAM or have esptool NOT erase the device and reprogram everything, even if unchanged. Our ~ 1MB bins are each clocking in around 8 seconds, so a full 4MB would track at around 32 secs. The time that PlatformIO wastes is likely constant so that'd probably be ~1minute for me.

So there's still some merit for some programmers in chasing a faster esptool flash_write operation, but it's a small slice of what platformio users are spending in each upload. I can see my biggest performance boost would be eliminating pio.

Thank you for talking it through.

Who is online

Users browsing this forum: No registered users and 92 guests