Device bricked ("csum err") after two months of service
Re: Device bricked ("csum err") after two months of service
One last thought, I guess a signal integrity problem with the design could explain all of the symptoms mentioned. ie: the encrypted flash contents are still valid but some other physical wearing-in process on the board is decaying signal quality to the point where it's failing reliably with the same mis-readed data.
You could try adding more decoupling to the power rail (piggy-backing SMD MLCC caps is usually pretty easy), injecting 3.3V rather than using the internal LDO, or removing/disabling other peripherals on the board, and see if anything changes.
You could try adding more decoupling to the power rail (piggy-backing SMD MLCC caps is usually pretty easy), injecting 3.3V rather than using the internal LDO, or removing/disabling other peripherals on the board, and see if anything changes.
-
- Posts: 45
- Joined: Sun Jan 06, 2019 12:42 pm
Re: Device bricked ("csum err") after two months of service
Many good points, @ESP_Angus! I will definitely try the multiple dump procedure the next time I have such a device (the one I had I reflashed and sent back and is now working well in the field. Reflashing somehow cures the problem. We haven't burnt the write protect fuse on these particular devices, so we're capable of resurrecting them a few times).
The quality of the flash chips is not a likely culprit, these are production Winbond chips sourced through a reputable supplier.
The next time I get a device with corrupted bootloader, I'm thinking about doing this:
1. Dump the flash several times
2. Add more MLCCs, see if the behaviour changes
3. Power the 3.3V rail through a good lab PSU, see if the behaviour changes
4. Reflash the bootloader only, compare which block changes after it reencrypts itself.
And will post an update here.
The quality of the flash chips is not a likely culprit, these are production Winbond chips sourced through a reputable supplier.
The next time I get a device with corrupted bootloader, I'm thinking about doing this:
1. Dump the flash several times
2. Add more MLCCs, see if the behaviour changes
3. Power the 3.3V rail through a good lab PSU, see if the behaviour changes
4. Reflash the bootloader only, compare which block changes after it reencrypts itself.
And will post an update here.
Re: Device bricked ("csum err") after two months of service
Sounds like a good plan.
I don't know what volume of devices you're manufacturing, but the other thing you could try is to dump the ciphertext flash for each device as it leaves the factory (after the initial encrypt pass) and then store it. This gives you two advantages:
1. If a device comes back, you can compare the flash contents and see exactly what has changed.
2. You can flash the "good" ciphertext straight back on it, no need to burn FLASH_CRYPT_CNT.
(Although I guess your reflash-then-compare method gives a similar result but a lot less labour intensive!)
I don't know what volume of devices you're manufacturing, but the other thing you could try is to dump the ciphertext flash for each device as it leaves the factory (after the initial encrypt pass) and then store it. This gives you two advantages:
1. If a device comes back, you can compare the flash contents and see exactly what has changed.
2. You can flash the "good" ciphertext straight back on it, no need to burn FLASH_CRYPT_CNT.
(Although I guess your reflash-then-compare method gives a similar result but a lot less labour intensive!)
-
- Posts: 45
- Joined: Sun Jan 06, 2019 12:42 pm
Re: Device bricked ("csum err") after two months of service
Whoa, this is a very good idea, definitely worth implementing! Many thanks!
-
- Posts: 45
- Joined: Sun Jan 06, 2019 12:42 pm
Re: Device bricked ("csum err") after two months of service
Lol, what an interesting development...
I got two devices back, both "bricked" in the same way.
"62" is encrypted, and shows damage in its first bootloader sector; its entropy is only ~0.62 and the histogram shows high prevalence of bytes towards 0xff.
More interestingly, as discussed earlier in the thread, the flash read differently in that sector on every run.
I tried 6 times, all are different, and the differences are only in offsets 0x1000..0x1fff.
Only 479 bytes in the sector differ between dumps, the others are the same. A typical excerpt of the differences:
(I made a program for comparing those).
As you can see, typically a single-bit difference.
I also got "26", a device that wasn't encrypted. Again, just 0x1000..0x1fff is damaged, and its almost all 0xFFs; only a few bits are zero. Again, this sector reads differently each time, but there are far fewer differences, just 3 bytes.
So my flash memory somehow developed "weak bits".
But I have no idea how this could happen. I thought it was impossible.
Any ideas? ESP_Angus?
I got two devices back, both "bricked" in the same way.
"62" is encrypted, and shows damage in its first bootloader sector; its entropy is only ~0.62 and the histogram shows high prevalence of bytes towards 0xff.
More interestingly, as discussed earlier in the thread, the flash read differently in that sector on every run.
I tried 6 times, all are different, and the differences are only in offsets 0x1000..0x1fff.
Only 479 bytes in the sector differ between dumps, the others are the same. A typical excerpt of the differences:
Code: Select all
...
At offset 4780 (0x12ac):
0xfb: 62-2.bin, 62-4.bin, 62-5.bin, 62-6.bin
0xff: 62-1.bin, 62-3.bin
At offset 4795 (0x12bb):
0x5f: 62-1.bin, 62-5.bin
0x7f: 62-2.bin, 62-4.bin, 62-6.bin
0xdf: 62-3.bin
At offset 4804 (0x12c4):
0xd6: 62-1.bin, 62-2.bin, 62-3.bin, 62-4.bin
0xf6: 62-5.bin, 62-6.bin
At offset 4815 (0x12cf):
0xbf: 62-1.bin, 62-2.bin, 62-4.bin, 62-6.bin
0xff: 62-3.bin, 62-5.bin
At offset 4828 (0x12dc):
0x3e: 62-3.bin
0x3f: 62-5.bin
0x7e: 62-2.bin
0x7f: 62-4.bin, 62-6.bin
0xff: 62-1.bin
At offset 4832 (0x12e0):
0x78: 62-2.bin, 62-3.bin, 62-4.bin, 62-5.bin, 62-6.bin
0x7c: 62-1.bin
At offset 4849 (0x12f1):
0xdb: 62-1.bin, 62-2.bin, 62-3.bin, 62-5.bin
0xdf: 62-4.bin, 62-6.bin
...
As you can see, typically a single-bit difference.
I also got "26", a device that wasn't encrypted. Again, just 0x1000..0x1fff is damaged, and its almost all 0xFFs; only a few bits are zero. Again, this sector reads differently each time, but there are far fewer differences, just 3 bytes.
So my flash memory somehow developed "weak bits".
But I have no idea how this could happen. I thought it was impossible.
Any ideas? ESP_Angus?
Re: Device bricked ("csum err") after two months of service
Wow, that's pretty nasty! Probably time to have a talk with your flash memory supplier.PanicanWhyasker wrote: ↑Thu May 09, 2019 4:59 pmSo my flash memory somehow developed "weak bits".
But I have no idea how this could happen. I thought it was impossible.
To rule out electrical issues at reading time, you could try taking the flash chip off your device PCB and putting it on a known-good ESP32 dev board (may need to remove the RF can first), or on a generic SPI flash reader device.
But most probably all I can think of is that there were power stability issues at the time the flash was written (causing improperly written sectors), or that it's a defective flash chip.
-
- Posts: 45
- Joined: Sun Jan 06, 2019 12:42 pm
Re: Device bricked ("csum err") after two months of service
Yeah, I haven't got to the hardware interventions part, because
- only this sector shows defective, and I'm reading 1 MB, if it were an electrical issue during reading, I'd expect to affect 100% of the dumps
- also (based on experience from other devices), when you overwrite this sector, the weak bits are cured
thus I'm inclined to think adding caps, moving the flash to a known good board, etc. will not make the weak bits disappear.
And the more burning question is what process in the ESP would cause it to try writing to 0x1000 during normal operation?
- only this sector shows defective, and I'm reading 1 MB, if it were an electrical issue during reading, I'd expect to affect 100% of the dumps
- also (based on experience from other devices), when you overwrite this sector, the weak bits are cured
thus I'm inclined to think adding caps, moving the flash to a known good board, etc. will not make the weak bits disappear.
And the more burning question is what process in the ESP would cause it to try writing to 0x1000 during normal operation?
Re: Device bricked ("csum err") after two months of service
It's probably the first sector written during programming so is there enough time for the power supply to stabilize before it is erased and written?
Re: Device bricked ("csum err") after two months of service
If you're following the recommended flash encryption flow then the plaintext bootloader will boot and then replace itself with the encrypted version on first boot. This is the only time I'd expect 0x1000 to be erased and written, after the initial flash.PanicanWhyasker wrote: ↑Fri May 10, 2019 5:27 amAnd the more burning question is what process in the ESP would cause it to try writing to 0x1000 during normal operation?
(If you're flashing pre-encrypted then it's only the initial flash.)
-
- Posts: 45
- Joined: Sun Jan 06, 2019 12:42 pm
Re: Device bricked ("csum err") after two months of service
@WiFive, my programming sequence is that the ESP with the blank flash chip is first powered through the normal power source, likely for more than 5 seconds. Only then I connect the serial harness and run esptool.
Also, the devices flashed like this work. They work continuously for months, they develop those weak bits only during actual usage in the field.
@ESP_Angus, interesting. So in this case is it possible that (due to signal integrity issues), when the ESP reboots in the field, it tries to reencrypt itself (or for the device which isn't encrypted - to try to initiate self-encryption)?
Also, the devices flashed like this work. They work continuously for months, they develop those weak bits only during actual usage in the field.
@ESP_Angus, interesting. So in this case is it possible that (due to signal integrity issues), when the ESP reboots in the field, it tries to reencrypt itself (or for the device which isn't encrypted - to try to initiate self-encryption)?
Who is online
Users browsing this forum: No registered users and 34 guests