Device bricked ("csum err") after two months of service

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Fri May 03, 2019 1:43 am

One last thought, I guess a signal integrity problem with the design could explain all of the symptoms mentioned. ie: the encrypted flash contents are still valid but some other physical wearing-in process on the board is decaying signal quality to the point where it's failing reliably with the same mis-readed data.

You could try adding more decoupling to the power rail (piggy-backing SMD MLCC caps is usually pretty easy), injecting 3.3V rather than using the internal LDO, or removing/disabling other peripherals on the board, and see if anything changes.

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Fri May 03, 2019 5:42 am

Many good points, @ESP_Angus! I will definitely try the multiple dump procedure the next time I have such a device (the one I had I reflashed and sent back and is now working well in the field. Reflashing somehow cures the problem. We haven't burnt the write protect fuse on these particular devices, so we're capable of resurrecting them a few times).

The quality of the flash chips is not a likely culprit, these are production Winbond chips sourced through a reputable supplier.

The next time I get a device with corrupted bootloader, I'm thinking about doing this:

1. Dump the flash several times
2. Add more MLCCs, see if the behaviour changes
3. Power the 3.3V rail through a good lab PSU, see if the behaviour changes
4. Reflash the bootloader only, compare which block changes after it reencrypts itself.

And will post an update here.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Fri May 03, 2019 6:36 am

Sounds like a good plan.

I don't know what volume of devices you're manufacturing, but the other thing you could try is to dump the ciphertext flash for each device as it leaves the factory (after the initial encrypt pass) and then store it. This gives you two advantages:

1. If a device comes back, you can compare the flash contents and see exactly what has changed.
2. You can flash the "good" ciphertext straight back on it, no need to burn FLASH_CRYPT_CNT.

(Although I guess your reflash-then-compare method gives a similar result but a lot less labour intensive!)

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Fri May 03, 2019 8:55 am

Whoa, this is a very good idea, definitely worth implementing! Many thanks!

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Thu May 09, 2019 4:59 pm

Lol, what an interesting development...

I got two devices back, both "bricked" in the same way.
"62" is encrypted, and shows damage in its first bootloader sector; its entropy is only ~0.62 and the histogram shows high prevalence of bytes towards 0xff.
More interestingly, as discussed earlier in the thread, the flash read differently in that sector on every run.
I tried 6 times, all are different, and the differences are only in offsets 0x1000..0x1fff.
Only 479 bytes in the sector differ between dumps, the others are the same. A typical excerpt of the differences:

Code: Select all

...
At offset 4780 (0x12ac):
   0xfb: 62-2.bin, 62-4.bin, 62-5.bin, 62-6.bin
   0xff: 62-1.bin, 62-3.bin
At offset 4795 (0x12bb):
   0x5f: 62-1.bin, 62-5.bin
   0x7f: 62-2.bin, 62-4.bin, 62-6.bin
   0xdf: 62-3.bin
At offset 4804 (0x12c4):
   0xd6: 62-1.bin, 62-2.bin, 62-3.bin, 62-4.bin
   0xf6: 62-5.bin, 62-6.bin
At offset 4815 (0x12cf):
   0xbf: 62-1.bin, 62-2.bin, 62-4.bin, 62-6.bin
   0xff: 62-3.bin, 62-5.bin
At offset 4828 (0x12dc):
   0x3e: 62-3.bin
   0x3f: 62-5.bin
   0x7e: 62-2.bin
   0x7f: 62-4.bin, 62-6.bin
   0xff: 62-1.bin
At offset 4832 (0x12e0):
   0x78: 62-2.bin, 62-3.bin, 62-4.bin, 62-5.bin, 62-6.bin
   0x7c: 62-1.bin
At offset 4849 (0x12f1):
   0xdb: 62-1.bin, 62-2.bin, 62-3.bin, 62-5.bin
   0xdf: 62-4.bin, 62-6.bin
...
(I made a program for comparing those).
As you can see, typically a single-bit difference.

I also got "26", a device that wasn't encrypted. Again, just 0x1000..0x1fff is damaged, and its almost all 0xFFs; only a few bits are zero. Again, this sector reads differently each time, but there are far fewer differences, just 3 bytes.

So my flash memory somehow developed "weak bits".
But I have no idea how this could happen. I thought it was impossible.

Any ideas? ESP_Angus?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Fri May 10, 2019 4:49 am

PanicanWhyasker wrote:
Thu May 09, 2019 4:59 pm
So my flash memory somehow developed "weak bits".
But I have no idea how this could happen. I thought it was impossible.
Wow, that's pretty nasty! Probably time to have a talk with your flash memory supplier.

To rule out electrical issues at reading time, you could try taking the flash chip off your device PCB and putting it on a known-good ESP32 dev board (may need to remove the RF can first), or on a generic SPI flash reader device.

But most probably all I can think of is that there were power stability issues at the time the flash was written (causing improperly written sectors), or that it's a defective flash chip.

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Fri May 10, 2019 5:27 am

Yeah, I haven't got to the hardware interventions part, because

- only this sector shows defective, and I'm reading 1 MB, if it were an electrical issue during reading, I'd expect to affect 100% of the dumps
- also (based on experience from other devices), when you overwrite this sector, the weak bits are cured

thus I'm inclined to think adding caps, moving the flash to a known good board, etc. will not make the weak bits disappear.

And the more burning question is what process in the ESP would cause it to try writing to 0x1000 during normal operation?

WiFive
Posts: 3529
Joined: Tue Dec 01, 2015 7:35 am

Re: Device bricked ("csum err") after two months of service

Postby WiFive » Fri May 10, 2019 5:59 am

It's probably the first sector written during programming so is there enough time for the power supply to stabilize before it is erased and written?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Fri May 10, 2019 6:10 am

PanicanWhyasker wrote:
Fri May 10, 2019 5:27 am
And the more burning question is what process in the ESP would cause it to try writing to 0x1000 during normal operation?
If you're following the recommended flash encryption flow then the plaintext bootloader will boot and then replace itself with the encrypted version on first boot. This is the only time I'd expect 0x1000 to be erased and written, after the initial flash.

(If you're flashing pre-encrypted then it's only the initial flash.)

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Fri May 10, 2019 6:32 am

@WiFive, my programming sequence is that the ESP with the blank flash chip is first powered through the normal power source, likely for more than 5 seconds. Only then I connect the serial harness and run esptool.
Also, the devices flashed like this work. They work continuously for months, they develop those weak bits only during actual usage in the field.

@ESP_Angus, interesting. So in this case is it possible that (due to signal integrity issues), when the ESP reboots in the field, it tries to reencrypt itself (or for the device which isn't encrypted - to try to initiate self-encryption)?

Who is online

Users browsing this forum: No registered users and 53 guests