I think there are three ways that can be followed:
1. The .NET MicroFramework.
For this approach the .NET code will be compiled into an optimized IL (intermediate language). On the ESP32 there must be an IL interpreter that interprets the IL code at runtime. That's not a JIT (Just-In-Time) compiler. So every single instruction must be interpreted again and again. I've worked with a Netduino plus 2 and I found that code execution was not so fast.
The hurdle for this approach would be porting the .NET MicroFramework interpreter to the ESP32. Currently only the ARM architecture is supported in the code. I think porting this code to another processor architecture would be very difficult.
2. The LLILUM AOT (Ahead-Of-Time) compiling approach
With LLILUM you compile a .NET IL code into a LLVM IL representation and then into machine code for a given target processor. LLVM is a new sort of compiler that can compile several input languages into LLVM IL representation and then into machine code.
But the problem here is that the ESP32 is not supported by LLVM. For getting the ESP32 supported a LLVM backend must be written. I've looked into the description
http://llvm.org/docs/WritingAnLLVMBackend.html how to write such a backend and found that this would also too complicated.
3. Transforming the .NET code into C/C++ and compile it with GCC - an AOT approach
The one and only tool that can create ESP32 machine code is the xtensa GCC compiler. So maybe a tool that transforms the .NET IL code or the input C# code into C/C++ code is needed. This C/C++ code should then be compiled with the xtensa GCC compiler.