Bug in FPU Coprocessor float division?

michal1 · Postby **michal1** » Tue May 14, 2019 8:03 pm

Hello,

I have noticed that using the native FPU division instructions leads to unexpected results when dividing two floating numbers of significantly different exponents.

It seems that if a division operator is used in C code, gcc inserts a call to __divsf3() which does not use the native FPU division instructions but instead relies on software implementation and gives the correct result. As I am working on computationally intensive DSP code, performance is really important to me and I would like to use the native instructions instead. Using the FPU instruction sequence for division as documented in the ISA Reference manual and implemented in divsf() : https://github.com/espressif/esp-idf/bl ... /test_fp.c however leads to wrong results (typically 0.0) when the two exponents differ a lot. The FPU division should however be IEEE compliant according to documentation.

An example code:

Code: Select all

float x = 24e9;
float y = 4;

printf("C: %.2f/%.2f = %.2f\n", x,y, x/y);
printf("ASM: %.2f/%.2f = %.2f\n", x,y, divsf(x,y));

x = 17;
y = 4;

printf("C: %.2f/%.2f = %.2f\n", x,y, x/y);
printf("ASM: %.2f/%.2f = %.2f\n", x,y, divsf(x,y));

Output:

Code: Select all

C: 24000000000.00/4.00 = 6000000000.00
ASM: 24000000000.00/4.00 = 0.00
C: 17.00/4.00 = 4.25
ASM: 17.00/4.00 = 4.25

I was wondering if anybody has come across those issues and if this is indeed a hardware bug in the FPU or if there are perhaps limitations to the native FPU?

Thank you very much!

michal1 · Postby **michal1** » Wed May 15, 2019 4:33 pm

It turns out there is a bug in divsf() implementation in https://github.com/espressif/esp-idf/bl ... /test_fp.c

I believe the correct implemenation should be:

Code: Select all

float divsf(float a, float b)
{
    float result;
    asm volatile (
        "wfr f0, %1\n"
        "wfr f1, %2\n"
        "div0.s f3, f1 \n"
        "nexp01.s f4, f1 \n"
        "const.s f5, 1 \n"
        "maddn.s f5, f4, f3 \n"
        "mov.s f6, f3 \n"
        "mov.s f7, f1 \n"
        "nexp01.s f8, f0 \n"
        "maddn.s f6, f5, f3 \n"
        "const.s f5, 1 \n"
        "const.s f2, 0 \n"
        "neg.s f9, f8 \n"
        "maddn.s f5,f4,f6 \n"
        "maddn.s f2, f9, f3 \n" /* Original was "maddn.s f2, f0, f3 \n" */
        "mkdadj.s f7, f0 \n"
        "maddn.s f6,f5,f6 \n"
        "maddn.s f9,f4,f2 \n"
        "const.s f5, 1 \n"
        "maddn.s f5,f4,f6 \n"
        "maddn.s f2,f9,f6 \n"
        "neg.s f9, f8 \n"
        "maddn.s f6,f5,f6 \n"
        "maddn.s f9,f4,f2 \n"
        "addexpm.s f2, f7 \n"
        "addexp.s f6, f7 \n"
        "divn.s f2,f9,f6\n"
        "rfr %0, f2\n"
        :"=r"(result):"r"(a), "r"(b)
    );
    return result;
}

Another question is, why does gcc by default use __divsf3() when the performance of the native FPU is significantly better?

ESP_Angus · Postby **ESP_Angus** » Thu May 16, 2019 2:15 am

Thanks for pointing this out, michal.

EDIT: The libgcc __divsf3() implementation in the toolchain uses FPU registers. IDF is accidentally linking the version in ROM which does not use FPU registers. We'll fix this so that you get the FPU version when building a project that uses floating point division.

EDIT 2: Fix is in internal review now.

simap2000 · Postby **simap2000** » Sat May 09, 2020 4:13 pm

Hello!
This post has been most helpful! I haven't tested it thoroughly, but this division implementation is much faster than what I'm getting natively.

I poked around on the esp-idf github, and it looks like it's still using the rom version that is slow. What happened to the fix? Should I file a github issue?

ESP_Angus · Postby **ESP_Angus** » Sun May 10, 2020 11:22 pm

Hi simap2000,

Sorry this post should have been updated when the fix was merged. The fix was merged in this commit and should be included in ESP-IDF v4.0 and newer.

Can you please give some more details about what you're seeing? Specifically: Which ESP-IDF version are you using, and what is the reason for saying "it looks like it's still using the rom version that is slow"?

Thanks,

Angus

Bryght-Richard · Postby **Bryght-Richard** » Wed Feb 28, 2024 2:59 pm

@ESP_Angus, this still seems to be a problem on ESP32S3, IDF5.1.2, G++12.2.0, compiler optimization -O2, same result on godbolt.org

Code: Select all

float fdiv(float a, float b)
{
    return a / b;
}

float fmul(float a, float b)
{
    return a * b;
}

Compiles to:

Code: Select all

fdiv(float, float):
        entry   sp, 32
        mov.n   a11, a3
        mov.n   a10, a2
        call8   __divsf3
        mov.n   a2, a10
        retw.n
fmul(float, float):
        entry   sp, 32
        wfr     f0, a2
        wfr     f1, a3
        mul.s   f0, f0, f1
        rfr     a2, f0
        retw.n

Postby **ESP_Sprite** » Fri Mar 01, 2024 6:56 am

I don't see the issue? Division is always done in the __divsf3 function, as division on Xtensa is not one single opcode, but a bunch of them. Do you have a reason to think the __divsf3 linked does not use the FPU?

Bryght-Richard · Postby **Bryght-Richard** » Fri Mar 01, 2024 4:19 pm

@ESP_sprint, thank you, I had misunderstood how the divide instructions worked, and though they were short enough to be inlined effectively without branching to __divsf3.

Bug in FPU Coprocessor float division?

Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Re: Bug in FPU Coprocessor float division?

Who is online

About Us

Extra

Information