ARM Cortex-M Run-Time Library Analysis

This page forms part of an ARM Cortex-M Run-Time Library Analysis by Tom Vajzovic.

Copyright

The text and presentation of this analysis is copyright 2018 Tom Vajzovic. You may not copy it except as permitted by law.

The ARM and GCC routines presented here are subject to separate copyright. Displaying them in this way is academic fair use and so I have not sought a licence from the copyright holders. You must not take them from here to use them for any other purpose. You shouldn't want to anyway, because they are suboptimal.

You may use my versions (which are better) according to the terms of the The Truly Free Licence (public domain).

Cortex-M3 and Cortex-M4 (ARMv7-M / ARMv7E-M)

64-bit Multiply Function

Signed or Unsigned Multiply 64 x 64 = 64

Tool ARM standardlib ARM Microlib GCC 4 for Cortex-M3 GCC 5 & 6 for Cortex-M3 GCC 7 for Cortex-M3 GCC 4, 5 & 6 for Cortex-M4 GCC 7 for Cortex-M4
Code
__aeabi_lmul:
    push    {lr}
    mov     lr, r0
    umull   r0, r12, r2, lr
    mla     r1, r2, r1, r12
    mla     r1, r3, lr, r1
    pop     {pc}
__aeabi_lmul:
    push    {r0-r12, lr}
    movs    r1, 0
    sub     sp, 8
    mov     r11, r1
    mov     r10, r1
    mov     r8, r1
1:  ldr     r0, [sp, 8]
    uxth    r0, r0
    str     r0, [sp, 0]
    ldrd    r1, r0, [sp, 8]
    lsrs    r1, r1, 16
    orr.w   r1, r1, r0, lsl 16
    asrs    r0, r0, 16
    strd    r1, r0, [sp, 8]
    ldrd    r5, r0, [sp, 16]
    movs    r7, 0
    mov     r9, r0
    mov     r6, r7
    mov     r4, r7
2:  uxth    r1, r5
    ldr     r0, [sp, 0]
    lsrs    r5, r5, 16
    orr.w   r5, r5, r9, lsl 16
    muls    r0, r1
    mov.w   r9, r9, lsr 16
    movs    r1, 0
    mov     r2, r4
    bl      __aeabi_llsl
    adds    r7, r0, r7
    adcs    r1, r6
    adds    r4, 16
    mov     r6, r1
    cmp     r4, 64
    blt.n   2b
    mov     r2, r8
    mov     r0, r7
    bl      __aeabi_llsl
    adds.w  r11, r0, r11
    adc.w   r1, r1, r10
    add.w   r8, r8, 16
    mov     r10, r1
    cmp.w   r8, 64
    blt.n   1b
    add     sp, 24
    mov     r0, r11
    pop     {r4-r12, pc}
__aeabi_lmul:
    push    {r4}
    mov     r4, r3
    mul.w   r1, r2, r1
    umull   r2, r3, r2, r0
    mla     r1, r4, r0, r1
    mov     r0, r2
    add     r1, r3
    ldr.w   r4, [sp], 4
    bx      lr
__aeabi_lmul:
    push    {r4}
    mov     r4, r3
    mul.w   r1, r2, r1
    umull   r2, r3, r0, r2
    mla     r1, r4, r0, r1
    mov     r0, r2
    add     r1, r3
    pop     {r4}
    bx      lr
__aeabi_lmul:
    mov.w   r12, 0
    push    {r4-r7, r11}
    mov     r5, r1
    movs    r1, 0
    mov     r4, r0
    mov     r7, r3
    mul.w   r3, r0, r12
    mla     r3, r2, r1, r3
    umull   r0, r1, r0, r2
    add     r3, r1
    mla     r3, r7, r4, r3
    mla     r1, r5, r2, r3
    pop     {r4-r7, r11}
    bx      lr
__aeabi_lmul:
    mul.w   r1, r2, r1
    mla     r3, r3, r0, r1
    umull   r0, r1, r0, r2
    add     r1, r3
    bx      lr
__aeabi_lmul:
    push    {r4-r7, r11}
    mov.w   r12, 0
    mov     r5, r3
    movs    r7, 0
    mul.w   r3, r0, r12
    mla     r3, r2, r7, r3
    umull   r6, r7, r0, r2
    add     r3, r7
    mla     r3, r5, r0, r3
    mla     r1, r1, r2, r3
    mov     r0, r6
    pop     {r4-r7, r11}
    bx      lr
Code (bytes) 18 120 (+30) 26 24 44 16 42
Stack (bytes) 4 64 4 4 20 0 20
Cycles Cortex-M4 (0ws) 9 to 11 589 to 721 12 to 14 12 to 14 25 to 27 6 to 8 24 to 26
Cycles Cortex-M3 (0ws) 11 to 17 589 to 721 15 to 19 15 to 19 31 to 35 9 to 13 30 to 34

Details

Additional size in parentheses is for other functions that the code calls.

Cycle counts will vary within the ranges shown because:

Cycle counts may also be significantly more than what is shown because there may be contention for the data bus (eg: when DMA is using the same RAM bank). This will slow down each version in proportion to how much it uses the stack.

Details of exact versions tested.

Conclusions

The version provided by GCC 4, 5 and 6 for Cortex-M4 is the fastest and smallest. The version provided by ARM standardlib is only slightly worse.

GCC incorrectly provides bigger and slower routines for Cortex-M3 when the ones it provides for Cortex-M4 would be perfectly suitable.

GCC 7 provides significantly bigger and slower routines than its older versions.

The ARM Microlib version is not optimized at all. Microlib is supposed to be as small as possible at the expense of sometimes being slightly slower. In this case though it is eight times bigger than it could be (nine times if you include the shift function it calls). On Cortex-M3 it runs sixty times slower than it should, and on Cortex-M4 one hundred times slower.

The best libgcc version could be made two bytes smaller still by using muls instead of mul.w as the first instruction. This would execute in the same number of cycles but the Architecture Reference Manual recommends against using muls. I presume this is because it sets the flags when they aren't required, which may decrease the performance of the branch predictor (possibly making later branch instructions slower).