ARM Cortex-M Run-Time Library Analysis

This page forms part of an ARM Cortex-M Run-Time Library Analysis by Tom Vajzovic.

Copyright

The text and presentation of this analysis is copyright 2018 Tom Vajzovic. You may not copy it except as permitted by law.

The ARM and GCC routines presented here are subject to separate copyright. Displaying them in this way is academic fair use and so I have not sought a licence from the copyright holders. You must not take them from here to use them for any other purpose. You shouldn't want to anyway, because they are suboptimal.

You may use my versions (which are better) according to the terms of the The Truly Free Licence (public domain).

Cortex-M3 and Cortex-M4 (ARMv7-M / ARMv7E-M)

64-bit Shift Functions

Signed or Unsigned Shift Left

Tool ARM standardlib ARM Microlib GCC Mine
Code
__aeabi_llsl:
    subs.w  r3, r2, 32
    bpl.n   1f
    rsb     r3, r2, 32
    lsl.w   r1, r1, r2
    lsr.w   r3, r0, r3
    lsl.w   r0, r0, r2
    orr.w   r1, r1, r3
    bx      lr
1:  lsl.w   r1, r0, r3
    mov.w   r0, 0
    bx      lr
__aeabi_llsl:
    cmp     r2, 32
    blt.n   1f
    subs    r2, 32
    lsl.w   r1, r0, r2
    movs    r0, 0
    bx      lr
1:  lsls    r1, r2
    rsb     r3, r2, 32
    lsr.w   r3, r0, r3
    orrs    r1, r3
    lsls    r0, r2
    bx      lr
__aeabi_llsl:
    lsls    r1, r2
    adds    r3, r0, 0
    lsls    r0, r2
    mov     r12, r3
    subs    r2, 32
    lsls    r3, r2
    orrs    r1, r3
    negs    r2, r2
    mov     r3, r12
    lsrs    r3, r2
    orrs    r1, r3
    bx      lr
__aeabi_llsl:
    subs    r3, r2, 32
    bhs     1f
    rsbs    r3, 0
    lsr     r3, r0, r3
    lsls    r1, r2
    lsls    r0, r2
    orrs    r1, r3
    bx      lr
1:  lsls    r1, r0, r3
    movs    r0, 0
    bx      lr
Code (bytes) 38 30 24 28
Cycles (0ws) 7 to 11 7 to 14 13 to 15 7 to 11

Details

Cycle counts will vary within the ranges shown because:

Details of exact versions tested.

Conclusions

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest in the worst case and probably fastest on average (depending on input distribution). The ARM Microlib version is not small but has the fastest best case and is faster than the libgcc version even in the worst case.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: lsl.w) where they could use a narrow equivalent that sets the flags (eg: lsls). This could be to avoid setting the flags when they are not required (which might help the branch predictor) or it could be a mistake. My version (which I wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.

Unsigned Shift Right

Tool ARM standardlib ARM Microlib GCC Mine
Code
__aeabi_llsr:
    subs.w  r3, r2, 32
    bpl.n   1f
    rsb     r3, r2, 32
    lsr.w   r0, r0, r2
    lsl.w   r3, r1, r3
    lsr.w   r1, r1, r2
    orr.w   r0, r0, r3
    bx      lr
1:  lsr.w   r0, r1, r3
    mov.w   r1, 0
    bx      lr
__aeabi_llsr:
    cmp     r2, 32
    blt.n   1f
    subs    r2, 32
    lsr.w   r0, r1, r2
    movs    r1, 0
    bx      lr
1:  lsr.w   r3, r1, r2
    lsrs    r0, r2
    rsb     r2, r2, 32
    lsls    r1, r2
    orrs    r0, r1
    mov     r1, r3
    bx      lr
__aeabi_llsr:
    lsrs    r0, r2
    adds    r3, r1, 0
    lsrs    r1, r2
    mov     r12, r3
    subs    r2, 32
    lsrs    r3, r2
    orrs    r0, r3
    negs    r2, r2
    mov     r3, r12
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
__aeabi_llsr:
    subs    r3, r2, 32
    bhs     1f
    rsbs    r3, 0
    lsl     r3, r1, r3
    lsrs    r0, r2
    lsrs    r1, r2
    orrs    r0, r3
    bx      lr
1:  lsrs    r0, r1, r3
    movs    r1, 0
    bx      lr
Code (bytes) 38 32 24 28
Cycles (0ws) 7 to 11 7 to 15 13 to 15 7 to 11

Details

Cycle counts will vary within the ranges shown because:

Details of exact versions tested.

Conclusions

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest in the worst case and probably fastest on average (depending on input distribution). The ARM Microlib version is not small but has the fastest best case and is faster than the libgcc version even in the worst case.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: lsl.w) where they could use a narrow equivalent that sets the flags (eg: lsls). This could be to avoid setting the flags when they are not required (which might help the branch predictor) or it could be a mistake. My version (which I wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.

Signed Shift Right

Tool ARM standardlib ARM Microlib GCC Mine
Code
__aeabi_lasr:
    subs.w  r3, r2, 32
    bpl.n   1f
    rsb     r3, r2, 32
    lsr.w   r0, r0, r2
    lsl.w   r3, r1, r3
    asr.w   r1, r1, r2
    orr.w   r0, r0, r3
    bx      lr
1:  asr.w   r0, r1, r3
    mov.w   r1, r1, asr 31
    bx      lr
__aeabi_lasr:
    cmp     r2, 32
    blt.n   1f
    asrs    r3, r1, 31
    subs    r2, 32
    asr.w   r0, r1, r2
    orr.w   r3, r3, r0, asr 31
    b.n     2f
1:  asr.w   r3, r1, r2
    lsrs    r0, r2
    rsb     r2, r2, 32
    lsls    r1, r2
    orrs    r0, r1
2:  mov     r1, r3
    bx      lr
__aeabi_lasr:
    lsrs    r0, r2
    adds    r3, r1, 0
    asrs    r1, r2
    subs    r2, 32
    bmi.n   1f
    mov     r12, r3
    asrs    r3, r2
    orrs    r0, r3
    mov     r3, r12
1:  negs    r2, r2
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
__aeabi_lasr:
    subs    r3, r2, 32
    bhs     1f
    rsbs    r3, 0
    lsl     r3, r1, r3
    lsrs    r0, r2
    asrs    r1, r2
    orrs    r0, r3
    bx      lr
1:  asrs    r0, r1, r3
    asrs    r1, 31
    bx      lr
Code (bytes) 38 36 26 28
Cycles (0ws) 7 to 11 11 to 15 11 to 16 7 to 11

Details

Cycle counts will vary within the ranges shown because:

Details of exact versions tested.

Conclusions

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest.

The ARM Microlib version is neither small nor fast.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: lsl.w) where they could use a narrow equivalent that sets the flags (eg: lsls). This could be to avoid setting the flags when they are not required (which might help the branch predictor) or it could be a mistake. My version (which I wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.