ARM Cortex-M Run-Time Library Analysis

This page forms part of an ARM Cortex-M Run-Time Library Analysis by Tom Vajzovic.

Copyright

The text and presentation of this analysis is copyright 2018 Tom Vajzovic. You may not copy it except as permitted by law.

The ARM and GCC routines presented here are subject to separate copyright. Displaying them in this way is academic fair use and so I have not sought a licence from the copyright holders. You must not take them from here to use them for any other purpose. You shouldn't want to anyway, because they are suboptimal.

You may use my versions (which are better) according to the terms of the The Truly Free Licence (public domain).

Cortex-M0 (ARMv6-M)

64-bit Shift Functions

Signed or Unsigned Shift Left

Tool ARM standardlib & Microlib GCC 4, 5, 6 & 7 Mine
Code
__aeabi_llsl:
    push    {r4, lr}
    cmp     r2, 32
    blt.n   1f
    mov     r1, r0
    subs    r2, 32
    lsls    r1, r2
    movs    r0, 0
    pop     {r4, pc}
1:  lsls    r1, r2
    movs    r3, 32
    subs    r4, r3, r2
    mov     r3, r0
    lsrs    r3, r4
    orrs    r1, r3
    lsls    r0, r2
    pop     {r4, pc}
__aeabi_llsl:
    lsls    r1, r2
    adds    r3, r0, 0
    lsls    r0, r2
    mov     r12, r3
    subs    r2, 32
    lsls    r3, r2
    orrs    r1, r3
    negs    r2, r2
    mov     r3, r12
    lsrs    r3, r2
    orrs    r1, r3
    bx      lr
__aeabi_llsl:
    movs    r3, r0
    lsls    r1, r2
    lsls    r0, r2
    subs    r2, 32
    bhs     1f
    rsbs    r2, 0
    lsrs    r3, r2
    orrs    r1, r3
    bx      lr
1:  lsls    r3, r2
    movs    r1, r3
    bx      lr
Code (bytes) 32 24 24
Stack (bytes) 8 nil nil
Cycles (0ws) 15 or 20 14 11 or 12

Details

Cycle counts shown depend on whether the shift is more or less than 32 bits.

Cycle count will be more than what is shown for the ARM library version because it unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.

Details of exact versions tested.

Conclusions

The ARM standardlib and Microlib routines are identical.

All libgcc versions tested are identical.

The libgcc routine is smaller and faster than the ARM one.

My version is the same size as the libgcc one, but faster still.

Unsigned Shift Right

Tool ARM standardlib ARM Microlib GCC 4, 5, 6 & 7 Mine
Code
__aeabi_llsr:
    push    {r4, lr}
    cmp     r2, 32
    blt.n   1f
    mov     r0, r1
    subs    r2, 32
    lsrs    r0, r2
    movs    r1, 0
    pop     {r4, pc}
1:  mov     r3, r1
    movs    r4, 32
    lsrs    r3, r2
    lsrs    r0, r2
    subs    r2, r4, r2
    lsls    r1, r2
    orrs    r0, r1
    mov     r1, r3
    pop     {r4, pc}
__aeabi_llsr:
    push    {r4, lr}
    cmp     r2, 32
    blt.n   1f
    mov     r0, r1
    subs    r2, 32
    lsrs    r0, r2
    movs    r1, 0
    pop     {r4, pc}
1:  mov     r3, r1
    lsrs    r3, r2
    lsrs    r0, r2
    movs    r4, 32
    subs    r2, r4, r2
    lsls    r1, r2
    orrs    r0, r1
    mov     r1, r3
    pop     {r4, pc}
__aeabi_llsr:
    lsrs    r0, r2
    adds    r3, r1, 0
    lsrs    r1, r2
    mov     r12, r3
    subs    r2, 32
    lsrs    r3, r2
    orrs    r0, r3
    negs    r2, r2
    mov     r3, r12
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
__aeabi_llsr:
    movs    r3, r1
    lsrs    r1, r2
    lsrs    r0, r2
    subs    r2, 32
    bhs     1f
    rsbs    r2, 0
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
1:  lsrs    r3, r2
    movs    r0, r3
    bx      lr
Code (bytes) 34 34 24 24
Stack (bytes) 8 8 nil nil
Cycles (0ws) 15 or 21 15 or 21 14 11 or 12

Details

Cycle counts shown depend on whether the shift is more or less than 32 bits.

Cycle counts will be more than what is shown for both the ARM library variants because they unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.

Details of exact versions tested.

Conclusions

The ARM standardlib and Microlib routines contain the same instructions in a slightly different order, resulting in the same code size and performance.

The libgcc routine is smaller and faster than either of the ARM ones.

My version is the same size as the libgcc one, but faster.

Signed Shift Right

Tool ARM standardlib ARM Microlib GCC 4, 5, 6 & 7 Mine
Code
__aeabi_lasr:
    push    {r4, lr}
    cmp     r2, 32
    blt.n   1f
    mov     r0, r1
    asrs    r3, r1, 31
    subs    r2, 32
    asrs    r0, r2
    asrs    r1, r0, 31
    orrs    r3, r1
    b.n     2f
1:  mov     r3, r1
    movs    r4, 32
    asrs    r3, r2
    lsrs    r0, r2
    subs    r2, r4, r2
    lsls    r1, r2
    orrs    r0, r1
2:  mov     r1, r3
    pop     {r4, pc}
__aeabi_lasr:
    push    {r4, lr}
    cmp     r2, 32
    blt.n   1f
    asrs    r3, r1, 31
    mov     r0, r1
    subs    r2, 32
    asrs    r0, r2
    asrs    r1, r0, 31
    orrs    r3, r1
    b.n     2f
1:  mov     r3, r1
    asrs    r3, r2
    lsrs    r0, r2
    movs    r4, 32
    subs    r2, r4, r2
    lsls    r1, r2
    orrs    r0, r1
2:  mov     r1, r3
    pop     {r4, pc}
__aeabi_lasr:
    lsrs    r0, r2
    adds    r3, r1, 0
    asrs    r1, r2
    subs    r2, 32
    bmi.n   1f
    mov     r12, r3
    asrs    r3, r2
    orrs    r0, r3
    mov     r3, r12
1:  negs    r2, r2
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
__aeabi_lasr:
    movs    r3, r1
    asrs    r1, r2
    lsrs    r0, r2
    subs    r2, 32
    bhs     1f
    rsbs    r2, 0
    lsls    r3, r2
    orrs    r0, r3
    bx      lr
1:  asrs    r3, r2
    movs    r0, r3
    bx      lr
Code (bytes) 38 38 26 24
Stack (bytes) 8 8 nil nil
Cycles (0ws) 21 21 15 or 13 11 or 12

Details

Cycle counts shown depend on whether the shift is more or less than 32 bits.

Cycle counts will be more than what is shown for both the ARM library variants because they unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.

Details of exact versions tested.

Conclusions

The ARM standardlib and Microlib routines contain the same instructions in a slightly different order, resulting in the same code size and performance.

The libgcc routine is smaller and faster than either of the ARM ones.

My version is smaller and faster than the libgcc one.