This page forms part of an ARM Cortex-M Run-Time Library Analysis by Tom Vajzovic.
The text and presentation of this analysis is copyright 2018 Tom Vajzovic. You may not copy it except as permitted by law.
The ARM and GCC routines presented here are subject to separate copyright. Displaying them in this way is academic fair use and so I have not sought a licence from the copyright holders. You must not take them from here to use them for any other purpose. You shouldn't want to anyway, because they are suboptimal.
You may use my versions (which are better) according to the terms of the The Truly Free Licence (public domain).
Tool | ARM standardlib & Microlib | GCC 4, 5, 6 & 7 | Mine |
---|---|---|---|
Code | __aeabi_llsl: push {r4, lr} cmp r2, 32 blt.n 1f mov r1, r0 subs r2, 32 lsls r1, r2 movs r0, 0 pop {r4, pc} 1: lsls r1, r2 movs r3, 32 subs r4, r3, r2 mov r3, r0 lsrs r3, r4 orrs r1, r3 lsls r0, r2 pop {r4, pc} | __aeabi_llsl: lsls r1, r2 adds r3, r0, 0 lsls r0, r2 mov r12, r3 subs r2, 32 lsls r3, r2 orrs r1, r3 negs r2, r2 mov r3, r12 lsrs r3, r2 orrs r1, r3 bx lr | __aeabi_llsl: movs r3, r0 lsls r1, r2 lsls r0, r2 subs r2, 32 bhs 1f rsbs r2, 0 lsrs r3, r2 orrs r1, r3 bx lr 1: lsls r3, r2 movs r1, r3 bx lr |
Code (bytes) | 32 | 24 | 24 |
Stack (bytes) | 8 | nil | nil |
Cycles (0ws) | 15 or 20 | 14 | 11 or 12 |
Cycle counts shown depend on whether the shift is more or less than 32 bits.
Cycle count will be more than what is shown for the ARM library version because it unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.
Details of exact versions tested.
The ARM standardlib and Microlib routines are identical.
All libgcc versions tested are identical.
The libgcc routine is smaller and faster than the ARM one.
My version is the same size as the libgcc one, but faster still.
Tool | ARM standardlib | ARM Microlib | GCC 4, 5, 6 & 7 | Mine |
---|---|---|---|---|
Code | __aeabi_llsr: push {r4, lr} cmp r2, 32 blt.n 1f mov r0, r1 subs r2, 32 lsrs r0, r2 movs r1, 0 pop {r4, pc} 1: mov r3, r1 movs r4, 32 lsrs r3, r2 lsrs r0, r2 subs r2, r4, r2 lsls r1, r2 orrs r0, r1 mov r1, r3 pop {r4, pc} | __aeabi_llsr: push {r4, lr} cmp r2, 32 blt.n 1f mov r0, r1 subs r2, 32 lsrs r0, r2 movs r1, 0 pop {r4, pc} 1: mov r3, r1 lsrs r3, r2 lsrs r0, r2 movs r4, 32 subs r2, r4, r2 lsls r1, r2 orrs r0, r1 mov r1, r3 pop {r4, pc} | __aeabi_llsr: lsrs r0, r2 adds r3, r1, 0 lsrs r1, r2 mov r12, r3 subs r2, 32 lsrs r3, r2 orrs r0, r3 negs r2, r2 mov r3, r12 lsls r3, r2 orrs r0, r3 bx lr | __aeabi_llsr: movs r3, r1 lsrs r1, r2 lsrs r0, r2 subs r2, 32 bhs 1f rsbs r2, 0 lsls r3, r2 orrs r0, r3 bx lr 1: lsrs r3, r2 movs r0, r3 bx lr |
Code (bytes) | 34 | 34 | 24 | 24 |
Stack (bytes) | 8 | 8 | nil | nil |
Cycles (0ws) | 15 or 21 | 15 or 21 | 14 | 11 or 12 |
Cycle counts shown depend on whether the shift is more or less than 32 bits.
Cycle counts will be more than what is shown for both the ARM library variants because they unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.
Details of exact versions tested.
The ARM standardlib and Microlib routines contain the same instructions in a slightly different order, resulting in the same code size and performance.
The libgcc routine is smaller and faster than either of the ARM ones.
My version is the same size as the libgcc one, but faster.
Tool | ARM standardlib | ARM Microlib | GCC 4, 5, 6 & 7 | Mine |
---|---|---|---|---|
Code | __aeabi_lasr: push {r4, lr} cmp r2, 32 blt.n 1f mov r0, r1 asrs r3, r1, 31 subs r2, 32 asrs r0, r2 asrs r1, r0, 31 orrs r3, r1 b.n 2f 1: mov r3, r1 movs r4, 32 asrs r3, r2 lsrs r0, r2 subs r2, r4, r2 lsls r1, r2 orrs r0, r1 2: mov r1, r3 pop {r4, pc} | __aeabi_lasr: push {r4, lr} cmp r2, 32 blt.n 1f asrs r3, r1, 31 mov r0, r1 subs r2, 32 asrs r0, r2 asrs r1, r0, 31 orrs r3, r1 b.n 2f 1: mov r3, r1 asrs r3, r2 lsrs r0, r2 movs r4, 32 subs r2, r4, r2 lsls r1, r2 orrs r0, r1 2: mov r1, r3 pop {r4, pc} | __aeabi_lasr: lsrs r0, r2 adds r3, r1, 0 asrs r1, r2 subs r2, 32 bmi.n 1f mov r12, r3 asrs r3, r2 orrs r0, r3 mov r3, r12 1: negs r2, r2 lsls r3, r2 orrs r0, r3 bx lr | __aeabi_lasr: movs r3, r1 asrs r1, r2 lsrs r0, r2 subs r2, 32 bhs 1f rsbs r2, 0 lsls r3, r2 orrs r0, r3 bx lr 1: asrs r3, r2 movs r0, r3 bx lr |
Code (bytes) | 38 | 38 | 26 | 24 |
Stack (bytes) | 8 | 8 | nil | nil |
Cycles (0ws) | 21 | 21 | 15 or 13 | 11 or 12 |
Cycle counts shown depend on whether the shift is more or less than 32 bits.
Cycle counts will be more than what is shown for both the ARM library variants because they unnecessarily use the stack, and Cortex-M0 has a Von-Neumann architecture, meaning that data access will delay fetching the next instruction.
Details of exact versions tested.
The ARM standardlib and Microlib routines contain the same instructions in a slightly different order, resulting in the same code size and performance.
The libgcc routine is smaller and faster than either of the ARM ones.
My version is smaller and faster than the libgcc one.