This page forms part of an ARM Cortex-M Run-Time Library Analysis by Tom Vajzovic.

The text and presentation of this analysis is copyright 2018 Tom Vajzovic. You may not copy it except as permitted by law.

The ARM and GCC routines presented here are subject to separate copyright. Displaying them in this way is academic fair use and so I have not sought a licence from the copyright holders. You must not take them from here to use them for any other purpose. You shouldn't want to anyway, because they are suboptimal.

You may use my versions (which are better) according to the terms of the The Truly Free Licence (public domain).

Tool | ARM standardlib | ARM Microlib | GCC | Mine |
---|---|---|---|---|

Code | __aeabi_llsl: subs.w r3, r2, 32 bpl.n 1f rsb r3, r2, 32 lsl.w r1, r1, r2 lsr.w r3, r0, r3 lsl.w r0, r0, r2 orr.w r1, r1, r3 bx lr 1: lsl.w r1, r0, r3 mov.w r0, 0 bx lr | __aeabi_llsl: cmp r2, 32 blt.n 1f subs r2, 32 lsl.w r1, r0, r2 movs r0, 0 bx lr 1: lsls r1, r2 rsb r3, r2, 32 lsr.w r3, r0, r3 orrs r1, r3 lsls r0, r2 bx lr | __aeabi_llsl: lsls r1, r2 adds r3, r0, 0 lsls r0, r2 mov r12, r3 subs r2, 32 lsls r3, r2 orrs r1, r3 negs r2, r2 mov r3, r12 lsrs r3, r2 orrs r1, r3 bx lr | __aeabi_llsl: subs r3, r2, 32 bhs 1f rsbs r3, 0 lsr r3, r0, r3 lsls r1, r2 lsls r0, r2 orrs r1, r3 bx lr 1: lsls r1, r0, r3 movs r0, 0 bx lr |

Code (bytes) | 38 | 30 | 24 | 28 |

Cycles (0ws) | 7 to 11 | 7 to 14 | 13 to 15 | 7 to 11 |

Cycle counts will vary within the ranges shown because:

- there are different code paths if the shift is more or less than 32 bits.
- the branch predictor may or may not be able to successfully fetch and decode instructions after a branch before it is taken.

Details of exact versions tested.

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest in the worst case and probably fastest on average (depending on input distribution). The ARM Microlib version is not small but has the fastest best case and is faster than the libgcc version even in the worst case.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: `lsl.w`

) where they could use
a narrow equivalent that sets the flags (eg: `lsls`

). This could be to avoid setting the flags when
they are not required (which might help the branch predictor) or it could be a mistake. My version (which I
wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.

Tool | ARM standardlib | ARM Microlib | GCC | Mine |
---|---|---|---|---|

Code | __aeabi_llsr: subs.w r3, r2, 32 bpl.n 1f rsb r3, r2, 32 lsr.w r0, r0, r2 lsl.w r3, r1, r3 lsr.w r1, r1, r2 orr.w r0, r0, r3 bx lr 1: lsr.w r0, r1, r3 mov.w r1, 0 bx lr | __aeabi_llsr: cmp r2, 32 blt.n 1f subs r2, 32 lsr.w r0, r1, r2 movs r1, 0 bx lr 1: lsr.w r3, r1, r2 lsrs r0, r2 rsb r2, r2, 32 lsls r1, r2 orrs r0, r1 mov r1, r3 bx lr | __aeabi_llsr: lsrs r0, r2 adds r3, r1, 0 lsrs r1, r2 mov r12, r3 subs r2, 32 lsrs r3, r2 orrs r0, r3 negs r2, r2 mov r3, r12 lsls r3, r2 orrs r0, r3 bx lr | __aeabi_llsr: subs r3, r2, 32 bhs 1f rsbs r3, 0 lsl r3, r1, r3 lsrs r0, r2 lsrs r1, r2 orrs r0, r3 bx lr 1: lsrs r0, r1, r3 movs r1, 0 bx lr |

Code (bytes) | 38 | 32 | 24 | 28 |

Cycles (0ws) | 7 to 11 | 7 to 15 | 13 to 15 | 7 to 11 |

Cycle counts will vary within the ranges shown because:

- there are different code paths if the shift is more or less than 32 bits.
- the branch predictor may or may not be able to successfully fetch and decode instructions after a branch before it is taken.

Details of exact versions tested.

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest in the worst case and probably fastest on average (depending on input distribution). The ARM Microlib version is not small but has the fastest best case and is faster than the libgcc version even in the worst case.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: `lsl.w`

) where they could use
a narrow equivalent that sets the flags (eg: `lsls`

). This could be to avoid setting the flags when
they are not required (which might help the branch predictor) or it could be a mistake. My version (which I
wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.

Tool | ARM standardlib | ARM Microlib | GCC | Mine |
---|---|---|---|---|

Code | __aeabi_lasr: subs.w r3, r2, 32 bpl.n 1f rsb r3, r2, 32 lsr.w r0, r0, r2 lsl.w r3, r1, r3 asr.w r1, r1, r2 orr.w r0, r0, r3 bx lr 1: asr.w r0, r1, r3 mov.w r1, r1, asr 31 bx lr | __aeabi_lasr: cmp r2, 32 blt.n 1f asrs r3, r1, 31 subs r2, 32 asr.w r0, r1, r2 orr.w r3, r3, r0, asr 31 b.n 2f 1: asr.w r3, r1, r2 lsrs r0, r2 rsb r2, r2, 32 lsls r1, r2 orrs r0, r1 2: mov r1, r3 bx lr | __aeabi_lasr: lsrs r0, r2 adds r3, r1, 0 asrs r1, r2 subs r2, 32 bmi.n 1f mov r12, r3 asrs r3, r2 orrs r0, r3 mov r3, r12 1: negs r2, r2 lsls r3, r2 orrs r0, r3 bx lr | __aeabi_lasr: subs r3, r2, 32 bhs 1f rsbs r3, 0 lsl r3, r1, r3 lsrs r0, r2 asrs r1, r2 orrs r0, r3 bx lr 1: asrs r0, r1, r3 asrs r1, 31 bx lr |

Code (bytes) | 38 | 36 | 26 | 28 |

Cycles (0ws) | 7 to 11 | 11 to 15 | 11 to 16 | 7 to 11 |

Cycle counts will vary within the ranges shown because:

- there are different code paths if the shift is more or less than 32 bits.
- the branch predictor may or may not be able to successfully fetch and decode instructions after a branch before it is taken.

Details of exact versions tested.

The libgcc version is smallest but slowest. The ARM standardlib version is the fastest.

The ARM Microlib version is neither small nor fast.

My version is as fast as the ARM standardlib version but smaller. It is not quite as small as the libgcc version.

Both the ARM versions are bigger than they could be because they use wide instructions (eg: `lsl.w`

) where they could use
a narrow equivalent that sets the flags (eg: `lsls`

). This could be to avoid setting the flags when
they are not required (which might help the branch predictor) or it could be a mistake. My version (which I
wrote before seeing the ARM versions) is equivalent to the ARM standardlib version but with the narrow instructions where possible.