AVR 8-bit: Reusability vs. Efficiency?

Question

My implementation of an API function doing a simple SPI transfer, offering a void *intfPtr parameter to pass a "device descriptor" which I am using to pass the I/O port and pin for SPI chip select, looks like this:

#include <stdint.h>

typedef struct {
    volatile uint8_t *port;
    uint8_t pin;
} Intf;

static volatile uint8_t PORTD_OUT = 4;

static uint8_t transmit(uint8_t data) {
    return 0x00;
}

static uint8_t bme68xWrite(uint8_t reg,
                           const uint8_t *data,
                           uint32_t len,
                           void *intfPtr) {
    const Intf intf = *((Intf *)intfPtr);
    *intf.port &= ~(1u << intf.pin);
    transmit(reg);
    for (uint32_t i = 0; i < len; i++) {
        transmit(data[i]);
    }
    *intf.port |= (1u << intf.pin);

    return 0;
}

I was wondering about how efficient (as in number of instructions) this implementation is and, if I picked the correct part, this is the two lines before transmit(reg):

00002cdc <.Loc.97>:
    const Intf intf = *((Intf *)intfPtr);
    2cdc:       00 81           ld      r16, Z
    2cde:       11 81           ldd     r17, Z+1        ; 0x01

00002ce0 <.LVL44>:
    *intf.port &= ~(1u << intf.pin);
    2ce0:       d8 01           movw    r26, r16
    2ce2:       2c 91           ld      r18, X

00002ce4 <.Loc.100>:
    2ce4:       92 81           ldd     r25, Z+2        ; 0x02

00002ce6 <.Loc.101>:
    2ce6:       41 e0           ldi     r20, 0x01       ; 1
    2ce8:       50 e0           ldi     r21, 0x00       ; 0
    2cea:       5a 01           movw    r10, r20
    2cec:       01 c0           rjmp    .+2             ; 0x2cf0 <.L2^B2>

00002cee <.L1^B6>:
    2cee:       aa 0c           add     r10, r10

00002cf0 <.L2^B2>:
    2cf0:       9a 95           dec     r25
    2cf2:       ea f7           brpl    .-6             ; 0x2cee <.L1^B6>

00002cf4 <.Loc.102>:
    2cf4:       9a 2d           mov     r25, r10
    2cf6:       90 95           com     r25
    2cf8:       92 23           and     r25, r18
    2cfa:       9c 93           st      X, r25

Not so surprising, simply hardcoding port and pin like PORTD_OUT &= ~(1u << BME_CS_PD4); yields a lot fewer instructions:

00002cd0 <.Loc.97>:
    PORTD_OUT &= ~(1u << BME_CS_PD4);
    2cd0:       90 91 64 04     lds     r25, 0x0464     ; 0x800464 <__TEXT_REGION_LENGTH__+0x7f0464>

00002cd4 <.Loc.98>:
    2cd4:       9f 7e           andi    r25, 0xEF       ; 239
    2cd6:       90 93 64 04     sts     0x0464, r25     ; 0x800464 <__TEXT_REGION_LENGTH__+0x7f0464>

Counting all instructions of both implementations, it is 76 vs. 53. This with avr-gcc (GCC) 14.2.0 and -O2 by the way.

So, even if passing the port and pin as parameter is maybe more elegant than hardcoding them, it seems to be an expensive deal, especially considering that the function is called very often?

You really need uint32_t len? Using uint16_t will shave off some more bytes. — emacs drives me nuts
– emacs drives me nuts, Commented Oct 24 at 15:16
@emacsdrivesmenuts Certainly not, but the API defines it: github.com/boschsensortec/BME68x_SensorAPI/blob/… — Torsten Römer
– Torsten Römer, Commented Oct 24 at 17:34
So when you are bound the that exact code, it's unclear to me what you question is about, i.e. what degrees of freedom to change do you have? — emacs drives me nuts
– emacs drives me nuts, Commented Oct 24 at 17:50
Well, my question was more like if it would be preferable to hardcode port and pin, saving quite some instructions, or if that would be too ugly. Or if I maybe missed out on something. — Torsten Römer
– Torsten Römer, Commented Oct 24 at 18:02
Depends on what your coding rules are. You coud clone the project and make adjustments. You are already on GitHub, so cloning is just 1 click away. Whether the performance gain is needed, that's up to you to find out. When you want to use the code across different projects, then you don't really know on the bme68x driver level if the performance is ok or not. The bme68x project has very low traffic, so keeping your clone up to date isn't a big headache (even without cloning you may want to keep an eye on the bme project for fug fixes). — emacs drives me nuts
– emacs drives me nuts, Commented Oct 25 at 13:42

emacs drives me nuts · Accepted Answer · 2025-10-24 14:15:18Z

There are too much unknowns to say something specific. In particular, the code cannot be compiled. What can be said is the the code with the pointer is not only larger but also slower, e.g. it contains a shift by a variable offset.

In the case you only need one such interface, you can increase performance and still have it portable / modular by avoiding callbacks or indirection like in your example. To that end, you can use LTO (-flto etc) and external functions like

// in usebme68.h
extern volatile uint8_t* usebme68_get_port_addr (void);

And then implement that in the application as

// in main.c
#include <avr/io.h>
#include "usebme68.h"
extern inline volatile uint8_t* usebme68_get_port_addr (void)
{
    return &PORTB;
}

Notice that GCC knows two flavours of extern inline:

The extern inline as of C99. When the function cannot be inlined for some reason (like taking the address of the inline function), then the compiler will add an implementation of the function.
GCC's original meaning of extern inline: When the function cannot be inlined DO NOT pop an implementation. The implementation can be provided by, say, a library fallback. For this variant you'll have to add attribute gnu_inline which is availeble when the built-in macro __GNUC_STDC_INLINE__ is defined. When __GNUC_STDC_INLINE__ is not defined, then the semantics of inline is according to 2. (this is only the case for quite old GCC versions).

For the shift with variable offsets, there is __builtin_avr_mask1 provided #if __GNUC__ >= 15.

Collectives™ on Stack Overflow

AVR 8-bit: Reusability vs. Efficiency?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related