My single threaded program allocates and initializes a volatile block of memory on an unspecified hardware architecture. It then writes into this block in a loop using a stride equal to the cache line size (usually 64 bytes). Each write can either transfer a single byte (1 byte), or an entire long (8 bytes).
To be clear, the total number of writes is fixed. Only the number of bytes per write can vary.
There are no reads, no other threads and no other stuff is going on. Should I expect a performance difference between these cases?
My expectation is that there will be none. I believe this depends on the formalities of the bus transport. If the bus has a minimum chunk size of at least 64 bits, then both cases map to the same physical transfer execution. Else, there could be a small difference as the program is clearly memory throughput bound. I believe virtually all common computing hardware has a bus width larger than 64 bits.