Varying performance of numpy axpy

Question

I was trying to test the performance of numpy using this very simple script:

import numpy as np
import argparse
from timeit import default_timer as timer

p = argparse.ArgumentParser()
p.add_argument("--N", default=1000, type=int, help="size of matrix A")
p.add_argument(
    "--repeat", default=1000, type=int, help="perform computation x = A*b repeat times"
)
args = p.parse_args()

np.random.seed(0)
A = np.random.rand(args.N, args.N)
b = np.random.rand(args.N)
x = np.zeros(args.N)

ts = timer()
for i in range(args.repeat):
    x[:] = A.dot(b)
te = timer()

gbytes = 8.0 * (args.N**2 + args.N * 2) * 1e-9
print("bandwidth: %.2f GB/s" % (gbytes * args.repeat / (te - ts)))

What it does is it creates a random dense matrix, performs matrix-vector multiplication repeat times, and computes the averaged bandwidth of such operation, which I believe includes memory read, computation and memory write. However when I run this script on my laptop, the results vary quite significantly for each run:

~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 93.64 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 99.15 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 95.08 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 77.28 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 56.90 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 63.87 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 85.43 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 95.69 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 93.91 GB/s
~/toys/python ❯ python numpy_performance.py --N 8000 --repeat 100
bandwidth: 101.99 GB/s

Is this behavior expected? If so, how can it be explained? Thanks!

I cannot reproduce your problem on my Linux PC. This basically means this is due to your system/hardware but we cannot really help you further without more information about your platform. This includes the exact CPU reference, the OS governor, the DRAM reference and they configuration and possibly information about running processes if any. A critical information is also the time taken by the benchmark since a throughput does not make much sense for very short operations. The amount of available memory also matters. — Jérôme Richard
– Jérôme Richard, Commented Mar 19, 2023 at 11:45
Note that a 93.64 GB/s throughput is suspiciously high of a laptop. Nearly all laptops (except maybe the Apple M1/M2-based ones) cannot reach more than ~70 GiB nowadays. This likely indicates that the benchmark is somehow biased (though it looks fine to me at first glance, unless you have a CPU with a huge cache). — Jérôme Richard
– Jérôme Richard, Commented Mar 19, 2023 at 11:46
Thanks for your reply @JérômeRichard, I'm actually running this on the apple M1 pro chip with 32 GB DDR5 RAM. I just tried running the script with a larger repeat number (100 -> 1000) so it takes ~O(10) seconds to finish, and the discrepancy seems to be reduced. Within 10 runs, the bandwidths are somewhere between 62.35 GB/s to 74.20 GB/s, does this make more sense to you? — aaronfu
– aaronfu, Commented Mar 19, 2023 at 19:59
Ok, results make sense on the M1. The M1 is a big-little processor so you should check the process is scheduled on the big cores. Little cores should be much slower for such a task. Besides, it takes some time to create threads. Usually dozens of us on mainstream machine (possibly more on servers or Windows). IDK, on a Mac with a M1. Also note that OpenBLAS is often used by Numpy by default but I am not sure it support well the M1 (the load imbalance can cause sub-optimal and instable performances). AFAIK, Apple accelerate should be a good alternative on the M1. — Jérôme Richard
– Jérôme Richard, Commented Mar 19, 2023 at 21:45
AFAIK, the M1 pro should be able to reach theoretically 200 GB/s (at least 120 GB/s) so there is a room for improvements, especially since the computation should be clearly memory bound. In practice, besides using Apple Accelerate, it is pretty hard to do better (many libraries do not support big-little very well, nor ARM, and GPU-oriented computations often focus on Cuda which is only for Nvidia GPUs). — Jérôme Richard
– Jérôme Richard, Commented Mar 19, 2023 at 21:59

Saxtheowl · Accepted Answer · 2023-03-19 03:55:56Z

0

There can be multiple reason for the not stable result, CPU can not be stable because not confired for your unique process, you can have other process interfering with your runs, and also thermal throttling that could perturbate the run between cooling

One of the thing you could do is to make multiple run then average the result

answered Mar 19, 2023 at 3:55

Saxtheowl

4,7025 gold badges28 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Varying performance of numpy axpy

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related