mamba create -n bench -c anaconda -c conda-forge ipython numpy "libblas=*=*mkl" "liblapack=*=*mkl" intel-openmp
=*=*openblas
: uses OpenBLAS; default=*=*mkl
: uses Intel’s Math Kernel Library (MKL)=*=*accelerate
: uses Apple’s Accelerate framework with vecLib=*=*blis
: uses AMD' AOCL-BLISTODO:
import numpy as np
def test(size):
left = np.random.random((size, size))
right = np.random.random((size, size))
return np.sum(left @ right)
%timeit test(2048)
# AMD 5850U Note: single channel
# BLIS (prioritizes one core at max frequency)
## 365 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# MKL (prioritizes multi-core but physical only)
## 202 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MKL but dual channel
## 190 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# AMD 2950X quad channel
# BLIS
## 649 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# MKL (also multi-core physical)
## 151 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Mac Intel i7-8750H
# MKL
## 132 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
mamba create -n py "python>=3.13" ipython numpy numba powerlaw pyspark matplotlib seaborn pandas networkx python-build "libblas=*=*mkl" "liblapack=*=*mkl" intel-openmp ruff pydantic jupyterlab scikit-image scikit-learn simdjson cython
Matrix Multiplication (2048x2048) on Apple M1 (4+4)
Matrix Multiplication (2048x2048) on Mac Intel i7 (6)