mamba create -n bench -c anaconda -c conda-forge ipython numpy "libblas=*=*mkl" "liblapack=*=*mkl" intel-openmp

TODO:

import numpy as np
def test(size):
		left = np.random.random((size, size))
		right = np.random.random((size, size))
		return np.sum(left @ right)
		
%timeit test(2048)

# AMD 5850U Note: single channel
# BLIS (prioritizes one core at max frequency)
## 365 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# MKL (prioritizes multi-core but physical only)
## 202 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# MKL but dual channel
## 190 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# AMD 2950X quad channel
# BLIS
## 649 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# MKL (also multi-core physical)
## 151 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Mac Intel i7-8750H
# MKL
## 132 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

mamba create -n py "python>=3.13" ipython numpy numba powerlaw pyspark matplotlib seaborn pandas networkx python-build "libblas=*=*mkl" "liblapack=*=*mkl" intel-openmp ruff pydantic jupyterlab scikit-image scikit-learn simdjson cython

Matrix Multiplication (2048x2048) on Apple M1 (4+4)

Matrix Multiplication (2048x2048) on Mac Intel i7 (6)