-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
hello,
I have a piece of code which calls dtrmv in very moderate dimensions (N<10) but repeatedly (~1e6) to compute Normal cumulative distribution function from the inverse cholesky factor of its the correlation matrix (openturns),
and it seems openblas has a penalty there as it is trying to start as many threads as possible (40 HT on my machine).
Consider the following reproducer:
#include <cblas.h>
#include <stdio.h>
int main()
{
double A[25] = {1.1,2.0,1.0,-3.0,4.0,
-1.4,2.0,2.0,3.0, 0.0,
5.4, 3.45, -5.9, 0.0, 0.0,
7.1, 4.3, 0.0, 0.0, 0.0,
-8.2, 0.0, 0.0, 0.0, 0.0};
const int N = 5;
double X[5] = {1.0,2.0,1.0,-3.0,4.0};
for (unsigned int i = 0; i < 1000000; ++ i)
{
cblas_dtrmv(CblasRowMajor, CblasLower, CblasNoTrans, CblasUnit, N, A, N, X, 1);
}
for(int i=0; i<N; i++)
printf("%g ", X[i]);
printf("\n");
return 0;
}
With the default thread count (40) it takes ~15s but only ~0.1s with OMP_NUM_THREADS=1.
Looks it would need something similar to what's done in #4585.
This is openblas 0.3.29 from fedora rawhide (with flexiblas).
I also tried 0.3.30 from archlinux.
/cc @martin-frbg
Metadata
Metadata
Assignees
Labels
No labels