Skip to content

dtrmv performance penalty with small N #5402

@jschueller

Description

@jschueller

hello,

I have a piece of code which calls dtrmv in very moderate dimensions (N<10) but repeatedly (~1e6) to compute Normal cumulative distribution function from the inverse cholesky factor of its the correlation matrix (openturns),
and it seems openblas has a penalty there as it is trying to start as many threads as possible (40 HT on my machine).

Consider the following reproducer:

#include <cblas.h>
#include <stdio.h>

int main()
{
  double A[25] = {1.1,2.0,1.0,-3.0,4.0,
                  -1.4,2.0,2.0,3.0, 0.0,
                  5.4, 3.45, -5.9, 0.0, 0.0,
                  7.1, 4.3, 0.0, 0.0, 0.0,
                  -8.2, 0.0, 0.0, 0.0, 0.0};
  const int N = 5;
  double X[5] = {1.0,2.0,1.0,-3.0,4.0};
  for (unsigned int i = 0; i < 1000000; ++ i)
  {
    cblas_dtrmv(CblasRowMajor, CblasLower, CblasNoTrans, CblasUnit, N, A, N, X, 1);
  }
  for(int i=0; i<N; i++)
    printf("%g ", X[i]);
  
  printf("\n");
  return 0;
}

With the default thread count (40) it takes ~15s but only ~0.1s with OMP_NUM_THREADS=1.

Looks it would need something similar to what's done in #4585.

This is openblas 0.3.29 from fedora rawhide (with flexiblas).

I also tried 0.3.30 from archlinux.

/cc @martin-frbg

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions