The core library of the Tenzing project. tenzing-core provides facilities for interacting with CUDA + MPI programs as sequential decision problems. This facilitates optimizing CUDA + MPI programs using sequential decision strategies.
Two solvers are available
- tenzing-mcts: Uses Monte-Carlo tree search
- tenzing-dfs: Uses depth-first search
On a supported platform:
source load-env.shIn any case:
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70
makeTests are split into two locations:
- unit tests may be defined in source files
- tests with a more "itegration" flavor are in
test/
To run tests, you can do
make testctesttenzing-all-ltc: list tests cases-tc="a,b": only run test cases namedaandb
This creates some CMake complexity, as the test functions present in static libraries will not be linked into the resulting test binary. Therefore, we use a CMake object library to generate the test binary, and then generate a static library from the object library. object library properties do not get propagated properly / at all, so we have to redefine what needs to be linked and included, etc
tenzing-core has been tested on the following platforms:
- NERSC perlmutter: g++ 10.3 / nvcc 11.4 / Cray MPICH 8.1.13
- Sandia vortex (similar to ORNL Lassen and OLCF Summit): g++ 7.5.0 / nvcc 10.1 / IBM Spectrum MPI
- Sandia ascicgpu
- Visit the API documentation in docs/api.md
ascicgpusystem documentation in docs/ascicgpu.mdvortexsystem documentation in docs/vortex.mdperlmutterssytem documentation in docs/perlmutter.md
- python bindings (with pybind11)
See CONTRIBUTING.md for contribution guidelines.
- enable / disable CUDA / MPI
- isolate Ser/Des
- isolate platform assignments
- a
BoundOpcannot produce thestd::shared_ptr<OpBase>of it's unbound self, onlyOpBase- can't ask an
std::shared_ptr<BoundOp>forstd::shared_ptr<OpBase> - maybe std::shared_from_this?
- can't ask an
- special status of
StartandEndis a bit clumsy.- maybe there should be a
StartEnd : BoundOpthat they both are instead of separate classes- in the algs they're probably treated the same (always synced, etc)
- maybe there should be a
-
Platformis a clumsy abstraction, since it also tracks resources that are only valid for a single order- e.g., each order requires a certain number of events, which can be resued for the next order
Please see NOTICE.md for copyright and license information.