|
| 1 | +# sse2neon |
| 2 | + |
| 3 | + |
| 4 | +A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics. |
| 5 | + |
| 6 | +## Introduction |
| 7 | + |
| 8 | +`sse2neon` is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics |
| 9 | +to [Arm NEON](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon), |
| 10 | +shortening the time needed to get an Arm working program that then can be used to |
| 11 | +extract profiles and to identify hot paths in the code. |
| 12 | +The header file `sse2neon.h` contains several of the functions provided by Intel |
| 13 | +intrinsic headers such as `<xmmintrin.h>`, only implemented with NEON-based counterparts |
| 14 | +to produce the exact semantics of the intrinsics. |
| 15 | + |
| 16 | +## Mapping and Coverage |
| 17 | + |
| 18 | +Header file | Extension | |
| 19 | +---|---| |
| 20 | +`<mmintrin.h>` | MMX | |
| 21 | +`<xmmintrin.h>` | SSE | |
| 22 | +`<emmintrin.h>` | SSE2 | |
| 23 | +`<pmmintrin.h>` | SSE3 | |
| 24 | +`<tmmintrin.h>` | SSSE3 | |
| 25 | +`<smmintrin.h>` | SSE4.1 | |
| 26 | +`<nmmintrin.h>` | SSE4.2 | |
| 27 | +`<wmmintrin.h>` | AES | |
| 28 | + |
| 29 | +`sse2neon` aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension. |
| 30 | + |
| 31 | +In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely, |
| 32 | +please be aware that some SSE intrinsics exist a direct mapping with a concrete |
| 33 | +NEON-equivalent intrinsic. However, others lack of 1-to-1 mapping, that means the |
| 34 | +equivalents are implemented using several NEON intrinsics. |
| 35 | + |
| 36 | +For example, SSE intrinsic `_mm_loadu_si128` has a direct NEON mapping (`vld1q_s32`), |
| 37 | +but SSE intrinsic `_mm_maddubs_epi16` has to be implemented with 13+ NEON instructions. |
| 38 | + |
| 39 | +## Usage |
| 40 | + |
| 41 | +- Put the file `sse2neon.h` in to your source code directory. |
| 42 | + |
| 43 | +- Locate the following SSE header files included in the code: |
| 44 | +```C |
| 45 | +#include <xmmintrin.h> |
| 46 | +#include <emmintrin.h> |
| 47 | +``` |
| 48 | + {p,t,s,n,w}mmintrin.h should be replaceable, but the coverage of these extensions might be limited though. |
| 49 | + |
| 50 | +- Replace them with: |
| 51 | +```C |
| 52 | +#include "sse2neon.h" |
| 53 | +``` |
| 54 | + |
| 55 | +- Explicitly specify platform-specific options to gcc/clang compilers. |
| 56 | + * On ARMv8-A targets, you should specify the following compiler option: (Remove `crypto` and/or `crc` if your architecture does not support cryptographic and/or CRC32 extensions) |
| 57 | + ```shell |
| 58 | + -march=armv8-a+fp+simd+crypto+crc |
| 59 | + ``` |
| 60 | + * On ARMv7-A targets, you need to append the following compiler option: |
| 61 | + ```shell |
| 62 | + -mfpu=neon |
| 63 | + ``` |
| 64 | + |
| 65 | +## Compile-time Configurations |
| 66 | + |
| 67 | +Considering the balance between correctness and performance, `sse2neon` recognizes the following compile-time configurations: |
| 68 | +* `SSE2NEON_PRECISE_MINMAX`: Enable precise implementation of `_mm_min_ps` and `_mm_max_ps`. If you need consistent results such as NaN special cases, enable it. |
| 69 | +* `SSE2NEON_PRECISE_DIV`: Enable precise implementation of `_mm_rcp_ps` and `_mm_div_ps` by additional Netwon-Raphson iteration for accuracy. |
| 70 | +* `SSE2NEON_PRECISE_SQRT`: Enable precise implementation of `_mm_sqrt_ps` and `_mm_rsqrt_ps` by additional Netwon-Raphson iteration for accuracy. |
| 71 | + |
| 72 | +The above are turned off by default, and you should define the corresponding macro(s) as `1` before including `sse2neon.h` if you need the precise implementations. |
| 73 | + |
| 74 | +## Run Built-in Test Suite |
| 75 | + |
| 76 | +`sse2neon` provides a unified interface for developing test cases. These test |
| 77 | +cases are located in `tests` directory, and the input data is specified at |
| 78 | +runtime. Use the following commands to perform test cases: |
| 79 | +```shell |
| 80 | +$ make check |
| 81 | +``` |
| 82 | + |
| 83 | +You can specify GNU toolchain for cross compilation as well. |
| 84 | +[QEMU](https://www.qemu.org/) should be installed in advance. |
| 85 | +```shell |
| 86 | +$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A |
| 87 | +``` |
| 88 | +or |
| 89 | +```shell |
| 90 | +$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A |
| 91 | +``` |
| 92 | + |
| 93 | +Check the details via [Test Suite for SSE2NEON](tests/README.md). |
| 94 | + |
| 95 | +## Adoptions |
| 96 | +Here is a partial list of open source projects that have adopted `sse2neon` for Arm/Aarch64 support. |
| 97 | +* [aether-game-utils](https://github.com/johnhues/aether-game-utils) is a collection of cross platform utilities for quickly creating small game prototypes in C++. |
| 98 | +* [Apache Impala](https://impala.apache.org/) is a lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. |
| 99 | +* [Apache Kudu](https://kudu.apache.org/) completes Hadoop's storage layer to enable fast analytics on fast data. |
| 100 | +* [ART](https://github.com/dinosaure/art) is an implementation in OCaml of [Adaptive Radix Tree](https://db.in.tum.de/~leis/papers/ART.pdf) (ART). |
| 101 | +* [Async](https://github.com/romange/async) is a set of c++ primitives that allows efficient and rapid development in C++17 on GNU/Linux systems. |
| 102 | +* [Blender](https://www.blender.org/) is the free and open source 3D creation suite, supporting the entirety of the 3D pipeline. |
| 103 | +* [Boo](https://github.com/AxioDL/boo) is a cross-platform windowing and event manager similar to SDL or SFML, with additional 3D rendering functionality. |
| 104 | +* [CARTA](https://github.com/CARTAvis/carta-backend) is a new visualization tool designed for viewing radio astronomy images in CASA, FITS, MIRIAD, and HDF5 formats (using the IDIA custom schema for HDF5). |
| 105 | +* [Catcoon](https://github.com/i-evi/catcoon) is a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) implementation in C. |
| 106 | +* [dab-cmdline](https://github.com/JvanKatwijk/dab-cmdline) provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls. |
| 107 | +* [EDGE](https://github.com/3dfxdev/EDGE) is an advanced OpenGL source port spawned from the DOOM engine, with focus on easy development and expansion for modders and end-users. |
| 108 | +* [Embree](https://github.com/embree/embree) a collection of high-performance ray tracing kernels. Its target users are graphics application engineers who want to improve the performance of their photo-realistic rendering application by leveraging Embree's performance-optimized ray tracing kernels. |
| 109 | +* [emp-tool](https://github.com/emp-toolkit/emp-tool) aims to provide a benchmark for secure computation and allowing other researchers to experiment and extend. |
| 110 | +* [FoundationDB](https://www.foundationdb.org) is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. |
| 111 | +* [iqtree_arm_neon](https://github.com/joshlvmh/iqtree_arm_neon) is the Arm NEON port of [IQ-TREE](http://www.iqtree.org/), fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. |
| 112 | +* [kram](https://github.com/alecazam/kram) is a wrapper to several popular encoders to and from PNG/[KTX](https://www.khronos.org/opengles/sdk/tools/KTX/file_format_spec/) files with [LDR/HDR and BC/ASTC/ETC2](https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/adaptive-scalable-texture-compression/single-page). |
| 113 | +* [libscapi](https://github.com/cryptobiu/libscapi) stands for the "Secure Computation API", providing reliable, efficient, and highly flexible cryptographic infrastructure. |
| 114 | +* [libmatoya](https://github.com/matoya/libmatoya) is a cross-platform application development library, providing various features such as common cryptography tasks. |
| 115 | +* [Madronalib](https://github.com/madronalabs/madronalib) enables efficient audio DSP on SIMD processors with readable and brief C++ code. |
| 116 | +* [minimap2](https://github.com/lh3/minimap2) is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. |
| 117 | +* [MMseqs2](https://github.com/soedinglab/MMseqs2) (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. |
| 118 | +* [MRIcroGL](https://github.com/rordenlab/MRIcroGL) is a cross-platform tool for viewing NIfTI, DICOM, MGH, MHD, NRRD, AFNI format medical images. |
| 119 | +* [N2](https://github.com/oddconcepts/n2o) is an approximate nearest neighborhoods algorithm library written in C++, providing a much faster search speed than other implementations when modeling large dataset. |
| 120 | +* [niimath](https://github.com/rordenlab/niimath) is a general image calculator with superior performance. |
| 121 | +* [OBS Studio](https://github.com/obsproject/obs-studio) is software designed for capturing, compositing, encoding, recording, and streaming video content, efficiently. |
| 122 | +* [OGRE](https://github.com/OGRECave/ogre) is a scene-oriented, flexible 3D engine written in C++ designed to make it easier and more intuitive for developers to produce games and demos utilising 3D hardware. |
| 123 | +* [OpenXRay](https://github.com/OpenXRay/xray-16) is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World. |
| 124 | +* [parallel-n64](https://github.com/libretro/parallel-n64) is an optimized/rewritten Nintendo 64 emulator made specifically for [Libretro](https://www.libretro.com/). |
| 125 | +* [PFFFT](https://github.com/marton78/pffft) does 1D Fast Fourier Transforms, of single precision real and complex vectors. |
| 126 | +* [PlutoSDR Firmware](https://github.com/seanstone/plutosdr-fw) is the customized firmware for the [PlutoSDR](https://wiki.analog.com/university/tools/pluto) that can be used to introduce fundamentals of Software Defined Radio (SDR) or Radio Frequency (RF) or Communications as advanced topics in electrical engineering in a self or instructor lead setting. |
| 127 | +* [Pygame](https://www.pygame.org) is cross-platform and designed to make it easy to write multimedia software, such as games, in Python. |
| 128 | +* [simd_utils](https://github.com/JishinMaster/simd_utils) is a header-only library implementing common mathematical functions using SIMD intrinsics. |
| 129 | +* [SMhasher](https://github.com/rurban/smhasher) provides comprehensive Hash function quality and speed tests. |
| 130 | +* [Spack](https://github.com/spack/spack) is a multi-platform package manager that builds and installs multiple versions and configurations of software. |
| 131 | +* [srsLTE](https://github.com/srsLTE/srsLTE) is an open source SDR LTE software suite. |
| 132 | +* [Surge](https://github.com/surge-synthesizer/surge) is an open source digital synthesizer. |
| 133 | +* [XMRig](https://github.com/xmrig/xmrig) is an open source CPU miner for [Monero](https://web.getmonero.org/) cryptocurrency. |
| 134 | + |
| 135 | +## Related Projects |
| 136 | +* [SIMDe](https://github.com/simd-everywhere/simde): fast and portable implementations of SIMD |
| 137 | + intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. |
| 138 | +* [CatBoost's sse2neon](https://github.com/catboost/catboost/blob/master/library/cpp/sse/sse2neon.h) |
| 139 | +* [ARM\_NEON\_2\_x86\_SSE](https://github.com/intel/ARM_NEON_2_x86_SSE) |
| 140 | +* [AvxToNeon](https://github.com/kunpengcompute/AvxToNeon) |
| 141 | +* [POWER/PowerPC support for GCC](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000) contains a series of headers simplifying porting x86_64 code that |
| 142 | + makes explicit use of Intel intrinsics to powerpc64le (pure little-endian mode that has been introduced with the [POWER8](https://en.wikipedia.org/wiki/POWER8)). |
| 143 | + - implementation: [xmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/xmmintrin.h), [emmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/emmintrin.h), [pmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/pmmintrin.h), [tmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/tmmintrin.h), [smmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/smmintrin.h) |
| 144 | + |
| 145 | +## Reference |
| 146 | +* [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) |
| 147 | +* [Arm Neon Intrinsics Reference](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) |
| 148 | +* [Neon Programmer's Guide for Armv8-A](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a) |
| 149 | +* [NEON Programmer's Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) |
| 150 | +* [qemu/target/i386/ops_sse.h](https://github.com/qemu/qemu/blob/master/target/i386/ops_sse.h): Comprehensive SSE instruction emulation in C. Ideal for semantic checks. |
| 151 | + |
| 152 | +## Licensing |
| 153 | + |
| 154 | +`sse2neon` is freely redistributable under the MIT License. |
0 commit comments