Simultaneous multithreading – Architecture and technical overview
By Isabella Ward / April 13, 2023 / No Comments / IBM Certifcation Exam, Power and performance management, Simultaneous multithreading
2.1.3 Simultaneous multithreading
Each core of the Power10 processor supports multiple hardware threads that represent independent execution contexts. If only one hardware thread is used, the processor core runs in single-threaded (ST) mode.
If more than one hardware thread is active, the processor runs in simultaneous
multi-threading (SMT) mode. In addition to the ST mode, the Power10 processor supports the following different SMT modes:
Ê SMT2: Two hardware threads active Ê SMT4: Four hardware threads active Ê SMT8: Eight hardware threads active
SMT enables a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context. Computational workloads can use the processor core’s execution units with a higher degree of parallelism. This ability significantly enhances the throughput and scalability of multi-threaded applications and optimizes the compute density for single-threaded workloads.
SMT is primarily beneficial in commercial environments where the speed of an individual transaction is not as critical as the total number of transactions that are performed. SMT typically increases the throughput of most workloads, especially those workloads with large or frequently changing working sets, such as database servers and web servers.
Chapter 2. Architecture and technical overview 57
Table 2-2 lists a historic account of the SMT capabilities that are supported by each implementation of the IBM Power Architecture® since POWER4.
Table 2-2 SMT levels that are supported by Power processors
a. PHYP supports a maximum 240 x SMT8 = 1920. AIX support up to 1920 (240xSMT8) total threads in a single partition, starting with AIX v7.3 + Power10.
2.1.4 Matrix-multiply assist AI workload acceleration
The matrix-multiply assist (MMA) facility was introduced by the Power Instruction Set Architecture (ISA) v3.1. The related instructions implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution, and discrete Fourier transform.
To efficiently accelerate MMA operations, the Power10 processor core implements a dense math engine (DME) microarchitecture that effectively provides an accelerator for cognitive computing, machine learning, and AI inferencing workloads.
The DME encapsulates compute efficient pipelines, a physical register file, and associated data-flow that keeps resulting accumulator data local to the compute units. Each MMA pipeline performs outer-product matrix operations, reading from and writing back a 512-bit accumulator register.
Power10 implements the MMA accumulator architecture without adding an architected state. Each architected 512-bit accumulator register is backed by four 128-bit Vector Scalar eXtension (VSX) registers.
Code that uses the MMA instructions is included in OpenBLAS and Eigen libraries. This library can be built by using the most recent versions of GNU Compiler Collection (GCC) compiler. The latest version of OpenBLAS is available at this web page.
OpenBLAS is used by Python-NumPy library, PyTorch, and other frameworks, which makes it easy to use the performance benefit of the Power10 MMA accelerator for AI workloads.
For more information about the implementation of the Power10 processor’s high throughput math engine, see the white paper A matrix math facility for Power ISA processors.
For more information about fundamental MMA architecture principles with detailed instruction set usage, register file management concepts, and various supporting facilities, see
Matrix-Multiply Assist Best Practices Guide, REDP-5612.
58 IBM Power E1080: Technical Overview and Introduction