2.1.2 Power10 processor core

The Power10 processor core inherits the modular architecture of the POWER9 processor core, but the re-designed and enhanced micro-architecture significantly increases the processor core performance and processing efficiency. The peak computational throughput is markedly improved by new execution capabilities and optimized cache bandwidth characteristics. Extra matrix math acceleration engines can deliver significant performance gains for machine learning, particularly for AI inferencing workloads.

The Power E1080 server uses the Power10 enterprise-class processor variant in which each core can run with up to eight essentially independent hardware threads. If all threads are active, the mode of operation is referred to as 8-way simultaneous multithreading (SMT8) mode. A Power10 core with SMT8 capability is named Power10 SMT8 core or SMT8 core for short. The Power10 core also supports modes with four active threads (SMT4); that is, two active threads (SMT2) and one single active thread (ST).

The SMT8 core includes two execution resource domains. Each domain provides the functional units to service up to four hardware threads.

Figure 2-5 shows the functional units of a SMT8 core where all 8 threads are active. The two execution resource domains are highlighted with colored backgrounds in two different shades of blue.

Figure 2-5 Power10 SMT8 core

Each of the two execution resource domains supports between one and four threads and includes 4 vector scalar units (VSU) of 128-bit width, two matrix-multiply assist (MMA) accelerators, and one quad-precision floating-point (QP) and decimal floating-point (DF) unit.

One VSU and the directly associated logic is called an execution slice. Two neighboring slices can also be used as a combined execution resource which is then named super-slice. When operating in SMT8 mode, eight SMT threads are subdivided in pairs that collectively run on two adjacent slices as indicated through colored backgrounds in different shades of green.

In SMT4 or lower thread modes, one to two threads each share a four slices resource domain. Figure 2-5 also indicates other essential resources that are shared among the SMT threads, such as instruction cache, instruction buffer and L1 data cache.

Chapter 2. Architecture and technical overview 55

The SMT8 core supports automatic workload balancing to change the operational SMT thread level. Depending on the workload characteristics the number of threads that is running on one chiplet can be reduced from four to two and even further to only one active thread. An individual thread can benefit in terms of performance if fewer threads run against the core’s executions resources.

Micro-architecture performance and efficiency optimization lead to a significant improvement of the performance per watt signature compared with the previous POWER9 core implementation. The overall energy efficiency is better by a factor of approximately 2.6, which demonstrates the advancement in processor design that is manifested by Power10.

The Power10 processor core includes the following key features and improvements that affect performance:

Ê Enhanced load and store bandwidth

Ê Deeper and wider instruction windows

Ê Enhanced data prefetch

Ê Branch execution and prediction enhancements

Ê Instruction fusion

Enhancements in the area of computation resources, working set size, and data access latency are described next. The change in relation to the POWER9 processor core implementation is provided in parentheses.

Enhanced computation resources

The following are major computational resource enhancements:

Ê Eight vector scalar unit (VSU) execution slices, each supporting 64-bit scalar or 128-bit single instructions multiple data (SIMD) +100% for permute, fixed-point, floating-point, and crypto (Advanced Encryption Standard (AES)/SHA) +400% operations.

Ê Four units for matrix-math assist (MMA) acceleration each capable of producing a 512-bit result per cycle (new), +400% Single and Double precision FLOPS plus support for reduced precision AI acceleration).

Ê Two units for quad-precision floating-point and decimal floating-point operations additional instruction types

Larger working sets

The following major changes were implemented in working set sizes:

Ê L1 instruction cache: 2 x 48 KB 6-way (96 KB total). (+50%)

Ê L2 cache: 2 MB 8-way. (+400%)

Ê L2 translation lookaside buffer (TLB): 2 x 4K entries (8K total). (+400%)

Data access with reduced latencies

The following major changes reduce latency for load data:

Ê L1 data cache access at four cycles nominal with zero penalty for store-forwarding. (- 2 cycles) for store forwarding

Ê L2 data access at 13.5 cycles nominal. (-2 cycles)

Ê L3 data access at 27.5 cycles nominal. (-8 cycles)

Ê Translation lookaside buffer (TLB) access at 8.5 cycles nominal for effective-to-real address translation (ERAT) miss including for nested translation. (-7 cycles)

Micro-architectural innovations that complement physical and logic design techniques and specifically address energy efficiency include the following examples:

56   IBM Power E1080: Technical Overview and Introduction

Ê Improved clock-gating

Ê Reduced flush rates with improved branch prediction accuracy

Ê Fusion and gather operating merging

Ê Reduced number of ports and reduced access to selected structures

Ê Effective address (EA)-tagged L1 data and instruction cache yield ERAT access only on a cache miss

In addition to significant improvements in performance and energy efficiency, security represents a major architectural focus area. The Power10 processor core supports the following security features:

Ê Enhanced hardware support that provides improved performance while mitigating for speculation-based attacks

Ê Dynamic Execution Control Register (DEXCR) support Ê Return oriented programming (ROP) protection

Leave a Reply

Your email address will not be published. Required fields are marked *