2.1.7 On-chip L3 cache and intelligent caching

The Power10 processor includes a large on-chip L3 cache of up to 120 MB with a

non-uniform cache access (NUCA) architecture that provides mechanisms to distribute and share cache footprints across a set of L3 cache regions. Each processor core can access an associated local 8MB of L3 cache. It also can access the data in the other L3 cache regions on the chip and throughout the system.

Each L3 region serves as a victim cache for its associated L2 cache and also can provide aggregate storage for the on-chip cache footprint.

Intelligent L3 cache management enables the Power10 processor to optimize the access to L3 cache lines and minimize cache latencies. The L3 includes a replacement algorithm with data type and reuse awareness. It also supports an array of prefetch requests from the core, including instruction and data, and works cooperatively with the core, memory controller, and SMP interconnection fabric to manage prefetch traffic, which optimizes system throughput and data latency.

The L3 cache supports the following key features:

Ê Enhanced bandwidth that supports up to 64 bytes per core processor cycle to each SMT8 core.

Ê Enhanced data prefetch that is enabled by 96 L3 prefetch request machines that service prefetch requests to memory for each SMT8 core.

Ê Plus-one prefetching at the memory controller for enhanced effective prefetch depth and rate.

Ê Power10 software prefetch modes that support fetching blocks of data into the L3 cache. Ê Data access with reduced latencies.

2.1.8 Open memory interface

The Power10 processor introduces a new and innovative open memory interface (OMI). The OMI is driven by 8 on-chip memory controller units (MCUs) and is implemented in two separate physical building blocks that lie in opposite areas at the outer edge of the Power10 die. Each area supports 64 OMI lanes that are grouped in four ports. One port in turn consists of two links with 8 lanes each, which operate in a latency-optimized manner with unprecedented bandwidth and scale at 32 Gbps speed.

The aggregated maximum theoretical full-duplex bandwidth of the OMI interface culminates at 2 x 512 GBps = 1 TBps per Power10 single chip module (SCM).

The OMI physical interface enables low latency, high-bandwidth, technology-agnostic host memory semantics to the processor and allows attaching established and emerging memory elements. With the Power E1080 server. OMI initially supports one main tier, low-latency, enterprise-grade Double Data Rate 4 (DDR4) differential DIMM (DDIMM) per OMI link. This configuration yields a total memory capacity of 16 DDIMMs per SCM and 64 DDIMMs per Power E1080 server node. The memory bandwidth depends on the DDIMM density configured for a specific Power E1080 server.

The maximum theoretical duplex memory bandwidth is 409 GBps per SCM if 32 GB or 64 GB DDIMMs running at 3200 MHz are used. The maximum memory bandwidth is slightly reduced to 375 GBps per SCM if 128 GB or 256 GB DDIMMs running at 2933 MHz are used.

60   IBM Power E1080: Technical Overview and Introduction

In summary, the Power10 SCM supports 128 OMI lanes with the following characteristics:

Ê 32 Gbps signaling rate Ê Eight lanes per OMI link

Ê Two OMI links per OMI port (2 x 8 lanes)

Ê Eight OMI ports per single chip module (16 x 8 lanes)

Leave a Reply

Your email address will not be published. Required fields are marked *