04 Sun Golla PDF

Niagara2: A Highly Threaded
Server-on-a-Chip
Robert Golla
Principal Architect
Sun Microsystems
October 10, 2006
Contributors
Jama Barreh
Jeff Brooks
William Bryg
Bruce Chang
Robert Golla
Greg Grohoski
Rick Hetherington
Paul Jordan
Mark Luttrell
Mark Mcpherson
Shimon Muller
Chris Olson
Bikram Saha
Manish Shah
Michael Wong
Page 2
Agenda
Chip Overview
Throughput Computing
Sparc core
Crossbar
L2 cache
Networking
PCI-Express
Power
Status
Summary
Page 3
Niagara2 Chip Overview

L2 Data
Bank 0
L2 Data
Bank 4
L2B0
SPARC
Core 0
SPARC
Core 1
SPARC
Core 5
L2 Data
Bank 5
L2B1
L2B5
L2
TAG0
L2
TAG1
L2
TAG5
L2
TAG4
SIO
SII
MCU1
L2B3
CCX
CCU
L2 Data
Bank 3
L2B2
MCU3
L2B7
L2
TAG2
L2
TAG3
L2
TAG7
L2
TAG6
L2B6
L2 Data
Bank 6
SPARC
Core 2
SPARC
Core 3
SPARC
Core 7
SPARC
Core 6
TDS
RDP
PEU
PSR
FSR
L2 Data
Bank 7
L2 Data
Bank 2
DMU
MCU2
EFU
NCU
L2 Data
Bank 1
MCU0
FSR
L2B4
SPARC
Core 4
ESR
FSR
MAC
RTX
8 Sparc cores, 8
threads each
Shared 4MB L2,
8-banks, 16-way
associative
Four dual-channel
FBDIMM memory
controllers
Two 10/1 Gb Enet
ports
One PCI-Express
x8 1.0A port
342 mm^2 die
size in 65 nm
711 signal I/O,
1831 total
Page 4
Niagara2 Chip Overview
Sparc Core
L2 Bank0
Sparc Core
L2 Bank1
Sparc Core
L2 Bank2
8x9
L2 Bank3
Cache
Sparc Core Crossbar L2 Bank4
Sparc Core
Sparc Core
L2 Bank5
Sparc Core
L2 Bank6
Sparc Core
L2 Bank7
Memory
Control 0
FBDIMM
Memory
Control 1
FBDIMM
Memory
Control 2
FBDIMM
Memory
Control 3
Full 8x9 crossbar

switch
FBDIMM
I/O
2x10/1
GE
NIU
(Ethernet)
System
Interface
Unit
PCI-EX
X8
@2.5Gb/s
Connects every core

to every L2 bank
and vice-versa
Supports 8 byte
writes from a core to
a bank
Supports 16 byte
reads from a bank to
core
One port for core to
read/write IO
System interface
unit connects
networking and IO
to memory
Page 5
Throughput Computing
C
M
M
C
C
C
M
C
C
M
M
C
Single
Thread
Compute Time
C
Threads
Memory Latency
M
C
M
C
M
C
M
C
M
C
Time
For a single thread
Memory is THE bottleneck to improving performance
Commercial server workloads exhibit poor memory locality
Only a modest throughput speedup is possible by reducing compute time
Conventional single-thread processors optimized for ILP have low

utilizations
With many threads
Its possible to find something to execute every cycle
Significant throughput speedups are possible

Page 6
Processor utilization is much higher
Engineering Solutions
Design Problem
> Double UltraSparc T1's throughput and throughput/watt
> Improve UltraSparc T1's FP single-thread and
throughput performance
> Minimize required area for these improvements
Considered doubling number of UltraSparc T1

cores
> 16 cores of 4 threads each
> Takes too much die area
> No area left for improving FP performance
Page 7
Multithread_performance_vs._threads (12/2002)
Total IPC
2.0
Relative Throughput Performance
Niagara2
Probabilistic Modelling
> Generate synthetic traces
for each thread with an

instruction/miss profile that
matches TPC-C
> Schedule ready threads to
run on some number of
execution units
> End simulation once
simulated distributions are
close to actual
distributions
Works very well for simple

scalar cores running lots of
threads on transactional
workloads
1.0
> Within 10 percent of a

UltraSparc T1
detailed cycle accurate

simulator
> Detailed cycle accurate
simulator not available at
beginning of the project
Page 8
Decided to increase the number of threads per
core and increase execution bandwidth
> 8 threads per core x 8 cores = 64 threads total
> 2 EXUs per core
> More than doubles UltraSparc T1s throughput
> Doubling threads is more area efficient than doubling
cores
> Integrate FGU into core pipeline

6 cycle FP latency
Threads running FP are non-blocking
> Enhance Niagara2s cryptography

Added more ciphers
Enhanced existing public key support
Page 9
Throughput Changes
Niagara2 throughput changes vs. UltraSparc T1
> Add instruction buffers after L1 instruction cache for
each thread
> Add new pipe stage pick

> Choose 2 threads out of 8 to execute each cycle
> Increase execution units from 1 to 2

> Increase set associativity of L1 instruction cache to 8
> Increase size of fully associative DTLB from 64 to 128
entries
> Increase L2 banks from 4 to 8

> 15 percent performance loss with only 4 banks and 64 threads
> Increase threads from 4 to 8
Page 10
Sparc Core Block Diagram

TLU
IFU
EXU0
SPU
FGU
EXU1
LSU
MMU/
HWTW
Gasket
Crossbar/L2
IFU Instruction Fetch Unit

>
16 KB I$, 32B lines, 8-way SA
>
64-entry fully-associative ITLB
EXU0/1 Integer Execution Units
>
4 threads share each unit
>
8 register windows/thread
>
160 IRF entries/thread
LSU Load/Store Unit
>
8 threads share LSU
>
8KB D$, 16B lines, 4-way SA
>
128-entry fully-associative DTLB
FGU Floating-Point/Graphics Unit
8 threads share FGU
32 FRF entries/thread
SPU Stream Processing Unit
>
Cryptographic coprocessor
TLU Trap Logic Unit
>
Updates machine state, handles
exceptions and interrupts
MMU Memory Management Unit
>
Hardware tablewalk (HWTW)
>
8KB, 64KB, 4MB, 256MB pages
Page 11
Core Pipeline
8 stage integer pipeline
Fetch
Cache
Pick
Decode Execute Mem Bypass
> 3-cycle load-use penalty

> Memory (data translation, access tag/data array)
> Bypass (late way select, data formatting, data forwarding)
12 stage floating-point pipeline

Fetch
Cache
Pick
Decode Execute
Fx1
Fx2
Fx3
Fx4
Fx5
FB
FW
> 6-cycle latency for dependent FP ops

> Longer pipeline for divide/sqrt
Page 12
Integer/LSU Pipeline
F
Thread
Group 0
IB0-3
IFU
Thread
Group 1
IB4-7
LSU
Instruction cache is shared by

all 8 threads
Least-recently-fetched
algorithm used to select
next thread to fetch
Each thread is written into

thread-specific instruction
buffer
Decouples fetch from

pick
Each thread statically assigned
to one of 2 thread groups
Pick chooses 1 ready thread
each cycle within each thread
group
Picking within each thread

group is independent of the
other
Least-recently-picked
algorithm used to select
next thread to execute
Decode resolves resource
hazards not handled during pick
Page 13
Integer/LSU Pipeline
F2
Thread
Group 0
C6
IB0-3
IFU
Thread
Group 1
P5
D2
D7
E0
IB4-7
P0
LSU
E6
M3
M4
M4
B1
B1
B7
W2
W6
W6
Threads are
interleaved between
pipeline stages with
very few restrictions
Any thread can be at
fetch or cache stage
Threads are split into
2 thread groups
before pick stage
Load/store and
floating-point units
are shared between
all 8 threads
Up to 1 thread from
either thread group
can be scheduled on
a shared unit
Page 14
Stream Processing Unit

MA Scratchpad
160x64b, 2R/1W
MA
Sources
To FGU
Multiply
Result
From FGU
rs1
rs2
MA Execution
Store Data, Address
DMA Engine
Hash
Engine
Address,
Data to/from
L2
Cipher
Engines
Cryptographic coprocessor
>
One per core
>
Runs in parallel w/core at
same frequency
Two independent sub-units
>
Modular Arithmetic Unit
>
RSA, binary and integer
polynomial elliptic curve
(ECC)
>
Shares FGU multiplier
>
Cipher/Hash Unit
>
RC4, DES/3DES, AES128/192/256
>
MD5, SHA-1, SHA-256
>
Designed to achieve
wire-speed on both
10Gb Ethernet ports
>
Facilitates wirespeed encryption

and decryption
DMA engine shares cores

crossbar port
Page 15
Crossbar
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc

Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
~90 GB/s write
Non-blocking, pipelined
switch
8 load/store requests and 8
data returns can be done at
the same time
Divided into 2 parts
L2 B0 Mux
PCX
Connects 8 cores to 8 L2
Banks and I/O
L2 B7 Mux
~180 GB/s read
L2
L2
L2
L2
L2
L2
L2
L2
Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7
PCX processor to
cache
CPX cache to
processor
Arbitration for a target is

required
Priority given to oldest
requestor to maintain
fairness and order
Three cycle arbitration
protocol
Request, arbitrate and

then grant
Page 16
L2 Cache
PCX Request
Replayed Miss
CPX Return
Input I/O Request

Queue
Fill Request
Arbiter
Output
Queue
L2
Directory
16B
lookup
L2 Tag
Array
4 MB L2 cache
Invalidation
Packet
L2 Valid
Array
L2 Data
Array
miss
16B
64B Eviction
I/O data 64B
16B
16 way set associative
8 L2 banks
64 byte line size
L2 cache is write-back,
write-allocate
hit
Support for partial stores

Coherency is managed
by the L2 cache
Miss Buffer
Write-back
Buffer
Miss Request
Fill Buffer
I/O
Write
Buffer
Arbiter
64B Line Fill
Miss Request 64B Memory

to Memory
Read
64B Memory
Write
L1 data cache is writethru
Directories maintained
for all 16 L1 caches
Data transfers between

the L2 and a core are
done in 16 byte packets
Page 17
Integrated Networking
FBDIMMs
42 GB/s read
21 GB/s write
L2
L2
L2
L2
L2
L2
L2
L2
Crossbar
Pipelined
memory
accesses
tolerate relaxed
ordering
C0
L2 C1 C2 C3 C4 C5 C6 C7
NIU
(Ethernet)
10 GE Ethernet
System
Interface
Unit
PCI-Ex
X8 @ 2.5 GHz
Integrate networking for

better overall performance
All network data is

sourced from and
destined to main
memory
Integration minimizes
impact of memory
Get networking
closer to memory to
reduce latency
Able to take full

advantage of higher
memory bandwidth
Eliminates inherent
inefficiencies of I/O
protocol translation
Page 18
Networking Features
Line Rate Packet Classification (~30M pkt/s)
> Based on Layer 1/2/3/4 of the protocol stack
Multiple DMA Engines

> Matches DMAs to threads
> Binding flexibility between DMAs and ports
> 16 transmit + 16 receive DMA channels
Virtualization Support
> Supports up to 8 partitions
> Interrupts may be bound to different hardware threads
Dual Ethernet ports

> 2 dual-speed MACs (10G/1G) with integrated serdes
Page 19
PCI-Express
Data Management
DMA/PIO
Cache Lines
IOMMU
350 MHz
TLP Packets
Interrupt
XMIT
128b
CSR
32b
RCV
128b
CSR
32b
Data Link and Physical Layer
2.5 Gb/s
16b
Transfers are in packets with

headers and max data
payloads from 128B to 512B
IOMMU supports I/O
virtualization and process
device isolation by using
PCIEs BDF#
MSI Support
Transaction Layer
X8 Serdes
Point-to-point, dual-simplex
chip interconnect
PCI Express Core
250 MHz
PCI-Express operates at 2.5

Gb/s per lane per direction
X8 Serdes
Event queue accumulates

MSIs
Allows many MSIs to be
serviced upon an interrupt
Total I/O bandwidth is 3-4 GB/s

with max payload sizes of
128B to 512B
16b
Page 20
Power Management
Limit speculation
> Sequential prefetch of instruction cache lines
> Predict conditional branches as not-taken
> Predict loads hit in the data cache
> Hardware tablewalk search control
Extensive clock gating

> Datapath
> Control blocks
> Arrays
Power throttling
> 3 external power throttle pins
> Inject stall cycles into the decode stage based on state of these pins
> If power_throttle_pins[2:0]==n then n stalls in window of 8, n is 0-7

> Affects all threads
Page 21
Niagara2 System Status

First silicon
arrived at the
end of May
Booted Solaris
in 5 days
Current
systems are
fully
operational
Expect
systems to
ship in
2H2007
Page 22
Summary
Niagara2 combines all major server functions on one
chip
> Integrated networking
> Integrated PCI-Express
> Embedded wire-speed cryptography
Niagara2 has improved performance vs. UltraSparc

T1
>
>
>
>
Better integer throughput and throughput/watt (>2x)

Improved integer single-thread performance (>1.4x)
Better floating-point throughput (>10x)
Better floating-point single-thread performance (>5x)
Enables new generation of power-efficient, fullysecure datacenters

Page 23
Thank you ...

robert.golla@sun.com
Page 24

04 Sun Golla PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

04 Sun Golla PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Niagara2: A Highly Threaded

Niagara2 Chip Overview

Niagara2 Chip Overview

Full 8x9 crossbar

Connects every core

Memory is THE bottleneck to improving performance

Commercial server workloads exhibit poor memory locality

Only a modest throughput speedup is possible by reducing compute time

Conventional single-thread processors optimized for ILP have low

Its possible to find something to execute every cycle

Significant throughput speedups are possible

Processor utilization is much higher

Considered doubling number of UltraSparc T1

Relative Throughput Performance

for each thread with an

Works very well for simple

> Within 10 percent of a

detailed cycle accurate

> Integrate FGU into core pipeline

Threads running FP are non-blocking

> Enhance Niagara2s cryptography

Enhanced existing public key support

> Add new pipe stage pick

> Increase execution units from 1 to 2

> Increase L2 banks from 4 to 8

> Increase threads from 4 to 8

Sparc Core Block Diagram

IFU Instruction Fetch Unit

8 threads share FGU

Decode Execute Mem Bypass

> 3-cycle load-use penalty

12 stage floating-point pipeline

> 6-cycle latency for dependent FP ops

Instruction cache is shared by

Each thread is written into

Decouples fetch from

Picking within each thread

Stream Processing Unit

Facilitates wirespeed encryption

DMA engine shares cores

Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc

Arbitration for a target is

Request, arbitrate and

Input I/O Request

I/O data 64B

16 way set associative

64 byte line size

Support for partial stores

Miss Request 64B Memory

L1 data cache is writethru

Data transfers between

Integrate networking for

All network data is

Able to take full

Multiple DMA Engines

> 16 transmit + 16 receive DMA channels

Dual Ethernet ports

Data Link and Physical Layer

Transfers are in packets with

PCI Express Core

PCI-Express operates at 2.5

Event queue accumulates