Escolar Documentos
Profissional Documentos
Cultura Documentos
Server-on-a-Chip
Robert Golla
Principal Architect
Sun Microsystems
October 10, 2006
Contributors
Jama Barreh
Jeff Brooks
William Bryg
Bruce Chang
Robert Golla
Greg Grohoski
Rick Hetherington
Paul Jordan
Mark Luttrell
Mark Mcpherson
Shimon Muller
Chris Olson
Bikram Saha
Manish Shah
Michael Wong
Page 2
Agenda
Chip Overview
Throughput Computing
Sparc core
Crossbar
L2 cache
Networking
PCI-Express
Power
Status
Summary
Page 3
L2 Data
Bank 4
L2B0
SPARC
Core 0
SPARC
Core 1
SPARC
Core 5
L2 Data
Bank 5
L2B1
L2B5
L2
TAG0
L2
TAG1
L2
TAG5
L2
TAG4
SIO
SII
MCU1
L2B3
CCX
CCU
L2 Data
Bank 3
L2B2
MCU3
L2B7
L2
TAG2
L2
TAG3
L2
TAG7
L2
TAG6
L2B6
L2 Data
Bank 6
SPARC
Core 2
SPARC
Core 3
SPARC
Core 7
SPARC
Core 6
TDS
RDP
PEU
PSR
FSR
L2 Data
Bank 7
L2 Data
Bank 2
DMU
MCU2
EFU
NCU
L2 Data
Bank 1
MCU0
FSR
L2B4
SPARC
Core 4
ESR
FSR
MAC
RTX
8 Sparc cores, 8
threads each
Shared 4MB L2,
8-banks, 16-way
associative
Four dual-channel
FBDIMM memory
controllers
Two 10/1 Gb Enet
ports
One PCI-Express
x8 1.0A port
342 mm^2 die
size in 65 nm
711 signal I/O,
1831 total
Page 4
Sparc Core
L2 Bank0
Sparc Core
L2 Bank1
Sparc Core
L2 Bank2
8x9
L2 Bank3
Cache
Sparc Core Crossbar L2 Bank4
Sparc Core
Sparc Core
L2 Bank5
Sparc Core
L2 Bank6
Sparc Core
L2 Bank7
Memory
Control 0
FBDIMM
Memory
Control 1
FBDIMM
Memory
Control 2
FBDIMM
Memory
Control 3
FBDIMM
I/O
2x10/1
GE
NIU
(Ethernet)
System
Interface
Unit
PCI-EX
X8
@2.5Gb/s
System interface
unit connects
networking and IO
to memory
Page 5
Throughput Computing
C
M
M
C
C
C
M
C
C
M
M
C
Single
Thread
Compute Time
C
Threads
Memory Latency
M
C
M
C
M
C
M
C
M
C
Time
For a single thread
Engineering Solutions
Design Problem
> Double UltraSparc T1's throughput and throughput/watt
> Improve UltraSparc T1's FP single-thread and
throughput performance
> Minimize required area for these improvements
Page 7
Engineering Solutions
Multithread_performance_vs._threads (12/2002)
Total IPC
2.0
Niagara2
Probabilistic Modelling
> Generate synthetic traces
1.0
Engineering Solutions
Decided to increase the number of threads per
core and increase execution bandwidth
> 8 threads per core x 8 cores = 64 threads total
> 2 EXUs per core
> More than doubles UltraSparc T1s throughput
> Doubling threads is more area efficient than doubling
cores
Page 9
Throughput Changes
Niagara2 throughput changes vs. UltraSparc T1
> Add instruction buffers after L1 instruction cache for
each thread
entries
Page 10
IFU
EXU0
SPU
FGU
EXU1
LSU
MMU/
HWTW
Gasket
Crossbar/L2
32 FRF entries/thread
SPU Stream Processing Unit
>
Cryptographic coprocessor
TLU Trap Logic Unit
>
Updates machine state, handles
exceptions and interrupts
MMU Memory Management Unit
>
Hardware tablewalk (HWTW)
>
8KB, 64KB, 4MB, 256MB pages
Page 11
Core Pipeline
8 stage integer pipeline
Fetch
Cache
Pick
Cache
Pick
Decode Execute
Fx1
Fx2
Fx3
Fx4
Fx5
FB
FW
Page 12
Integer/LSU Pipeline
F
Thread
Group 0
IB0-3
IFU
Thread
Group 1
IB4-7
LSU
Least-recently-fetched
algorithm used to select
next thread to fetch
Least-recently-picked
algorithm used to select
next thread to execute
Decode resolves resource
hazards not handled during pick
Page 13
Integer/LSU Pipeline
F2
Thread
Group 0
C6
IB0-3
IFU
Thread
Group 1
P5
D2
D7
E0
IB4-7
P0
LSU
E6
M3
M4
M4
B1
B1
B7
W2
W6
W6
Threads are
interleaved between
pipeline stages with
very few restrictions
Any thread can be at
fetch or cache stage
Threads are split into
2 thread groups
before pick stage
Load/store and
floating-point units
are shared between
all 8 threads
Up to 1 thread from
either thread group
can be scheduled on
a shared unit
Page 14
rs1
rs2
MA Execution
Store Data, Address
DMA Engine
Hash
Engine
Address,
Data to/from
L2
Cipher
Engines
Cryptographic coprocessor
>
One per core
>
Runs in parallel w/core at
same frequency
Two independent sub-units
>
Modular Arithmetic Unit
>
RSA, binary and integer
polynomial elliptic curve
(ECC)
>
Shares FGU multiplier
>
Cipher/Hash Unit
>
RC4, DES/3DES, AES128/192/256
>
MD5, SHA-1, SHA-256
>
Designed to achieve
wire-speed on both
10Gb Ethernet ports
>
Crossbar
Non-blocking, pipelined
switch
8 load/store requests and 8
data returns can be done at
the same time
Divided into 2 parts
L2 B0 Mux
PCX
Connects 8 cores to 8 L2
Banks and I/O
L2 B7 Mux
~180 GB/s read
L2
L2
L2
L2
L2
L2
L2
L2
Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7
PCX processor to
cache
CPX cache to
processor
L2 Cache
PCX Request
Replayed Miss
CPX Return
Fill Request
Arbiter
Output
Queue
L2
Directory
16B
lookup
L2 Tag
Array
4 MB L2 cache
Invalidation
Packet
L2 Valid
Array
L2 Data
Array
miss
16B
64B Eviction
16B
8 L2 banks
L2 cache is write-back,
write-allocate
hit
Miss Buffer
Write-back
Buffer
Miss Request
Fill Buffer
I/O
Write
Buffer
Arbiter
64B Line Fill
64B Memory
Write
Directories maintained
for all 16 L1 caches
Integrated Networking
FBDIMMs
42 GB/s read
21 GB/s write
L2
L2
L2
L2
L2
L2
L2
L2
Crossbar
Pipelined
memory
accesses
tolerate relaxed
ordering
C0
L2 C1 C2 C3 C4 C5 C6 C7
NIU
(Ethernet)
10 GE Ethernet
System
Interface
Unit
PCI-Ex
X8 @ 2.5 GHz
Integration minimizes
impact of memory
Get networking
closer to memory to
reduce latency
Eliminates inherent
inefficiencies of I/O
protocol translation
Page 18
Networking Features
Line Rate Packet Classification (~30M pkt/s)
> Based on Layer 1/2/3/4 of the protocol stack
Virtualization Support
> Supports up to 8 partitions
> Interrupts may be bound to different hardware threads
PCI-Express
Data Management
DMA/PIO
Cache Lines
IOMMU
350 MHz
TLP Packets
Interrupt
XMIT
128b
CSR
32b
RCV
128b
CSR
32b
2.5 Gb/s
16b
Transaction Layer
X8 Serdes
Point-to-point, dual-simplex
chip interconnect
250 MHz
X8 Serdes
16b
Page 20
Power Management
Limit speculation
> Sequential prefetch of instruction cache lines
> Predict conditional branches as not-taken
> Predict loads hit in the data cache
> Hardware tablewalk search control
Power throttling
> 3 external power throttle pins
> Inject stall cycles into the decode stage based on state of these pins
Page 22
Summary
Niagara2 combines all major server functions on one
chip
> Integrated networking
> Integrated PCI-Express
> Embedded wire-speed cryptography
Page 24