c66x Corepac PDF

C66x CorePac: Achieving C66xCorePac:Achieving
HighPerformance
Agenda
1. CorePacArchitecture
2. Single Instruction Multiple Data (SIMD) 2. SingleInstructionMultipleData(SIMD)
3. MemoryAccess
4 Pi li C t 4. PipelineConcept
CorePacArchitecture
3. MemoryAccess
C66xCorePac
CorePacincludes:
Level1Program
Memory(L1P)
SingleCycle
Cache / RAM
Level2
Memory
CorePacincludes:
DSP Core
Two registers
Four functional units per
register side
256
Cache/RAM
Memory
(L2)
Program/Data
Cache/RAM
register side
L1P memory (Cache/RAM)
L1D memory (Cache/RAM)
L2 memory (Cache/RAM)
DSPCore
InstructionFetch
M L M L
/
S D
64bit
S D
RegA[32] RegB[32]
Memory
Controller
Level1Data
Memory(L1D)
Controller
C66xCorePac
SingleCycle
Cache/RAM
C66xDSPCore
Memory
Four functional units per side:
o Multiplier (.M)
A0
.D1 .D2
B0
o ALU (.L)
o Data (.D)
o Control (.S)
These independent functional units
S1 S2
p
enable efficient execution of parallel
specialized instructions:
o Multiplier (.M1and.M2) and ALU (.L1
and .L2) provide MAC (multiple
.S1 .S2
MACs
) p ( p
accumulation) operations.
o Data (.D) provides data input/output.
o Control (.S) provides control
functions (loop, branch, call).
.M1 .M2
( p, , )
Each DSP core dispatches up to eight
parallel instructions each cycle.
All instructions are conditional, which
enables efficient pipelining.
A31
.
.
.L1 .L2
B31
.
.
/
pp g
The optimized C compiler generates
efficient target code.
Controller/Decoder
C66xDSPCoreCrossPath
A0
A1
RegisterFileA
B0
B1
RegisterFileB
Any64bitpairof
registersfromAcan
beoneoftheinputs
A1
A2
A3
B1
B2
B3
p
toaBfunctional
unit,andviceversa.
A4
.
B4
.
.
.
.
.
A31 B31
A
.D1
S1
B
.D1
S1
A31 B31
.S1
.M1
L1
.S1
.M1
L1 .L1 .L1
PartialListof.DInstructions
PartialListof.LInstructions
PartialListof.MInstructions
PartialListof.SInstructions
SingleInstructionMultipleData(SIMD)
3. MemoryAccess
C66xSIMDInstructions:Examples
ADDDP AddTwoDoublePrecisionFloatingPointValues
DADD2 4WaySIMDAddition,PackedSigned16bit
Performs4additionsoftwosetsof416bitnumberspackedinto64
bitregisters.
The4resultsareroundedto4packed16bitvalues
unit=.L1,.L2,.S1,.S2
FMPYDP FastDoublePrecisionFloatingPointMultiply g p y
QMPY32 4WaySIMDMultiply,PackedSigned32bit.
Performs4multiplicationsoftwosetsof432bitnumberspacked
i t 128 bit i t into128bitregisters.
The4resultsarepacked32bitvalues.
unit=.M1or.M2
C66xSIMDInstruction:CMATMPY
Manyapplicationsusecomplexmatrixarithmetic.
CMATMPY 2x1ComplexVectorMultiply2x2ComplexMatrix
Resultsin1x2signedcomplexvector.
Allvaluesare16bit(16bitreal/16bitImaginary)
unit=.M1or.M2
Howmanymultiplicationsarecomplexmultiplication,whereeach
complexmultiplicationhasthefollowing?
4complexmultiplications(4realmultiplicationseach)
TwoMunits(16multiplicationseach)=32multiplications
Corecyclespersecond(1.25G) y p ( )
Totalmultiplicationspersecond=40Gmultiplications
8cores=320Gmultiplications
Theissuehereis,canwefeedthefunctionalunitsdatafastenough?
FeedingtheFunctionalUnits
Therearetwochallenges:
Howtoprovideenoughdatafrommemorytothecore
AccesstoL1memoryiswide(2x64bit)andfast(0waitstate)
MultiplemechanismsareusedtoefficientlytransfernewdatatoL1
fromL2andexternalmemory.
Howtogetvaluesinandoutofthefunctionalunits
Hardwarepipelineenablesexecutionofinstructionseverycycle.
Efficient instruction scheduling maximizes functional unit Efficientinstructionschedulingmaximizesfunctionalunit
throughput.
MemoryAccess
3. MemoryAccess
InternalBuses
PC
Program Address x32
PC
ProgramAddress x32
ProgramData x256
L1
Memories
Fetch
A
Regs
DataAddress T1 x32
DataData T1 x32/64
B
Regs
DataAddress T2 x32
DataData T2 x32/64
L2and
External
Memory
Peripherals
C62x: Dual 32-Bit Load/Store
C67x: Dual 64-Bit Load / 32-Bit Store
C64x, C674x, C66x: Dual 64-Bit Load/Store
PipelineConcept
3. MemoryAccess
NonPipelinedvs.PipelinedCPU
CPUType
ClockCycles
123 456 789
F
2
D
2
E
2
F
3
D
3
E
3
F
1
D
1
E
1
NonPipelined
F
1
D
1
E
1
F
2
D
2
E
2
Pipelined
Stage Pipeline Function
F
2
D
2
E
2
F
3
D
3
E
3
Pipeline full
Stage PipelineFunction
F
Fetch
Generateprogramfetchaddress
Readopcode
D Routeopcode tofunctionalunits
NowlookattheC66xpipeline.
Decode
p
Decodeinstructions
E
Execute
Executeinstructions
ProgramFetchPhases
Phase Description
PG Generatefetchaddress
PS Sendaddresstomemory
PW Waitfordataready
PR Read opcode
C66x
PR Readopcode
Core
Functional
Units
PR
PS
Memory
PG
PW
PipelinePhases Review
Program Fetch
Execute
Decode
PG PS PW PR D E
PG PS PW PR D E
PG PS PW PR D E PG PS PW PR D E
PG PS PW PR D E
PG PS PW PR D E
PG PS PW PR D E
Singlecycleperformanceisnotaffectedbyaddingthree
PG PS PW PR D E
programfetchphases.
Thatis,thereisstillanexecuteeverycycle.
Howaboutdecode?Isitonlyonecycle?
DecodePhases
DecodePhase Description
DP Intelligentlyroutesinstructionto
functional unit (dispatch) functionalunit(dispatch)
DC Instructiondecodedatfunctionalunit
(decode)
C66x
Core Core
PR
Functional
Units
DP
DC
PS
Memory
PG
PW
PipelinePhases
ProgramFetch
Execute
Decode
PG PS PW PR DP DC E1
PipelineFull p
Howmanycyclesdoesittaketoexecuteaninstruction?
InstructionDelays
AllC66xinstructionsrequireonlyonecycleto
execute,butsomeresultsaredelayed. y
Description InstructionExample Delay
SingleCycle Allinstructionsexcept 0
Integermultiplication and MPY,FMPYSP 1
newfloatingpoint
Legacyfloatingpoint
multiplication
MPYSP 2
Load LDW 4
Branch B 5
SoftwarePipelineExample
Dotproduct;AtypicalDSPMACoperation.
LDH
| | LDH
Howmanycycleswould
it take to perform this
| |
MPY
ADD
ittaketoperformthis
loopfivetimes?
(Disregarddelayslots). ( g y )
______________cycles
SoftwarePipelineExample
AtypicalDSPMACoperation dotproduct
LDH
| | LDH
Howmanycycleswould
it take to perform this
| |
MPY
ADD
ittaketoperformthis
loop5times?
(Disregarddelayslots). ( g y )
5x3=15cycles
NonPipelinedCode
Cycl e
. M1 . M2 . L1 . L2 . S1 . S2 . D1 . D2 1 l dh l dh
2 mpy
3 add
4 l dh l dh 4 l dh l dh
5 mpy
6 add
7 l dh l dh
8 mpy
9 dd 9 add
PipeliningCode
Cycl e
. M1 . M2 . L1 . L2 . S1 . S2 . D1 . D2 1 l dh l dh
2 mpy l dh l dh
3 add mpy l dh l dh
4 add mpy l dh l dh 4 add mpy l dh l dh
5 add mpy l dh l dh
6 add mpy
7 add
Pipelining these instructions took 1/2 the cycles! Pipeliningtheseinstructionstook1/2thecycles!
SoftwarePipelineSupport
Thecompilerissmartenoughtoscheduleinstructions
efficiently.
DSP l ith t i ll l i t i DSPalgorithmsaretypicallyloopintensive.
Generallyspeaking,servicingofinterruptsisnotallowedin
themiddleoftheloopbecausefixedtimingisessential. p g
TheC66xhardwareSPLOOPenablesservicingofinterrupts
inthemiddleofloops.
NOTE:FormoreinformationonSPLOOP,refertoChapter8
oftheC66xCPUandInstructionSetReferenceGuide.
ForMoreInformation
Formoreinformation,refertotheC66xCPU
andInstructionSetReferenceGuide.
Forquestionsregardingtopicscoveredinthis
training, visit the support forums at the training,visitthesupportforumsatthe
TIE2ECommunity website.

c66x Corepac PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

c66x Corepac PDF

Enviado por

Direitos autorais:

Formatos disponíveis

C66x CorePac: Achieving C66xCorePac:Achieving

Você também pode gostar