Você está na página 1de 249

Introduction

4. Instruction tables
Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs

By Agner Fog. Technical University of Denmark.


Copyright 1996 - 2014. Last updated 2014-02-19.

Introduction
This is the fourth in a series of five manuals:
1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac
platforms.
2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly
programmers and compiler makers.
4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation
breakdowns for Intel, AMD and VIA CPUs.
5. Calling conventions for different C++ compilers and operating systems.
The latest versions of these manuals are always available from www.agner.org/optimize.
Copyright conditions are listed below.
The present manual contains tables of instruction latencies, throughputs and micro-operation
breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA.

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower
than the values published elsewhere. The discrepancies can be explained by the following factors:
My figures are experimental values while figures published by microprocessor vendors may be
based on theory or simulations.
My figures are obtained with a particular test method under particular conditions. It is possible that
different values can be obtained under other conditions.
Some latencies are difficult or impossible to measure accurately, especially for memory access
and type conversions that cannot be chained.
Latencies for moving data from one execution unit to another are listed explicitly in some of my
tables while they are included in the general latencies in some tables published by Intel.
Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64-bit).
Values for far calls and interrupts may be different in different modes. Call gates have not been
tested.
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.
If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version.

Page 1

Introduction
Copyright notice

This series of five manuals is copyrighted by Agner Fog. Public distribution and
mirroring is not allowed. Non-public distribution to a limited audience for
educational purposes is allowed. The code examples in these manuals can be
used without restrictions. A GNU Free Documentation License shall automatically
come into force when I die. See www.gnu.org/copyleft/fdl.html

Page 2

Definition of terms

Definition of terms
Operands

Latency

Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose
register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm
register, y = 256 bit ymm register, v = any vector register, sr = segment register, m =
any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are
minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly
more than 100 clock cycles on many processors, except in move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results may give a
similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way.
Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64
bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64
bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the
figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But
if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles
per instruction plus one extra clock cycle in the end. The latency in this case is listed
as 4 in the tables because this is the value it adds to a dependency chain.

Reciprocal
throughput

The throughput is the maximum number of instructions of the same kind that can be
executed per clock cycle when the operands of each instruction are independent of
the preceding instructions. The values listed are the reciprocals of the throughputs,
i.e. the average number of clock cycles per instruction when the instructions are not
part of a limiting dependency chain. For example, a reciprocal throughput of 2 for
FMUL means that a new FMUL instruction can start executing 2 clock cycles after a
previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution
units can handle 3 integer additions per clock cycle.
The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency.
The values listed are for a single thread or a single core. A missing value in the table
means that the value has not been measured.

Page 3

Definition of terms
ops

Uop or op is an abbreviation for micro-operation. Processors with out-of-order cores


are capable of splitting complex instructions into ops. For example, a read-modify instruction may be split into a read-op and a modify-op. The number of ops that an
instruction generates is important when certain bottlenecks in the pipeline limit the
number of ops per clock cycle.

Execution
unit

The execution core of a microprocessor has several execution units. Each execution
unit can handle a particular category of ops, for example floating point additions. The
information about which execution unit a particular op goes to can be useful for two
purposes. Firstly, two ops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle
when the result of a op executing in one execution unit is needed as input for a op
in another execution unit.

Execution
port

The execution units are clustered around a few execution ports on most Intel processors. Each op passes through an execution port to get to the right execution unit. An
execution port can be a bottleneck because it can handle only one op at a time. Two
ops cannot execute simultaneously if they need the same execution port, even if
they are going to different execution units.

Instruction
set

This indicates which instruction set an instruction belongs to. The instruction is only
available in processors that support this instruction set. The different instruction sets
are listed at the end of this manual. Availability in processors prior to 80386 does not
apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not
apply to 128-bit packed integer instructions, which require SSE2. Availability in the
SSE instruction set does not apply to double precision floating point instructions,
which require SSE2.
32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use
XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only
available under operating systems that support this register set.

How the values were measured


The values in the tables are measured with the use of my own test programs, which are available
from www.agner.org/optimize/testp.zip
The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock
frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent
of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp
counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD
Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup.
Instruction throughputs are measured with a long sequence of instructions of the same kind, where
subsequent instructions use different registers in order to avoid dependence of each instruction on
the previous one. The input registers are cleared in the cases where it is impossible to use different
registers. The test code is carefully constructed in each case to make sure that no other bottleneck
is limiting the throughput than the one that is being measured.
Instruction latencies are measured in a long dependency chain of identical instructions where the
output of each instruction is needed as input for the next instruction.
The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code
cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a
larger number of instructions is desired.

Page 4

Definition of terms
It is not possible to measure the latency of a memory read or write instruction with software methods.
It is only possible to measure the combined latency of a memory write followed by a memory read
from the same address. What is measured here is not actually the cache access time, because in
most cases the microprocessor is smart enough to make a "store forwarding" directly from the write
unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of
this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables.
But in fact, the only value that makes sense to performance optimization is the sum of the write time
and the read time.
A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and
XMM vector registers. The value that can be measured is the combined latency of data transfer from
one type of registers to another type and back again (A B A). The division of this latency between the A B latency and the B A latency is sometimes obvious, sometimes based on guesswork, op counts, indirect evidence, or triangular sequences such as A B Memory A. In
many cases, however, the division of the total latency between A B latency and B A latency is
arbitrary. However, what cannot be measured cannot matter for performance optimization. What
counts is the sum of the A B latency and the B A latency, not the individual terms.
The op counts are usually measured with the use of the performance monitor counters (PMCs) that
are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation.
The execution ports and execution units that are used by each instruction or op are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that
can give this information directly. In other cases it is necessary to obtain this information indirectly by
testing whether a particular instruction or op can execute simultaneously with another
instruction/op that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to
another. This delay can be used for detecting whether two different instructions/ops are using the
same or different execution units.

Page 5

Microprocessors tested

Microprocessor versions tested


The tables in this manual are based on testing of the following microprocessors

Processor name
AMD K7 Athlon
AMD K8 Opteron
AMD K10 Opteron
AMD Bulldozer
AMD Piledriver
AMD Steamroller
AMD Bobcat
AMD Kabini
Intel Pentium
Intel Pentium MMX
Intel Pentium II
Intel Pentium III
Intel Pentium 4
Intel Pentium 4 EM64T
Intel Pentium M
Intel Core Duo
Intel Core 2 (65 nm)
Intel Core 2 (45 nm)
Intel Core i7
Intel 2nd gen. Core
Intel 3rd gen. Core
Intel 4th gen. Core
Intel Atom 330
VIA Nano L2200
VIA Nano L3050

Family Model
Microarchitecture number number
Code name
(hex)
(hex)
Comment

Bulldozer, Zambezi
Piledriver
Steamroller, Kaveri
Bobcat
Jaguar
P5
P5
P6
P6
Netburst
Netburst, Prescott
Dothan
Yonah
Merom
Wolfdale
Nehalem
Sandy Bridge
Ivy Bridge
Haswell
Diamondville
Isaiah

6
F
10
15
15
15
14
16
5
5
6
6
F
F
6
6
6
6
6
6
6
6
6
6
6

Page 6

6
5
2
1
2
30
1
0
2
4
6
7
2
4
D
E
F
17
1A
2A
3A
3C
1C
F
F

Step. 2, rev. A5
Stepping A
2350, step. 1
FX-6100, step 2
FX-8350, step 0. And others
A10-7850K, step 1
E350, step. 0
A4-5000, step 1
Stepping 4

Stepping 4, rev. B0
Xeon. Stepping 1
Stepping 6, rev. B1
Not fully tested
T5500, Step. 6, rev. B2
E8400, Step. 6
i7-920, Step. 5, rev. D0
i5-2500, Step 7
i7-3770K, Step 9
i7-4770K, step. 3
Step. 2
Step. 2
Step. 8 (prerel. sample)

AMD K7

AMD K7
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the
pipeline.

Execution unit:

Indicates which execution unit is used for the macro-operations. ALU means
any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both
used. AGU means any of the three integer address generation units. FADD
means floating point adder unit. FMUL means floating point multiplier unit.
FMISC means floating point store and miscellaneous unit. FA/M means FADD
or FMUL is used. FANY means any of the three floating point units can be
used. Two macro-operations can execute simultaneously if they go to different
execution units.

Integer instructions
Instruction
Move instructions
MOV
MOV

Operands

r,r
r,i

Ops

1
1

Latency Reciprocal
throughput
1
1

1/3
1/3

MOV
MOV
MOV
MOV

r8,m8
r16,m16
r32,m32
m8,r8H

1
1
1
1

4
4
3
8

1/2
1/2
1/2
1/2

MOV

m8,r8L

1/2

MOV
MOV
MOV

m16/32,r
m,i
r,sr

1
1
1

2
2
2

1/2
1/2
1

Page 7

Execution
unit

Notes

ALU
ALU
Any addr. mode.
Add 1 clk if code
segment base
ALU, AGU 0
do.
ALU, AGU
do.
AGU
AH,
BH,
CH, DH
AGU
Any other 8-bit
register
AGU
Any addressing
mode
AGU
AGU

AMD K7
MOV
MOVZX, MOVSX
MOVZX, MOVSX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL

sr,r/m
r,r
r,m
r,r
r,m
r,r

6
1
1
1
1
3

9-13
1
4
1

r,m

3
2
1
1
2
2
1
9
2
3
6
9
2
9
2
1
4
2
1
10
1

16
5

1
1
7
1
1
7
1

r8/m8

1
1
1
1
1
1
1
1
1
1
9
12
16
4
31
3

1
7
5
6
7
5
13
3

r16/m16
r32/m32
r16,r16/m16

3
3
2

3
4
3

r
i
m
sr

r
m
DS/ES/FS/GS
SS

r16,[m]
r32,[m]

r,m
r

r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m

3
2
3
2
1
1

Page 8

8
1/3
1/2
1/3
1/2
1
16
1
1
1
1
1
4
1
1
10
18
1
4
1
1/3
2
2
1
9
1/3

1/3
1/2
2,5
1/3
1/2
2,5
1/3
1/2
1/3
3
5
6
7

ALU
ALU, AGU
ALU
ALU, AGU
ALU
Timing depends
ALU, AGU on hw
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
Any addr. size
AGU
Any addr. size
ALU
ALU
ALU
ALU

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU
ALU0

2
3
2

ALU0_1
ALU0_1
ALU0

latency ax=3,
dx=4

AMD K7
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE
CWD, CDQ
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR

r32,r32/m32
r16,(r16),i
r32,(r32),i
r16,m16,i
r32,m32,i
r8/m8
r16/m16
r32/m32
r8
r16
r32
m8
m16
m32

2
2
2
3
3
32
47
79
41
56
88
42
57
89
1
1

r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
r,r
r,r

1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2
5
4
8
19
23

4
4
5

24
24
40
17
25
41
17
25
41
1
1

1
1
7
1
1
1
7
1
1
1
4
3
3
3
7
7
5
8
6
7
4
4
7
1

2
7
7
6
7
9
Page 9

2,5
1
2
2
2
23
23
40
17
25
41
17
25
41
1/3
1/3

ALU0
ALU0
ALU0
ALU0
ALU0
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU

1/3
1/2
2,5
1/3
1/2
1/3
2,5
1/3
1/3
1/3
4
3
3
3
3
4
4
4
4
3
2
3
3
1/3
1/2
2
1
2
2
3
7
9

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU

AMD K7
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD

r,m
r,m
r
m

Control transfer instructions


JMP
short/near
JMP
JMP
JMP

20
23
1
1
1
1
2
3

8
10
1

8
10
1/3
1/2
1/3
1/3
1
2

ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU

ALU
low values = real
mode

far
r
m(near)

16-20
1
1

23-32

m(far)
short/near
short
short
near

17-21
1
2
7
3

25-33

CALL
CALL
CALL

far
r
m(near)

16-22
4
5

23-32
3
3

CALL
RETN
RETN

m(far)

16-22
2
2

24-33
3
3

15-23

24-35

15-24
32
33

24-35
81
42

INTO

String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS

4
5
4
3
7
4
5
5
7
6

JMP
Jcc
J(E)CXZ
LOOP
CALL

RETF
RETF
IRET
INT
BOUND

2
2

3-4
2

2
2
2
1
3
1-4
2
2
6
3-4
Page 10

1/3 - 2
1/3 - 2
3-4
2

ALU
ALU, AGU

ALU
ALU
ALU
ALU

low values = real


mode
rcp. t.= 2 if jump
rcp. t.= 2 if jump

low values = real


mode
3
3

ALU
ALU, AGU
low values = real
mode

3
3

2
2
2
1
3
1-4
2
2
6
3-4

ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no
jump
values are for no
jump

values per count


values per count
values per count
values per count
values per count

AMD K7
Other
NOP (90)
Long NOP (0F 1F)
ENTER

1
1
i,0

LEAVE
CLI
STI
CPUID
RDTSC
RDPMC

3
8-9
16-17
19-28
5
9

0
0
12

1/3
1/3
12

ALU
ALU
12
3 ops, 5 clk if 16
bit

3
5
27
44-74
11
11

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1

Operands

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

FCMOVcc
FFREE
FINCSTP, FDECSTP

st0,r
r

FNSTSW
FSTSW
FNSTSW
FNSTCW

Ops

1
1
7
30
1
1
10
260
1
1
1
1

Latency Reciprocal
throughput
2
4
16
41
2
3
7
0
9
7

1/2
1/2
4
39
1/2
1
5
188
0,4
1
1
1

Execution
unit

Notes

FA/M
FANY

FA/M
FMISC

FMISC
FMISC, FA/M

FMISC

42

Low latency immediately after


FMISC, FA/M FCOMI
FANY
FANY
Low latency immediately after
FMISC, ALU FCOM FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
faster if
FMISC, ALU unchanged

4
4
4
4

1
1-2
1
2

FADD
FADD,FMISC
FMUL
FMUL,FMISC

11-25

8-22

FMUL

9
1
1

AX
AX
m16
m16

2
3
2
3

6-12
6-12

FLDCW

m16

14

Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL

r/m
m
r/m
m

1
2
1
2

FDIV(R)(P)

r/m

Page 11

5
1/3
1/3

12
12
8
1

Low values are


for round divisors

AMD K7
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1

2
1
1
1
1
2
1
2
5
1
1

12-26
2
2
2
3

Math
FSQRT
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1

1
44
51
76
46
72
5
7
8
49
63

Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR

1
1
7
25
76
65
44
85

r/m
r
m

9-23
1
1
1
1
1
1
2
3
8
8

FMUL,FMISC
FMUL
FADD
FADD
FADD

35
90-100
90-100
100-150
100-200
160-170
8
11
27
126
147

12

FMUL

0
0

1/3
1/3
24
92
147
120
59
87

FANY
ALU
FMISC
FMISC

2
10
7-10
8-11

do.

FADD, FMISC

FADD
FMISC, ALU
FMUL
FMUL

Integer MMX instructions


Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVNTQ
PACKSSWB/DW
PACKUSWB

Operands

Ops

Latency Reciprocal
throughput

r32, mm
mm, r32
mm,m32
m32, r
mm,mm
mm,m64
m64,mm
m,mm

2
2
1
1
1
1
1
1

7
9

mm,r/m

Page 12

Execution
unit

2
2
1/2
1
1/2
1/2
1
2

FMICS, ALU
FANY, ALU
FANY
FMISC
FA/M
FANY
FMISC
FMISC

FA/M

Notes

AMD K7
PUNPCKH/LBW/WD
PSHUFW
MASKMOVQ
PMOVMSKB
PEXTRW
PINSRW

mm,r/m
mm,mm,i
mm,mm
r32,mm
r32,mm,i
mm,r32,i

1
1
32
3
2
2

mm,r/m
mm,r/m

2
2

5
12

2
1/2
24
3
2
2

FA/M
FA/M
FADD
FMISC, ALU
FA/M

1
1

2
2

1/2
1/2

FA/M
FA/M

mm,r/m
mm,r/m
mm,r/m
mm,r/m
mm,r/m

1
1
1
1
1

3
3
2
2
3

1
1
1/2
1/2
1

FMUL
FMUL
FA/M
FA/M
FADD

mm,r/m

1/2

FA/M

mm,i/mm/m

1/2

FA/M

1/3

FANY

Arithmetic instructions
PADDB/W/D PADDSB/W
PADDUSB/W
PSUBB/W/D PSUBSB/W
PSUBUSB/W

PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PAVGB/W
PMIN/MAX SW/UB
PSADBW
Logic
PAND PANDN POR
PXOR
PSLL/RLW/D/Q
PSRAW/D
Other
EMMS

Floating point XMM instructions


Instruction
Move instructions
MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHLPS, MOVLHPS
MOVHPS, MOVLPS
MOVHPS, MOVLPS
MOVNTPS
MOVMSKPS
SHUFPS

Operands

Ops

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
m,r
r32,r
r,r/m,i

2
2
2
2
5
5
1
2
1
1
1
1
2
3
3

Latency Reciprocal
throughput
2

2
4
3
2

3
Page 13

1
2
2
1
2
2
1
1
1
1/2
1/2
1
4
2
3

Execution
unit
FA/M
FMISC
FMISC
FA/M

FA/M
FANY FMISC
FMISC
FA/M
FMISC
FMISC
FMISC
FADD
FMUL

Notes

AMD K7
UNPCK H/L PS
Conversion
CVTPI2PS
CVT(T)PS2PI
CVTSI2SS
CVT(T)SS2SI
Arithmetic
ADDSS SUBSS
ADDPS SUBPS
MULSS
MULPS

r,r/m

xmm,mm
mm,xmm
xmm,r32
r32,xmm

1
1
4
2

4
6

r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2

4
4
4
4

FMUL

10
3

FMISC
FMISC
FMISC
FMISC

1
2
1
2

FADD
FADD
FMUL
FMUL

DIVSS
DIVPS
RCPSS
RCPPS
MAXSS MINSS
MAXPS MINPS
CMPccSS
CMPccPS
COMISS UCOMISS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2
1
2
1
2
1

11-16
18-30
3
3
2
2
2
2
2

8-13
18-30
1
2
1
2
1
2
1

FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
FADD

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

FMUL

Math
SQRTSS
SQRTPS
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2

19
36
3
3

16
36
1
2

FMUL
FMUL
FMUL
FMUL

Other
LDMXCSR
STMXCSR

m
m

8
3

Low values are


for round divisors, e.g. powers
of 2.
do.

9
10

3DNow instructions (obsolete)


Instruction

Operands

Move and convert instructions


PREFETCH(W)
m
PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm

Ops

1
1
1
1
1
1

Latency Reciprocal
throughput

5
5
5
5
2
Page 14

1/2
1
1
1
1
1/2

Execution
unit
AGU
FMISC
FMISC
FMISC
FMISC
FA/M

Notes

3DNow E
3DNow E
3DNow E

AMD K7
Integer instructions
PAVGUSB
PMULHRW

mm,mm
mm,mm

1
1

2
3

1/2
1

FA/M
FMUL

Floating point instructions


PFADD/SUB/SUBR
PFCMPEQ/GE/GT
PFMAX/MIN
PFMUL
PFACC
PFNACC, PFPNACC
PFRCP
PFRCPIT1/2
PFRSQRT
PFRSQIT1

mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm

1
1
1
1
1
1
1
1
1
1

4
2
2
4
4
4
3
4
3
4

1
1
1
1
1
1
1
1
1
1

FADD
FADD
FADD
FMUL
FADD
FADD
FMUL
FMUL
FMUL
FMUL

Other
FEMMS

mm,mm

1/3

FANY

Page 15

3DNow E

K8

AMD K8
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.

Execution unit:

Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.

Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI

Operands

Ops

Latency Reciprocal Execution


throughput unit

r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H

1
1
1
1
1
1
1

1
1
4
4
3
3
8

1/3
1/3
1/2
1/2
1/2
1/2
1/2

ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU

m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r

1
1
1
1
1
6
1

3
3
3
3
2
9-13

1/2
1/2
1/2
1/2
1/2-1
8
2-3

AGU
AGU
AGU
AGU

Page 16

AGU

Notes

Any addressing mode.


Add 1 clock if code
segment base 0
AH, BH, CH, DH
Any other 8-bit register
Any addressing mode

K8
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG

r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r

1
1
1
1
1
1
3

1
4
1

XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
IN
OUT

r,m

3
2
1
1
2
2
5
9
2
3
4-6
7-9
25
9
2
1
1
4
1
1
10
1
1
1
6
1
7
270
300

16
5
1
1
1
1
2
4
1
1
8
28
10
4
3
2
2
3
1
1

1
1
1
1
1
1
1
1
1
1
9
12
16
4

1
1
7
1
1
7
1

r
i
m
sr

r
m
DS/ES/FS/GS

Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD

SS

r16,[m]
r32,[m]
r64,[m]

r,m
r
m
m

r,i/DX
i/DX,r

r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m

1
2

1
7
5
6
7
5
Page 17

1/3
1/2
1/3
1/2
1/3
1/2
1

ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU

16

ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
ALU
ALU
ALU

1
1
1
1
2
4
1
1
8
28
10
4
1
1/3
1/3
2
1/3
1/3
9
1/3
1/2
1/2
8
5
16

1/3
1/2
2,5
1/3
1/2
2,5
1/3
1/2
1/3
3
5
6
7

ALU
AGU
AGU

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0

Timing depends on
hw

Any address size


Any address size
Any address size

K8
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,R
OR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD

31
1
3
2
2
1
1
1
2
1
1
3
3
3
31
46
78
143
40
55
87
152
41
56
88
153
1
1

13
3
3-4
3
4-5
3
3
4
4
3
4

r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl

r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8
r16
r32
r64
m8
m16
m32
m64

15
23
39
71
17
25
41
73
17
25
41
73
1
1

1
2
1
2
1
1
2
1
1
2
2
2
2
15
23
39
71
17
25
41
73
17
25
41
73
1/3
1/3

ALU
ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU

1
1
1
1
1
1
1
1
1
1
9
7
9
7

1
1
7
1
1
1
7
1
1
1
3
3
4
3

1/3
1/2
2,5
1/3
1/2
1/3
2,5
1/3
1/3
1/3
3
3
4
3

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU

1
1
10
9
9
8
6
7

7
7
9
8
7
8
3
3

3
4
4
4
4
3
3
3

ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU

Page 18

latency ax=3, dx=4


latency rax=4, rdx=5

K8
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSF
BSR
BSF
BSF
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD

m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r16/32,r
r64,r
r,r
r16,m
r32,m
r64,m
r,m
r
m

Control transfer instructions


JMP
short/near
JMP
JMP
JMP

8
1
1
5
2
5
4
8
8
21
22
28
20
22
25
28
1
1
1
1
1
2

6
1

2
7
7
5
8
8
9
10
8
9
10
10
1

far
r
m(near)

16-20
1
1

23-32

m(far)
short/near
short
short
near

17-21
1
2
7
3

25-33

CALL
CALL
CALL

far
r
m(near)

16-22
4
5

23-32
3
3

CALL
RETN
RETN

m(far)

16-22
2
2

24-33
3
3

15-23

24-35

15-24
32
33
6
2

24-35
81
42

JMP
Jcc
J(E/R)CXZ
LOOP
CALL

RETF
RETF
IRET
INT
BOUND
INTO

i
i
m

3
1/3
1/2
2
1
2
2
5
3
8
9
10
8
9
10
10
1/3
1/2
1/3
1/3
1/3
1/3

ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU

ALU
low values = real
mode

2
2

3-4
2

1/3 - 2
1/3 - 2
3-4
2

Page 19

ALU
ALU
ALU
ALU

low values = real


mode
recip. thrp.= 2 if jump
recip. thrp.= 2 if jump

low values = real


mode
3
3

ALU
ALU, AGU
low values = real
mode

3
3

2
2

String instructions

ALU
ALU, AGU

ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no jump
values are for no jump

K8
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS

4
2
5
2
4
2
1.5 - 2 0.5 - 1
7
3
3
1-2
5
2
5
2
2
3
6
2

2
2
2
0.5 - 1
3
1-2
2
2
3
2

Other
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC

1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
6
10
9
12

1/3
1/3
12
3
5
27

values are per count


values are per count
values are per count
values are per count
values are per count

ALU
ALU
12
3 ops, 5 clk if 16 bit

7
7

Floating point x87 instructions


Instruction

Operands

Ops

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

1
1
7
30
1
1
10
260
1
1
1
1

2
4
16
41
2
3
7
173
0
9
7

1/2
1/2
4
39
1/2
1
5
160
0,4
1
1
1

FCMOVcc
FFREE
FINCSTP, FDECSTP

st0,r
r

9
1
1

4-15

4
2
1/3

FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW

AX
AX
m16
m16
m16

2
3
2
3
18

6-12
6-12

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1

Latency Reciprocal Execution


throughput unit

Arithmetic instructions
Page 20

12
12
8
1
50

Notes

FA/M
FANY

FA/M
FMISC

FMISC
FMISC, FA/M
FMISC
Low latency immediFMISC, FA/M ately after FCOMI
FANY
FANY
Low latency immediately after FCOM
FMISC, ALU FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged

K8
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL

r/m
m
r/m
m

1
2
1
2

4
4
4
4

1
1-2
1
2

FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1

r/m
m

1
2
1
1
1
1
2
1
2
5
1
1

11-25
12-26
2
2
2
3

8-22
9-23
1
1
1
1
1
1
1
3
8
8

Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1

1
1
66
73
98
67
97
5
7
53
72
75

27

Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR

1
1
8
26
77
70
61
101

r/m
r
m

2
10
7-10
8-11

140-190
150-190
170-200
150-180
217
8
12
126
179
175

0
0

12
1

FADD
FADD,FMISC
FMUL
FMUL,FMISC
Low values are for
round divisors
FMUL
FMUL,FMISC
do.
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD
FMISC, ALU
FMUL
FMUL

FMUL
FMISC

1/3
1/3
27
100
171
136
56
95

FANY
ALU
FMISC
FMISC

Integer MMX and XMM instructions


Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD

Operands

r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32

Ops

2
2
1
3
3

Latency Reciprocal Execution


throughput unit
4
9
2
3
Page 21

2
2
1/2
2
2

FMICS, ALU
FANY, ALU
FANY
FMISC, ALU

Notes

K8
MOVD
MOVD
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
PINSRW

xmm,m32
m32, r

2
1

1
1

r64,mm/xmm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm

2
2
3
1
2
1
2
1
2
2
2
4
5
1
2
1
2

4
9
9
2
2

mm,r/m

FA/M

xmm,r/m

FA/M

mm,r/m

FA/M

xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,mm/x,i
mm,r32,i
xmm,r32,i

2
2
1
3
1
2
32
64
1
2
2
3

2
2
2
3
2
2

FA/M
FA/M
FA/M
FA/M
FA/M
FA/M

2
5
12
12

2
1
1/2
1,5
1/2
1
13
26
1
2
2
3

FADD
FMISC, ALU
FA/M
FA/M

2
2

2
2
2
1/2
1
1/2
1
1
1
2
2
2
2
1/2
1
2
3

FANY
FMISC
Moves 64 bits.Name
FMISC, ALU of instruction differs
FANY, ALU
do.
FANY, ALU
do.
FA/M
FA/M, FMISC
FANY
FANY, FMISC
FMISC
FA/M
FMISC
FMISC

FA/M
FA/M, FMISC
FMISC
FMISC

Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W

mm,r/m

1/2

FA/M

PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W

xmm,r/m

FA/M

Page 22

K8
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMADDWD
PMADDWD
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ

mm,r/m
xmm,r/m

1
2

2
2

1/2
1

FA/M
FA/M

mm,r/m

FMUL

xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m

2
1
2
1
2
1
2
1
2

3
3
3
2
2
2
2
3
3

2
1
2
1/2
1
1/2
1
1
2

FMUL
FMUL
FMUL
FA/M
FA/M
FA/M
FA/M
FADD
FADD

mm,r/m

1/2

FA/M

xmm,r/m

FA/M

mm,i/mm/m

1/2

FA/M

x,i/x/m
xmm,i

2
2

2
2

1
1

FA/M
FA/M

1/3

FANY

Other
EMMS

Floating point XMM instructions


Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVDDUP

Operands

Ops

Latency Reciprocal Execution


throughput unit

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r

2
2
2
2
4
5
1
2
1

2
4
3

1
2
2
1
2
2
1
1
1

FA/M
FANY FMISC
FMISC

r,r

1/2

FA/M

r,m

FMISC

m,r
r,r

1
2

1
1

FMISC

2
Page 23

Notes

FA/M
FMISC
FMISC
FA/M

SSE3

K8
MOVSH/LDUP
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D
MULPS/D

DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D

r,r
m,r
r32,r
r,r/m,i
r,r/m

2
2
1
3
2

2
8
3
3

2
3
1
2
3

FMISC
FADD
FMUL
FMUL

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm

SSE3

2
4
3
1
2
2
2
4
1
2
1
3
3
2
2
2

4
8
8
2
5
5
5
8
4
5
6
8
14
12
10
9

2
3
8
1
2
2
2
3
1
2
1
2
2
2
2
2

FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC

r,r/m
r,r/m

1
2

4
4

1
2

FADD
FADD

r,r/m
r,r/m
r,r/m

2
1
2

4
4
4

2
1
2

FADD
FMUL
FMUL

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2
1
2
1
2
1
2

11-16
18-30
11-20
16-34
3
3
2
2
2
2

8-13
18-30
8-17
16-34
1
2
1
2
1
2

FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD

r,r/m

FADD

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

FMUL

Math
SQRTSS
SQRTPS

r,r/m
r,r/m

1
2

19
36

16
36

FMUL
FMUL

Page 24

SSE3

Low values are for


round divisors, e.g.
powers of 2.
do.
do.
do.

K8
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2

Other
LDMXCSR
STMXCSR

m
m

8
3

27
48
3
3

24
48
1
2

9
10

Page 25

FMUL
FMUL
FMUL
FMUL

K10

AMD K10
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.

Execution unit:

Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.

Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX

Operands

r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H
m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r
r,r

Ops

1
1
1
1
1
1
1
1
1
1
1
1
6
1
1

Latency Reciprocal Execution


throughput unit
1
1
4
4
3
3
8
3
3
3
3
3-4
8-26
1
Page 26

1/3
1/3
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
8
1
1/3

ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
AGU
AGU
AGU
AGU

Notes

Any addr. mode. Add


1 clock if code segment base 0
AH, BH, CH, DH
Any other 8-bit reg.
Any addressing mode

from AMD manual


AGU
ALU

K10
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH(W)
SFENCE
LFENCE
MFENCE
IN
OUT

1
1
1
1
1
2
2
2
r
1
i
1
m
2
sr
2
9
9
r
1
m
3
6
DS/ES/FS/GS
SS
10
28
9
r16,[m]
2
r32/64,[m]
1
r32/64,[m]
1
4
1
1
r,m
10
r
1
m
1
m
1
m
1
6
1
4
r,i/DX
~270
i/DX,r
~300

Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM

r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m

r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m

1
1
1
1
1
1
1
1
1
1
9
12
16
4
30

4
1
4
1
4
1
21
5

6
3
10
26
16
6
3
1
2
3
1
1
1

1
4
1
4
1
1
7
5
6
7
5
13
Page 27

1/2
1/3
1/2
1/3
1/2
1
19
5
1/2
1/2
1
1
3
6
1/2
1
8
16
11
6
1
1/3
1/3
2
1/3
1
10
1/3
1/2
1/2
1/2
8
1
33

ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
AGU
ALU
ALU
ALU

1/3
1/2
1
1/3
1/2
1
1/3
1/2
1/3
2
5
6
7
5
13

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU

ALU
AGU
AGU
AGU

Timing depends on hw

Any address size


2 source operands
W. scale or 3 opr.

3DNow

K10
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
IDIV
IDIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO

r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r8
m8
r16/m16
r32/m32
r64/m64
r16/m16
r32/m32
r64/m64

Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS

1
3
2
2
1
1
1
2
1
1
3
3
3

1
1

r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i

1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2

3
3
3
4
3
3
4
4
3
4

17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1
1

1
4
1
1
7
1
1
1
3
3
4
3
7
7
7
7
8
7
3
3
7,5
1
7
2
Page 28

1
2
1
2
1
1
2
1
1
2
2
2
2
17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1/3
1/3

ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU

1/3
1/2
1
1/3
1/2
1/3
1
1/3
1/3
1
3
3
4
3
1
1
5
6
6
5
2
3
6
1/3
1/2
2
1/3

ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU

latency ax=3, dx=4


latency rax=4, rdx=5

Depends on number
of significant bits in
absolute value of dividend. See AMD software optimization
guide.

K10
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSR
BSF
BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD

m,i
m,i
m,r
m,r
r,r
r,r
r,m
r,m
r,r/m
r,r/m
r
m

Control transfer instructions


JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
INT
i
BOUND
m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS

5
4
8
8
6
7
7
8
1
1
1
1
1
1
1
2

1
16-20
1
1
17-21
1
2
7
3
16-22
4
5
16-22
2
2
15-23
15-24
32
33
6
2

4
5
4
2
7
3
5
5
7
3

9
9
8
8
4
4
7
7
2
2
1

1,5
1,5
10
7
3
3
3
3
1
1
1/3
1/2
1/3
1/3
1/3
2/3

ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU
ALU
ALU
ALU

ALU

2
2

ALU
ALU, AGU

1/3 - 2
2/3 - 2
3
2

ALU
ALU
ALU
ALU

3
3

ALU
ALU, AGU

23-32

low values = real mode

25-33

2
23-32
3
3
24-33
3
3
24-35
24-35
81
42

SSE4.A / SSE4.2
SSE4.A, AMD only

low values = real mode

low values = real mode

low values = real mode

3
3

Page 29

ALU
ALU
low values = real mode
low values = real mode

2
2

2
2
2
1
3
1
2
2
3
1

recip. thrp.= 2 if jump


recip. thrp.= 2 if jump

2
2
2
1
3
1
2
2
3
1

real mode
real mode
values are for no jump
values are for no jump

values are per count


values are per count
values are per count
values are per count
values are per count

K10
Other
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC

1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
30
13

1/3
1/3

ALU
ALU
12

3
5
27

3 ops, 5 clk if 16 bit

67
5

Floating point x87 instructions


Instruction

Operands

Ops

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

1
1
7
20
1
1
10
218
1
1
1
1

FCMOVcc
FFREE
FINCSTP, FDECSTP

st0,r
r

9
1
1

FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW

AX
AX
m16
m16
m16

2
3
2
3
12

r/m
m
r/m
m
r/m
m

1
2
1
2
1
2
1
1
1
1
2
1

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1

Latency Reciprocal Execution


throughput unit
2
4
13
94
2
2
8
167
0
6
4

1/2
1/2
4
30
1/2
1
7
163
1/3
1
1
1

Notes

FA/M
FANY

FA/M
FMISC

FMISC
FMISC
FMISC

1/3
1/3

Low latency immediFMISC, FA/M ately after FCOMI


FANY
FANY

16
14
9
2
14

FMISC, ALU after FCOM FTST


FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged

Low latency immediately

Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST

r/m
r
m

4
4
?
31
2

Page 30

1
4
1
4
24
24
2
1
1
1
1
1

FADD
FADD,FMISC
FMUL
FMUL,FMISC
FMUL
FMUL,FMISC
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD

K10
FXAM
FRNDINT
FPREM
FPREM1

2
6
1
1

Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1

1
1
45
51
76
45
9
5
11
8
8
12

Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR

1
1
8
26
77
70
61
85

m
m
m
m

35
~51?
~90?
~125?
~119
151?
9
9
65
13
114

0
0

162
133
63
89

1
37
7
7

FMISC, ALU

35
1

FMUL
FMISC

FMUL
FMUL

45?
29
41
30?
30?
44?

1/3
1/3
28
103
149
149
58
79

FANY
ALU
FMISC
FMISC

Integer MMX and XMM instructions


Instruction

Operands

Ops

Latency Reciprocal Execution


throughput unit

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD

r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,mm/x

1
2
1
1
2
1
1

3
6
4
3
6
2
2

1
3
1/2
1
3
1/2
1

MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA

r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m

1
2
2
1
1
1
1
1
1
1

3
6
6
2
2,5
4
2
2
2,5
2

1
3
3
1/2
1/3
1/2
1/2
1
1/3
1/2

Page 31

Notes

FADD
FANY
FADD

FMISC
Moves 64 bits.Name
of instruction differs
do.
FMUL, ALU
do.
FA/M
FANY
FANY
?
FMISC
FANY
?
FADD

K10
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ

m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm

2
1
3
1
1
1
2

2
2
3
2
2

1
1/2
2
1/3
1/3
1
1

FMUL,FMISC

mm,r/m

1/2

FA/M

xmm,r/m

1/2

FA/M

mm,r/m

1/2

FA/M

xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,(x)mm,i
(x)mm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i

1
1
1
1
1
1
32
64
1
2
2
3
3
1
1

3
3
3
3
2
2

FA/M
FA/M
FA/M
FA/M
FA/M
FA/M

3
6
9
6
6
2
2

1/2
1/2
1/2
1/2
1/2
1/2
13
24
1
1
3
2
2
1/2
1/2

1
1

2
2

1/2
1/2

FA/M
FA/M

1
1
1
1
1

3
3
2
2
3

1
1
1/2
1/2
1

FMUL
FMUL
FA/M
FA/M
FADD

mm/xmm,r/m

1/2

FA/M

mm,i/mm/m

1/2

FA/M

x,i/(x)mm

1/2

FA/M

Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
mm/xmm,r/m
PCMPEQ/GT B/W/D mm/xmm,r/m
PMULLW PMULHW
PMULHUW
PMULUDQ
mm/xmm,r/m
PMADDWD
mm/xmm,r/m
PAVGB/W
mm/xmm,r/m
PMIN/MAX SW/UB
mm/xmm,r/m
PSADBW
mm/xmm,r/m
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D

Page 32

FANY
FANY
FMISC
FMUL,FMISC

FADD
FA/M
FA/M
FA/M
FA/M
FA/M

SSE4.A, AMD only


SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only

K10
PSLLDQ, PSRLDQ

xmm,i

Other
EMMS

1/2

FA/M

1/3

FANY

Floating point XMM instructions


Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D

Operands

Ops

Latency Reciprocal Execution


throughput unit

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r

1
1
2
1
1
3
1
1
1

2,5
2
2
2,5
2
3
2
2
2

1/2
1/2
1
1/2
1/2
2
1/2
1/2
1

FANY
?
FMUL,FMISC
FANY
?
FMISC
FA/M
?
FMISC

r,r

1/2

FA/M

r,m

1/2

FA/M

m,r
m,r
m,r
r32,r
r,r/m,i
r,r/m

1
2
1
1
1
1

3
3
3

1
3
1
1
1/2
1/2

FMISC
FMUL,FMISC
FMISC
FADD
FA/M
FA/M

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm

1
2
3
3
1
1
1
2
2
1
1
2
3
3
2
2

2
7
8
7
4
4
4
7
7
4
4
7
14
14
8
8

1
1
2
2
1
1
1
1
1
1
1
1
3
3
1
1

FMISC

FADD,FMISC
FADD,FMISC

r,r/m

FADD

Page 33

FMISC
FMISC
FMISC

FMISC
FMISC

Notes

SSE4.A, AMD only

K10
ADDPS/D SUBPS/D
MULSS/D
MULPS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
1
1
1
1
1
1
1
1

r,r/m

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

Other
LDMXCSR
STMXCSR

m
m

4
4
4
16
18
20
20
3
2
2
2
2

1
1
1
13
15
17
17
1
1
1
1
1

FADD
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD

FADD

1/2

FA/M

1
1
1
1
1
1

19
21
27
27
3
3

16
18
24
24
1
1

FMUL
FMUL
FMUL
FMUL
FMUL
FMUL

12
3

12
12

10
11

Obsolete 3DNow instructions


Instruction

Operands

Ops

Latency Reciprocal Execution


throughput unit

Move and convert instructions


PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm

1
1
1
1
1

5
5
5
5
2

1
1
1
1
1/2

FMISC
FMISC
FMISC
FMISC
FA/M

Integer instructions
PAVGUSB
PMULHRW

mm,mm
mm,mm

1
1

2
3

1/2
1

FA/M
FMUL

Floating point instructions


PFADD/SUB/SUBR
mm,mm
PFCMPEQ/GE/GT
mm,mm
PFMAX/MIN
mm,mm
PFMUL
mm,mm

1
1
1
1

4
2
2
4

1
1
1
1

FADD
FADD
FADD
FMUL

Page 34

Notes

3DNow extension
3DNow extension
3DNow extension

K10
PFACC
PFNACC, PFPNACC
PFRCP
PFRCPIT1/2
PFRSQRT
PFRSQIT1

mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm

1
1
1
1
1
1

Other
FEMMS

mm,mm

4
4
3
4
3
4

1
1
1
1
1
1

FADD
FADD
FMUL
FMUL
FMUL
FMUL

1/3

FANY

Thank you to Xucheng Tang for doing the measurements on the K10.

Page 35

3DNow extension

Bulldozer

AMD Bulldozer
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.

Execution pipe:

Indicates which execution pipe or unit is used for the macro-operations:


Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
P01: can use either P0 or P1
P23: can use either P2 or P3
Two macro-operations can execute simultaneously if they go to different
execution pipes

Domain:

Tells which execution unit domain is used:


ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.

Page 36

Bulldozer

Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB

Operands

Ops

r,r
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r,r

1
1
1
1
1
1
1
1
1
1
1
1
1
2

1
1
4
4

r,m

2
2
1
1
2
8
9
1
2
34
14
2
2

~50
6

1
1
4
2
1
1
1
1
1
6
1
6

2
1
3
2
1
1

1
1

1
1

r
i
m

r
m

r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]

r
m
m
m

r,r
r,i

Latency Reciprocal
throughput

5
1
5
4
1
5
1
1

0.5
0.5
0.5
1
1
2
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1

EX01
EX01
AG01
EX01 AG01

~50
2
1
1
1.5
4
9
1
1
19
8

EX01

2-3
2-3

Page 37

Execution
pipes

Notes

all addr. modes


all addr. modes

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
89
0,25
89

EX01
EX01

0.5
0.5

EX01
EX01

Timing depends on
hw

any addr. size


16 bit addr. size
scale factor > 1
or 3 operands
all other cases

EX01

AMD 3DNow

Bulldozer
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST

r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
r
m

r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64

r,r
r,i
r,m
m,r
m,i
r,r

1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
9
1
2
1
1
1
1
1
2
1
1
2
2
2
14
18
16
16
33
36
36
36
1
1
2

1
1
1
1
1
1

7-8
7-8
1
1
1
9
9
1
1
1
7-8
6
9
10
6
20
4
4
4
6
4
4
6
5
4
6

20
15-27
16-43
16-75
23
23-33
22-48
22-79
1
1
1

1
1
7-8
7-8
1
Page 38

0.5
1
1

1
1
1
0.5
0.5
0.5
0.5
1

20
2
2
2
4
2
2
4
2
2
4
2
2
4
20
15-28
16-43
16-75
20
20-27
20-43
20-75

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

0.5
1

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01

0.5
0.5
0.5
1
1
0.5

EX01
EX01
EX01
EX01
EX01
EX01

Bulldozer
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
LZCNT
POPCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
EXTRQ
EXTRQ
INSERTQ
INSERTQ

r,i
m,r
m,i
r
m
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r,r
r,r/m
r
m

r16/32,r16/32
r64,r64
r,r
x,i,i
x,x
x,x,i,i
x,x

Control transfer instructions


JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short

1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
1
1
2
2
1
1
2
1
1
1
1

1
7
1
1
1
8
9
1
8
8
3
4
1

3
4
4
2
4
1

0.5
0.5
0.5
0.5
1
0.5
0.5

3
3,5
3,5
0.5
0.5
3,5
1
2
5
3
4
4
5
2
2
0.5
1
0.5

4
4
2
3
3
3
3

1
1
1
1
1
1
1
1
Page 39

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
EX1
EX01
EX01
EX01
EX01

SSE4.A
SSE4.2

3
4
2
4
2
1
1
1
1

P1
P1
P1
P1

SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A

2
2
2
1-2
1-2
1-2
1-2
1-2

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1

2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping

Bulldozer
CALL
CALL
CALL
RET
RET
BOUND
INTO

near
r
m
i
m

String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
CRC32
CRC32
CRC32
XGETBV

m,r
m,r
m,r
m8,r8
m8,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128

a,0
a,b

r32,r8
r32,r16
r32,r32

2
2
3
1
4
11
4

2
2
2
2
2-3
5
24

3
6n
3
2n
3 per 16B
5
2n
4 per 16B
3
7n
6
9n

3
3n
3
2n
3 per 16B
3
2n
3 per 16B
3
4n
3
4n

1
4
4
5
5
6
6
18
18
22
22

1
1
40
13
11+5b
2
37-63
36
22
3
5
5
4

EX1
EX1
EX1
EX1
EX1
for no jump
for no jump

small n
best case
small n
best case

~55
10
~51
15
~51
14
~52
15
~53
52
~94

3
5
6

Page 40

0.25
0.25
43
22
16+4b
4
112-280
42
300
2
5
6
31

none
none

Bulldozer

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS

Operands

Ops

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
4
3
1
3

st0,r
r
AX
m16
m16
m16

r/m
m
r/m
m
r
m
m
r/m
r
m

1
2
1
2
1
2
2
1
1
1
2
2
1
1
1
1
1

1
1
10-162
160-170

Latency Reciprocal
throughput
2
8
14
61
2
8
9
240
0
12
8
3
0
~13
~13

5-6
5-6
10-42

~20
4
19-62
19-65

0.5
1
4
40
0.5
1
20
244
0.5
1
1
0.5
3
0.25
0.25
22
19
3
2

1
2
1
2
5-18

0.5
0.5
0.5
1
1
0.5
0.5
1

10-53
65-210
~160
Page 41

0.5
65-210
~160

Execution
pipes

Domain, notes

P01

fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp

P0 P1 P2 P3
P01

P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3

P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01
P01
P01
P0
P0
P0

P01
P01
P0 P1 P3
P0 P1 P3

inherit

fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

Bulldozer
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1

12-166
11-190
10-355
8
12
10
10-175
10-175

Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR

1
1
18
31
103
76

m864
m864

95-160
95-245
60-440
52
10
64-71

300
312

95-160
95-245
60-440
5

0.25
0.25
57
170
300
312

P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3

none
none
P0
P0
P0 P1 P2 P3
P0 P3

Integer MMX and XMM instructions


Instruction

Operands

Move instructions
MOVD
r32/64, mm/x
MOVD
mm/x, r32/64
MOVD
mm/x,m32
MOVD
m32,mm/x
MOVQ
mm/x,mm/x
MOVQ
mm/x,m64
MOVQ
m64,mm/x
MOVDQA
xmm,xmm
MOVDQA
xmm,m
MOVDQA
m,xmm
VMOVDQA
ymm,ymm
VMOVDQA
ymm,m256
VMOVDQA
m256,ymm
MOVDQU
xmm,xmm
MOVDQU
xmm,m
MOVDQU
m,xmm
LDDQU
xmm,m
VMOVDQU
ymm,m256
VMOVDQU
m256,ymm
MOVDQ2Q
mm,xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m,mm
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
(x)mm,r/m
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m

Ops

Latency Reciprocal
throughput

1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2
8
1
1
1
1
1
1
1

8
10
6
5
2
6
5
0
6
5
2
6
5
0
6
5
6
6
6
2
2
6
6
6
2
2

1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
3
0.25
0.5
1
0.5
1-2
10
0.5
0.5
2
2
0.5
1
1

Page 42

Execution
pipes

Notes

P23
P3
none

inherit domain

P3
P23
P3
none
P3

P2 P3
P23
P23
P3
P3
P1
P1
P1

inherit domain

Bulldozer
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
PMOVSXBW/BD/BQ/
WD/WQ/DQ
xmm,xmm
PMOVZXBW/BD/BQ/W
D/WQ/DQ
xmm,xmm
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m

1
1
1
1
1
1
1
1
31
64
2
2
2

2
2
3
2
2
2
2
2
38
48
10
10
12

1
1
1
1
1
1
1
0.5
37
61
1
1
2

P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1

P1

SSE4.1

1
1
2
1

2
2
2
2

1
1
2
1

P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P23

(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m

1
3
4
1
1
1

2
5
5
2
2
2

0.5
2
2
0.5
0.5
0.5

P23
P1 P23
P1 P23
P23
P23
P23

(x)mm,r/m
xmm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m

1
1
1
1
1
1
1

4
5
4
4
4
4
2

1
2
1
1
1
1
0.5

P0
P0
P0
P0
P0
P0
P23

(x)mm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i

1
2
1
1
2
8

2
4
2
2
4
8

0.5
1
0.5
0.5
1
4

P23
P1 P23
P23
P23
P23
P1 P23

VPCOMB/W/D/Q

x,x,x/m,i

0.5

P23

VPCOMUB/W/D/Q

x,x,x/m,i

0.5

P23

Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW

Page 43

SSE4.1

AVX

SSSE3
SSSE3
SSE4.1
SSE4.2

SSE4.1
SSE4.1
SSSE3

SSE4.1
SSSE3
SSSE3
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7

Bulldozer
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST

x,x/m

0.5

P23

AMD XOP

x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x

1
1
1
1
1
1
1
1
1
1

2
2
4
5
4
4
5
4
4
4

0.5
0.5
1
2
1
1
2
1
1
1

P23
P23
P0
P0
P0
P0
P0
P0
P0
P0

AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P23

(x)mm,r/m

P1

(x)mm,i
xmm,i
xmm,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m

1
1
2
1
1
1
1

2
2
3
2
3
3

1
1
1
1
1
1
1

P1
P1
P1 P3
P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP

x,x,i
x,x,i
x,x,i
x,x,i

27
27
7
7

17
10
14
7

10
10
3
4

P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3

SSE4.2
SSE4.2
SSE4.2
SSE4.2

x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i

5
2
2
2
2
1
1

12
5
5
5
5
5
5

7
2
2
2
2
1
1

P1
P01
P01
P01
P01
P0
P0

pclmul
aes
aes
aes
aes
aes
aes

Execution
pipes

Domain, notes

Other
EMMS

0.25

Floating point XMM and YMM instructions


Instruction

Operands

Ops

Latency Reciprocal
throughput

Move instructions
Page 44

Bulldozer
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS

x,x
y,y

1
2

0
2

0.25
0.5

x,m128

0.5

y,m256

1-2

m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x

1
4
8
1
1
1

5
5
6
2
6
5

1
3
10
0.5
0.5
1

x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i

1
2
1
1
2

7
8
7
2
10

1
1
1
1
1

P1 P3
P3
P1
P1 P3

P3

4
1
2
1
2
1
2
3
4
0.5
1
1
2
1
0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
1
1
2
1

P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1

SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec

P1

ivec

P23
P23
P23
P1

ivec

P1

ivec

P1
P1
P1 P3

ivec
ivec

1
1
2
1
2
1
2
8
10
1
2
1
2
1
1
2
2
1
2
2
2
1
1
2
2
1
2
2

2
2
3
3
2
2
4
2
2
2
2
2
2
6
6
6
6
2
2
2
2
10
Page 45

none
P23

inherit domain
ivec

P3
P3
P2 P3
P01

fp

ivec

Bulldozer
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D

m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

2
1
2
1
1
2
2
1
2
18
34

14
2
7
2
2
9
9
9
22
25

1
1
1
1
1
1
1
0.5
1
7
13

P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3

Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI

x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x

2
4
2
4
1
1
1
2
1
2
2
4
2
4
1
1
2
2
2
2
2
2

7
7
7
7
4
4
4
4
4
4
7
8
7
7
4
4
7
7
14
13
14
13

1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1

P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0
P0
P0 P1
P0 P1
P0
P0
P0
P0

fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

x,x/m
x,x/m

1
1

5-6
5-6

0.5
0.5

P01
P01

fma
fma

VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D

y,y,y/m
x,x/m
y,y,y/m

2
1
2

5-6
5-6
5-6

1
0.5
1

P01
P01
P01

fma
fma
fma

HADDPS/D HSUBPS/D

x,x

10

P01 P1

ivec/fma

HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D

x,m128

P01 P1

ivec/fma

y,y,y

P01 P1

ivec/fma

y,y,m

10

P01 P1

ivec/fma

Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D

10

Page 46

ivec

ivec

Bulldozer
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD

x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m

1
1
2
1
2
1
2
1
2

5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5

0.5
0.5
1
4.5-9.5
9-19
4.5-11
9-22
1
2

P01
P01
P01
P01
P01
P01
P01
P01
P01

fma
fma
fma
fp
fp
fp
fp
fp
fp

x,x/m
y,y,y/m

1
2

2
2

0.5
1

P01
P01

fp
fp

x,x/m

P01 P3

fp

x,x/m

0.5

P01

fp

VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m128,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m256,i
29
DPPD
x,x,i
15
DPPD
x,m128,i
17
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above

2
4

1
1

P01
P0

fp
fp

4
25

5-6
5-6
5-6

2
6
7
13
13
5
6
0.5
0.5
1

P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01

fp
fma
fma
fma
fma
fma
fma
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4

Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD

27
15

x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m

1
2
1
2
1
2
2
3

14-15
14-15
24-26
24-26
5
5
10
10

4.5-12
9-24
4.5-16.5
9-33
1
2
2
2

P01
P01
P01
P01
P01
P01
P01
P01

fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP

AND/ANDN/OR/XORPS/
PD

x,x/m

0.5

P23

ivec

VAND/ANDN/OR/XOR
PS/PD

y,y,y/m

P23

ivec

Logic

Other
VZEROUPPER
VZEROUPPER

9
16

4
5
Page 47

32 bit mode
64 bit mode

Bulldozer
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR

m32
m32
m4096
m4096
m
m

17
32
1
2
67
116
122
177

10
19
136
176
196
250

Page 48

6
10
4
19
136
176
196
250

P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3

32 bit mode
64 bit mode

Piledriver

AMD Piledriver
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.

Execution pipe:

Indicates which execution pipe or unit is used for the macro-operations:


Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
P01: can use either P0 or P1
P23: can use either P2 or P3
Two macro-operations can execute simultaneously if they go to different
execution pipes

Domain:

Tells which execution unit domain is used:


ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.

Page 49

Piledriver

Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE

Operands

Ops

r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r16,r8
r32,r
r64,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2

1
1
1
1
1
4
4

r,m

2
2
1
1
2
8
9
1
2
34
14
2
2

~40
6

1
1
4
2
1
1
1
1
1
7

2
1
3
2
1
1

r
i
m

r
m

r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]

r
m
m
m

Latency Reciprocal
throughput

4
1
1
1
5
4
1
5
1
1
1
1
1

0.5
0.5
0.3
0.3
0.5
0.5
1
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5

EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01

~40
2
1
1
1
4
9
1
1
18
8

EX01

2-3
2-3

Page 50

Execution
pipes

all addr. modes


all addr. modes

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
81

Notes

EX01
EX01

Timing depends on
hw

any addr. size


16 bit addr. size
scale factor > 1
or 3 operands
all other cases

EX01

PREFETCHW

Piledriver
LFENCE
MFENCE

1
7

Arithmetic instructions
ADD, SUB
r,r
ADD, SUB
r,i
ADD, SUB
r,m
ADD, SUB
m,r
ADD, SUB
m,i
ADC, SBB
r,r
ADC, SBB
r,i
ADC, SBB
r,m
ADC, SBB
m,r
ADC, SBB
m,i
CMP
r,r
CMP
r,i
CMP
r,m
CMP
m,i
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
IMUL
r16,m16,i
IMUL
r32,m32,i
IMUL
r64,m64,i
DIV
r8/m8
DIV
r16/m16
DIV
r32/m32
DIV
r64/m64
IDIV
r8/m8
IDIV
r16/m16
IDIV
r32/m32
IDIV
r64/m64
CBW, CWDE, CDQE
CDQ, CQO
CWD

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2

17-22
13-26
12-40
13-71
17-21
13-26
13-40
13-71
1
1
1

Logic instructions
AND, OR, XOR
AND, OR, XOR

1
1

1
1

r,r
r,i

0,25
81

1
1
7-8
7-8
1
1
1
9
9
1
1

1
7-8
6
9
10
6
15
4
4
4
6
4
4
6
5
4
6

Page 51

0.5
0.5
0.5
1
1

1
1
1
0.5
0.5
0.5
0.5
0.5
1

15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-22
13-25
12-40
13-71
13-18
13-25
13-40
13-71

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

0.5
1

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01

0.5
0.5

EX01
EX01

Piledriver
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL

r,m
m,r
m,i
r,r
r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m

r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r

1
1
1
1
1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2

7-8
7-8
1
1

1
7-8
1
1
1
1
7
7
1
7
6
3
3
1

2
20
21
3
4
4
1

0.5
1
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5

3
3
3,5
0.5
0.5
3,5
1

3
4
4
5
0.5
1
0.5

4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 52

3
4
2
4
2
2
0.67
0.67
1
1
1
1
1
1
1
1
1

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

EX0

BMI1

SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM

Piledriver
BLSI
BLSIC
T1MSKC
TZMSK

r,r
r,r
r,r
r,r

Control transfer instructions


JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)

m8/m16
m32/m64

m,r
m,r
m,r
m,r8/16
m,r8/16
m,r32/64
m,r32/64
m64
m64
m128
m128

2
2
2
2

2
2
2
2

1
1
1
1

1
1
1
1
1
1
1
1
2
2
3
1
4
11
4

2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2

3
6n
6n
3
1n
3 per 16B
5
1-3n
4.5 pr 16B
3
7n
6
9n

3
3n
2.5n
3
1n
3 per 16B
3
1n
3 per 16B
3
3-4n
3
4n

1
4
4
5
5
6
6
18
18
22
22

AMD TBM
AMD TBM
AMD TBM
AMD TBM

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1

for no jump
for no jump

small n
best case
small n
best case

~40
20
~39
23
~40
20
~40
25
~42
66
~80

0.25
Page 53

2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping

none

Piledriver
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDPMC
CRC32
CRC32
CRC32

a,0
a,b

r32,r8
r32,r16
r32,r32

1
40
13
20+3b
2
38-64
4
36
21
3
5
5

3
5
6

0.25
40
21
16+4b
4
105-271
30
42
310
2
5
6

none

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)

Operands

Ops

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2

2
7
20
64
2
7
22
220
0
11
7

1
2
1
2
1
1
2
1
1
1
2
2

5-6

st0,r
r
AX
m16
m16
m16

r/m
m
r/m
m
r
m
m
r/m
r
m

Latency Reciprocal
throughput

3
0

5-6
9-40

Page 54

0.5
1
4
35
0.5
1
20
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2

1
2
1
2
4-16

0.5
0.5
0.5
1
1

Execution
pipes

Domain, notes

P01

fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp

P0 P1 P2 P3
P01

P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3

P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01

inherit

fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp

Piledriver
FTST
FXAM
FRNDINT
FPREM
FPREM1

1
1
1
1
1

Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR

~20
4
17-60
17-60

1
14-50
1
10-162 60-210
160-170
~154
12-166 86-141
11-190 166-231
10-355 60-352
8
44
12
7
10
60-73
10-176
10-176

m864
m864

1
1
18
31
103
76

300
236

0.5
0.5
1

P01
P01
P0
P0
P0

5-20
0.5
60-146
~154
86-141
86-204
60-352
5
5

P01
P01
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3

0.25
0.25
54
134
300
236

none
none
P0
P0
P0 P1 P2 P3
P0 P3

fp
fp
fp
fp
fp

Integer MMX and XMM instructions


Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU

Operands

Ops

r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256

1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2

Latency Reciprocal
throughput
8
10
6
5
2
6
5
0
6
5
2
6
11
0
6
5
6
6
Page 55

1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
17
0.25
0.5
1
0.5
1

Execution
pipes

Notes

P3

P3
P23
P3
none

inherit domain

P3
P23
P3
none
P3

inherit domain

Piledriver
VMOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/D
Q
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFB
PSHUFD
PSHUFW
PSHUFL/HW
PALIGNR
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB/W/D/Q
PINSRB/W/D/Q
EXTRQ
EXTRQ
INSERTQ
INSERTQ
PMOVSXBW/BD/BQ/
WD/WQ/DQ
PMOVZXBW/BD/BQ/W
D/WQ/DQ
VPCMOV
VPCMOV
VPPERM
Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD

m256,ymm
mm,xmm
xmm,mm
m,mm
m,xmm
xmm,m
(x)mm,r/m
(x)mm,r/m

8
1
1
1
1
1
1
1

14
2
2
5
5
6
2
2

20
0.5
0.5
2
2
0.5
1
1

P2 P3
P23
P23
P3
P3

(x)mm,r/m
xmm,r/m
xmm,r/m
(x)mm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
(x)mm,r/m,i
xmm,r/m
mm,mm
xmm,xmm
r32,mm/x
r,x/mm,i
x/mm,r,i
x,i,i
x,x
x,x,i,i
x,x

1
1
1
1
1
1
1
1
1
31
64
2
2
2
1
1
1
1

2
2
2
3
2
2
2
2
2
36
59
10
10
12
3
1
1
1

1
1
1
1
1
1
1
1
0.5
59
92
1
1
2
1
1
1
1

P1
P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1
P1
P1
P1
P1

AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A

x,x

P1

SSE4.1

x,x
x,x,x,x/m
y,y,y,y/m
x,x,x,x/m

1
1
2
1

2
2
2
2

1
1
2
1

P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P23

(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m

1
3
4
1
1
1

2
5
5
2
2
2

0.5
2
2
0.5
0.5
0.5

P23
P1 P23
P1 P23
P23
P23
P23

(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m

1
1
1
1
1

4
5
4
4
4

1
2
1
1
1

P0
P0
P0
P0
P0

Page 56

P1
P1

SSE4.1

SSE4.1

SSSE3
SSSE3
SSE4.1
SSE4.2

SSE4.1
SSE4.1
SSSE3

Piledriver
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW

(x)mm,r/m
(x)mm,r/m

1
1

4
2

1
0.5

P0
P23

(x)mm,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i

1
2
1
1
2
8

2
4
2
2
4
8

0.5
1
0.5
0.5
1
4

P23
P1 P23
P23
P23
P23
P1 P23

VPCOMB/W/D/Q

x,x,x/m,i

0.5

P23

VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD

x,x,x/m,i

0.5

P23

SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7

x,x/m

0.5

P23

AMD XOP

x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x

1
1
1
1
1
1
1
1
1
1

2
2
4
5
4
4
5
4
4
4

0.5
0.5
1
2
1
1
2
1
1
1

P23
P23
P0
P0
P0
P0
P0
P0
P0
P0

AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P23

(x)mm,r/m

P1

(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m

1
1
2
1
1
1
1

2
2
3
2
3
3

1
1
1
1
1
1
1

P1
P1
P1 P3
P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP

x,x,i
x,x,i
x,x,i
x,x,i

27
27
7
7

16
10
13
7

10
10
3
4

P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3

SSE4.2
SSE4.2
SSE4.2
SSE4.2

x,x/m,i
x,x,x,i
x,x,m,i
x,x

5
6
7
2

12
12
12
5

7
7
7
2

P1
P1
P1
P01

pclmul
pclmul
pclmul
aes

Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC

Page 57

SSE4.1
SSSE3
SSSE3

Piledriver
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST

x,x
x,x
x,x
x,x
x,x,i

Other
EMMS

2
2
2
1
1

5
5
5
5
5

2
2
2
1
1

P01
P01
P01
P0
P0

aes
aes
aes
aes
aes

Execution
pipes

Domain, notes

none
P23

inherit domain
ivec

P3
P3
P2 P3
P01

fp

0.25

Floating point XMM and YMM instructions


Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP

Operands

Ops

Latency Reciprocal
throughput

x,x
y,y

1
2

0
2

0.25
0.5

x,m128

0.5

y,m256

m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x

1
4
8
1
1
1
1
1
2
1
1
2
2
1
4
1
1
2
1
2
1
2
8
10
1
2
1
2
1

5
11
15
2
6
5
8
7
7
6
2
10

1
17
20
0.5
0.5
1
1
0.5
1
1
1
1
1
2
18
4
1
2
1
2
1
2
3
4
0.5
1
1
2
1

2
2
3
3
2
2
4
2
2
2
2
2
Page 58

P1
P01
P1 P3
P3
P1
P1 P3

ivec

P3
P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1

AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec

Piledriver
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH

x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

1
2
2
1
2
2
2
1
1
2
2
1
2
2
2
1
2
1
1
2
2
1
2
18
34

x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i

2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4

6
2
6
2
7
2
13
7
13
~100
~190

0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
~90
~180

8
7
8
8
4
4
4
4
4
4
8
8
8
7
8
4
7
7
13
12
13
12
8
8

1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2

2
6
6
6
6
2
2
2
2

Page 59

P1

ivec

P23
P23
P23
P1

ivec

P1

ivec

P1
P1
P1 P3
P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3

ivec
ivec

P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P23
P0
P0 P1
P0 P1
P0
P0 P3
P0
P0 P3
P0 P1
P0 P1

ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C

ivec

ivec

Piledriver
VCVTPH2PS
VCVTPH2PS

x,x/m
y,x/m

2
4

8
8

2
2

P0 P1
P0 P1

F16C
F16C

Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D

x,x/m
x,x/m

1
1

5-6
5-6

0.5
0.5

P01
P01

fma
fma

VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D

y,y,y/m
x,x/m
y,y,y/m

2
1
2

5-6
5-6
5-6

1
0.5
1

P01
P01
P01

fma
fma
fma

HADDPS/D HSUBPS/D

x,x

10

P01 P1

ivec/fma

HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD

x,m

P01 P1

ivec/fma

y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m

8
1
1
2
1
2
1
2
1
2

10
5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5

4
0.5
0.5
1
5-10
9-20
5-10
9-18
1
2

P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01

ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp

x,x/m
y,y,y/m

1
2

2
2

0.5
1

P01
P01

fp
fp

x,x/m

P01 P3

fp

x,x/m

0.5

P01

fp

VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m,i
29
DPPD
x,x,i
15
DPPD
x,m,i
17
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
All other FMA3 instructions: same as above
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above

2
4

1
1

P01
P0

fp
fp

4
25

5-6
5-6
5-6

2
6
7
13
13
5
6
1
1
1

P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01

5-6
5-6
5-6

0.5
0.5
1

P01
P01
P01

fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4

27
15

Page 60

Piledriver
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD

x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m

1
2
1
2
1
2
2
3

13-15
14-15
24-26
24-26
5
5
10
10

5-12
9-24
5-15
9-29
1
2
2
2

P01
P01
P01
P01
P01
P01
P01
P01

fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP

AND/ANDN/OR/XORPS/
PD

x,x/m

0.5

P23

ivec

VAND/ANDN/OR/XOR
PS/PD

y,y,y/m

P23

ivec

136
176
196
250

4
5
6
10
34
17
136
176
196
250

P2 P3
P2 P3
P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3

32 bit mode
64 bit mode
32 bit mode
64 bit mode

m32
m32
m4096
m4096
m
m

9
16
17
32
7
2
67
116
122
177

Logic

Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR

Page 61

Steamroller

AMD Steamroller
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. The latency listed does not include the
memory operand where the listing for register and memory operand are joined
(r/m).

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.

Execution pipe:

Indicates which execution pipe or unit is used for the macro-operations:


Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div. Integer add, mul, bool
P1: floating point add, mul, div. Shuffle, shift, pack
P2: Integer add. Bool, store
P01: can use either P0 or P1
P02: can use either P0 or P2
Two macro-operations can execute simultaneously if they go to different
execution pipes

Domain:

Tells which execution unit domain is used:


ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.

Integer instructions
Page 62

Steamroller
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
POP
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE

Operands

Ops

r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2

1
1
1
1
1
3
4

r,m

2
2
1
1
2
8
9
1
2
34
14
1
2
1

~38
6

1
1
4
2
1
1
1
1
1
7
1
7

2
1
3
2
1
1

r
i
m

r
m

sp
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]

r
m
m
m

Latency Reciprocal
throughput

4
1
5
4
1
5
1
1
1
1
1

0.5
0.5
0.25
0.25
0.5
0.5
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5

EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01

~38
2
1
1
1
4
9
1
1
19
8

EX01

2
2-3
2

Arithmetic instructions
Page 63

Execution
pipes

all addr. modes


all addr. modes

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
~80
0,25
~80

Notes

EX01
EX01

Timing depends on
hw

any addr. size


16 bit addr. size
scale factor > 1
or 3 operands
all other cases

EX01

PREFETCHW

Steamroller
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST

r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,i
r
m

r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64

r,r
r,i
r,m
m,r
m,i
r,r

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2

1
1
1
1
1
1

1
1
7
7
1
1
1
9
9
1
1

1
7
6
8
10
6
15
4
4
4
6
4
4
6
5
4
6

17-22
15-25
13-39
13-70
17-22
14-25
13-39
13-70
1
1
1

1
1
7
7
1
Page 64

0.5
0.5
0.5
1
1

1
1
1
0.5
0.5
0.5
0.5
0.5
1

15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-17
15-25
13-39
13-70
13-17
14-24
13-39
13-70

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

0.5
1

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01

0.5
0.5
0.5
1
1
0.5

EX01
EX01
EX01
EX01
EX01
EX01

Steamroller
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL
BLSI
BLSIC
T1MSKC
TZMSK

r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m

r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r

1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7-8
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

1
7
1
1
1
1
7
7
1
7
7
3
4
1

3
4
4
1

0.5
0.5
0.5
0.5
1
0.5
0.5
0.5

3
4
4
0.5
0.5
3,5
1
2
5
3
4
4
5
0.5
1
0.5

4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 65

3
4
2
4
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01

EX0

BMI1

SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM

Steamroller
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
CMPXCHG
CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE

m8/m16
m32/m64

m,r
m,r
m,r
m,r8
m,r16
m,r32/64
m8,r8
m16,r16
m,r32/64
m64
m64
m128
m128

1
1
1
1
1
1
1
1
2
2
3
1
4
11
4

2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2

3
6n
6n
3
1n
3 per 16B
5
~1n
4-5 pr 16B
3
7n
6
9n

3
3n
2.5n
3
~1n
2 per 16B
3
~1n
~2 per 16B
3
3-4n
3
4n

1
4
4
5
6
6
5
6
6
18
18
24
24

EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1

for no jump
for no jump

small n
best case
small n
best case

~39
9-12
~39
15
15
13
~40
~40
~40
~14
~42
~47
~80

1
1
8

0.25
0.25
4
Page 66

2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping

none
none

Steamroller
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32
CRC32

a,0
a,b

r32,r8
r32,r16
r32,r32

13
11+5b
2
38-64
4
44
44
22
3
5
7

3
5
6

21
20-30
3
100-300
30
78
105
360
2
5
6

rdtscp

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST

Operands

Ops

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m

1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2

2
7
11
52
2
7
14
222
0
11
7

1
2
1
2
1
1
2
1
1
1
2
2
1

st0,r
r
AX
m16
m16
m16

r/m
m
r/m
m
r
m
m
r/m
r
m

Latency Reciprocal
throughput

3
0
11

5
9-37

Page 67

0.5
1
4
34
0.5
1
19
222
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2

1
2
1
2
4-16
4
0.5
0.5
0.5
1
1
0.5

Execution
pipes

Domain, notes

P01

fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp

P0 P1 P2
P01

P0 P1 P2
P01
P01
P0 P2
P01
P0 P1 P2
none
none
P0 P2
P0 P2

P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01 P2
P01
P01

inherit

fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp

Steamroller
FXAM
FRNDINT
FPREM FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR

m864
m864

1
1
1

26
4
17-60

0.5
1
12-53

P01
P0
P0

1
1
10-164
18-166
12-168
11-192
10-365
10
12
10-18
9-183
206

10-50

5-20
0.5
60-165

P01
P01
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2

1
1
18
31
98
73

60-210
76-158
90-245
60-440
49
8
60-74
60-280
~390

256
166

90-165
90-210
60-365
5
5

0.25
0.25
63
131
256
166

fp
fp
fp

none
none
P0
P0
P0 P1 P2
P0 P2

Integer MMX and XMM instructions


Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU
VMOVDQU
MOVDQ2Q
MOVQ2DQ

Operands

Ops

r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256
m256,ymm
mm,xmm
xmm,mm

1
2
1
1
1
1
1
1
1
1
2
2
2
1
1
1
1
2
2
1
1

Latency Reciprocal
throughput
4
5
2
3
2
2
3
0
2
3
2
3
4
0
2
3
2
3
4
1
1
Page 68

1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
1
0.25
0.5
1
0.5
1
1
0.5
0.5

Execution
pipes

Notes

P2

P02

none

inherit domain

P2
P02
P2
none
P2

P02
P02

inherit domain

Steamroller
MOVNTQ
m,mm
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
(x)mm,r/m
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
EXTRQ
x,i,i
EXTRQ
x,x
INSERTQ
x,x,i,i
INSERTQ
x,x
PMOVSXBW/BD/BQ/
WD/WQ/DQ
x,x
PMOVZXBW/BD/BQ/W
D/WQ/DQ
x,x
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW

1
1
1
1
1

3
3
2
2
2

1
1
0.5
1
1

P2
P2

1
1
1
1
1
1
1
1
1
31
65
2
2
2
1
1
1
1

2
2
2
3
2
2
2
2
2
32
45
5
5
6
3
1
1
1

1
1
1
1
1
1
1
1
0.5
16
31
1
1
1
1
1
1
1

P1
P1
P1
P1
P1
P1
P1
P1
P02
P2
P0 P1 P2
P1 P2
P1 P2
P1
P1
P1
P1
P1

AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A

P1

SSE4.1

1
1
2
1

2
2
2
2

1
1
2
1

P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P02

(x)mm,r/m
x,x
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m

1
3
1
1
1

2
5
2
2
2

0.5
2
0.5
0.5
0.5

P02
P02 2P1
P02
P02
P02

(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m

1
1
1
1
1
1
1

4
5
4
4
4
4
2

1
2
1
1
1
1
0.5

P0
P0
P0
P0
P0
P0
P02

(x)mm,r/m
x,r/m

1
2

2
4

0.5
1

P02
P1 P02

Page 69

P1
P1

SSE4.1

SSE4.1

SSSE3
SSE4.1
SSE4.2

SSE4.1
SSE4.1
SSSE3

SSE4.1

Steamroller
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW

(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i

1
1
2
8

2
2
4
8

0.5
0.5
1
4

P02
P02
P02
P1 P02

VPCOMB/W/D/Q

x,x,x/m,i

0.5

P02

VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD

x,x,x/m,i

0.5

P02

SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7

x,x/m

0.5

P02

AMD XOP

x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x

1
1
1
1
1
1
1
1
1
1

2
2
4
5
4
4
5
4
4
4

0.5
0.5
1
2
1
1
2
1
1
1

P02
P02
P0
P0
P0
P0
P0
P0
P0
P0

AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP

(x)mm,r/m

0.5

P02

(x)mm,r/m

P1

(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m

1
1
2
1
1
1
1

2
2
14
3
2
3
3

1
1
1
1
1
1
1

P1
P1
P1 P2
P1
P1
P1
P1

SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP

x,x,i
x,x,i
x,x,i
x,x,i

30
30
9
8

11
10
5
6

11
10
5
6

P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2

SSE4.2
SSE4.2
SSE4.2
SSE4.2

x,x/m,i
x,x,x,i
x,x,m,i
x,x
x,x
x,x
x,x
x,x
x,x,i

7
7
8
2
2
2
2
1
1

11
11

7
7
7
1
1
1
1
1
1

P1
P1
P1
P01
P01
P01
P01
P0
P0

pclmul
pclmul
pclmul
aes
aes
aes
aes
aes
aes

Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST

5
5
5
5
5
5
Page 70

SSSE3
SSSE3

Steamroller
Other
EMMS

0.25

Floating point XMM and YMM instructions


Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS

Operands

Ops

Latency Reciprocal
throughput

x,x
y,y

1
2

0
2

0.25
0.5

x,m128

0.5

y,m256

m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32

1
2
2
1
1
1
1
1
2
1
1
2
2
1
2
1
1
2
1
2
1
2
8
12
1
2
1
2
1
1
2
2
1
2

3
3
3
2
2
3
3
3
4
3
2
5
15
3
3

1
2
2
0.5
0.5
1
1
0.5
1
1
1
1
1
1
2-3
3
1
2
1
2
1
2
3.5
4
0.5
1
0.5
1
1
0.5
2
1
0.5
0.5

2
2
3
3
2
2
4
2
2
2
2
2
2
8
8
Page 71

Execution
pipes

Domain, notes

none
P02

inherit domain
ivec

P2
P2
P2
P01

fp

P2
P1
P01
P1 P2
P2
P1
P1 P2
P1 P2
P2
P2
P2
P2
P2
P1
P1
P1
P1
P0 P2
P0 P2
P01
P01
P01
P01
P1
P1

P02

ivec

AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
fp
fp

ivec
ivec

Steamroller
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D SUBSS/D

y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

2
2
1
1
2
2
1
2
2
2
1
2
1
1
2
2
1
2
20
41

x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i
x,x/m
y,x/m

x,x/m

8
8
2

10
2
10
2
9
2
10
9
9
~35
~35

0.5
0.5
1
0.5
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
8
16

2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4
2
4

6
6
6
6
4
4
4
4
4
4
7
7
7
7
6
5
7
7
13
12
12
12
7
7
7
7

5-6

2
2
2

Page 72

P02
P02
P1

ivec

P1

ivec

P1
P1
P1 P2
P1 P2
P02
P0 P2
P1
P1
P02
P02
P01
P01
P0 P1 P2
P0 P1 P2

ivec
ivec

1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2
2
2

P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P2
P0
P0 P1
P0 P1
P0
P0 P2
P0
P0 P2
P0 P1
P0 P1
P0 P1
P0 P1

ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C
F16C
F16C

P01

fma

ivec

ivec

Steamroller
ADDPS/D SUBPS/D

x,x/m

5-6

P01

fma

VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D

y,y,y/m
x,x/m
y,y,y/m

2
1
2

5-6
5-6
5-6

2
1
1

P01
P01
P01

fma
fma
fma

HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD

x,x

10

P0 P1

ivec/fma

y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m

8
1
1
2
1
2
1
2
1
2

10
5-6
5-6
5-6
9-17
9-17
9-32
9-32
5
5

4
0.5
0.5
1
4-6
9-12
4-13
9-27
1
2

P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01

ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp

x,x/m
y,y,y/m

1
2

2
2

0.5
1

P01
P01

fp
fp

x,x/m

P01 P2

fp

x,x/m

0.5

P01

fp

VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
9
DPPS
x,m,i
10
VDPPS
y,y,y,i
13
VDPPS
y,m,i
15
DPPD
x,x,i
7
DPPD
x,m,i
8
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
All other FMA3 instructions: same as above
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above

2
4

1
1

P01
P0

fp
fp

4
25

5-6
5-6
5-6

2
4
5
8
8
3
4
0.5
0.5
1

P0
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P01
P01
P01

5-6
5-6
5-6

0.5
0.5
1

P01
P01
P01

fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4

12-13
12-13
26-29
27-28
5
5

4-9
9-18
4-18
9-37
1
2

P01
P01
P01
P01
P01
P01

Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS

x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m

1
2
1
2
1
2

25
14

Page 73

fp
fp
fp
fp
fp
fp

Steamroller
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD

x,x
x,m

2
4

10

2
2

P01
P01

AMD XOP
AMD XOP

AND/ANDN/OR/XORPS/
PD

x,x/m

0.5

P02

ivec

VAND/ANDN/OR/XOR
PS/PD

y,y,y/m

P02

ivec

m32
m32
m4096
m4096
m
m

9
16
17
32
9
2
59-67
104-112
121-137
191-209

Logic

Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR

4
5
6
10
36
17
78
160
147-166
291-297

Page 74

P02
P02
P0 P2
P0 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2

32 bit mode
64 bit mode
32 bit mode
64 bit mode

Bobcat

AMD Bobcat
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar
instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the pipeline.

Execution pipe:

Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.

Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG

Operands

r,r
r,i
r,m
m,r
m8,r8H
m,i
m,r
r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m

Ops Latency Reciprocal


throughput
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3

1
4
4
7
6
1
5
1
5
1
1
20
Page 75

0.5
0.5
1
1
1
1
1
0.5
1
0.5
1
0.5
1
1

Execution
pipe
I0/1
I0/1
AGU
AGU
AGU
AGU
AGU
I0/1

Notes

Any addr. mode


Any addr. mode
AH, BH, CH, DH

I0/1
I0/1
Timing dep. on hw

Bobcat
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH
SFENCE
LFENCE
MFENCE

r
i
m

r
m

r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]

r
m
m
m

Arithmetic instructions
ADD, SUB
r,r/i
ADD, SUB
r,m
ADD, SUB
m,r
ADC, SBB
r,r/i
ADC, SBB
r,m
ADC, SBB
m,r/i
CMP
r,r/i
CMP
r,m
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA
AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
DIV
r8/m8

2
1
1
3
9
9
1
4
29
9
2
1
1
1
4
1
1
1
1
1
1
4
1
4

1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
33
1
3
2
2
1
1
1
2
1
1
1

3
1
2-4
4
1
1
1

1
6-7
1
1
6
5
10
7
8
5
23
3
3-5
3-4
6-7
3
3
6
4
3
7
27
Page 76

1
1
2
6
9
1
4
22
8
2
0.5
1
0.5
2
0.5

I0
I0/1
I0
I0/1
I0/1

0.5
1
1
1
~45
1
~45

I0/1
AGU
AGU
AGU
AGU
AGU
AGU

0.5
1
1
1
1

I0/1

0.5
1
0.5

I0/1

23
1
2
1
1
4
3
1
4
27

Any address size


no scale, no offset
w. scale or offset
RIP relative

AMD only

I0/1

I0/1

I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0

latency ax=3, dx=5


latency eax=3, edx=4
latency rax=6, rdx=7

Bobcat
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF, BSR
BSF, BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD

r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64

1
1
1
1
1
1
1
1
1

33
49
81
29
37
55
81
1
1

33
49
81
29
37
55
81

I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1

r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL

1
1
1
1
1
1
1
1
1
1
9
7
9
9

0.5
1
1
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4

I0/1

m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r,r
r,m
r,r/m
r,r/m
r
m

1
1
10
9
9
8
6
7
8
1
1
5
2
5
4-5
8
8
11
11
9
8
1
1
1
1
1
2

7
7

1
1
1
1
1
5
4
6
5

18

3
4
18

16
15
6
12
5
1

Page 77

I0/1
I0/1
I0/1
I0/1
I0/1

1
1
~15
~14
15
15
3
4
15
0.5
1
3
1
15
15
13
15
6
6
5
0.5
1
0.5
0.5
1
2

SSE4.A/SSE4.2
SSE4.A, AMD only

I0/1
I0/1
I0
I0,I1

Bobcat
Control transfer instructions
JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i
BOUND
m
INTO

1
1
1
1
2
8
2
2
5
1
4
8
4

2
2
2
1/2 - 2
1-2
4
2
2
2
~3
~4
4
2

String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS

4
5
4
2
7
2
5
6
7
6

~3
~3
2

Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC

recip. t. = 2 if jump
recip. t. = 2 if jump

values for no jump


values for no jump

values are per count


best case 6-7 B/clk

5
best case 5 B/clk
3
3
4
3

1
0
1
0
6
i,0
12
a,b
10+6b
2
30-52 70-830
26
14

0.5
0.5
6

values are per count


values are per count

I0/1
I0/1
36
34+6b

32 bit mode

87
8

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP

Operands

r
m32/64
m80
m80
r
m32/64
m80
m80

Ops Latency Reciprocal


throughput
1
1
7
21
1
1
16
217

2
6
14
30
2
6
19
177
Page 78

0.5
1
5
35
0.5
1
9
180

Execution
pipe
FP0/1
FP0/1

FP0/1
FP1

Notes

Bobcat
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT

r
m
m
st0,r
r
AX
m16
m16
m16

r
m
m
r
m
m
r
m
m
r
m
r
m

1
1
1
1
12
1
1
2
2
3
12

1
1
2
1
1
2
1
1
2
1
1
1
1
1
2
1
2
5
1
1

0
9
6
7
1
~20
~20

3
3
5
5
19

1
1
3
3
3
19
19
19
2
1
1
1
2
1
1
2

11
11-16
11-19

1
31
1
4-44 27-105
11-51 51-94
11-75 48-110
~45
~113
9-75 49-163
5
8
7
9
30-56
~60
8
29
12
44

1
1
9
26

1
1
1
1
7
1
1
10
10
2
10

0
0

Page 79

1
27-105
51-94
48-110
~113
49-163

0.5
0.5
30
78

FP1
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1
FP0
FP1

FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
FP1
FP0, FP1
FP1
FP1

FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

FP0, FP1
ALU
FP0, FP1
FP0, FP1

Bobcat
FNSAVE
FRSTOR
FXSAVE
FXRSTOR

m
m
m
m

85
80
71
111

163
123
105
118

FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

Integer MMX and XMM instructions


Instruction

Operands

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD

r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,(x)mm

1
1
1
1
3
2
1

7
7
5
6
6
5
6

1
3
1
1
3
1
2

FP0
FP0/1
FP0/1
FP0
FP1
FP1
FP1

r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm

1
2
3
1
2
1
2
1
2
2
2
2
2
1
2
1
2

7
7
7
1
1
5
5
6
1
6
6
6-9
6-9
1
1
13
13

1
3
3
0.5
1
1
1
2
1
2
3
2-5.5
3-6
0.5
1
1,5
3

FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1
FP0/1
AGU
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1

mm,r/m

0.5

FP0/1

xmm,r/m

FP0/1

mm,r/m

0.5

xmm,r/m
xmm,r/m
xmm,r/m
mm,mm
xmm,xmm
xmm,xmm,i
mm,mm,i
xmm,xmm,i
xmm,xmm,i
mm,mm
xmm,xmm

2
2
1
1
6
3
1
2
20
32
64

1
1
1
2
3
2
1
2
19
146-1400
279-3000

1
1
0.5
1
3
2
0.5
2
12
130-1170
260-2300

MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU, LDDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/D
Q
PUNPCKH/LBW/WD/D
Q
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFB
PSHUFB
PSHUFD
PSHUFW
PSHUFL/HW
PALIGNR
MASKMOVQ
MASKMOVDQU

Ops Latency Reciprocal


throughput

Page 80

Execution
pipe

FP0, FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1

Notes

Moves 64 bits.
Name differs
do.
do.

Suppl. SSE3
Suppl. SSE3

Suppl. SSE3

Bobcat
PMOVMSKB
PEXTRW
PINSRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ
Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PHADD/SUBW/SW/D
PHADD/SUBW/SW/D
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULHRSW
PMULHRSW
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR

r32,(x)mm
r32,(x)mm,i
mm,r32,i
xmm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i

1
2
2
3
3
3
1
1

8
12
10
10
3-4
3-4
1
2

2
2
6
3
3
1
2

FP0
FP0, FP1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0/1
FP0/1

mm,r/m

0.5

FP0/1

xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m

2
1
2
1
2

1
1
4
1
1

1
0.5
1
0.5
1

FP0/1
FP0/1
FP0/1
FP0/1
FP0/1

mm,r/m

FP0

xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m

2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2

2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2

2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
2
2

FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP0, FP1

mm,r/m

0.5

FP0/1

Page 81

SSE4.A, AMD only


SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only

Suppl. SSE3
Suppl. SSE3

Suppl. SSE3
Suppl. SSE3

Suppl. SSE3
Suppl. SSE3

Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3

Bobcat
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ

xmm,r/m

FP0/1

mm,i/mm/m

FP0/1

xmm,i/xmm/m

2
2

1
1

1
1

FP0/1
FP0/1

0.5

FP0/1

xmm,i

Other
EMMS

Floating point XMM instructions


Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
MOVSHDUP,
MOVSLDUP
MOVSHDUP,
MOVSLDUP
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS

Operands

Ops Latency Reciprocal


throughput

Execution
pipe

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r

2
2
2
2
2
2
1
2
1

1
6
6
1
6-9
6-9
1
6
5

1
2
3
1
2-6
3-6
0.5
2
2

FP0/1
AGU
FP1
FP0/1
AGU
FP1
FP0/1
FP1
FP1

r,r

0.5

FP0/1

r,m

AGU

m,r
m,r
m,r
r,r
r,m64

1
2
1
2
2

5
12
12
2
7

3
3
2
1
2

FP1
FP1
FP1
FP0/1
FP0/1

r,r

FP0/1

r,m
r32,r
r,r/m,i
r,r/m

2
1
3
2

12
~6
2
1

3
2
2
1

AGU
FP0
FP0/1
FP0/1

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm

2
4
3
1
2
2
2
4
1

5
5
5
4
4
5
4
6
4

2
3
3
1
4
2
4
3
2

FP1
FP0, FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1

Page 82

Notes

SSE4.A, AMD only


SSE3
SSE3

Bobcat
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI

xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm

2
1
3
3
2
2
2

5
4
6
12
11
12
11

2
1
2
3
3
1
1

FP1
FP1
FP0, FP1
FP0, FP1
FP1
FP0, FP1
FP0, FP1

r,r/m
r,r/m
r,r/m

1
2
2

3
3
3

1
2
2

FP0
FP0
FP0

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

2
1
1
2
2
1
2
1
2
1
2
1
2
1
2

3
2
4
2
4
13
38
17
34
3
3
2
2
2
2

2
1
2
2
4
13
38
17
34
1
2
1
2
1
2

FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0

r,r/m

FP0

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

FP0/1

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
2
1
2
1
2

14
48
24
48
3
3

14
48
24
48
1
2

FP1
FP1
FP1
FP1
FP1
FP1

Other
LDMXCSR
STMXCSR

m
m

12
3

10
11

FP0, FP1
FP0, FP1

Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D

Page 83

SSE3
SSE3

Jaguar

AMD Jaguar
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:

Ops:
Latency:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.

Reciprocal throughput:

This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/2 indicates that
the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline.

Execution pipe:

Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.

Integer instructions
Instruction
Move instructions
MOV
MOV

Operands

Ops Latency Reciprocal


throughput

Execution
pipe

r,r
r,i

1
1

0.5
0.5

I0/1
I0/1

MOV

r8/16,m

AGU

MOV

m,r8/16

AGU

MOV

r32/64,m

AGU

MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD

m,r32/64
m,i
m,r
r,r
r,m
r64,r32

1
1
1
1
1
1

1
1
1
0.5
1
0.5

AGU
AGU
AGU
I0/1

6
1
4
1
Page 84

Notes

Any addressing
mode
Any addressing
mode
Any addressing
mode
Any addressing
mode

Jaguar
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
MOVBE
MOVBE
PREFETCHNTA
PREFETCHT0/1/2
PREFETCHW
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA
AAS
DAA
DAS
AAD
AAM

r64,m32
r,r
r,m
r8,r8
r,r

1
1
1
3
2

3
1

r,m

3
2
1
1
2
2
9
9
1
3
1
29
9
2
1
1
1
4
1
1
1
1
1
1
1
1
1
4
4

16
5

1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
8

r
i
m
SP

r
m
SP

r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]

r
r,m
m,r
m
m
m

r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m

2
1

3
1
2
3
1
1
1

6
1
8
1
1
6
5
8
6
8
5
14
Page 85

1
0.5
1
2
1

I0/1
I0/1
I0/1
Timing depends on
hw

3
1
1
1
1
6
8
1
2
2
18
8
2
0.5
1
0.5
2
0.5
1
0.5
1
1
~100
~100
~100
0.5
~45
~45

I0
I0/1
I0
I0/1
I0/1
I0/1

MOVBE
MOVBE
AGU
AGU
AGU
AGU
AGU
AGU

0.5
1
1
1
1

I0/1

0.5
1
0.5
1

I0/1

13

Any address size


1-2 comp., no scale
3 comp. or scale
RIP relative

I0/1

I0/1

Jaguar
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
ANDN
ANDN
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT

r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64

1
3
2
2
1
1
1
2
1
1
1
2
2
2
1
2
2
2
1
1

3
3
3
6
3
3
6
4
3
6
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43
1
1

1
3
2
5
1
1
4
1
1
4
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43

I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1

r,i
r,r
r,m
m,r
r,r,r
r,r,m
r,i
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL

1
1
1
1
1
2
1
1
1
1
1
1
1
1
9
7
9
7

1
1

0.5
0.5
1
1
0.5
1
0.5
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4

I0/1
I0/1

m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r

1
1
10
9
9
8
6
7
8
1
1
5

6
1
1
1
1
6
1
1
1
5
4
5
4

3
4

Page 86

1
1
11
11
11
11
3
4
11
0.5
1
3

BMI1
BMI1
I0/1
I0/1
I0/1
I0/1
I0/1
I0/1

Jaguar
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR
BSF, BSR
POPCNT
LZCNT
TZCNT
BLSI BLSR
BLSI BLSR
BLSMSK
BLSMSK
BEXTR
BEXTR
SETcc
SETcc
CLC, STC
CMC
CLD
STD

r,r/i
m,i
m,i
m,r
r,r
r,r
r,m
r,r/m
r,r
r,r
r,r
r,m
r,r
r,m
r,r,r
r,m,r
r
m

2
5
4
8
7
8
8
1
1
2
2
3
2
3
1
2
1
1
1
1
1
2

4
4
1
1
2
2
2
1
1

1
11
11
11
4
4
4
0.5
0.5
1
1
2
1
2
0.5
1
0.5
1
0.5

1
1
2

Control transfer instructions


JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i

1
1
1
1
2
8
10
2
2
5
1
4

2
2
2
0.5 - 2
1-2
5
6
2
2
2
3
3

BOUND

4
~5n
4
~2n
2/16B
7
~2n
2/16B
5
~6n

2
~3n
2
~n
1/16B
4
~1.5n
1/16B
3
~3n

INTO
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS

Page 87

SSE4A/SSE4.2
SSE4A/LZCNT
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1

I0/1
I0/1
I0
I0,I1

2 if jumping
2 if jumping

values are for no


jump
values are for no
jump

for small n
best case
for small n
best case

Jaguar
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32

7
~6n

m,r
m,r
m,r
m,r8
m,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128

r,r
r,m

1
4
4
5
5
6
6
18
18
28
28

4
~3n

19
11
16
11
16
11
17
11
19
32
38

1
1
37
i,0
12
a,b
10+6b
2
30-59 70-230
5
34
34
30
3
3
4

0.5
0.5
46

I0/1
I0/1
18

17+3b
3

32 bit mode

5
41
42
27
2
2

rdtscp

Floating point x87 instructions


Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW

Operands

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
st0,r
r
AX
m16

Ops Latency Reciprocal


throughput
1
1
7
21
1
1
10
217
1
1
1
1
12
1
1
2
2

2
4
9
24
2
3
9
167
0
8
4
7
1

Page 88

0.5
1
5
29
0.5
1
7
168
1
1
1
1
7
1
1
11
11

Execution
pipe
FP0/1
FP0/1

FP0/1
FP1

FP1
FP1
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1

Notes

Jaguar
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1

m16
m16

3
12

r
m
m
r
m
m
r
m
m

1
1
2
1
1
1
1
1
2
1
1
1
1
1
2
1
2
5
1
1

r
m
r
m

Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR

1
1
4-44
11-51
11-76
11-45
9-75
5
7
8
8-51
61

m
m

1
1
9
27
88
80

22

8
11-54
11-56

35
30-139
38-93
55-122
55-177
44-167
27
9
32-37
30-120
~160

138-150
136

2
9

FP0
FP1

1
1
2
3
3

FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
1FP1
FP0, FP1
FP1
FP1

22
22
22
2
1
1
1
2
1
1
2
4

35
1
30-151

30-120
~160

FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

0.5
0.5
32
78
138-150
136

FP0/1
ALU
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

55-180
55-177
44-167
6

Integer MMX and XMM instructions


Instruction
Move instructions
MOVD

Operands

r32, mm

Ops Latency Reciprocal


throughput
1

4
Page 89

Execution
pipe
FP0

Notes

Jaguar
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD

mm, r32
mm,m32
r32, x
x, r32
x,m32
m32,(x)mm

2
1
1
2
1
1

6
4
4
6
4
3

1
1
1
1
1
1

FP0/1
AGU
FP0
FP1
AGU
FP1

MOVD / MOVQ
r64,(x)mm
MOVQ
mm,r64
MOVQ
x,r64
MOVQ
mm,mm
MOVQ
x,x
MOVQ
(x)mm,m64
MOVQ
m64,(x)mm
MOVDQA
x,x
VMOVDQA
y,y
MOVDQA
x,m
VMOVDQA
y,m
MOVDQA
m,x
VMOVDQA
m,y
MOVDQU, LDDQU
x.m
MOVDQU
m,x
MOVDQ2Q
mm,x
MOVQ2DQ
x,mm
MOVNTQ
m,mm
MOVNTDQ
m,x
PACKSSWB/DW
PACKUSWB
mm,r/m
PACKSSWB/DW
PACKUSWB
x,r/m
PUNPCKH/LBW/WD/D
Q
mm,r/m
PUNPCKH/LBW/WD/D
Q
x,r/m
PUNPCKH/LQDQ
x,r/m
PSHUFB
mm,mm
PSHUFB
x,x
PSHUFD
x,x,i
PSHUFW
mm,mm,i
PSHUFL/HW
x,x,i
PALIGNR
x,x,i
PBLENDW
x,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
x,x
PMOVMSKB
r32,(x)mm
PEXTRW
r32,(x)mm,i
PINSRW
mm,r32,i
PINSRB/W/D/Q
x,r,i
PINSRB/W/D/Q
x,m,i
PEXTRB/W/D/Q
r,x,i
PEXTRB/W/D/Q
m,x,i
INSERTQ
x,x

1
2
2
1
1
1
1
1
2
1
2
1
2
1
1
1
1
1
1

4
6
6
1
1
4
3
1
1
4
4
3
3
4
3
1
1
429
429

1
1
1
0.5
0.5
1
1
0.5
1
1
2
1
2
1
1
0.5
0.5
2
2

FP0
FP0/1
FP0/1
FP0/1
FP0/1
AGU
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1

0.5

FP0/1

0.5

FP0/1

0.5

FP0/1

1
1
1
3
1
1
1
1
1
32
64
1
1
2
2
1
1
1
3

2
2
1
4
2
1
1
2
1
432
43-2210
3
4
8
7

0.5
0.5
0.5
2
0.5
0.5
0.5
0.5
0.5
17
34
1
1
1
1
1
1
1
2

FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0, FP1

3
2
Page 90

Moves 64 bits.Name
of instruction differs
do.
do.

AVX
AVX
AVX

Suppl. SSE3
Suppl. SSE3

Suppl. SSE3
SSE4.1

SSE4.1
SSE4.1
SSE4A, AMD only

Jaguar
INSERTQ
EXTRQ
EXTRQ
PMOVSXBW/BD/BQ/
WD/WQ/DQ
PMOVZXBW/BD/BQ/
WD/WQ/DQ

x,x,i,i
x,x
x,x,i,i

3
1
1

2
1
1

2
0.5
0.5

FP0, FP1
FP0/1
FP0/1

SSE4A, AMD only


SSE4A, AMD only
SSE4A, AMD only

x,x

0.5

FP0/1

SSE4.1

x,x

0.5

FP0/1

SSE4.1

1
1
1
1
1
1
1

1
1
2
1
1
1
1

0.5
0.5
0.5
0.5
0.5
0.5
0.5

FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1

1
3
1
1
1
1
1
1
1
1
1
3

2
4
2
2
2
2
1
1
1
1
2
4

1
2
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1

FP0
FP0 FP1
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1

(x)mm,r/m

0.5

FP0/1

mm,i/mm/m

0.5

FP0/1

x,x

0.5

FP0/1

x,i

x,i
x,x/m

1
1
1

1
2
3

0.5
0.5
1

FP0/1
FP0/1
FP0

SSE4.1

x,x,i
x,m,i
x,x,i
x,m,i

9
10
9
10

5
5
9
9

FP0/1
FP0/1
FP0/1
FP0/1

SSE4.2
SSE4.2
SSE4.2
SSE4.2

Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
(x)mm,r/m
PHADD/SUBW/SW/D
mm,r/m
PHADD/SUBW/SW/D
x,r/m
PCMPEQ/GT B/W/D
mm,r/m
PCMPEQ/GT B/W/D
x,r/m
PCMPEQQ
(x)mm,r/m
PCMPGTQ
(x)mm,r/m
PMULLW PMULHW
PMULHUW
PMULUDQ
(x)mm,r/m
PMULLD
x,r/m
PMULDQ
x,r/m
PMULHRSW
(x)mm,r/m
PMADDWD
(x)mm,r/m
PMADDUBSW
(x)mm,r/m
PAVGB/W
(x)mm,r/m
PMIN/MAX SW/UB
(x)mm,r/m
PABSB/W/D
(x)mm,r/m
PSIGNB/W/D
(x)mm,r/m
PSADBW
(x)mm,r/m
MPSADBW
x,x,i
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM

9
Page 91

Suppl. SSE3
Suppl. SSE3

SSE4.1
SSE4.2

SSE4.1
SSE4.1
Suppl. SSE3
Suppl. SSE3

Suppl. SSE3
Suppl. SSE3
SSE4.1

Jaguar
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST

x,x,i
x,m,i
x,x,i
x,m,i

3
4
3
4

x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i

1
2
2
2
2
1
1

3
5
5
5
5
2
2

Other
EMMS

2
2
8
2

FP0/1
FP0/1
FP0/1
FP0/1

SSE4.2
SSE4.2
SSE4.2
SSE4.2

1
1
1
1
1
1
1

FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1

PCLMUL
AES
AES
AES
AES
AES
AES

0.5

FP0/1

Floating point XMM instructions


Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP

Operands

Ops Latency Reciprocal


throughput

Execution
pipe

x,x
y,y
x,m
y,m
m,x
m,y
x,x
y,y
x,m
y,m
m,x
m,y
x,x
x,m
m,x

1
2
1
2
1
2
1
2
1
2
1
2
1
1
1

1
1
4
4
3
3
1
1
4
4
3
3
1
4
3

0.5
1
1
2
1
2
0.5
1
1
2
1
2
0.5
1
1

FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
AGU
FP1

x,x

FP0/1

x,m

FP0/1

m,x
m,x
m,x
x,x
x,m64
y,y
y,m
x,x
x,m
y,y

1
1
1
1
1
2
2
1
1
2

4
429

1
1
1
0.5
1
1
2
0.5
1
1

FP1
FP1
FP1
FP0/1
AGU
FP0/1
AGU
FP0/1
AGU
FP0/1

2
2
1
1
Page 92

Notes

SSE4A, AMD only


SSE3
SSE3
AVX
AVX

AVX

Jaguar
VMOVSH/LDUP
MOVMSKPS/D
VMOVMSKPS/D
SHUFPS/D
VSHUFPS/D
UNPCK H/L PS/D
VUNPCK H/L PS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D

y,m
r32,x
r32,y
x,x/m,i
y,y,y,i
x,x/m
y,y,y
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

2
1
1
1
2
1
2
1
1
1
1
1
1
2
2
1
2
19
36

Conversion
CVTPS2PD
VCVTPS2PD
CVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS/PD
VCVTDQ2PS/PD
CVT(T)PS2DQ
VCVT(T)PS2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS

x,x/m
y,x/m
x,x/m
x,y
x,x/m
x,x/m
x,x/m
y,y
x,x/m
y,y
x,x/m
y,y
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
x/m,x,i
x/m,y,i
x,x/m
y,x/m

x,x/m
x,x/m
y,y/m
x,x/m
y,y/m

Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D

ADDSUBPS/D
VADDSUBPS/D

6
1
13
15
15
21
32

2
1
1
0.5
1
0.5
1
1
1
0.5
1
1
1
1
2
1
2
16
22

AGU
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0/1
FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1

AVX
AVX
>300 clk if mask=0
>300 clk if mask=0
AVX
AVX

1
2
1
3
2
2
1
2
1
2
1
3
1
1
1
1
2
2
2
2
1
3
1
2

3
4
4
6
5
4
4
4
4
4
4
7
4
4
4
4
9
9
8
8
4
6
4
5

1
2
1
2
8
7
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
2
1
2

FP1
FP1
FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1
FP1

F16C
F16C
F16C
F16C

1
1
2
1
2

3
3
3
3
3

1
1
2
1
2

FP0
FP0
FP0
FP0
FP0

3
3
2
2
2
2
3
3
1
12

Page 93

AVX
AVX
AVX
AVX

AVX
AVX

SSE3

Jaguar
HADD/SUBPS/D
VHADD/SUBPS/D
MULSS/PS
VMULPS
MULSD/PD
VMULPD
DIVSS
DIVPS
VDIVPS
DIVSD
DIVPD
VDIVPD
RCPSS
RCPPS
VRCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
CMPccSS/D
CMPccPS/D
VCMPccPS/D
(U)COMISS/D
ROUNDSS/SD/PS/PD
VROUNDSS/D/PS/D
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD

x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m,i
y,y/m,i
x,x,i
x,m,i
y,y,y,i
y,m,i
x,x,i
x,m,i

1
2
1
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
5
6
10
12
3
4

4
4
2
2
4
4
14
19
38
19
19
38
2
2
2
2
2
2
2
2
2

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
VANDPS/D, etc.

x,x/m
y,y/m

1
2

Math
SQRTSS
SQRTPS
VSQRTPS
SQRTSD
SQRTPD
VSQRTPD
RSQRTSS/PS
VRSQRTPS

x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
y,y/m

m
m

Other
LDMXCSR
STMXCSR
VZEROUPPER
VZEROUPPER
VZEROALL

1
2
1
2
2
2
14
19
38
19
19
38
1
1
2
1
1
2
1
1
2
1
1
2
4
4
7
7
3
3

FP0
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

1
1

0.5
1

FP0/1
FP0/1

1
2
2
1
2
2
1
2

16
21
42
27
27
54
2
2

16
21
42
27
27
54
1
2

FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1

12
3
21
37
41

9
13

8
12
30
46
58

FP0, FP1
FP0, FP1

4
4
11
12
9

Page 94

SSE3

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1

32 bit mode
64 bit mode
32 bit mode

Jaguar
VZEROALL
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
XSAVE
XSAVE
XRSTOR
XRSTOR

73
66
58
115
123
130
114
219
251

66
58
189
198
145
129
342
375

Page 95

90
66
58
189
197
145
129
342
375

64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode

Intel Pentium

Intel Pentium and Pentium MMX


List of instruction timings
Explanation of column headings:
Operands
Clock cycles
Pairability

r = register, accum = al, ax or eax, m = memory, i = immediate data, sr =


segment register, m32 = 32 bit memory operand, etc.
The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably.
u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe,
np = not pairable.

Integer instructions (Pentium and Pentium MMX)


Instruction
NOP
MOV
MOV
MOV
MOV
XCHG
XCHG
XCHG
XLAT
PUSH
POP
PUSH
POP
PUSH
POP
PUSHF
POPF
PUSHA POPA
PUSHAD POPAD
LAHF SAHF
MOVSX MOVZX
LEA
LDS LES LFS LGS LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
TEST
TEST
TEST
TEST
INC DEC

Operands
r/m, r/m/i
r/m, sr
sr , r/m
m , accum
(E)AX, r
r,r
r,m
r/i
r
m
m
sr
sr

r , r/m
r,m
m
r , r/i
r,m
m , r/i
r , r/i
r,m
m , r/i
r , r/i
m , r/i
r,r
m,r
r,i
m,i
r

Clock cycles Pairability


1
uv
1
uv
1
np
>= 2 b)
np
1
uv h)
2
np
3
np
>15
np
4
np
1
uv
1
uv
2
np
3
np
1 b)
np
>= 3 b)
np
3-5
np
4-6
np
5-9 i)
np
5
np
2
np
3 a)
np
1
uv
4 c)
np
1
uv
2
uv
3
uv
1
u
2
u
3
u
1
uv
2
uv
1
uv
2
uv
1
f)
2
np
1
uv
Page 96

Intel Pentium
INC DEC
NEG NOT
MUL IMUL
MUL IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR SAL
SHR SHL SAR SAL
SHR SHL SAR SAL
ROR ROL RCR RCL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
JMP CALL
JMP CALL
conditional jump
CALL JMP
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
BOUND
CLC STC CMC CLD STD
CLI STI
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS

m
r/m
r8/r16/m8/m16
all other versions
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32

r,i
m,i
r/m, CL
r/m, 1
r/m, i(><1)
r/m, CL
r/m, i(><1)
r/m, CL
r, i/CL
m, i/CL
r, r/i
m, i
m, i
r, r/i
m, i
m, r
r , r/m
r/m
short/near
far
short/near
r/m
i
i
short
short
r,m

3
1/3
11
9 d)
17
25
41
22
30
46
3
2
1
3
4/5
1/3
1/3
4/5
8/10
7/9
4 a)
5 a)
4 a)
4 a)
9 a)
7 a)
8 a)
14 a)
7-73 a)
1/2 a)
1 e)
>= 3 e)
1/4/5/6 e)
2/5 e
2/5 e
3/6 e)
4/7 e)
5/8 e)
4-11 e)
5-10 e)
8
2
6-9
2
7+3*n g)
3
10+n g)
4
12+n g)
4
Page 97

uv
np
np
np
np
np
np
np
np
np
np
np
u
u
np
u
np
np
np
np
np
np
np
np
np
np
np
np
np
np
v
np
v
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np

Intel Pentium
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
CPUID
RDTSC
Notes:
a
b
c
d
e
f
g
h
i
j

9+4*n g)
5
8+4*n g)
1 a)
13-16 a)
6-13 a) j)

np
np
np
np
np
np

This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction.
versions with FS and GS have a 0FH prefix. see note a.
versions with SS, FS, and GS have a 0FH prefix. see note a.
versions with two operands and no immediate have a 0FH prefix, see
note
a.
high values
are for mispredicted jumps/branches.
only pairable if register is AL, AX or EAX.
add one clock cycle for decoding the repeat prefix unless preceded by a
multi-cycle instruction (such as CLD).
pairs as if it were writing to the accumulator.
9 if SP divisible by 4 (imperfect pairing).
on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual
mode. On PMMX: 8 and 13 clocks respectively.

Floating point instructions (Pentium and Pentium MMX)


Explanation of column headings
Operands
Clock cycles

Pairability
i-ov
fp-ov

Instruction
FLD
FLD
FBLD
FST(P)
FST(P)
FST(P)
FBSTP
FILD
FIST(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FNSTSW
FLDCW
FNSTCW

r = register, m = memory, m32 = 32-bit memory operand, etc.


The numbers are minimum values. Cache misses, misalignment,
denormal operands, and exceptions may increase the clock counts
considerably.
+ = pairable with FXCH, np = not pairable with FXCH.
Overlap with integer instructions. i-ov = 4 means that the last four clock
cycles can overlap with subsequent integer instructions.
Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
Operand
r/m32/m64
m80
m80
r
m32/m64
m80
m80
m
m

AX/m16
m16
m16

Clock cycles Pairability


1
0
3
np
48-58
np
1
np
2 m)
np
3 m)
np
148-154
np
3
np
6
np
2
np
5 s)
np
6 q)
np
8
np
2
np
Page 98

i-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0

fp-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0

Intel Pentium
FADD(P)
FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FCHS FABS
FCOM(P)(P) FUCOM
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM
FTST
FXAM
FPREM
FPREM1
FRNDINT
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FXCH
FINCSTP FDECSTP
FFREE
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
m
n
o
p

q
r
s

r/m
r/m
r/m
r/m
r/m
m
m
m
m

r
r

m
m

3
3
3
19/33/39 p)
1
1
6
6
22/36/42 p)
4
1
17-21
16-64
20-70
9-20
20-32
12-66
70
65-100 r)
89-112 r)
53-59 r)
103 r)
105 r)
120-147 r)
112-134 r)
1
1
2
2
6-9
12-22
124-300
70-95
1

0
0
0
0
0
0
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np

2
2
2
38 o)
0
0
2
2
38 o)
0
0
4
2
2
0
5
0
69 o)
2
2
2
2
2
36 o)
2
0
0
0
0
0
0
0
0
0

2
2
2 n)
2
0
0
2
2
2
0
0
0
2
2
0
0
0
2
2
2
2
2
2
0
2
0
0
0
0
0
0
0
0
0

The value to store is needed one clock cycle in advance.


1 if the overlapping instruction is also an FMUL.
Cannot overlap integer multiplication instructions.
FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9
of the floating point control word.
The first 4 clock cycles can overlap with preceding integer instructions.
Clock counts are typical. Trivial cases may be faster, extreme cases may
be slower.
May be up to 3 clocks more when output needed for FST, FCHS, or
FABS.

MMX instructions (Pentium MMX)

Page 99

Intel Pentium
A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX
multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one
multiplication per clock cycle.
The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS
takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes
approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX.
There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit
is one step later in the pipeline than the load unit. But the penalty comes when you store data from an
MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance.
This is analogous to the floating point store instructions.
All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Page 100

Pentium II and III

Intel Pentium II and Pentium III


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc.

ops:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:

The number of ops that the instruction generates for each execution port.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency chain. (This is
not the same as the time spent in the execution unit. Values may be inaccurate
in situations where they cannot be measured exactly, especially with memory
operands). The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably. Floating point operands
are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.

Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind.

Integer instructions (Pentium Pro, Pentium II and Pentium III)


Instruction

Operands
p0

MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
POP
POP
PUSH
POP
PUSH
POP
PUSHF(D)

r,r/i
r,m
m,r/i
r,sr
m,sr
sr,r
sr,m
r,r
r,m
r,r
r,m
r,r
r,m

p1

ops
p01 p2 p3
1
1
1
1
1
1

8
7

Latency
p4

1
1
5
8

1
1
1

1
1

r/i
r
(E)SP
m
m
sr
sr
3

1
1
3
4
1
1
1
2
1
5
2
8
11
Page 101

1
1
1
1
1
1
1

1
1
1

1
1
1

high b)

Reciprocal
throughput

Pentium II and III


POPF(D)
PUSHA(D)
POPA(D)
LAHF SAHF
LEA
LDS LES LFS LGS
LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP TEST
CMP TEST
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR ROR
ROL
SHR SHL SAR ROR
ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc

10

r,m

6
2
2
1

1
8

1 c)

m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

8
1
1
1
2
2
3
1
1
1
1

3
1
1

1
1

1
1

r,(r),(i)
(r),m
r8
r16
r32
m8
m16
m32

1
1
1
1
2
3
3
2
2
2

2
2

4
15
4
4
19
23
39
19
23
39

1
1
1
1
1
1
1
1

1
1
1

1
r,i/CL

m,i/CL
r,1
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r/i
r,r/i
m,r/i
r,r
r,m
r
m

1
1
4
3
1
4
4
2
2

1
4
3
2
3
2

1
1
1
1

1
1
6
1
6
1
1
1
1

Page 102

1
1
1

1
1
1

1
1
1

1
1
1

1
1
12
21
37
12
21
37

Pentium II and III


JMP
JMP
JMP
JMP
JMP
conditional jump
CALL
CALL
CALL
CALL
CALL
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
LOOP(N)E
ENTER
ENTER
LEAVE
BOUND
CLC STC CMC
CLD STD
CLI
STI
INTO
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
NOP (90)
Long NOP (0F 1F)
CPUID
RDTSC
IN
OUT
PREFETCHNTA d)
PREFETCHT0/1/2 d)
SFENCE d)
Notes
a)

short/near
far
r
m(near)
m(far)
short/near
near
far
r
m(near)
m(far)

1
1
1
1
1
1

r,m

2
2

1
2

21
1

28

1
1
1

2
4

1
1

2
3

28

i
i
short
short
short
i,0
a,b

21

23
23
1
8
8
12
18 +4b
2
6
1
4

1
2
1
1
3
3

1
2
1
1
2

2
2

1
2
1
1
2

2
2
2
2

1
1
1

2
2
ca.
7

1
b-1

1
2b

1
2

9
17
5
2
10+6n

1
a)
3
a)
2

ca. 5n
1
ca. 6n
12+7n
12+9n
r

1
1
1

0,5
1

23-48
31
18
18
m
m

>300
>300
1
1
1

Faster under certain conditions: see manual 3: "The microarchitecture of Intel,


AMD and VIA CPUs".
Page 103

Pentium II and III


b)
c)
d)

Has an implicit LOCK prefix.


3 if constant without base or index register
P3 only.

Floating point x87 instructions (Pentium Pro, II and III)


Instruction

Operands
p0

FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
FSCALE
FXTRACT

r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m

r
AX
m16
m16
m16
r
m
r
m
r
m

r
m
r
m
m
m
m
m

p1

ops
p01 p2 p3

Latency
p4

Reciprocal
throughput

1
1
2
2

2
38
1

1
2
2

2
165
3
2
1
2
2
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
6
6
6
6
1
1
23
33
30
56
15

1
2
2

1
1

0
5
5

f)

2
7
1
1

10
1

1
1
1

1
1
1
1
1
1
1

1
3
3-4
5
5-6
38 h)
38 h)
2
1
1
1
1
1

1
2

Page 104

1
1
2 g)
2 g)
37
37

Pentium II and III


FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FINCSTP FDECSTP
FFREE
FFREEP
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
e)
f)
g)

h)

i)

r
r

1
17-97
18-110
17-48
36-54
31-53
21-102
25-86
1
1
1
2

27-103
29-130
66
103
98-107
13-143
44-143

69
e)
e)
e)
e)
e)
e)
e)

e,i)

3
13
141
72
2
Not pipelined
FXCH generates 1 op that is resolved by register renaming without going to any
port.
FMUL uses the same circuitry as integer multiplication. Therefore, the combined
throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL
per 3 clock cycles.
FDIV latency depends on precision specified in control word: 64 bits precision
gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/(latency-1).
Faster for lower precision.

Integer MMX instructions (Pentium II and Pentium III)


Instruction

Operands
p0

MOVD MOVQ
MOVD MOVQ
MOVD MOVQ
PADD PSUB PCMP
PADD PSUB PCMP
PMUL PMADD
PMUL PMADD
PAND(N) POR PXOR
PAND(N) POR PXOR
PSRA PSRL PSLL
PSRA PSRL PSLL
PACK PUNPCK
PACK PUNPCK
EMMS
MASKMOVQ d)
PMOVMSKB d)
MOVNTQ d)

r,r
mm,m32/64
m32/64,mm
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm/i
mm,m64
mm,mm
mm,m64

p1

ops
p01 p2 p3
1
1
1
1
1
1

1
1

Latency
p4
1
1
1
3
3
1

1
1
1
1
1
1
1

1
1
1
1
1

11
mm,mm
r32,mm
m64,mm

Page 105

6 k)
2-8
1

Reciprocal
throughput
0,5
1
1
0,5
1
1
1
0,5
1
1
1
1
1
2 - 30
1
1 - 30

Pentium II and III


PSHUFW d)
PSHUFW d)
PEXTRW d)
PINSRW d)
PINSRW d)
PAVGB PAVGW d)
PAVGB PAVGW d)
PMIN/MAXUB/SW d)
PMIN/MAXUB/SW d)
PMULHUW d)
PMULHUW d)
PSADBW d)
PSADBW d)
Notes:
d)
k)

mm,mm,i
mm,m64,i
r32,mm,i
mm,r32,i
mm,m16,i
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64

1
1
1
1
1

1
2
2
1
2
1
2
1
2
3
4
5
6

1
1
1
1
1
1
1

1
1
2
2

1
1
1

1
1

1
1
1
1
1
0,5
1
0,5
1
1
1
2
2

P3 only.
The delay can be hidden by inserting other instructions between EMMS and any
subsequent floating point instruction.

Floating point XMM instructions (Pentium III)


Instruction

Operands
p0

MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHPS MOVLPS
MOVHPS MOVLPS
MOVLHPS MOVHLPS
MOVMSKPS
MOVNTPS
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVTPS2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVTSS2SI
ADDPS SUBPS
ADDPS SUBPS
ADDSS SUBSS
ADDSS SUBSS
MULPS
MULPS

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32
m32,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128

p1

ops
p01 p2 p3
2
2
2
4
4
1
1
1
1
1
1
1

Latency
p4

2
4

1
1

1
2
2
2
2
1
2
2
1
1
2
2
1
1
2
2

1
2
1
2
1
2
2
1
2

Page 106

1
2
3
2
3
1
1
1
1
1
1
1

2
3
4
3
4
4
5
3
4
3
3
3
3
4
4

Reciprocal
throughput
1
2
2
4
4
1
1
1
1
1
1
1
2 - 15
1
2
1
1
2
2
1
2
2
2
1
1
2
2

Pentium II and III


MULSS
MULSS
DIVPS
DIVPS
DIVSS
DIVSS
AND(N)PS ORPS XORPS
AND(N)PS ORPS XORPS
MAXPS MINPS
MAXPS MINPS
MAXSS MINSS
MAXSS MINSS
CMPccPS
CMPccPS
CMPccSS
CMPccSS
COMISS UCOMISS
COMISS UCOMISS
SQRTPS
SQRTPS
SQRTSS
SQRTSS
RSQRTPS
RSQRTPS
RSQRTSS
RSQRTSS
RCPPS
RCPPS
RCPSS
RCPSS
SHUFPS
SHUFPS
UNPCKHPS UNPCKLPS
UNPCKHPS UNPCKLPS
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR

xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m128
m32
m32
m4096
m4096

1
1
2
2
1
1

1
2
1
2
2
2
2
1
1
2
2
1
1
1
1

2
2
1
2
1
1

2
2
2
2
2
2
1
1
2
2
1
1

2
1
2
1
2
1
2
2
2
2

1
2
2

11
6
116
89

Page 107

4
4
48
48
18
18
2
2
3
3
3
3
3
3
3
3
1
1
56
57
30
31
2
3
1
2
2
3
1
2
2
2
3
3
15
7
62
68

1
1
34
34
17
17
2
2
2
2
1
1
2
2
1
1
1
1
56
56
28
28
2
2
1
1
2
2
1
1
2
2
2
2
15
9

Pentium M

Intel Pentium M, Core Solo and Core Duo


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops fused domain:


ops unfused domain:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:

Reciprocal throughput:

i = immediate data, r = register, mm = 64 bit mmx register, xmm =


128 bit xmm register, sr = segment register, m = memory, m32 =
32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count
as two.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant
first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency
chain. (This is not the same as the time spent in the execution
unit. Values may be inaccurate in situations where they cannot be
measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating
point operands are presumed to be normal numbers. Denormal
numbers, NAN's and infinity increase the delays by 50-150
clocks, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give
a similar delay.
The average number of clock cycles per instruction for a series of
independent instructions of the same kind.

Integer instructions
Instruction

Operands

ops
fused
domain

ops unfused domain


p0

p1 p01 p2

p3

Latency Reciprocal
p4
throughput

Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG

r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r32
r,r
r,m
r,r
r,m
r,r

1
1
1
2
1
2
8
8
2
1
1
2
2
3

0,5
1
1
1

1
1
8
7

1
1

1
1

1
5
8

1
1
1

2
0,5
1
1,5

1,5

1
1

1
1

Page 108

1
1
3

Pentium M
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LAHF SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
SFENCE/LFENCE/MFENCE
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV

r,m
r
i
m
sr

r
(E)SP
m
sr

r,m
r
m
m
m

7
2
1
2
2
2
16
18
1
3
2
10
17
10
1
2
1
2
11
1
1
2

4
1

1
1

1
1
11
2

2
9
6
2
1

10

1
1
1

1
1
1
1
1
8

high b)

1
1
1
1
1
8

1
1
1
1
1
8

1
1
2

1
1
1
1

6
8

7
1

1
1
1
8

3
1
1
1

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r
m

r8
r16/r32
r,r
r,r,i
m8
m16/m32
r,m
r,m,i
r8
r16
r32

1
1
3
2
2
7
1
1
2
1
3
1
3
4
1
3
1
1
1
3
1
2
5
4
4

1
1
6

18
18

16
7
1
1
1

>300
>300

1
1

1
1
1
1
1
4
1
1
1
1
1

1
1

1
2
1

1
2

1
1

1
1
1

1
1
1

0,5
1
1
2

0,5
1
1
0,5

2
15
4
5
4
4
4
5
4
4
15-16 c)
15-24 c)
15-39 c)

1
1
1
1
1
1
1
1
12
12-20 c)
12-20 c)

1
1
1
1
3
1
1
1
3
1
1
4
3
3

Page 109

2
2

1
1
1
1
1
1
1

Pentium M
DIV IDIV
DIV IDIV
DIV IDIV
CBW CWDE
CWD CDQ
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
TEST
SHR SHL SAR ROR ROL
SHR SHL SAR ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD STD

m8
m16
m32

r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r,i/CL
m,i/CL
r,1
r8,i/CL
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

Control transfer instructions


JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
conditional jump
short/near
J(E)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r

6
5
5
1
1

1
1
3
1
1
2
1
3
2
9
8
6
7
12
11
10
2
4
1
8
2
1
10
3
2
2
1
2
1
4

1
22
1
2
25
1
2
11
11
4
32
4

4
3
3

1
1
1

1
1
1

1
1

1
1
1
1
1
1
1
1
1
5
4
3
2
6
5
5
2
1

1
1

1
2

0,5
1
1
0,5
1
1
1

1
1

1
1

1
1
1
4
4
3
2
3
3
2

1
1
1
1

1
1

1
1

1
1
1
1
1
1
1
1

1
1
1
1
1

1
1

1
1

1
1

1
1
1
1
1

1
8
8
1

27

1
1

1
7

1
2

23

1
1
1

1
21

2
11
10
9
1
1
1
1

Page 110

12
12-20 c)
12-20 c)
1
1

2
2

15-16 c)
15-24 c)
15-39 c)
1
1

1
2
1

1
2
1

1
28
1
2
31
1
1
6
6
2
27
9

Pentium M
CALL
CALL
RETN
RETN
RETF
RETF
BOUND
INTO

m(near)
m(far)
i
i
r,m

String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CLI
STI
ENTER
ENTER
LEAVE
CPUID
RDTSC
Notes:
a)
b)
c)

4
35
2
3
27
27
15
5

1
29
1
1
24
24
7

2
6n
3
5n
6
6n
3
7n
6
9n

2
1

6
5

1
2
1
1
3
3
2

1
2

1
2

2
30
2
2
30
30
8
4

4
0,5
1
0,7
0,7
0,5
1,3
0,6
0,7
0,5

10+6n
ca. 5n
1
ca. 6n
1
12+7n
4
12+9n

1
1
2

1
a)
3
a)
2

1
1
2

0,5
1

9
17
i,0
a,b

12
ca.
3
38-59
13

10
18 +4b
2

1
1
b-1 2b
1

38-59
13

ca. 130
42

Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Has an implicit LOCK prefix.
High values are typical, low values are for round divisors. Core Solo/Duo is
more efficient than Pentium M in cases with round values that allow an earlyout algorithm.

Floating point x87 instructions


Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)

Operands

r
m32/64
m80
m80
r

ops
fused
domain

1
1
4
40
1

ops unfused domain


p0

p1 p01 p2

1
2
38
1

Page 111

1
2
2

p3

Latency Reciprocal
p4
throughput
1
1

Pentium M
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE
FFREEP
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1

m32/m64
m80
m80
r
m
m
m

r
AX
m16
m16
m16
r
r

r
m
r
m
r
m

r
m
r
m
m
m
m

1
6
169
1
4
4
4
1
2
2
3
2
3
3
1
1
2
142
72

1
1
1
1
1
1
1
1
1
1
2
1
6
6
6
6
1
1
26
15

28
15
1
80-100
90-110
~ 20
~ 40
~ 55

1
2
2

2
165
3
2
2
1
2
2
3
1
1
1
1
1
2
142
72

1
1
1

1
1

2
7

19
3
1
2
131
91

1
1
1
1
1
1

1
1
1

1
1
1
1
1

1
1
1

2
1
1

1
1
1
1

3
3
5
5
9-38 c)
9-38 c)
1
1
1
1
1
1
3
5
9-38 c)

26
15

15
1
80-100
90-110
~20
~40
~55

1
1
2
2
8-37 c)
8-37 c)
1
1
1
1
1
1
3
3
8-37 c)
4
1
1

37
19

28

Page 112

0
5
5
5

3
167
0.33 f)
2
2
2

1
1

3
5
5
3

1
2
2

43
9
9 h)
80-110
100-130
~45
~60
~65

Pentium M
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
c)
f)
g)

~ 100
~ 85

~100
~85

1
2
3
14

~140
~140

1
1
13
27

3
14

High values are typical, low values are for low precision or round divisors.
FXCH generates 1 op that is resolved by register renaming without going to
any port.
SSE3 instruction only available on Core Solo and Core Duo.

Integer MMX and XMM instructions


Instruction

Operands

ops
fused
domain

ops unfused domain


p0

p1 p01 p2

p3

Latency Reciprocal
p4
throughput

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB

r32,mm
mm,r32
mm,m32
m32,mm
r32,xmm
xmm,r32
xmm,m32
m32, xmm
mm,mm
mm,m64
m64,mm
xmm,xmm
xmm,m64
m64, xmm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm

1
1
1
1
1
2
2
1
1
1
1
2
2
1
2
2
2
4
8
4
1
2
1
4

1
1

mm,mm

mm,m64

xmm,xmm

1
1
1
1

1
2
1

1
1

1
1
2
1

1
1
1

1
2
2

2
5-6
1
1

2
2-3 2-3
1
1

1
1
2

Page 113

1
1

1
2

0,5
0,5
1
1
1
1
1
1
0,5
1
1
1
1
1
1
2
2
2-10
4-20
2
1
1
2
3

Pentium M
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKHQDQ
PUNPCKHQDQ
PUNPCKLQDQ
PUNPCKLQDQ
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW

xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
xmm,xmm
xmm, m128
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
mm,mm
xmm,xmm
r32,mm
r32,xmm
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i

4
1
1
2
3
2
3
1
1
1
2
3
4
2
3
3
8
1
1
2
4
1
2

Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULUDQ
PMULUDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDWD
PMADDWD
PAVGB/W

mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm

1
1
2
4
2
2
4
6
1
1
2
2
1
1
2
4
1
1
2
4
1
1
2
4
1

1
1
1
2
1

2
1

1
2
2
1
1
1

1
2
1
1

1
1
2
1
1

1
1
1
1
1
1

2
2
2
1

1
2

1
1
1
1
2
1
2

j)
1
2

1
1
2
2
2
2
4
4
1
1
2
2
1
1
2
2
1
1
2
2

Page 114

1
1
2
3
1
1

1
1
1
2
1
2

0,5
1
1
2
1
1
2
2
0,5
1
1
2
1
1
2
2
1
1
2
2
1
1
2
2
0,5

1
2
2
1
2
2
1
1
1
2

2
1
2

1
1
2
2
1

1
2

1
2

2
1
1
2
2
1
1
1
1
1
1
2
2
1
1

3
3
3
3
4
4
4
4
3
3
3
3
1

Pentium M
PAVGB/W
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PSADBW
PSADBW
PSADBW
PSADBW

mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128

1
2
4
1
1
2
4
2
2
4
6

1
2
2
1
1
2
2
2
2
4
4

Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i

1
1
2
4
1
1
2
3
3
4

1
1
2
2

Other
EMMS
Notes:
g)
j)
k)

1
1
2
2

1
1
2
1
1
1
2
4
4
4
4

1
2

0,5
1
1
2
1
1
2
2
2
3

6 k)

1
1
2
1
1
2
2

1
1
1

11

11

1
1
2
0,5
1
1
2
1
1
2
2

SSE3 instruction only available on Core Solo and Core Duo.


Also uses some execution units under port 1.
You may hide the delay by inserting other instructions between EMMS and
any subsequent floating point instruction.

Floating point XMM instructions


Instruction

Operands

ops
fused
domain

ops unfused domain


p0

p1 p01 p2

p3

Latency Reciprocal
p4
throughput

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm

2
2
2
4
8
1
2
1
1
1
1
1

2
2
2

4
4

Page 115

1
1

1
j)

1
2
3
2
3
1
1
1
1
1
1
2

1
2
2
2
4
1
1
1
1
1
1
1

Pentium M
MOVNTPS/D
SHUFPS/D
SHUFPS/D
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD

m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128

2
3
4
2
2
4
4
4
2
3

Conversion
CVTPS2PD
CVTPS2PD
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI

xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64

4
4
4
6
2
3
2
3
2
4
2
4
4
5
4
6
1
2
1
2
4
5
3
5
2
2
3
2
3
2
3

xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,xmm

1
2
2
4
2
6?

Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
HADDPS HSUBPS g)

2
2
1

1
1

2
2

2
1
2

2
1
3
3

2
2
1
1

3-4
2
1
1

2
2
1
1

1
4
2

2
2

1
1
1
1
2
2

Page 116

1
1
1
1
1
1
1

3
2
3
2
4
1
4
2
3
1
3
1
5
1

3
3

1
1

4
1
2

2
2
2
2
4
4
4
4

2
2

2
2

1
1

4
2
4
4
1
4
1

1
1

1
1
2
2
2
?

4
1

1
2

3
3
3
3
3
7

3
2
2
1
2
5
5
1
1

3
3
3
3
2
2
2
2
2
2
2
2
2
2
3
3
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1

1
1
2
2
2
4

Pentium M
HADDPD HSUBPD g)
MULSS
MULSD
MULSS
MULSD
MULPS
MULPD
MULPS
MULPD
DIVSS
DIVSD
DIVSS
DIVSD
DIVPS
DIVPD
DIVPS
DIVPD
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
RCPSS
RCPSS
RCPPS
RCPPS

xmm,xmm
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128

3
1
1
2
2
2
2
4
4
1
1
2
2
2
2
4
4
1
2
2
4
1
2
1
2
2
4
1
2
2
4

Math
SQRTSS
SQRTSS
SQRTSD
SQRTSD
SQRTPS
SQRTPD
SQRTPS
SQRTPD
RSQRTSS
RSQRTSS
RSQRTPS
RSQRTPS

xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128

2
3
1
2
2
2
4
4
1
2
2
4

Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D

xmm,xmm
xmm,m128

2
4

3
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2

1
1

2
2

1
1

2
2
1
1
2
2
1
1

1
1
2
2

3
2

1
2

3
3
3
3
3

1
3
2

2
2
1
1
2
2
2
2

Page 117

1
1
1
2
2

Other

4
4
5
4
5
4
5
4
5
9-18 c)
9-32 c)
9-18 c)
9-32 c)
16-34 c)
16-62 c)
16-34 c)
16-62 c)
3

6-30
1
5-58
1
8-56
16-114
2
2
1
1
3
2

3
1
3
2

2
2

1
2

2
1
2
1
2
2
4
2
4
8-17 c)
8-31 c)
8-17 c)
8-31 c)
16-34 c)
16-62 c)
16-34 c)
16-62 c)
1
1
2
2
1
1
1
1
2
2
1
1
2
2

4-28
4-28
4-57
4-57
16-55
16-114
16-55
16-114
1
1
2
2

1
1

Pentium M
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
c)
g)
j)

m32
m32
m4096
m4096

9
6
118
87

9
6
32
43

43

43

44

High values are typical, low values are for round divisors.
SSE3 instruction only available on Core Solo and Core Duo.
Also uses some execution units under port 1.

Page 118

20
12
63
72

Merom

Intel Core 2 (Merom, 65nm)


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops fused domain:


ops unfused domain:

p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:

Latency:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops
fused domain. An x under p0, p1 or p5 means that at least one of the ops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one op which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these ops go to.
The total number of ops going to port 0, 1 and 5.
The number of ops going to port 0 (execution units).
The number of ops going to port 1 (execution units).
The number of ops going to port 5 (execution units).
The number of ops going to port 2 (memory read).
The number of ops going to port 3 (memory write address).
The number of ops going to port 4 (memory write data).
Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a op in the integer unit (int) is read by a
op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output
in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an
extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless
of which instruction comes next. Nothing listed under unit means that additional
delays are either unlikely to occur or unavoidable and therefore included in the
latency figure.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction

Operands

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main

Move instructions
Page 119

Laten- Recicy
procal
throughput

Merom
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT

r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r

1
1
1
1
1
2
8
8
2

r,r
r,m
r,r
r,m
r,r
r,m

2
2
3
x
1

x
x
x

x
x
x

x
x
x

m8

1
1
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r

1
1
2
2
2
4
1
1
1

r
i
m
sr

r
(E/R)SP
m
sr

r,m
r
m
m
m

x
1

4
3

x
x

x
x

1
1
4
5

1
1

1
1

1
1
1
1

1
1
15
9

3
9
23
2
1
2
1
2
11

x
x
1
1

x
x

x
x

1
1
1
1
1
8

1
1
1
1
1
1

1
1
1
1
1
8

1
1
1
1

1
1
1
2
2
3
1
1
1

x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x

Page 120

1
1
1
1

1
1
1
1

1
1

1
1

int
int
int
int
int
int
int
int
int

1
2
3
3

0,33
1
1
1
1
1
16
16
2

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

0,33
1
1

int
int
int
int
int
int
int
int
int

2
2
high b)
4
3

2
1
1
1
1
1
7
8
1
1,5
17

20
1
4
1
4

240

1
6
2
2
7
1
1
1

7
0,33
1
1
1
17
1
1
8
9
9
117

0,33
1
1
2
2
0,33
1
0,33

Merom
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR

r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl

3
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
3
5
4
32
56
4
6
5
32
56
1
1

1
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
3
5
4
32
56
3
5
4
31
55
1
1

1
1
2
1
1
1
3
1
3
2
9
8
6
4
12

1
1
1
1
1
1
2
1
2
2
9
8
6
3
9

x
x

x
x
x

x
1
x
1
x
x
x
1
1

x
x
x

1
1
1
1
x
x
2

1
x
x

x
x

1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
x
x

x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

Page 121

x
x
x
x
x
x

1
1

1
1

1
1

1
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5

18
18-26
18-42
29-61
39-72
18
18-26
18-42
29-61
39-72
1
1

1
6
1
1
6
1
6
2
12
11
11
7
14

1
1
1
1
1,5
1,5
4
1
1
2
1
1
2
1
1,5
1,5
4
1
1
2
2
1
2
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)

0,33
1
1
0,33
1
0,5
1
1
1
2

Merom
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD

m8,i/cl
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

Control transfer instructions


JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS

11
10
2
3
1
10
2
1
11
3
2
2
1
2
1
7
6

8
7
2
2
1
9
1
1
8
1
2
2
1
1
1
7
6

1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5

1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
x
30
30
13
5

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1

1
1

1
1

1
1
1
1

1
1

1
1

1
1

1
1
1

x
x
x
x

3
2
4+7n - 14+6n
4
2
8+5n - 20+1.2n
8
5
1
1
1
7+7n - 13+n
4
3
7+8n - 17+7n
7
5
7+10n - 7+9n
Page 122

x
x
x
x

1
2

1
1
1
x
x
x

1
1

1
2
1
1
2
2
2

1
1

1
1

1
1

5
1
2

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

13
13
2
7
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int
int
int
int
int
int
int

1
5
6
2
1
1

0
0
0
0

1
1
5
1

1
2
1
1
0,33
4
14

1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3

1
1+5n - 21+3n
1
7+2n - 0.55n

1+3n - 0.63n
1
3+8n - 23+6n
3
2+7n - 22+5n

Merom
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)

i,0
a,b

1
1
3
12

1
1
3
10

3
46-100
29
23

x
x
x

x
x
x

x
x
x
1

int
int
int
int
int
int
int
int
int

0,33
1
8
8

180-215
64
54

Applies to all addressing modes


Has an implicit LOCK prefix.
Low values are for small results, high values for high results.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.

Floating point x87 instructions


Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR

Operands

r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m

r
AX
m16
m16
m16
r
m
m

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
170
1
1
2
3
3
1
2
2
2
1
2
2
3
1
2
142
78

1
1
2
38
1

3
166
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1
2

x
x

1
2
2

2
1
x
x

x
x

1
2
2

1
1
1

1
1
1

1
1
1
1

1
1

1
2
2

1
2

2
1
1

1
1
2

Page 123

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6

1
184
169

1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1
2
192
177

Merom
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT

r
m
r
m
r
m

r
m
r
m
m
m
m

Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
1
21-27 21-27
7-15 7-15

27
82
1
~96
~100
~19
~53
~98
~70

27
82
1
~96
~100
~19
~53
~98
~70

1
2
4
15

1
2
4
15

1
1
1
1
1
1
1
1

1
1
2
2
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

1
1
1

1
1
1
1
1

1
1
1
1

1
1
1

float
float
float
float
float
float
float
float
float

1
1
5
2
2
6-38 d) 5-37 d)
5-37 d)
1
1
1
1
1
1
1
2
2
5-37 d)
2
1
1
16-56
22-29

41
170
6-69
~96
~115
~45
~96
~136
~119

float
float
float
float

1
1
15
63

Round divisors or low precision give low values.


Resolved by register renaming. Generates no ops in the unfused domain.
SSE3 instruction set.

Integer MMX and XMM instructions


Instruction

Operands

Move instructions
MOVD k)
MOVD k)
MOVD k)

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1

int
1

Page 124

1
int

Laten- Recicy
procal
throughput
2
3
2

0,33
1
0,5

Merom
MOVD k)
(x)mm,m32/64
MOVQ
(x)mm, (x)mm
MOVQ
(x)mm,m64
MOVQ
m64, (x)mm
MOVDQA
xmm, xmm
MOVDQA
xmm, m128
MOVDQA
m128, xmm
MOVDQU
m128, xmm
MOVDQU
xmm, m128
LDDQU g)
xmm, m128
MOVDQ2Q
mm, xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m64,mm
MOVNTDQ
m128,xmm
mm,mm
PACKSSWB/DW
PACKUSWB
mm,m64
xmm,xmm
PACKSSWB/DW
PACKUSWB
xmm,m128
PUNPCKH/LBW/WD/DQ
mm,mm
PUNPCKH/LBW/WD/DQ
mm,m64
PUNPCKH/LBW/WD/DQ
xmm,xmm
PUNPCKH/LBW/WD/DQ xmm,m128
PUNPCKH/LQDQ
xmm,xmm
PUNPCKH/LQDQ
xmm, m128
PSHUFB h)
mm,mm
PSHUFB h)
mm,m64
PSHUFB h)
xmm,xmm
PSHUFB h)
xmm,m128
PSHUFW
mm,mm,i
PSHUFW
mm,m64,i
PSHUFD
xmm,xmm,i
PSHUFD
xmm,m128,i
PSHUFL/HW
xmm,xmm,i
PSHUFL/HW
xmm, m128,i
PALIGNR h)
mm,mm,i
PALIGNR h)
mm,m64,i
PALIGNR h)
xmm,xmm,i
PALIGNR h)
xmm,m128,i
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,(x)mm
PEXTRW
r32,mm,i
PEXTRW
r32,xmm,i
PINSRW
mm,r32,i
PINSRW
mm,m16,i
PINSRW
xmm,r32,i
PINSRW
xmm,m16,i

1
1
1
1
1
1
1
9
4
4
1
1
1
1
1
1
3
4
1
1
3
4
1
2
1
2
4
5
1
2
2
3
1
2
2
2
2
2
4
10
1
2
3
1
2
3
4

Arithmetic instructions
PADD/SUB(U)(S)B/W/D (x)mm, (x)mm
PADD/SUB(U)(S)B/W/D
(x)mm,m
PADDQ PSUBQ
(x)mm, (x)mm
PADDQ PSUBQ
(x)mm,m

1
1
2
2

1
1

int
int
int

x
1
1

int
int

1
4
2
2
1
1

x
x
x
x
x

x
x

x
x
x
x
x

1
2
2

1
2

int
int
int
int
1
1

1
1
3
3
1
1
3
3
1
1
1
1
4
4
1
1
2
2
1
1
2
2
2
2

1
1

1
2
3
1
1
3
3

1
1
2
2

x
x
x
x

1
1

1
1

1
1
1
1
1

1
1

x
x

x
x
x
x

x
x

Page 125

x
x

x
x
x
x

x
x

1
1
1
1
1
1
x
x
x
x

1
1
x
x

x
x
x
x

1
1
1
1
1

1
1

1
1

1
2

2
1
2
3
1
2
3
3-8
2-8
2-8
1
1

1
1
int
int
fltint
int
int
int
fltint
int
int
int
int
int
int
int
int
int
fltint
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int

3
1
3
1
1
3
1
3
1
2
2

2
3
5
2
6

1
0,33
1
1
0,33
1
1
4
2
2
0,33
0,33
2
2
1
1
2
2
1
1
2
2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
2-5
6-10
1
1
1
1
1
1,5
1,5

0,5
1
1
1

Merom
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
Notes:
g)
h)
k)

mm,mm

int

mm,m64

xmm,xmm

xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m

8
3
4
5
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

7
3
3
5
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

(x)mm,(x)mm
(x)mm,m
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i

1
1
1
1
1
2
3
2

1
1
1
1
1
2
2
2

x
x
1
1
1
x
x
x

x
x

11

11

int
int

1
1
1
x
x

x
x
1
1
1
1
1
1
1
1
1
1

x
x
x
x
x
x
x
x

1
1
1
1
1
1

x
x
x
x
x
x
x
x
1
1

1
1
1
1
1

x
x

1
1

x
x
x

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int
int
int
int

float

4
4

3
5
1
3
3
3
3
3
1
1
1
1
3

1
1
1
2
2

4
4
2
2
3
3
0,5
1
1
1
1
1
1
1
1
1
1
1
0,5
1
0,5
1
0,5
1
0,5
1
1
1

0,33
1
1
1
1
1
1
1

SSE3 instruction set.


Supplementary SSE3 instruction set.
MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits.

Page 126

Merom

Floating point XMM instructions


Instruction

Operands

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128

1
1
1
4
9
1
1
1
2
2
1
1
1
1
3
4
1
2
1
2
1
2
3
4
1
2

xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128

2
2
2
2
2
2
2
2
1
1
1
1
2
3
2
2

2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2

Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
x

int
int

1
1
2
4
1

1
x
x

x
x

1
x
x

2
1

1
int

2
int
int

1
1
1
1

1
1

1
1

1
1

int
1
1

3
3

1
1
1
1

1
1
1
1
3
3

1
1

1
1
1

1
1
2
2

Page 127

1
1
float
float

1
3
3
1
1
1
1
1
1
3
3
1
1

1
1
1
1
1

1
1
1
1

Laten- Recicy
procal
throughput
1
2
3
2-4
3-4
1
2
3
3
5
3
1
1

1
fltint
fltint
float
float
int
int
int
int
fltint
int
float
float

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

1
1
1
3
1

4
2
2
3
3
4
4

0,33
1
1
2
4
0,33
1
1
1
1
1
1
1
2-3
2
2
1
1
1
1
1
1
2
2
1
1

1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1

Merom
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS
DIVSS
DIVSD
DIVSD
DIVPS
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D

xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64

1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1

1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1

xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm

1
1
1
1
1
1
6
7
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
6
6
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Page 128

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

3
4
4
4
3
4
3

3
3
9
5
4
5
4
5
6-18 d)
6-32 d)
6-18 d)
6-32 d)
3
3
3
3
3

3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1

1
1
1
1
1
1
3
3
2
2
1
1
1
1
1
1
1
1
5-17 d)
5-17 d)
5-31 d)
5-31 d)
5-17 d)
5-17 d)
5-31 d)
5-31 d)
2
2
1
1
1
1
1
1
1

Merom
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D

xmm,m32/64
xmm,xmm
xmm,m128

1
1
1

1
1
1

xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m

1
2
1
2
1
1

1
1
1
1
1
1

1
1
1
1

Logic
AND/ANDN/OR/XORPS/D xmm,xmm
AND/ANDN/OR/XORPS/D xmm,m128

1
1

1
1

x
x

14
6
141
119

13
4

Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS

Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)

m32
m32
m4096
m4096

1
1
1

float
float
float

6-29

float
float
float
float
float
float

int
int

1
1
1
1

x
x

x
x

6-58
3

1
1

1
145
164

Round divisors give low values.


SSE3 instruction set.

Page 129

1
1
1

6-29
6-29
6-58
6-58
2
2

0,33
1

42
19
145
164

Wolfdale

Intel Core 2 (Wolfdale, 45nm)


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops fused domain:


ops unfused domain:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit


xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the
numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under
ops fused domain. An x under p0, p1 or p5 means that at least one of the
ops listed under p015 can optionally go to this port. For example, a 1 under
p015 and an x under p0 and p5 means one op which can go to either port 0
or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these
ops go to.

p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:

The total number of ops going to port 0, 1 and 5.


The number of ops going to port 0 (execution units).
The number of ops going to port 1 (execution units).
The number of ops going to port 5 (execution units).
The number of ops going to port 2 (memory read).
The number of ops going to port 3 (memory write address).
The number of ops going to port 4 (memory write data).
Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a op in the integer unit (int) is read by a
op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included
under latency when they are unavoidable. For example, movd eax,xmm0 has
an extra 1 clock delay for moving from the XMM-integer unit to the general
purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure.

Latency:

This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction

Operands

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main

Page 130

Laten- Recicy
procal
throughput

Wolfdale
Move instructions
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP

r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r

1
1
1
1
1
2
8
8
2

r,r
r16/32,m
r64,m
r,r
r,m
r,r
r,m

m8

1
1
2
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i

1
1
2
2
2
4
1

r
i
m
sr

r
(E/R)SP
m
sr

r,m
r
m
m
m

x
1

4
3

1
2
2
3
x
1

x
x

x
x
x
x
x

x
x
x
x
x

x
x

1
1
4
5

x
x
x
x

x
x
1
1

x
x

x
x

0,33
1
1
1

2
1

3
9
23
2
1
2
1
2
11

0,33
1
1
1
1
1
16
16
2

1
1

1
1

1
1

1
1
15
9

1
1

1
2
3
3

1
1
1
1
1
8

1
1
1
1
1
1

1
1
1
1
1
8

2
high b)
4
3

2
1

20
1
4
1
4

1
1
1
2
2
3
1

x
x
x
x
x
x
x

1
1
1
1

x
x
x
x
x
x
x

Page 131

x
x
x
x
x
x
x

1
1
1
1
1
7
8
1
1,5
17

1
1
1

1
1
1
1

120

1
1
1

1
1

6
2
2
7
1

7
0,33
1
1
1
17
1
1
8
6
9
90

0,33
1
1
2
2
0,33

Wolfdale
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL

m,r/i
r
m

r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r,i/cl

1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
7
7
32-38
56-62
4
7
7
32
56
1
1

1
1
2
1
1
1
3
1
3
2
9
8
6

1
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
7
7

x
x
x
x
x
x
x
x

x
x
x
1
x
x
1
x
x
x
1
1

x
x
x

1
1
1
1
x
x
2

1
x
x

x
x

1
1
1
1
1

1
1
1
1
1
1
2
1
2
2
9
8
6

x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x

x
x
x
x

Page 132

x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
6

17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5

x
x
x

3
7
6
31
55
1
1

56-62

x
x

1
1 2 1
x x x
2 3 2
9 10 13
x x x
1 2
2 3 2
x x x
x x x
x x x
x x x
x
x

32-38

1
1
1
1
1
1
1
1
1
1

1
0,33
1
1
1
1
1,5
1,5
4
1
1
2
1
1
2
1
1,5
1,5
4
1
1
2
2
1
2

9-18 c)
14-22 c)
14-23 c)
18-57 c)
34-88 c)
9-18
14-22 c)
14-23 c)
34-88 c)
39-72 c)

1
1
1
1
1

1
1

1
1
1

6
1

1
1
6
1
6
2
12
11
11

0,33
1
1
0,33
1
0,5
1
1
1
2

Wolfdale
RCR RCL
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD

m,1
m8,i/cl
m8,i/cl
m,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

Control transfer instructions


JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS

4
12
11
10
2
3
1
9
3
1
10
3
2
2
1
2
1
6
6

3
9
8
7
2
2
1
8
2
1
7
1
2
2
1
1
1
6
6

1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5

1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
1
30
30
13
5

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x

x
x
x
x

3
2
4+7n-14+6n
4
2
8+5n-20+1.2n
8
5
1
1
1
7+7n-13+n
4
3
7+8n-17+7n
Page 133

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
1
1

1
1
1
1

1
1
1
1

7
14
13
13
2
7
1

1
4
1

1
1
1
1

1
1

1
1

1
5
6
2

1
1
1

1
1

1
1

0
0

1
2

1
1
1
x
x
x

1
1

0
0

1
2
1
1
2
2
2

1
1

1
1

1
1
1
1
0,33
3
14

1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3

1
1+5n-21+3n
1

1
7+2n-0.55n

5
1+3n-0.63n
1

1
3+8n-23+6n

Wolfdale
CMPS
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)

7
5
7+10n-7+9n

i,0
a,b

1
1
3
12

1
1
3
10

3
53-117
13
23

3
2+7n-22+5n

x
x
x

x
x
x

x
x
x
1

0,33
1
8
8

1
53-211
32
54

Applies to all addressing modes


Has an implicit LOCK prefix.
Low values are for small results, high values for high results. The reciprocal
throughput is only slightly less than the latency.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.

Floating point x87 instructions


Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP

Operands

r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m

r
AX
m16
m16
m16

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
171
1
1
2
3
3
1
2
2
2
1
2
2
3
1

1
1
2
38
1
3
167
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1

1
2
x
1
x
x

x
x

x
x

1
1
1
1
1
1

1
2
2
1
2
2

1
2
2

1
1
1

1
1
1

1
2

2
1
1

1
1
1

Page 134

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6

1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1

Wolfdale
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)

r
m
m

2
141
78

2
95
51

r
m
r
m
r
m

1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
26-29
28-35
17-19

1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1

r
m
r
m
m
m
m

1
2
4
15

x
x
x

x
x 7 23 23
x 27

1
1

1
1
1
1
1
1

1
1

float
float
float

1
1
5
2
2
6-21 d) 5-20 d)
6-21 d) 5-20 d)
1
1
1
1
1
1

43
~170
6-20
32-85
70-100
38-107
45
50-100
40-130
55-130

x
x
x

x
x
1
x
x

x
x

x
x

x
x

x
x

float
float
float
float
float

x
x
x
x
x

x
x
x
x
x

x
x
x
x
x

float
float
float
float
float

x
x
x

float
float
float
float

1
1

1
x
x

x
x
x

1
1
1
1

2
142
177

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

1
1
1
1
2
1
1
2
1
1
x
x
x

x
x
x

28
28
53-84
1
1
18-85
76-100
18105
19
19
57-65
19-100
23-87

1
2
4
15

x
x
x

3
5
6-21

1
2
2
5-20 d)
2
1
1

13-40
18-41
10-22

1
1
15
63

Round divisors or low precision give low values.


Resolved by register renaming. Generates no ops in the unfused domain.
SSE3 instruction set.

Page 135

Wolfdale

Integer MMX and XMM instructions


Instruction

Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)

Operands

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main

r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128

1
1
1
1
1
1
1
1
1
1
9
4
4
1
1
1
1
1

mm,mm

mm,m64

xmm,xmm

xmm,m128
xmm,xmm
xmm,m
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64

1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1

int
1

int
int
int
int

1
1

x
1
1

int
int

1
4
2
2
1
1

x
x
x
x
x

x
x

x
x
x
x
x

1
2
2

1
2

1
2
int
int
int
int

1
1

int

1
1

Page 136

1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

0,33
1
0,5
1
0,33
1
1
0,33
1
1
4
2
2
0,33
0,33
2
2
1

int
int

2
3
2
2
1
2
3
1
2
3
3-8
2-8
2-8
1
1

1
1

Laten- Recicy
procal
throughput

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

1
1

1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Wolfdale
PSHUFB h)
PSHUFB h)
PSHUFB h)
PSHUFB h)
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)

mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
x, m128,i
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
x,x,xmm0
x,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i

1
2
1
1
1
2
1
2
1
2
2
3
1
1
2
2
1
1
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2

1
1
1
1
1
1
1
1
1
1
2
3
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
2
3
1
1
2
2
1
1

v,v
(x)mm,m
v,v
(x)mm,m

1
1
2
2

1
1
2
2

x
x
x
x

x
x
x
x

v,v

(x)mm,m64
v,v
(x)mm,m64
v,v
(x)mm,m
xmm,xmm
xmm,m128

4
3
4
1
1
1
1

3
3
3
1
1
1
1

1
1
1
x
x

2
2
2
x
x
1
1

1
1
1
x
x
x
?
x
x

3
x
x
x
?
x
x

Page 137

x
x
x
1
x
1
x
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
2

1
2

1
3

1
1
1
1

1
1

1
1
1
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int

int

int
int
int
int
int
int
int

1
1
1
1
2
1
2
1

2
3
3
3
3
3
1
2
1
1

3
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2-5
6-10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

0,5
1
1
1
2
2
2
2
0,5
1
1
1

Wolfdale
PMULL/HW PMULHUW
v,v
PMULL/HW PMULHUW
(x)mm,m
PMULHRSW h)
v,v
PMULHRSW h)
(x)mm,m
PMULLD j)
xmm,xmm
PMULLD j)
xmm,m128
PMULDQ j)
xmm,xmm
PMULDQ j)
xmm,m128
PMULUDQ
v,v
PMULUDQ
(x)mm,m
PMADDWD
v,v
PMADDWD
(x)mm,m
PMADDUBSW h)
v,v
PMADDUBSW h)
(x)mm,m
PAVGB/W
v,v
PAVGB/W
(x)mm,m
PMIN/MAXSB j)
xmm,xmm
PMIN/MAXSB j)
xmm,m128
PMIN/MAXUB
v,v
PMIN/MAXUB
(x)mm,m
PMIN/MAXSW
v,v
PMIN/MAXSW
(x)mm,m
PMIN/MAXUW j)
xmm,xmm
PMIN/MAXUW j)
xmm,m
PMIN/MAXSD j)
xmm,xmm
PMIN/MAXSD j)
xmm,m128
PMIN/MAXUD j)
xmm,xmm
PMIN/MAXUD j)
xmm,m128
PHMINPOSUW j)
xmm,xmm
PHMINPOSUW j)
xmm,m128
PABSB PABSW PABSD h)
v,v
PABSB PABSW PABSD
h)
(x)mm,m
PSIGNB PSIGNW
PSIGND h)
v,v
PSIGNB PSIGNW
PSIGND h)
(x)mm,m
PSADBW
v,v
PSADBW
(x)mm,m
MPSADBW j)
xmm,xmm,i
MPSADBW j)
xmm,m,i
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

v,v
(x)mm,m
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i

1
1
1
1
4
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1

1
1
1
1
4
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1

1
1
1
1
2
2
1
1
1
1
1
1
1
1

x
x
1
1
x
x
x
x
1

1
1
2
2

1
1
1
1
1

x
x

1
1

x
x
x
x

1
1
1

1
1
1
1

1
1
1

4
4
x

1
1
1
3
4

1
1
1
3
3

1
1
2
2
1
1
1
2
3
1

1
1
2
2
1
1
1
2
2
1

x
x
1
1
1
1
1
x
x
x

x
1
1
1
1

2
2

x
x
x
x

x
x
x
x

Page 138

1
1
1

3
3
5
5
3
3
3
3
1
1
1
1
1
1
1
4
1

int
int

x
x
x

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int

int
int
int
int
int
int
int
int
int
int

1
1
1
1
2
4
1
1
1
1
1
1
1
1
0,5
1
1
1
0,5
1
0,5
1
1
1
1
1
1
1
4
4
0,5
1

3
5

1
1
1
1
2
1

0,5
1
1
1
2
2

0,33
1
1
1
1
1
1
1
1
1

Wolfdale
Other
EMMS
Notes:
g)
h)
j)
k)
m)

11

11

float

SSE3 instruction set.


Supplementary SSE3 instruction set.
SSE4.1 instruction set
MASM uses the name MOVD rather than MOVQ for this instruction even
when moving 64 bits
Only available in 64 bit mode

Floating point XMM instructions


Instruction

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
BLENDPS/PD j)
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)

Operands

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
x,x,xmm0

x,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i

ops ops unfused domain


Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
4
9
1
1
1
2
2
1
1
1
1
1
2
1
2
1
1
2
2
1
2
1
2
1
1
1
2
2
2
1
2

int
int

1
1
2
4
1

1
x
x

x
x

1
x
x

2
1

1
int

2
int
int

1
1
1
1

1
1

1
1

1
1

int
1
1

1
1
1
1

1
1

1
1
1

1
1
1
1

1
1
1

Conversion
Page 139

x
1
1
1

1
1

1
2
3
2-4
3-4
1
2
3
3
5
3
1
1

1
1
1
2
2

1
1
x

1
1
float
float

1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
2
1
1
1

Laten- Recicy
procal
throughput

int
int
float
float
int
int
int
int
int
int
int
int
int
int
float
float
int
int
int
int

1
1
1
2
1
1
1
1
4
1

0,33
1
1
2
4
0,33
1
1
1
1
1
1
1
2-3
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1

Wolfdale
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI

xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64

2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1

2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1

Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS

xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm

1
1
1
1
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1

1
1
1
1
2
2
2
2

1
1
1
1

1
1
1
1

x
x
1
1
1
1
1
1
1
1
1

1
1
1
1

1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
x
x

Page 140

1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
2
2
x
x

1
1
1
1
1
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

4
2
2
3
3
4
4
3
3
4
4
4
3
4
3

1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1

1
1
3
1
1
3
1
1
7
3
3
6
1,5
1,5
4
1
1
5
1
1
4
1
1
5
1
1
6-13 d) 5-12 d)

Wolfdale
DIVSS
DIVSD
DIVSD
DIVPS
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D j)
ROUNDSS/D j)
ROUNDPS/D j)
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS

xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
x,m32/64
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4

1
1
1
1
1
1
1

xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m

1
2
1
2
1
1

1
1
1
1
1
1

1
1
1
1

xmm,xmm
xmm,m128

1
1

1
1

x
x

x
x

x
x

m32
m32
m4096
m4096

13
10
151
121

12
8
67
74

x
x
x
x

x
x
x
x

x 1
x
1 1
x 8 38 38
x 47

2
2
x
x

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
x
x

1
1
1
1
1
1
1
1
1
x
x

5-12 d)
6-21 d) 5-20 d)
5-20 d)
6-13 d) 5-12 d)
5-12 d)
6-21 d) 5-20 d)
5-20 d)
3
2
2
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
11
3
3
9
3
3

6-13

float
float
float
float
float
float

int
int

1
1
1
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

6-20
3

5-12
5-12
5-19
5-19
2
2

Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D

Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)

Round divisors give low values.


SSE3 instruction set.

Page 141

0,33
1

38
20
145
150

Nehalem

Intel Nehalem
List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops fused domain:


ops unfused domain:

p015:
p0:
p1:
p5:
p2:
p3:
p4:
Domain:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops
fused domain. An x under p0, p1 or p5 means that at least one of the ops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one op which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these ops go to.
The total number of ops going to port 0, 1 and 5.
The number of ops going to port 0 (execution units).
The number of ops going to port 1 (execution units).
The number of ops going to port 5 (execution units).
The number of ops going to port 2 (memory read).
The number of ops going to port 3 (memory write address).
The number of ops going to port 4 (memory write data).
Tells which execution unit domain is used: "int" = integer unit (general purpose
registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM
and x87 floating point). An additional "bypass delay" is generated if a register
written by a op in one domain is read by a op in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units.
The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural
domain such as a general purpose register (e.g. eax) in the "ivec" domain. For
example, the PEXTRW instruction executes in the "int" domain. The source
operand is an xmm register and the destination operand is a general purpose
register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm
operand is most likely to come from the "ivec" domain. If the xmm operand
comes from the "fp" domain then the bypass delay will be 2 rather than one.
The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the
result in the integer flags. Almost all instructions that read these flags execute
in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency
of the instruction itself and 2 is the bypass delay from the "fp" domain to the
"int" domain.
The bypass delay from the memory read unit to any other unit and from any
unit to the memory write unit are included in the latency figures in the table.
Where the domain is not listed, the bypass delays are either unlikely to occur
or unavoidable and therefore included in the latency figure.

Page 142

Nehalem
Latency:

This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction

Move instructions
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA

Operands

ops ops unfused domain


Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main

r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r

1
1
1
1
1
2
6
6
2

r,r

r,m
r,r
r,m
r,r
r,m

1
2
2
3
7
2
1
1
2
2
3
18
1
3
2
7
8
10
1
2
1
1
1
9
1

r
i
m
sr

r
(E/R)SP
m
sr

r,m
r32
r64
m
m

x
1

3
2

x
x

x
x

1
1
3
4

1
1

1
1

x
1

2
2
3
x
1

x
x
x

x
x
x

x
x
x

1
1
1

1
1
2
2

x
x

x
1

x
x

2
7
2
1
2
1
1
1
3

x
x

x
x
1
1
1
x

x
x

Page 143

1
1
1
5
1
8

6
1

1
1
1
1
1
1
8

1
1
1
1
1
1
8

Laten- Recicy
procal
throughput

int
int
int
int
int
int
int
int
int

~270

0.33
1
1
1
1
1
13
14
1

int

0.33

1
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

1
2
3
3

2
20 b)
5
3

1
4
1
1
3

2
1
1
1
1
1
1
8
1
5
1
15
14
8
0.33
1
1
1
1
15
1

Nehalem
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV c)
DIV c)
DIV c)
DIV c)
IDIV c)
IDIV c)
IDIV c)
IDIV c)
CBW CWDE CDQE
CWD CDQ CQO
POPCNT )
POPCNT )
CRC32 )
CRC32 )

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64

r,r
r,m
r,r
r,m

1
2
3
2

1
1
2
2
2
4
1
1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
6
6
~40
4
8
7
~60
1
1
1
1
1
1

1
1
1
2
2
3
1
1
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
6
6
x
4
8
7
x
1
1
1
1
1
1

x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
1
x
x
1
x
x
x
1
1

x
x
x
x
x
x
x
x
x
x

x
x
x
x
x

1
1
1

1
1

1
1

1
1

x
x
x
x
x

1
1
1
1
x
x
2

1
x
x

x
x

1
1
1
1
1
1
1
x
x
x
1
x
x
x
x
x

Page 144

2
4
3
x
2
5
3
x
x
1
1
1
1

1
1
1

1
1
1
1
1
1
1
1
1
1

1
x
x
x
1
x
x
x
x
x
1
1

int
int
int
int

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

1
9
23
5

1
6
2
2
7
1
1
1
6
3
15
20
3
5
5
3
3
3
3
3
3
3
3
5
5
3
3
3
3

11-21
17-22
17-28
28-90
10-22
18-23
17-28
37-100
1
1
3
3

0.33
1
1
2
2
0.33
1
0.33
1
1
2
7
1
2
2
2
1
1
1
1
1
2
1
2
2
2
1
1
1
1
1
1
7-11
7-12
7-17
19-69
7-11
7-12
7-17
26-86
1
1
1
1
1
1

Nehalem
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD
SHLD
SHRD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl
m8,i/cl
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

Control transfer instructions


JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)

1
1
2
1
1
1
3
1
3
2
9
8
6
4
12
11
10
2
3
2
3
1
9
2
1
10
3
1
2
1
2
1
2
2

1
1
1
1
1
1
2
1
2
2
9
8
6
3
9
8
7
2
2
2
2
1
8
2
1
7
3
1
1
1
1
1
2
2

1
31
1
1
31
1
1
2
6
11
2
46
3
4
47

1
31
1
1
31
1
1
2
6
11
2
46
2
3
47

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
x
x
x
x
x

x
x
x
x

1
1

1
1
1
1

1
1
1
1

1
1
1
1

1
1
1
1

1
1

1
1

1
x
x
x
x
x

1
1
1

x
x
x
?

x
x
x
?

1
11

1
1
1
x
x
1

1
1

1
1

9
?
?

Page 145

?
?

1
1

1
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

1
6
1
1
6
1
6
2
13
11
12-13
7
16
14
15
3
8
4
9
1

1
6
6
3
3
1
1

0
0
0
0
0

0.33
1
1
0.33
1
0.5
1
1
1
2

12-13

1
1
1
5
1
1

1
1
1
1
0.33
4
5

2
67
2
2
73
2
2
2
4
7
2
74
2
2
79

Nehalem
RETN
RETN
RETF
RETF
BOUND i)
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
)

i
i
r,m

small n
large n
small n
large n

a,0
a,b

1
3
39
40
15
4

1
2
39
40
13
4

1
1

int
int
int
int
int
int

2
1
11+4n
3
1
60+n
2.5/16 bytes
5
2
13+6n
2/16 bytes
3
2
37+6n
5
3
65+8n

1
1
5
11
34+7b
3
25-100
22
28

1
1
5
9

1
1

x
x
x
x

x
x
x
x

x
x
x
x

int
int
int
int
int
int
int
int
int
int
int
int

int
int
int
int
int
int
int
int
int

2
2
120
124
7
5

1
40+12n
1
12+n
1 clk / 16 bytes
4
12+n
1 clk / 16 bytes
1
40+2n
4
42+2n

0.33
1
9
8
79+5b
~200

5
~200
24
40-60

Applies to all addressing modes


Has an implicit LOCK prefix.
Low values are for small results, high values for high results.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.
SSE4.2 instruction set.

Floating point x87 instructions


Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)

Operands

r
m32/64
m80
m80
r
m32/m64

ops ops unfused domain


Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
41
1
1

1
1
2
38
1

1
1
x
1

1
x

1
2
3
1

Page 146

float
float
float
float
float
float

Laten- Recicy
procal
throughput
1
3
4
45
1
4

1
1
2
20
1
1

Nehalem
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN

m80
m80
r
m
m
m

r
AX
m16
m16
m16
r
m
m

r
m
r
m
r
m

r
m
r
m
m
m
m

7
208
1
1
3
3
1
2
2
2
2
3
2
2
1
2
143
79

3
204
0 f)
1
1
1
1
2
2
2
2
2
1
1
1
2
89
52

1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17

1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17

24
17
1
~100
~100
~100
19
~55
~100
~82

24
17
1
~100
~100
~100

19
~55
~100

~82

x
x

x
x

x
x

1
1
1
1
1

2
2

2
2

1
1

1
1

1
2

2
1

1
1
1
x
x
x

x
x
x

x
x 8 23 23
x 27

1
1

1
1
1
1
1
1

1
1

x
x
x

x
x
1
x
x
x
x
x
x
x

Page 147

1
1

1
1
1
1
2
1
1
2
1
1
x
x
x

x
x
x

x
x

x
x

x
x
x
x
x
x
x

x
x
x
x
x
x
x

1
1
1
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

5
242
0
6
7
7

2+2

7
5
1
178
156

5
245
1
1
1
1
1
2
2
2
1
2
31
1
1
4
178
156

float
3
1
float
1
float
5
1
float
1
float 7-27 d) 7-27 d)
float 7-27 d) 7-27 d)
float
1
1
float
1
1
float
1
float
1
float
1
float
1
float
3
2
float
5
2
float 7-27 d) 7-27 d)
float
1
float
1
float
1
float
14
float
19
float
22

float
12
float
13
float
~27
float 40-100
float 40-100
float ~110
float
58
float
~80
float ~115
float ~120

Nehalem
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)

1
1
1
2
2
x
3
3
~190 ~190 x

x
x
x

float
float
float
float

x
x
x

1
1
17
77

Round divisors or low precision give low values.


Resolved by register renaming. Generates no ops in the unfused domain.
SSE3 instruction set.

Integer MMX and XMM instructions


Instruction

Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)

Operands

ops ops unfused domain


Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

mm,mm

mm,m64

xmm,xmm

xmm,m128
xmm,xmm
xmm,m
(x)mm, (x)mm
(x)mm,m
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32

1
1
1
1
1
1
2
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1

x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x

int
1

1
ivec

1
1

ivec
1
1

ivec
1

1
1
1
1
1

1
1
x
x

x
x

x
x

ivec
ivec
1
1

1
1

Page 148

ivec

Laten- Recicy
procal
throughput
1+1
3
1+1
2
1
2
3
1
2
3
2
3
2
1
1
~270
~270
2

0.33
1
0.33
1
0.33
1
1
0.33
1
1
1
1
1
0.33
0.33
2
2
1

2
ivec

1
ivec

ivec

ivec

ivec

ivec

1
1
1
1
1

0.5
2
2
2
0.5
2
0.5
1
1
2
1
2

Nehalem
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)
PSHUFB h)
PSHUFB h)
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
Arithmetic instructions
PADD/SUB(U)
(S)B/W/D/Q
PADD/SUB(U)
(S)B/W/D/Q
PHADD/SUB(S)W/D h)
PHADD/SUB(S)W/D h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)

xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
(x)mm, (x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i

1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

(x)mm, (x)mm

(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128

1
3
4
1
1
1
1

1
3
3
1
1
1
1

x
x
x
x
x
x
x

x
x
x
x
x
x
x

Page 149

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec
ivec
float
ivec

2+2
2+1

ivec

2+1

ivec

2+1

ivec

2+1

ivec

1+1

ivec

1+1

ivec

1+1

ivec

1+1

ivec

ivec

ivec

ivec

1
1
1
1
1
1
1
1
1
1
1
1
2

1
2

1
x

1
1
1
1

1
1
1
1

1
2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
1
1
1
1
0.5
1
2
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

0.5
2
1,5
3
0.5
2
0.5
2

Nehalem
PCMPGTQ )
PCMPGTQ )
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULLD j)
PMULLD j)
PMULDQ j)
PMULDQ j)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXSB j)
PMIN/MAXSB j)
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW j)
PMIN/MAXUW j)
PMIN/MAXU/SD j)
PMIN/MAXU/SD j)
PHMINPOSUW j)
PHMINPOSUW j)
PABSB PABSW PABSD
h)
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
MPSADBW j)
MPSADBW j)
PCLMULQDQ n)
AESDEC, AESDECLAST,
AESENC, AESENCLAST
n)
AESIMC n)
AESKEYGENASSIST n)
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR

xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128

1
1
1
1
1
1
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1

(x)mm,(x)mm

(x)mm,m

(x)mm,(x)mm

(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm,i
xmm,m,i
xmm,xmm,i

1
1
1
3
4

1
1
1
3
3

x
x
x
x
x
x
x
x
x
x
x
x

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

ivec

1
1

1
1
1
1
1
1
1
1
1
1
1

1
ivec

ivec

ivec

1
1

1
1

x
x

Page 150

x
x

x
x

ivec

0.5

12

~5
~5
~5

~2
~2
~2

0.33
1

0.5

2
1
3
1
2
8

1
x
x

1
1
1
1
1
1
1
1
0.5
1
1
2
0.5
2
0.5
2
1
2
1
2
1
3

xmm,xmm
xmm,xmm
xmm,xmm,i

(x)mm,(x)mm
(x)mm,m

1
1
1
1
1
1
2

1
1

x
x

x
x
x
x
x
x
x
x
x
x
x
x

1
1
x
x

ivec

Nehalem
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i

2
2
1
1
1
2
3
1

2
2
1
1
1
2
2
1

x
x

String instructions
PCMPESTRI )
PCMPESTRI )
PCMPESTRM )
PCMPESTRM )
PCMPISTRI )
PCMPISTRI )
PCMPISTRM )
PCMPISTRM )

xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i

8
9
9
10
3
4
4
6

8
8
9
10
3
4
4
5

x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x

11

11

Other
EMMS
Notes:
g)
h)
j)
k)
)
m)
n)

x
x
x

x
x
1
1
1
1
1

x
x

ivec

ivec

ivec
ivec

1
2

ivec

1
1
1
2
1
2
1
1

ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec

14
14
7
7
8
8
7
7

5
6
6
6
2
2
2
5

1
1

x
x
x

1
1
1
1

float

SSE3 instruction set.


Supplementary SSE3 instruction set.
SSE4.1 instruction set
MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits
SSE4.2 instruction set
Only available in 64 bit mode
Only available on newer models

Floating point XMM instructions


Instruction

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS/D
SHUFPS/D

Operands

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i

ops ops unfused domain


Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
1
1
1
1
1
2
2
1
1
1
1
2

float
1
1

1
1

1
1

1
1
1
1

1
1
1

1
float
float

1
1

1
1
Page 151

1
1

1
float
float

Laten- Recicy
procal
throughput
1
2
3
2
3
1
2
3
3
5
1
1+2
~270
1

1
1
1
1-4
1-3
1
1
1
2
1
1
1
2
1
1

Nehalem
BLENDPS/PD j)
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)

xmm,xmm,i
xmm,m128,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i

1
2
2
3
1
1
1
1
1
1
1
2
1
3

1
1
2
2
1

Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI

xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64

2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1

2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1

Arithmetic
ADDSS/D SUBSS/D

xmm,xmm

1
1
2
2
1

float
float
float
float
float

1
1
1

float

1
2
1
2
1

1
1
1
1
1
1
2

1
1
1
1
1
2

?
1
1
1
1

x
x

Page 152

1
1
1
?

1
1
1

1
1
1
?
1
1

1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1

1
1
1
1

1
1
x
x

float
float
float

1
1+2

1
float
float

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

4
4
2
1
3+2
3+2
4+2
4+2
3+2
3+2

ivec/float

float/ivec

float
float
float
float
float
float
float
float

3+2

float

1
1
1
1
1
1

3+2
4+2
3+2

1
1
2
2
1
1
1
1
1
1
1
1
1
2

1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1

Nehalem
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS MULPS
MULSS MULPS
MULSD MULPD
MULSD MULPD
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D CMPccPS/D

xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m

1
1
1
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1

xmm,xmm

xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128

2
1
1
1
1
1
1

1
1
1
1
1
1
1

1
1
1
1
1
1
1

xmm,xmm,i

xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i

2
4
6
3
4

1
4
5
3
3

1
2
x
x
x

xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m

1
2
1
2
1
1

1
1
1
1
1
1

xmm,xmm
xmm,m128

1
1

1
1

1
1
1
2
2
2
2

1
1
1
1
1
1
1
1

1
1
1
1
1
1

1
1

float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float

3
3
5
3
4
5
7-14
7-22
3

1
1
1
1
1
2
2
2
2
1
1
1
1
7-14
7-14
7-22
7-22
2
2
1

CMPccSS/D CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D
ROUNDPS/D j)
ROUNDSS/D
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS

1
x
x
x

1
1
1
1

float
float
float
float
float
float
float
float

1
1
x
x
x

1
1
1
1

3
3

11

1
2

1
3

7-18

float
float
float
float
float
float

7-18
7-18
7-32
7-32
2
2

float
float

1
1
1
1

float
float
float
float
float

1+2

1
1
1
1
1
1
1

7-32
3

Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D

Other
Page 153

1
1

1
1

Nehalem
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
g)

m32
m32
m4096
m4096

6
2
141
112

6
1
141
90

x
x

x
x

SSE3 instruction set.

Page 154

x 1
1
1 1
x 5 38 38
x 42

90

5
1
90
100

Sandy Bridge

Intel Sandy Bridge


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same
register for both operands. m = memory operand, m32 = 32-bit memory operand, etc.

ops fused domain:

The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under ops fused
domain. A number indicated as 1+ under a read or write port means a 256-bit
read or write operation using two clock cycles for handling 128 bits each cycle.
The port cannot receive another read or write op in the second clock cycle, but
a read port can receive an address-calculation op in the second clock cycle.
An x under p0, p1 or p5 means that at least one of the ops listed under p015
can optionally go to this port. For example, a 1 under p015 and an x under p0
and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that
it is not known which of the three ports these ops go to.

ops unfused domain:

p015:
p0:
p1:
p5:
p23:
p4:
Latency:

The total number of ops going to port 0, 1 and 5.


The number of ops going to port 0 (execution units).
The number of ops going to port 1 (execution units).
The number of ops going to port 5 (execution units).
The number of ops going to port 2 or 3 (memory read or address calculation).
The number of ops going to port 4 (memory write data).
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles given by the
time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.

Integer instructions
Instruction

Move instructions
MOV

Operands

r,r/i

ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4


procal ments
dothroughmain
put
1

Page 155

Sandy Bridge
MOV

r,m

MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG

m,r
m,i
m,r
r,r

1
1
2
1

1
1
1

r,m

r,r
r,m
r,r
r,m

2
2
3
8

2
2
3
x

3
1
1
2
3
16
1
1
2
9
18
1
3
1
1

XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA

BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT

r
i
m

r
(E/R)SP
m

r,m
r,m

r32
r64
m
m

r,r/i
r,m
m,r/i
r,same
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r

1
1
1

0.5

1
1
1

~350
1

2
0

x
x
x

x
x
x

x
x
x

0
8
10
1
3
1
1

1
2
1
1
2
3
2

1
2

1
1
2
1
2
2
4
1
1
1

1
1
1
0
2
2
3
1
1
1

1
1
1
2
1
8
1
1
2
1
8

1
1
1
1
8

2
25
7
3

2
1

1
1
1
3

1
2

1
2

1
1
1
1
1

x
x
x

x
x
x

x
x
x

x
x
x
x
x
x

x
x
x
x
x
x

x
x
x
x
x
x

Page 156

0.5

1
2

1
1
1

1
1
1
implicit
lock
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
0.5
1

1
1
0.5
0.5
4
33
6

1
1
2

1
2

all addressing
modes

6
0
2
2
7
1
1
1

0.5
1
0.25
1
1
1,5
0.5

not 64 bit

not 64 bit
not 64 bit
simple
complex
or rip relative

Sandy Bridge
INC DEC NEG NOT
AAA AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR

3
2

1
2
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
10
11
10
x
10
10
9
x

r,r
r,m
r,r
r,m

3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
10
11
10
34-56
10
10
9
59138
1
1
1
2
1
1
1
1
1
1

r,r/i
r,m
m,r/i
r,same
r,r/i
m,r/i
r,i
m,i

1
1
2
1
1
1
1
3

1
1
1
0
1
1
1
1

r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64

4
2
20
3
4
4
3
3
4
3
3
3

1
1
1
1

1
1
1
1
1
1
1
1

1
1
1

6
4

20-24
21-25
20-28
30-94
21-24
21-25
20-27
40-103

1
1
1
2
1
1
1
1
1
1

1
1
1
1
1
1
3

1
1
1
1

1
3
1

x
x
x

x
x
x

x
x
x

x
x
x

x
x

x
x
x

Page 157

2
not 64 bit

11
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
11-14
11-14
11-18
22-76
11-14
11-14
11-18
25-84
0.5
1
0.5
1
1
0.5
1
1
1
1

1
1
2

6
0
1

1
2

1
1

0.5
1
0.25
0.5
0.5
2

not 64 bit
not 64 bit
not 64 bit

SSE4.2
SSE4.2
SSE4.2
SSE4.2

Sandy Bridge
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR
RCR
RCR
RCR
RCR
RCR
RCL
RCL
RCL
RCL
RCL
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD

r,cl
m,cl
r,i
m,i
r,cl
m,cl
r8,1
r16/32/64,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

Control transfer instructions


JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET

short
short
short
near
r
m
i

3
5
1
4
3
5
high
3
8
11
8
11
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3

3
3
1
3
3
3

1
1
1
1

1
1
1
1

1
1
1
1

2
7
11
3
2
3
2
3

2
7
11
2
1
2
2
2

1
2

2
high
2
5

3
8
7
8
7
3
8
7
8
7
1
4
3
1
8
1
1
7
1
1
1
1
1
0
1
3

5
2
6
x

2
1
x
1
1
x
2

x
1
3

1
x
x
x

1
x
x

1
1

2
4
1
2
2
4
high
2
5
6
5
6
2
6
6
6
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25

1
4

Page 158

1
1
1
1
1

1
1
2
1
1

1
1
1

0
0
0
0

2
2
2
1-2

1-2
2-4
5
5
2
2
2
2
2

fast if not
jumping

Sandy Bridge
BOUND
INTO

r,m

15
4

13
4

String instructions
LODS
REP LODS
STOS
REP STOS

3
5n+12
3
2n

REP STOS

1.5/16B

MOVS
REP MOVS

5
2n

1.5 n

REP MOVS

3/16B

1/16B

SCAS
REP SCAS
CMPS
REP CMPS

3
6n+47
5
8n+80

Other
NOP (90)
Long NOP (0F 1F)

PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC

not 64 bit
not 64 bit

1
~2n

1
n

worst
case
best case

1/16B
4

1
1

a,0
a,b

7
6

worst
case
best case
1

2n+45
4
2n+80

0
0

0.25
0.25

7
7
12
10
49+6b
3
3
31-75
21
35

decode
only 1
per clk

11
8

1
84+3b

7
100-250
28
42

Floating point x87 instructions


Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)

Operands

r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m

ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4


procal ments
dothroughmain
put
1
1
4
43
1
1
7
246
1
1
3

1
1
2
40
1

1
1

1
2

1
3
4
45
1
4
5

0
6
7

1
2
3

1
1
2

3
0
1
1

Page 159

1
1

1
1

1
1
2
21
1
1
5
252
0.5
1
2

Sandy Bridge
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT

r
AX
m16
m16
m16
r
m
m

r
m
r
m
r
m

r
m
r
m
m
m
m

3
1
2
2
3
2
2
3
2
1
1
143
90

1
1
2
2
3
2
1
2
1
1
1

1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41-87
17

1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28

1
1
1

1
2
3
2
1
1
1

1
1

1
1

1
1

8
5
1

3
1

1
1
1
1
1
1

1
?

5
1
10-24
1

1
1
1
?
1
1
2
1
1

1
1
3
1
?

4
1
1
1
1

17

21
26-50
22

27
27
17
17
1
1
64-100 x
20-110 x
20-110 x
53-118 x

12
10
10-24
47-100
47-115
43-123
61-69

102 102
28-91 x

1
2

1
2

2
2
2
2
2
1
1
1
1
1
166
165

1
1
1
1
10-24
10-24
1
1
1
1
1
1
1
1
2
1
2
21
26-50

130
93-146

Page 160

1
1

SSE3

Sandy Bridge
FNCLEX
FNINIT

5
26

5
26

22
81

Integer MMX and XMM instructions


Instruction

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ

Operands

ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4


procal ments
dothroughmain
put

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128

1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1

mm,mm

mm,m64

x,x

x,m128
x,x
x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x

1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1

1
1
1
1

x
1
1
1
1
1

1
1
1
2
1

1
1
1

Page 161

1
1

1
1

1
3
1
3
1
3
3
1
3
3
3
3
3
1
1
~300
~300

1
0.5
0.5
1
0.5
1
0.5
1
0.5
1

SSE3

1
0.5

0.5

SSE4.1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5

SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1

Sandy Bridge
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ

x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i

1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
3
2
3
2
2
2
2
2
2
2
2

1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
2
1
2
2
2
2
2
1
2
1
2
1
2
1

x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1

(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
x,same
x,same
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x

1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1

1
1
3
3
1
1
1
1
1
1
0
1
1
1
1
1
1

x
x
x
x
x
x
x
x
1
1

1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
1
1
x
x

x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
4

1
x
2
2

2
2
1

1
2

1
2

1
2
1
2
1
2
1

0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
2
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5

SSE4.1
SSSE3
SSSE3

SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1,
64b
SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1,
64 b

Arithmetic instructions
PADD/SUB(U,S)B/W/D/Q
PADD/SUB(U,S)B/W/D/Q

PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PSUBxx, PCMPGTx
PCMPEQx
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD

x
x
x
x
x
x
x
x

Page 162

2
1
1
1
1
1
5
1

1
1
1
1
1

1
1

0
0
5
1
5
1
5

0.5
0.5
1,5
1,5
0.5
0.5
0.5
0.5
1
1
0.25
0.5
1
1
1
1
1

SSSE3
SSSE3

SSE4.1
SSE4.1
SSE4.2
SSE4.2

SSSE3
SSSE3
SSE4.1

Sandy Bridge
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW

x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i

2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3

1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
1
1
1
1

1
1

1
1

Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PXOR
PTEST
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

(x)mm,(x)mm
(x)mm,m
x,same
x,x
x,m128
mm,mm/i
mm,m64
xmm,i
x,x
x,m128
x,i

1
1
1
1
1
1
1
1
2
3
1

1
1
0
1
1
1
1
1
2
2
1

x
x

x
x

x
x

x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i

8
8
8
8
3
4
3

8
7
8
7
3
3
3

String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM

Page 163

1
5
1
5
1
5
1
5
1
x
x
x
x
x
x
x
x
x
x
x
x

1
1
1
1
1
1
1
1
1
1
1
1
5
1

x
x
x
x

1
1
1
1
5
1
6
1

SSE4.1
SSE4.1
SSE4.1

SSSE3
SSSE3

SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3

SSE4.1
SSE4.1

1
1
0
1
1

1
1
1

1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1

1
1
1
2
1
1

4
1
11-12
1
3
1
11

0.5
0.25
1
1
1
2
1
1
1
1

4
4
4
4
3
3
3

SSE4.1
SSE4.1

SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2

Sandy Bridge
PCMPISTRM
Encryption instructions
PCLMULQDQ
AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESKEYGENASSIST

x,m128,i

x,x,i

18

18

x,x
x,x
x,x,i

2
2
11

2
2
11

31

31

Other
EMMS

SSE4.2

14

CLMUL

4
2
8

AES
AES
AES

18

Floating point XMM and YMM instructions


Instruction

Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD

Operands

ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4


procal ments
dothroughmain
put

x,x
y,y
x,m128

1
1
1

y,m256
m128,x

1
1

m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i

1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
2
2
2
1
2
1
2
1

1
1

1
1

1
1
1
1
1

1
1
3

1
1
0.5

1+
1

1
1

AVX

4
3

1+

3
1
3
3
3
3
1
2
2
~300
~300
1

1
1
0.5
1
1
1
1
1
1
1
25
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
1

AVX

1
1
1

1
1
1
1

1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

x
x
x

Page 164

1
1
1
1
1
1
1
1
1
1
1
1
x
x
x

1
4

1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1

AVX

AVX

AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX

Sandy Bridge
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ

y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

2
2
3
2
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
3
4
4

1
2
2
2
2
1

x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128

2
2
2
2
2
2
2
2
2
3
2
2
1
1
1
1
1
1

2
2
2
2
2
2
2
2
2
3
2
1
1
1
1
1
1
1

x
x
x
x
x

x
x
x
x
x
1

1+
2
1
2
1+
1
3
1
3

1
1

1
1
1
1

1
1
1
1

1+
1
1
1
1

1
3
1
4
1

1
1

1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2

1
1
1
1
1
1
1

1+

?
?

?
?

1
1
1
1

?
?

1
1
1
?
1
1

Page 165

1
1
1
1
1
1

1
?
1
?
1
1
1
1
1
1
1

1
1
1+
2
1

2
1
1
2
1
1
1+
1 1
1 1+

4
1
4
1+
3
1
3
1
4
1
3
1

1
1
1
1
1
1

3
1
3
1+
3
1

1
1
1
1
1
1
0.5
1
1
1
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

AVX
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

Sandy Bridge
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS

y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64

1
1
2
2
2
3
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
1
2
2

1
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2

x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128

1
1
1
1
1
1
1
1
1
1
3
4

1
1
1
1
1
1
1
1
1
1
3
3

1
1
1
1
1
1
1
1
1
1
1
1

2
2

y,y,y

y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x

4
1
1
1
1
1
1
1
1
1

3
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

3
1+
1
1
1
1
1
1
1
1

4
1
5
1
4
1
5
1+
4
1

4
1

1
1

4
1
4
1

?
?
?
1
?
?

1
1
1
1
1
1
1
1
1

Page 166

1
1
1
1
1
1
1
1

4
1

?
?

4
1
4
1

?
?

4
1

3
1
3
1
3
1+
3
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-14

1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1,5
1,5
1
1
1,5
1,5
1
1

AVX
AVX

AVX
AVX

AVX
AVX

1
1
1
1
1
1
1
1
1
1
2
2

AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3

AVX

2
1
1
1
1
1
1
1
1
10-14

AVX

AVX
AVX

AVX
AVX

Sandy Bridge
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D

x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256

1
3
4
1
1
3
4
1
1
3
4

1
3
3
1
1
3
3
1
1
3
3

1
2
2
1
1
2
?
1
1
2

x,x

x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i

2
1
2
2
2
1
1
1
1
1
1
1
2
1
2
4
6
4
6
3
4

1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
4
5
4
5
3
3

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2

x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256

1
1
3
4
1
2
3
4
1
1
3
4

1
1
3
3
1
1
3
3
1
1
3
3

x,x
x,m128

1
1

1
1

1
1

21-29
1+
10-22
1

1
?

21-45
1+
5
1

7
1+
3

10-14
20-28
20-28
10-22
10-22
20-44
20-44
1
1
2
2

AVX
AVX

AVX
AVX

AVX
AVX

CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS

1
1

1
3
1+
2
1
3
1
3
1
3
1+
3
1
3
1+
12
1
12
1+

9
1

1
1

10-14
1
1+

1
1

10-21
1
21-43
1+

1
1

5
1
7
1+

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
2
4
2
2

10-14
10-14
21-28
21-28
10-21
10-21
21-43
21-43
1
1
2
2

Logic
AND/ANDN/OR/XORPS/PD
AND/ANDN/OR/XORPS/PD

Page 167

1
1

1
1

1
1

AVX
AVX

AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1

AVX
AVX

AVX
AVX

AVX
AVX

Sandy Bridge
VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
(V)XORPS/PD

y,y,y

y,y,m256
x/y,x/y,same

1
1

1
0

Other
VZEROUPPER

VZEROALL

12

VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVEOPT

m32
m32
m32
m4096
m4096
m

AVX

1
0.25

AVX

AVX
AVX,
32 bit
AVX,
64 bit

1+

11

20
3
3
3
3
3
3
130
116
100-161

?
?

Page 168

?
?

1
1

1
1
1

9
3
1
1
68
72

1
1

60-500

AVX

Ivy Bridge

Intel Ivy Bridge


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm


register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same =
same register for both operands. m = memory operand, m32 = 32-bit memory
operand, etc.

ops fused domain:

The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the
numbers listed under p015 + p23 + p4 exceeds the number listed under ops
fused domain. A number indicated as 1+ under a read or write port means a
256-bit read or write operation using two clock cycles for handling 128 bits
each cycle. The port cannot receive another read or write op in the second
clock cycle, but a read port can receive an address-calculation op in the second clock cycle. An x under p0, p1 or p5 means that at least one of the ops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one op which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these ops go to.

ops unfused domain:

p015:
p0:
p1:
p5:
p23:

The total number of ops going to port 0, 1 and 5.


The number of ops going to port 0 (execution units).
The number of ops going to port 1 (execution units).
The number of ops going to port 5 (execution units).
The number of ops going to port 2 or 3 (memory read or address calculation).

p4:
Latency:

The number of ops going to port 4 (memory write data).


This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency.
The time unit used is core clock cycles, not the reference clock cycles given by
the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.

Integer instructions
Instruction

Operands

ops ops unfused domain Latency ReciComfused


procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put

Page 169

Ivy Bridge
Move instructions
MOV
MOV
MOV

r,i
r8/16,r8/16
r32/64,r32/64

1
1
1

1
1
1

x
x
x

x
x
x

x
x
x

MOV
MOV
MOV

r8/16,m8/16
r32/64,m32/64
r,m

1
1
1

MOV
MOV
MOVNTI
MOVSX MOVSXD
MOVZX
MOVZX

m,r
m,i
m,r
r,r
r16,r8
r32/64,r8

1
1
2
1
1
1

MOVZX
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG

r32/64,r16
r16,m8
r32/64,m
r,r
r,m
r,r
r,m

1
1
1
1
1
1

x
x
x

x
x
x

x
x
x

1
2
1

1
1

x
x

x
x

x
x

2
2
3
7

2
2
3
x

x
x
x

x
x
x

x
x
x

8
10
1
3
2
1

XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA

r16,m
r32/64,m

3
1
1
2
2
3
19
1
3
2
9
18
1
3
2
1

LEA

r32/64,m

r32
r64
m
m

1
2
1
1
2
3

1
2

BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE

r
i
m
(E/R)SP

r
(E/R)SP
m

1
1
1

1
1

x
x
x

x
x
x

x
x
x

x
x
x
x
x
x

x
x
x
1
x

x
x
x
x
x

1
1
1
2
1
1
8
1
1
2
1
8

1
1
1
1
1
8

1
1

0.5
0.5
1

3
~340
1
1
0-1

1
1
1
0.33
0.33
0.25

1
3
2

0.33
0.5
0.5

0.67
~0.8
1

2
25
7
3

1
1
2-4
1
3

1
2

1
1
43
43
4
36

may be
elimin.

64 b abs
address

may be
elimin.

implicit
lock
1
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
1
0.5

1
1

Page 170

2
2
2

0.33
0.33
0.25

1
2

1
2
3

1
1
1

1
1
0-1

not 64 bit

not 64 bit
not 64 bit
1-2 components

3 components
or RIP

Ivy Bridge
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64

r,r
r,m
r,r

1
1
2
2
2
4
1
1
1
3
2
3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
11
11
10
35-57
11
11
9
59134
1
1
1
2
1
1
1
1
1

1
1
1
2
2
3
1
1
1
1
2
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
11
11
10
x
11
11
9
x

x
x
x
x
x
x
x
x
x
x
x

1
1
1
2
1
1
1
1
1

x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
1

x
x
x
x
x
x
x
x
x
x
x

1
2

1
2

1
2

1
1
1
1
1
1
1
1

1
1
1

1
1
1

6
2
2
7-8
1
1
1
6
4
4
2
20
3
4
4
3
3
4
3
3
3

19-22
20-24
19-27
29-94
20-23
20-24
19-26
28-103

Page 171

x
x
x
x

1
1
1

x
x
x
x
x
x

1
1
1
1
1
1
3
1
3

0.33
0.5
1
1
1
2
0.33
0.5
0.33
1

8
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
9
10
11
22-76
8
8
8-11
26-88

not 64 bit
not 64 bit
not 64 bit
not 64 bit

0.33

0.5
1
1
1

SSE4.2
SSE4.2
SSE4.2

Ivy Bridge
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCL RCR
RCL RCR
RCL RCR
RCL RCR
RCL RCR
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD

r,m

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

1
1
2
1
1
1
3
2
5
2
1
4
2
5
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3

1
1
1
1
1
1
1
2
3
2
1
3
2
3
3
8
8
8
8
1
3
4
4
1
9
1
1
8
2
1
1
1
1
0
1
3

1
1
1
1

Control transfer instructions


JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E

short
short
short

x
x
x
x
x
x

x
x
x
x
x

x
x
x
x
x
x

1
1
2

x
1
1
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
x
x

1
2
5
2

1
5

1
2
1
1
1
1
2
1

1
1
3

1
x
x

1
1

0.33
0.5
1
0.33
0.5
0.5
2
1
4
1
0.5
2
1
4
2
5
6
5
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.33
4

x
x

1
1
1
1

1
1
1
1

0
0
0
0

2
2
2
1-2

1-2

2
7
11

2
7
11

Page 172

x
x

1
1
1
1
1

x
x
x
x
x
x

6
1

x
x

x
x

SSE4.2

x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1-2
4-5
6

short form

fast if no
jump
fast if no
jump

Ivy Bridge
CALL
CALL
CALL
RET
RET
BOUND
INTO

2
2
3
2
3
15
4

1
1
1
1
2
13
4

String instructions
LODS
REP LODS
STOS
REP STOS

3
~5n
3
many

REP STOS

many

MOVS
REP MOVS

5
2n

REP MOVS

4/16B

SCAS
REP SCAS
CMPS
REP CMPS

3
~6n
5
~8n

4
8
7
5
9
14
18
22
24

3
5
5
3
6
11
15
19
21

Synchronization instructions
XADD
LOCK XADD
LOCK ADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B

CMPXCHG16B
LOCK CMPXCHG16B

Other
NOP / Long NOP
PAUSE
ENTER
ENTER
LEAVE
XGETBV
CPUID
RDTSC
RDPMC
RDRAND

near
r
m
i
r,m

m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r

a,0
a,b

1
1
1
1
1

1
1
2
1
1
2

1
1
1

2
2
2
2
2
7
6

not 64 bit
not 64 bit

1
~2n
1

1
n

worst
case
best
case

1/16B
2

4
n

worst
case
best
case

1/16B
1
~2n
4
~2n

1
0
7
7
12
9
45+7b
3
2
8
37-82
21
35
13
12

x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x

1
2
1
2
2
2
2
2
2

1
1
1
1
1
1
1
1
1

7
22
22
7
22
7
22
16
27

0.25
10
8

1
84+3b

6
9

XGETBV

100-340

Floating point x87 instructions


Page 173

27
39
104-117 RDRAND

Ivy Bridge
Instruction

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT

Operands

r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m

r
AX
m16
m16
m16
r
m
m

r
m
r
m
r
m

r
m
r
m
m
m
m

ops ops unfused domain Latency ReciComfused


procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put

1
1
4
43
1
1
7
243
1
1
3
3
1
2
2
3
2
2
3
2
1
1
143
90

1
2
1
2
1
2
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17

2
40
1

1
1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28

1
1

0
6
7
7

1
2
3

1
1
2

3
0
1
1
1
1
2
2
3
2
1
2
1
1
1

1
2

1
3
5
45
1
4
5

1
1
1

1
1
1

1
1
x

1
1
1
1
1
2
x
2
1
1
1

1
1

2
4
1
1
1

1
1

17
Page 174

1
1

3
5
1
10-24
1
1
1

1
1
1
1

1
1
1
1
2
1
1
2
1
2

1
1
2
21
1
1
5
252
0.5
1
1
2

1
1
3
1

1
1

4
5
1
1
1
1

21-26
27-50
22

2
2
1
1
3
1
1
1
167
162

1
1
1
1
8-18
8-18
1
1
1
1
1
1
2
2
2
1
2
12
19
11

SSE3

Ivy Bridge
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN

25
17
1
21-78
23-100
20-110
16-23
42
56
102
28-72

Other
FNOP
WAIT
FNCLEX
FNINIT

1
2
5
26

25
17
1

x
x
1
x
x
x
x
42 x
56 x
102 x
x

1
2
5
26

x
x
x

x
x

x
x

x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x

x
x
x

1
1
x
x

49
10
10-23
47-106
48-115
50-123
~68
90-106
82
130
94-150

49
10
8-17
47-106
48-115
50-123
~68

1
1
22
80

Integer MMX and XMM instructions


Instruction

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA MOVDQU
MOVDQA MOVDQU
MOVDQA MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW

Operands

ops ops unfused domain Latency ReciComfused


procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128

1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1

mm,mm

mm,m64

x,x

x,m128
x,x

1
1

1
1

x
x

x
x

1
1

1
1

x
1
1

1
2
1

x
1
1
1

1
1
1
1

Page 175

1
1

1
3
1
3
1
3
3
0-1
3
3
3
1
1
~360
~360
3

1
1
1
0.5
0.33
0.5
1
0.25
0.5
1
0.5
1
0.33
1
1
0.5

eliminat.

SSE3

SSE4.1

1
1

0.5

1
1

0.5
0.5

SSE4.1

Ivy Bridge
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ

x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x
x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i

1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1

1
x
1
1
1
1
1

Page 176

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
4

1
2
2
2

2
2
2
1

1
2

1
2
1
2
1
2
1

0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
1
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5

SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3

SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1

Ivy Bridge
Arithmetic instructions

PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW

(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i

1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4

1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3

Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST

(x)mm,(x)mm
(x)mm,m
x,x

1
1
2

1
1
2

PADD/SUB(U,S)B/W/D/Q
PADD/SUB(U,S)B/W/D/Q

x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

3
1
1
1
1
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1

x
x
x
x
x
x
x
x
x
x
x
x

x
x
x
x
x
x
x
x
x
x
x
x

1
1

1
1
1
1
1
1
1
1
1
1
1
1
5
1

x
x
x
x

x
x
x
x

1
1
1
1

1
1

1
1

x
x
1

x
x
x

x
x
x

Page 177

1
1

1
1
1
1
5
1
6
1

1
1
1

0.5
0.5
1,5
1,5
0.5
0.5
0.5
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1

0.33
0.5
1

SSSE3
SSSE3

SSE4.1
SSE4.1
SSE4.2
SSE4.2

SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1

SSSE3
SSSE3

SSE4.1
SSE4.1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3

SSE4.1
SSE4.1

SSE4.1

Ivy Bridge
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

x,m128
mm,mm/i
mm,m64
xmm,i
x,x
x,m128
x,i

3
1
1
1
2
3
1

2
1
1
1
2
2
1

1
1
1
1
1
1

String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM

x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i

8
8
8
8
3
4
3
4

8
7
8
7
3
3
3
3

x,x,i
x,m,i

18
18

x,x

x,m
x,x
x,m
x,x,i
x,m,i

Encryption instructions
PCLMULQDQ
PCLMULQDQ
AESDEC, AESDECLAST,
AESENC, AESENCLAST

1
1
1
1
1
1
0.5

SSE4.1

4
4
4
4

SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2

14

8
8

CLMUL
CLMUL

AES

1
2
2
8
7

AES
AES
AES
AES
AES

1
1
x
x
x

x
x
x

3
3
3
3
3
3
3
3

1
1
1
1

4
3
4
3

18
17

x
x

x
x

x
x

3
2
3
11
11

2
2
2
11
10

1
2
2
x
x

31

31

1
2
1
1

4
1
12
1
3
1

3
11

AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESIMC
AESKEYGENASSIST
AESKEYGENASSIST
Other
EMMS

x
x

x
x

1
14
1
10
1

18

Floating point XMM and YMM instructions


Instruction

Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D

Operands

ops ops unfused domain Latency ReciComfused


procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put

x,x
y,y
x,m128

1
1
1

1
1

1
1

y,m256
m128,x

1
1

1+
1

m256,y
x,x

1
1

1+

1
Page 178

0-1
0-1
3

1
1
0.5

elimin.
elimin.

4
3

1
1

AVX

4
1

2
1

AVX

Ivy Bridge
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS

x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i

1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
2
2
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2

1
1
1
1
1

1
1
1
1

1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1

x
x
x
x
x
x
x
x

1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1

1
1+

1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1
1+
2
1
2
1+
1
3
1
3
4
5
5
5
1
3
1

1
1

1
1
1
1

1
1
1
1

1+
1
1
1
1
1

1
1
1
1
2
2
1
0
1
1

1
1
1
1
x
x
1

3
3
4
3
1
2
2
~380
~380
1

1+

x
x

Page 179

1
1
1
1+
2
1
1

1
1

1
1

2
4
1

0.5
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
1
1
1
1
1
1
0.5
1
1
0.5
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1

AVX

AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1

Ivy Bridge
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D

y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y

1
2
3
3
4
4

1
1
2
2
2
2

Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI

x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64

2
2
2
2
2
2
2
2
2
3
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2

2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2

1
x
x

x
x
x
x

x
x

1
1
1
1
1
1
1
1
1
1
1
1

1
1

1
1

Page 180

1
1
1
1
1
1
1

1
1
1+
1 1
1 1+

2
4
4
5

1
1
1
1
1
2

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

1
4
1+
4
1
1
1

1
1
1

4
1
2
1

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

3
1
3
1+
3
1
3
1+
1
1
1
1
1
1
1
1

4
1
5
1
4
1
5
1+
4
1

4
1

1
1
1
1
1

4
1
4
1
4
1
4
1

4
1
4
1

3
1
1
1
1
1
1
3
3
1
1
3
3
1
1

AVX
AVX
AVX
AVX
AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

Ivy Bridge
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D

x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128

1
1
1
1
1
1
1
1
1
1
3
4

1
1
1
1
1
1
1
1
1
1
3
3

1
1
1
1
1
1
1
1
1
1
1
1

2
2

y,y,y

y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256

4
1
1
1
1
1
1
1
1
1
1
3
4
1
1
3
4
1
1
3
4

3
1
1
1
1
1
1
1
1
1
1
3
3
1
1
3
3
1
1
3
3

x,x

x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256

2
1
2
2
2
1
1
1
1
1
1

1
1
1
2
2
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
1
1
2
2

3
1
3
1
3
1+
3
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-13
1
1
1

19-21
1+
10-20
1

1
1

20-35
1+
5
1

1
1

7
1+
3

1
1
1
1
1
1
1
1
1
1
2
2

AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3

AVX

2
1
1
1
1
1
1
1
1
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2

AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D

1
1

Page 181

1
3
1+
1
3
1
3
1
3
1+

1
1
1
1
1
1
1
1
1
1
1

AVX
AVX

AVX
AVX

Ivy Bridge
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD

x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i

1
2
1
2
4
6
4
6
3
4

1
1
1
1
4
5
4
5
3
3

1
1
1
1
2
2
2
2
1
1

1
1
1
1
1
1

Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS

x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256

1
1
3
4
1
1
3
4
1
1
3
4

1
1
3
3
1
1
3
3
1
1
3
3

1
1
2
2
1
1
2
2
1
1
2
2

x,x
x,m128

1
1

1
1

1
1

y,y,y

y,y,m256

m32
m32
m4096
m4096
m

4
12
20
3
3
130
116
100-161

0
2
2
2
2

1
3
1+
1
2
1
2
1
1

12
1
12
1+
9
1

11
1
1
1

19
1+
16
1

1
1

28
1+
5
1

1
1

7
1+

1
1
1
1
2
4
2
4
1
1

7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2

SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1

AVX
AVX

AVX
AVX

AVX
AVX

Logic
AND/ANDN/OR/XORPS/PD
AND/ANDN/OR/XORPS/PD

VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
Other
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVEOPT

1
1

Page 182

1
1

1
1

AVX

AVX

1
11
9
3
1
66
68

AVX
32 bit
64 bit

1+

1
1

6
7

60-500

Haswell

Intel Haswell
List of instruction timings and op breakdown
Explanation of column headings:
Instruction:

Name of instruction. Multiple names mean that these instructions have the same data.
Instructions with or without V name prefix behave the same unless otherwise noted.

Operands:

i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register,
(x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx,
xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc.

ops fused
domain:
ops unfused
domain:

The number of ops at the decode, rename and allocate stages in the pipeline. Fused
ops count as one.
The total number of ops for all execution port. Fused ops count as two. Fused macroops count as one. The instruction has op fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any
execution port or if the counters are inaccurate.

ops each port: The number of ops for each execution port. p0 means a op to execution port 0.
p01means a op that can go to either port 0 or port 1. p0 p1 means two ops going to
port 0 and 1, respectively.
Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address
Latency:

This is the delay that the instruction generates in a dependency chain. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase the clock
counts considerably. Where hyperthreading is enabled, the use of the same execution
units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal
throughput:

The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions

Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV

Operands

r,i
r8/16,r8/16
r32/64,r32/64
r8l,m
r8h,m
r16,m
r32/64,m

Reciproops
ops
cal
fused
unfused
through
domain domain ops each port Latency put
Comments

1
1
1
1
1
1
1

1
1
1
2
1
2
1

p0156
p0156
p0156
p23 p0156
p23
p23 p0156
p23

Page 183

1
0-1

0.25
0.25
0.25
0.5
0.5
0.5
0.5

may be elim.

all addressing
modes

Haswell
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA

m,r
m,i
m,r
r,r

1
1
2
1

2
2
2
1

p237 p4
p237 p4
p23 p4
p0156

r16,m8
r,m

1
1

2
1

p23 p0156
p23

r,r
r,m
r,r
r,m

2
3
3
8
3
2
2
3
3
4
19
1
3
3
9
18
1
3
2

2p0156
2p0156 p23
3p0156

r16,m

2
3
3
8
3
1
1
2
2
3
11
1
3
2
9
18
1
3
2

LEA

r32/64,m

LEA

r32/64,m

LEA
BSWAP
BSWAP
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHNTA/
0/1/2
LFENCE
MFENCE
SFENCE

3
~400
1

1
1
1
0.25
0.5
0.5

2
2
21
7
3

0.5
1
1
implicit lock

p23
p23 2p0156
2p237 p4

p06
3p0156
p1 p0156

1
1
4

2
1
1
1
1
1
8
0.5
4
1
18
9
1
1
1

p15

0.5

p1

r32/64,m

p1

r32
r64
r16,m16
r32,m32
r64,m64
m16,r16
m32,r32
m64,r64

1
2
3
2
3
2
2
3

1
2
3
2
3
3
3
4

p15
p06 p15
2p0156 p23
p15 p23
2p0156 p23
p06 p237 p4
p15 p237 p4
p06 p15 p237 p4

p23

0.5

2
2

none counted
p23 p4
p23 p4

4
33
5

r
i
m
stack pointer

r
stack pointer
m

2
3
2

p237 p4
p237 p4
p4 2p237
p0156 p237 p4
p1 p4 p237 p06

Arithmetic instructions
Page 184

1
1
2

all other
combinations

0.5
1
0.5
0.5
0.5
1
1
1

not 64 bit

not 64 bit
not 64 bit
16 or 32 bit
address size
1 or 2 components in
address
3 components
in address
rip relative
address

MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE

Haswell
ADD SUB
ADD SUB
ADD SUB

r,r/i
r,m
m,r/i

1
1
2

1
2
4

ADC SBB
ADC SBB
ADC SBB

r,r/i
r,m
m,r/i

2
2
4

2
3
6

CMP
CMP
INC DEC NEG
NOT
INC DEC NOT
NEG
AAA
AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MULX
MULX
MULX
MULX
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD

r,r/i
m,r/i
r

1
1
1

m
m

3
2
2
2
3
3
8
1
4
3
2
1
4
3
2
1
1
2
1
1
2
1
1
3
3
2
2
9
11
10
36
9
10
9
59
1
1
1
2

r8
r16
r32
r64
m8
m16
m32
m64
r,r
r,m
r16,r16,i
r32,r32,i
r64,r64,i
r16,m16,i
r32,m32,i
r64,m64,i
r32,r32,r32
r32,r32,m32
r64,r64,r64
r64,r64,m64
r8
r16
r32
r64
r8
r16
r32
r64

p0156
p0156 p23

2p0156 2p237 p4

2p0156
2p0156 p23

3p0156 2p237 p4

1
1
2

1
2
1

p0156
p0156 p23
p0156

1
1
1

0.25
0.5
0.25

4
4
2
2
3
3
8
1
4
3
2
2
5
4
3
1
2
2
1
1
3
2
2
3
4
2
3
9
11
10
36
9
10
9
59
1
1
1
2

p0156 2p237 p4
p0156 2p237 p4
p1 p0156
p1 p56
p1 2p0156
p1 2p0156
p0 p1 p5 p6
p1
p1 p0156
p1 p0156
p1 p6
p1
p1 p0156 p23
p1 p0156 p23
p1 p6 p23
p1
p1 p23
p1 p0156
p1
p1
p1 p0156 p23
p1 p23
p1 p23
p1 2p056
p1 2p056 p23
p1 p6
p1 p6 p23
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0156
p0156
p0156
p0156

6
6
4
6
4
4
21
3
4
4
3

1
1

Page 185

3
4
3
3

4
4
22-25
23-26
22-29
32-96
23-26
23-26
22-29
39-103
1
1
1
1

0.25
0.5
1

8
1
2
2
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
9
9
9-11
21-74
8
8
8-11
24-81

not 64 bit
not 64 bit
not 64 bit
not 64 bit
not 64 bit

AVX2
AVX2
AVX2
AVX2

Haswell
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHRD SHLD
SHRD SHLD
SHLD
SHRD
SHRD SHLD
SHLX SHRX SARX
SHLX SHRX SARX

RORX
RORX
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC

r,r
r,m
r,r
r,m

1
1
1
1
1
1

1
1
1
2
1
2

p06
p06
p1
p1 p23
p1
p1 p23

1
1
3

r,r/i
r,m
m,r/i

1
1
2

1
2
4

p0156
p0156 p23

2p0156 2p237 p4

r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
m,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
r,r,cl
m,r,cl
r,r,r
r,m,r
r,r,i
r,m,i
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

1
1
1
3
3
5
2
1
4
3
5
3
4
8
11
8
11
1
3
4
4
5
1
2
1
2
1
10
2
1
10
3
1
1
1
2
1
1

1
2
1
4
3
6
2
1
5
3
6
3
6
8
11
8
11
1
5
4
4
7
1
2
1
2
1
10
2
1
11
4
1
2
1
3
0
1

p0156
p0156 p23
p06
2p06 p237 p4
3p06
3p06 2p23 p4
2p06
p06
2p06 2p237 p4
3p06

1
1
1
1
2

2p06 p0156

p0156

p0156

p1

p0156
p0156

3
4

p06
p06 p23
p06
p06 p23
p06

p06 p23
p06
2p06 p23 p4
p1
p1 p23
p06
p06 p237 p4
none
p0156

Page 186

1
1
2

1
1

3
1

1
1
1
1

SSE4.2
SSE4.2
SSE4.2
SSE4.2

0.25
0.5
1
0.25
0.5
0.5
2
2
4
1
0.5
2
2
4
2
3
6
6
6
6
1
2
2
2
4
0.5
0.5
0.5
0.5
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.25

short form

BMI2
BMI2
BMI2
BMI2

Haswell
CMC
CLD STD
LZCNT
LZCNT
TZCNT
TZCNT
ANDN
ANDN
BLSI BLSMSK
BLSR
BLSI BLSMSK
BLSR
BEXTR
BEXTR
BZHI
BZHI
PDEP
PDEP
PEXT
PEXT

r,r
r,m
r,r
r,m
r,r,r
r,r,m
r,r

1
2
1
1
1
1
1
1
1

1
3
1
2
1
2
1
2
1

p0156
p15 p6
p1
p1 p23
p1
p1 p23
p15
p15 p23
p15

r,m

p15 p23

r,r,r
r,m,r
r,r,r
r,m,r
r,r,r
r,r,m
r,r,r
r,r,m

2
3
1
1
1
1
1
1

2
3
1
1
1
2
1
2

2p0156
2p0156 p23
p15
p15 p23
p1
p1 p23
p1
p1 p23

Control transfer instructions


JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near

1
1
1
1

1
1
2
1

p6
p6
p23 p6
p6

1-2
2
2
1-2

Conditional jump

p06

0.5-1

p6

1-2

p06

0.5-1

2
7
11
2
2
3
1
3
15
4

2
7
11
3
3
4
2
4
15
4

p0156 p6

0.5-2
5
6
2
2
3
1
2
8
5

not 64 bit
not 64 bit

3
2
5n+12
3
<2n

3
2

2p0156 p23
p0156 p23

p23 p0156 p4

1
1
~2n
1
~0.5n

worst case

Fused arithmetic
and branch
Fused arithmetic
and branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODSB/W
LODSD/Q
REP LODS
STOS
REP STOS

short/near

short
short
short
near
r
m
i
r,m

p237 p4 p6
p237 p4 p6
2p237 p4 p6
p237 p6
p23 2p6 p015

Page 187

1
3
3
1
1
1

2
1
3
3

4
1
1
1
1
0.5
0.5
0.5

LZCNT
LZCNT
BMI1
BMI1
BMI1
BMI1
BMI1

0.5

BMI1

0.5
1
0.5
0.5
1
1
1
1

BMI1
BMI1
BMI2
BMI2
BMI2
BMI2
BMI2
BMI2

predicted
taken
predicted not
taken
predicted
taken
predicted not
taken

Haswell
REP STOS

2.6/32B

MOVS
REP MOVS
REP MOVS

5
~2n
4/32B

SCAS
REP SCAS
CMPS
REP CMPS

3
6n
5
8n

p23 2p0156

2p23 3p0156

Synchronization instructions
XADD
m,r
LOCK XADD
m,r
LOCK ADD
m,r
CMPXCHG
m,r
LOCK CMPXCHG
m,r
CMPXCHG8B
m,r
LOCK CMPXCHG8B
m,r
CMPXCHG16B
m,r
LOCK CMPXCHG16B
m,r

4
9
8
5
10
15
19
22
24

5
9
8
6
10
15
19
22
24

1
1

0
0

Other
NOP (90)
Long NOP (0F
1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDPMC
RDRAND

a,0
a,b

1/32B
2p23 p4 2p0156

5
5
12
12
~14+7b ~45+7b
3
3
34-69 34-116
8
8
15
15
34
34
17
17

4
~1.5 n
1/32B

best case
aligned by 32
worst case
best case
aligned by 32

1
2n
4
2n

7
19
19
8
19
9
19
15
25

none
none

0.25
0.25

p05 3p6

9
8
~87+2b

2p0156 p23

p23 16p0156

6
100-250 100-250
9
24
37
~320

XGETBV

RDRAND

Floating point x87 instructions

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP

Operands

r
m32/64
m80
m80
r
m32/m64
m80
m80

Reciproops
ops
cal
fused
unfused
through
domain domain ops each port Latency put
Comments

1
1
4
43
1
1
7
238

1
1
4
43
1
2
7
226

p01
p23
2p01 2p23
p01
p4 p237
3p0156 2p23 2p4

Page 188

1
3
4
47
1
4
1

0.5
0.5
2
22
0.5
1
5
265

Haswell
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P)
FSUB(R)(P)
FADD(P)
FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM

FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1

r
m
m
m

0
2
3
3
1
2
2
3
2
3
3
3
1
1
147
90

none
p01 p23
p1 p23 p4
p1 p23 p4
p01
2p01
2p01
2p0 p5
p0 p0156
p0 p4 p237
p01 p23 p6
p237 p4 p6
p01
p01

0
6
7
7

r
m
m

2
1
3
3
1
2
2
3
2
2
3
2
1
1
147
90

p1

m
r
m
r
m

1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17

2
1
2
1
2
1
1
1
1
2
3
3
3
3
3
1
2
28
41
17

p1 p23
p0
p0 p23
p0
p0 p23
p0
p0
p1
p1 p23
2p01
3p01
2p1 p23
p0 p1 p23
p0 p1 p23
2p1 p23
p1
2p1

r
AX
m16
m16
m16

r
m
r
m
m
m
m

25-75
17
1
71-100
110
70-120
58-89
55-417
55-228

17
1

Page 189

2
6
7
0

5
10-24
1
1

19
27
11

p0

49-125
15
10-23
47-106
112
52-123
63-68
58-680
58-360

0.5
1
1
2
1
2
2
2
1
1
2
1
0.5
0.5
150
164

1
1
1
1
8-18
8-18
1
1
1
1
1
1.5
2
2
2
1
2
13
17
23

11
8-17

SSE3

Haswell
FPTAN
FPATAN

110-121
78-160

Other
FNOP
WAIT
FNCLEX
FNINIT

1
2
5
26

130
96-156

1
2
5
26

p01
p01
p0156

0.5
1
22
83

Integer MMX and XMM instructions

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA/U
MOVDQA/U
MOVDQA/U
VMOVDQA/U
VMOVDQA/U
VMOVDQA/U
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
VMOVNTDQ
MOVNTDQA
VMOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/L
BW/WD/DQ

Operands

Reciproops
ops
cal
fused
unfused
through
domain domain ops each port Latency put
Comments

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
r64,(x)mm
(x)mm,r64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x

1
1
1
1
1
1
1
1
1
1
1
1

1
2
1
1
1
1
1
2
1
1
2

p0
p237 p4
p5
p23
p0
p5
p015
p23
p237 p4
p015
p23
p237 p4

1
3
1
3
1
1
1
3
3
0-1
3
3

1
1
1
0.5
1
1
0.33
0.5
1
0.33
0.5
1

y,y
y,m256
m256,y
x, m128
mm, x
x,mm
m64,mm
m128,x
m256,y
x, m128
y,m256

1
1
1
1
2
1
1
1
1
1

1
1
2
1
2
1
2
2
2
1
1

p015
p23
p237 p4
p23
p01 p5
p015
p237 p4
p237 p4
p237 p4
p23
p23

0-1
3
4
3
1
1
~400
~400
~400
3
3

0.33
1
1
0.5
1
0.33
1
1
1
0.5
0.5

mm,mm

p5

mm,m64

p23 2p5

x,x / y,y,y

p5

x,m / y,y,m
x,x / y,y,y
x,m / y,y,m

1
1
1

2
1
2

p23 p5
p5
p23 p5

1
1
1

v,v / v,v,v

p5

Page 190

may be elim.

AVX
may be elim.
AVX
AVX
SSE3

AVX2
SSE4.1
AVX2

SSE4.1
SSE4.1

Haswell
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
QDQ
PUNPCKH/L
QDQ

v,m / v,v,m

p23 p5

x,x / y,y,y

p5

x,m / y,y,m

p23 p5

PMOVSX/ZX BW
BD BQ DW DQ

x,x

p5

PMOVSX/ZX BW
BD BQ DW DQ

x,m

p23 p5

VPMOVSX/ZX BW
BD BQ DW DQ

y,x

p5

y,m
v,v / v,v,v
v,m / v,v,m
mm,mm,i
mm,m64,i
v,v,i
v,m,i
v,v,i
v,m,i
v,v,i / v,v,v,i
v,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,v,i
v,v,m,i
y,y,y
y,y,m
y,y,i
y,m,i
y,y,y,i
y,y,m,i
mm,mm
x,x
v,v,m
m,v,v
r,v
r32,x,i
m8,x,i
x,y,i
m,y,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
y,y,x,i

2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
1
1
2
1
2
4
10
3
4
1
2
2
1
2
2
2
2
2
2
2
1

2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
2
1
2
1
2
4
10
3
4
1
2
3
1
2
2
2
2
2
2
2
1

p5 p23
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
2p5
2p5 p23
2p5
2p5 p23
p5
p23 p5
p015
p015 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p0 p4 2p23
4p04 2p56 4p23
p23 2p5
p0 p1 p4 p23
p0
p0 p5
p23 p4 p5
p5
p23 p4
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5

VPMOVSX/ZX BW
BD BQ DW DQ

PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
VPBLENDVB
VPBLENDVB
PBLENDW
PBLENDW
VPBLENDD
VPBLENDD
VPERMD
VPERMD
VPERMQ
VPERMQ
VPERM2I128
VPERM2I128
MASKMOVQ
MASKMOVDQU
VPMASKMOVD/Q
VPMASKMOVD/Q

PMOVMSKB
PEXTRB/W/D/Q
PEXTRB/W/D/Q
VEXTRACTI128
VEXTRACTI128

PINSRB
PINSRB
PINSRW
PINSRW
PINSRD/Q
PINSRD/Q
VINSERTI128

Page 191

1
1

1
1

1
1
1
1
1
2
2
1
1
3
3
3
13-413
14-438
4
13-14
3
2
3
4
2
2
2
3

SSE4.1

SSE4.1

AVX2

1
1
1
1
1
1
1
1
1
1
1
2
1
2
2
1
1
0.33
0.5
1
1
1
1
1
1
1
6
2
1
1
1
1
1
1
2
1
2
1
2
1
1

AVX2
SSSE3
SSSE3

SSSE3
SSSE3
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2

AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1

SSE4.1
SSE4.1
AVX2

Haswell
VINSERTI128
VPBROADCAST
B/W/D/Q
VPBROADCAST
B/W
VPBROADCAST
D/Q
VPBROADCAST
B/W/D/Q
VPBROADCAST
B/W
VPBROADCAST
D/Q

y,y,m,i

p015 p23

0.5

AVX2

x,x

p5

AVX2

x,m8/16

p01 p23 p5

AVX2

x,m32/64

p23

0.5

AVX2

y,x

p5

AVX2

y,m8/16

p01 p23 p5

AVX2

y,m32/64
y,m128
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y

1
1
20
34
15
22
12
20
14
22

1
1
20
34
15
22
12
20
14
22

p23
p23

5
3

0.5
0.5
9
12
8
7
7
9
7
9

AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2

PADD/SUB(S,US)
B/W/D/Q

v,v / v,v,v

p15

0.5

PADD/SUB(S,US)
B/W/D/Q

v,m / v,v,m

p15 p23

v,v / v,v,v

p1 2p5

v,m / v,v,m

p1 2p5 p23

v,v / v,v,v

p15

v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m

1
1
1
1
1

2
1
2
1
2

p15 p23
p15
p15 p23
p0
p0 p23

v,v / v,v,v

p0

v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v

1
1
1
2
3
1
1
1
1
1

2
1
2
2
3
1
2
1
2
1

p0 p23
p0
p0 p23
2p0
2p0 p23
p0
p0 p23
p0
p0 p23
p0

VBROADCASTI128

VPGATHERDD
VPGATHERDD
VPGATHERQD
VPGATHERQD
VPGATHERDQ
VPGATHERDQ
VPGATHERQQ
VPGATHERQQ
Arithmetic instructions

PHADD(S)W/D
PHSUB(S)W/D
PHADD(S)W/D
PHSUB(S)W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD

Page 192

0.5
3

1
5

5
10
5
5
5

SSSE3

SSSE3

0.5
0.5
0.5
0.5
1
1

SSE4.1
SSE4.1
SSE4.2
SSE4.2

1
1
1
1
2
2
1
1
1
1
1

SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1

Haswell
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
Logic instructions
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PTEST
PTEST
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
PSLLDQ
PSRLDQ

v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m

1
1
1
1
1

2
1
2
1
2

p0 p23
p0
p0 p23
p15
p15 p23

x,x / y,y,y

p15

x,m / y,y,m
x,x
x,m128
v,v
v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x,i / v,v,v,i
x,m,i / v,v,m,i

1
1
1
1
1
1
1
1
1
3
4

2
1
2
1
2
1
2
1
2
3
4

p15 p23
p0
p1 p23
p15
p15 p23
p15
p15 p23
p0
p0 p23
p0 2p5
p0 2p5 p23

v,v / v,v,v

p015

0.33

v,m / v,v,m
v,v
v,m

1
2
2

2
2
3

p015 p23
p0 p5
p0 p5 p23

0.5
1
1

mm,mm

p0

mm,m64

p0 p23

x,x / v,v,x

p0 p5

x,m / v,v,m

p0 p23

v,i / v,v,i

p0

v,v,v

2p0 p5

AVX2

v,v,m

2p0 p5 p23

AVX2

x,i / v,v,i

p5

Page 193

5
1

5
1
1
5
6

1
1
1
0.5
0.5

SSSE3
SSSE3

0.5

SSE4.1

0.5
1
1
0.5
0.5
0.5
0.5
1
1
2
2

SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3

SSE4.1
SSE4.1

SSE4.1
SSE4.1

Haswell
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM

x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i

Encryption instructions
PCLMULQDQ
x,x,i
PCLMULQDQ
x,m,i
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,x
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,m
AESIMC
x,x
AESIMC
x,m
AESKEYGENAS
SIST
x,x,i
AESKEYGENAS
SIST
x,m,i
Other
EMMS

8
8
9
9
3
4
3
4

8
8
9
9
3
4
3
4

6p05 2p16

11

4
4
5
5
3
3
3
3

SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2

3p0 2p16 4p5


6p05 2p16 p23
3p0
3p0 p23
3p0
3p0 p23

10

3
4

3
4

2p0 p5
2p0 p5 p23

2
2

CLMUL
CLMUL

p5

AES

2
2
3

2
2
3

p5 p23
2p5
2p5 p23

14

1.5
2
2

AES
AES
AES

10

10

2p0 8p5

10

AES

10

10

2p0 p23 7p5

AES

31

31

3p0 2p16 2p5 p23

11
10

13

Floating point XMM and YMM instructions

Instruction
Move instructions
MOVAPS/D
VMOVAPS/D

Operands

Reciproops
ops
cal
fused
unfused
through
domain domain ops each port Latency put
Comments

x,x
y,y

1
1

1
1

p5
p5

0-1
0-1

1
1

MOVAPS/D
MOVUPS/D

x,m128

p23

0.5

VMOVAPS/D
VMOVUPS/D

y,m256

p23

0.5

m128,x

p237 p4

m256,y
x,x
x,m32/64
m32/64,x
x,m64

1
1
1
1
1

2
1
1
2
2

p237 p4
p5
p23
p237 p4
p23 p5

4
1
3
3
4

1
1
0.5
1
1

MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D

Page 194

may be elim.
may be elim.

AVX

AVX

Haswell
MOVHPS/D
MOVLPS/D
MOVLPS/D
MOVHLPS
MOVLHPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD

VPERM2F128
VPERM2F128
VPERMPS
VPERMPS
VPERMPD
VPERMPD
BLENDPS/PD
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD

MOVDDUP
MOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTSD
VBROADCASTF128

MOVSH/LDUP
MOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D

VPGATHERDPS
VPGATHERDPS

m64,x
x,m64
m64,x
x,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,i
v,m,i
v,v,v
v,v,m
y,y,y,i
y,y,m,i
y,y,y
y,y,m
y,y,i
y,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
v,v
v,m
x,m32
y,m32
x,x
y,x
y,m64
y,x
y,m128
v,v
v,m
x,x / v,v,v
x,m / v,v,m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
v,v,m
m128,x,x
m256,y,y
x,[r+s*x],x
y,[r+s*y],y

1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
1
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
4
4
20
34

2
2
2
1
1
1
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
2
1
2
1
2
3
4
4
20
34

p4 p237
p23 p5
p4 p237
p5
p5
p0
p0
p4 p237
p4 p237
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p015
p015 p23
p5
2p5 p23
2p5
2p5 p23
p5
p23
p23
p23
p5
p5
p23
p5
p23
p5
p23
p5
p5 p23
p0 p5
p0 p5 p23
p5
p23 p4
p5
p23 p5
p5
p015 p23
2p5 p23
p0 p1 p4 p23
p0 p1 p4 p23

Page 195

3
4
3
1
1
3
2
~400
~400
1
1
1
3
3
3
1
2
2
1
3
4
5
1
3
5
3
3
1
3
1

4
3
4
1
4
3
4
4
13
14

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.33
0.5
2
2
2
2
1
0.5
0.5
0.5
1
1
0.5
1
0.5
1
0.5
1
1
1
1
1
1
1
1
1
1
2
1
2
9
12

AVX

AVX
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX2
AVX2
AVX
AVX2
AVX
SSE3
SSE3
SSE3
SSE3
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2

Haswell
VPGATHERQPS
VPGATHERQPS
VPGATHERDPD
VPGATHERDPD
VPGATHERQPD
VPGATHERQPD

x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y

15
22
12
20
14
22

15
22
12
20
14
22

Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI

x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32/64
x,m32
r32/64,x
r32,m64

2
2
2
2
2
3
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2

2
3
2
3
2
3
2
2
2
2
2
2
1
2
1
2
1
2
1
2
2
2
2
2
2
3
2
3
1
2
2
2
2
2
2
3
2
2
2
3
2
2
2
3

8
7
7
9
7
9

p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p0 p5
p0 p23
p0 p5
p0 p23
p0 p5
p0 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
Page 196

4
5
4
2
5
2
3
3
3
3
4
6
4
6
4
4
4
4
4
4
4
4

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1

AVX2
AVX2
AVX2
AVX2
AVX2
AVX2

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

AVX
AVX

Haswell
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D PS/D
SUBSS/D PS/D
ADDSS/D PS/D
SUBSS/D PS/D
ADDSUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D PS/D
MULSS/D PS/D
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPS
VDIVPS
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D
CMPccPS/D
CMPccSS/D
CMPccPS/D
(U)COMISS/D
(U)COMISS/D
MAXSS/D PS/D
MINSS/D PS/D
MAXSS/D PS/D
MINSS/D PS/D
ROUNDSS/D PS/D
ROUNDSS/D PS/D

DPPS
DPPS
DPPD
DPPD
VFMADD...
(all FMA instr.)
VFMADD...
(all FMA instr.)
Math
SQRTSS/PS

x,v,i
m,v,i
v,x
v,m

2
4
2
2

2
4
2
2

p1 p5
p1 p4 p5 p23
p1 p5
p1 p23

x,x / v,v,v

p1

x,m / v,v,m
x,x / v,v,v
x,m / v,v,m

1
1
1

2
1
2

p1 p23
p1
p1 p23

1
1
1

SSE3
SSE3

x,x / v,v,v

p1 2p5

SSE3

x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
x,x
x,m
x,x
x,m
y,y,y
y,y,m256
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256

4
1
1
1
1
1
1
3
4
3
4
1
1
3
4

4
1
2
1
2
1
2
3
4
3
4
1
2
3
4

p1 2p5 p23
p01
p01 023
p0
p0 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23

2
0.5
0.5
7
7
8-14
8-14
14
14
16-28
16-28
1
1
2
2

SSE3

x,x / v,v,v

p1

x,m / v,v,m
x,x
x,m32/64

2
1
2

2
1
2

p1 p23
p1
p1 p23

x,x / v,v,v

p1

x,m / v,v,m
v,v,i
v,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,i
x,m128,i

1
2
3
4
6
3
4

2
2
3
4
6
3
4

p1 p23
2p1
2p1 p23
2p0 p1 p5

v,v,v

v,v,m

x,x

5
10-13
10-20
18-21
19-35
5
7

F16C
F16C
F16C
F16C

AVX
AVX
AVX
AVX

AVX
AVX

1
1
1
1

6
14

2p0 p1 p5 p23 p6

p0 p1 p5
p0 p1 p5 p23

p01

p01 p23

p0
Page 197

1
1
1
1

11

1
1
2
2
2
4
1
1

SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1

0.5

FMA

0.5

FMA

Haswell
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS

x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256

1
3
4
1
1
3
4
1
1
3
4

2
3
4
1
2
3
4
1
2
3
4

p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23

28-29

AND/ANDN/OR/XO
RPS/PD

x,x / v,v,v

p5

AND/ANDN/OR/XO
RPS/PD

x,m / v,v,m

p5 p23

Other
VZEROUPPER

none

VZEROALL

12

12

none

10

VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
XSAVEOPT

20
3
3
3
130
116
224
173

20
3
4

none
p0 p6 p23
p0 p4 p6 p237

8
3
1
1
68
72
84
111

19
16

5
7

7
14
14
8-14
8-14
16-28
16-28
1
1
2
2

AVX
AVX

AVX
AVX

AVX
AVX

Logic

m32
m32
m32
m4096
m4096

Page 198

6
7

AVX
AVX,
32 bit
AVX,
64 bit

AVX

Pentium 4

Intel Pentium 4
List of instruction timings and op breakdown
This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for
P4E, listed on the next sheet

Explanation of column headings:


Instruction:
Operands:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc.

ops:
Microcode:
Latency:

Number of ops issued from instruction decoder and stored in trace cache.
Number of additional ops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the
next dependent instruction starts in the same execution unit. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase
the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latency of moves to and from memory cannot be measured
accurately because of the problem with memory intermediates explained
above under How the values were measured.

Additional latency:

This number is added to the latency if the next dependent instruction is in a


different execution unit. There is no additional latency between ALU0 and
ALU1.
This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent
instruction can begin to execute in the same execution subunit. A value of
0.25 indicates 4 instructions per clock cycle in one thread.

Reciprocal
throughput:

Port:

Execution unit:

Execution subunit:
Instruction set

The port through which each op goes to an execution unit. Two independent
ops can start to execute simultaneously only if they are going through different ports.
Use this information to determine additional latency. When an instruction with
more than one op uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.

Integer instructions

Page 199

Pentium 4

alu0/1
alu0/1
load
load
store
store

86
86
86
86
86
86
86
86
sse2
386
386
386
386
ppro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
386
386
386
86
86
86
86
486

alu0/1
load
alu0

alu0/1

alu0/1
alu0/1
int,alu
int,alu
int,alu
int
alu0/1
int
int,alu
86

sse
sse

Notes

Page 200

Instruction set

r,m
r
r,r/i
m
m

Subunit

r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]

0
0,5 0.5-1 0,25 0/1
0
0,5 0.5-1 0,25 0/1
0
2
0
1
2
0
3
0
1
2
0
1
2
0
0
2 0,3
2
6
4
12
0
14
0
33
0
0,5 0.5-1 0,25 0/1
0
2
0
1
2
0
0,5 0.5-1 0,5 0
0
3 0.5-1 1 2,0
0
6
0
3
0
1,5 0.5-1 1 0/1
8 >100
0
3
0
1
2
0
1
2
0
2
4
7
4
10
10
19
0
1
0
1
8
14
5
13
8
52
16
14
0
0,5 0.5-1 0,25 0/1
0
1 0.5-1 0,5 0/1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4
0
4
1
0
0,5 0.5-1 0,5 0/1
0
5
0
1
1
7
15
0
7
0
2
64
>1000
2
6
2
6

Execution unit

r
m
sr

Port
Reciprocal throughput

r
i
m
sr

Additional latency

1
1
1
2
1
3
4
4
2
1
1
1
2
3
3
4
4
2
2
3
4
4
4
2
4
4
4
4
1
2
3
2
3
1
1
3
4
3
8
4
4

Latency

r,r
r,i
r32,m
r8/16,m
m,r
m,i
r,sr
sr,r/m
m,r32
r,r
r,m
r,r
r,m
r,r/m
r,r
r,m

Microcode

Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVSX
MOVSX
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2

Operands

ops

Instruction

b, c

a, q
c
c
a, e

Pentium 4
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW
CWD, CDQ
CWDE
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT

4
4
4

r,r
r,m
m,r
r,r
r,i
r,m
m,r
r,r
r,m
r
m
r
m

r8/32
r16
m8/32
m16
r32,r
r32,(r),i
r16,r
r16,r,i
r16,m16
r32,m32
r,m,i
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32

r,r
r,m
m,r
r,r
r,m
r

1
2
3
4
3
4
4
1
2
2
4
1
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
2
2
1

1
2
3
1
2
1

2
2
2

40
38
100

0
0,5 0.5-1 0,25
0
1 0.5-1 1
0
8
4
4
6
0
6
0
6
0
6
6
8
0
8
7
9
8
0
0,5 0.5-1 0,25
0
1 0.5-1 1
0
0,5 0.5-1 0,5
0
4
4
0
0,5 0.5-1 0,5
0
3
27 90
57 100
10 22
22 56
6
16
0
8
7
17
0
8
7-8 16
0
8
10 16
0
8
0
14
0
4,5
0
14
0
4,5
5
16
0
9
5
15
0
8
7
15
0
10
0
14
0
8
7
14
0
10
20 61
0
24
18 53
0
23
21 50
0
23
24 61
0
24
22 53
0
23
20 50
0
23
0
1 0.5-1 1
0
1 0.5-1 0,5
0
0,5 0.5-1 0,5

0
0
0
0
0
0

0,5
1
8
0,5
1
0,5

0.5-1 0,5
0.5-1 1
4
0.5-1 0,5
0.5-1 1
0.5-1 0,5

Page 201

sse
sse2
sse2

0/1

alu0/1

1
1
1

int,alu
int,alu
int,alu

0/1

alu0/1

0/1

alu0/1

alu0

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0/1
0

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
alu0
alu0/1
alu0

alu0

alu0

alu0

fpmul
fpdiv
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv

86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
386
386
386
186
386
386
186
86
86
386
86
86
386
86
86
386

86
86
86
86
86
86

c
c
c

c
c

a
a
a
a
a

c
c
c
c
c

Pentium 4
NOT
SHL, SHR, SAR
SHL, SHR, SAR
ROL, ROR
ROL, ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
CLI
STI

m
r,i
r,CL
r,i
r,CL
r,1
r,i
r,CL

4
1
2
1
2
1
4
4

0
0
0
0
0
0
15
15

1
0
1
0
1
0
0

4
1
1
1
1
1
15
14

m,i/CL
m,1
m,i/CL
r,r,i/CL
m,r,i/CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r
r,m
r
m

4
4
4
4
4
3
2
4
4
3
2
4
4
2
3
3
4
3
3
4
4
4
4

7-8 10
0
7
10
0
18 18-28
14 14
0
18 14
0
0
4
0
0
4
0
0
4
0
12 12
0
0
6
0
0
6
0
7
18
0
15 14
0
0
4
0
0
4
0
0
5
0
0
5
0
0
10
0
0
10
0
7
52
0
5
48
0
5
35
12 43

1
4
3
3
4
1
4
4
3
4
4
4
4
4
4
4

0
28
0
0
31
0
4
4
0
34
4
4
38
0
0
33

Control transfer instructions


JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF

4
6
4
6
4
16
16

0
118
4
4
11
0
0
0
2

8
9
2
2
11
Page 202

1
1
1
1
1
1
1

int
int
int
int
int
int
int

mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh

10
10
14
14
14
2
1
2
12
2
4
8
14
2
3
1
3
2
2
52
48
35
43

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh

1
118
4
4
11
2-4
2-4
2-4
2

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

alu0

branch

alu0
alu0

branch
branch

alu0
alu0
alu0
alu0

branch
branch
branch
branch

alu0
alu0

branch
branch

alu0
alu0

branch
branch

86
186
86
186
86
86
186
86
86
86
86
386
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86
86
86
86

86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86

d
d
d
d
d
d
d
d
d

d
d
d
d

Pentium 4
RETF
IRET
ENTER
ENTER
LEAVE
BOUND
INTO
INT

4
33 11
4
48 24
4
12 26
4 45+24n
4
0
3
4
14 14
4
5
18
4
84 644

i,0
i,n
m
i

String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CPUID
RDTSC
Notes:
a)
b)

c)
d)
e)
q)

4
4
4
4
4
4
4
4
4
4

0
0
26
128+16n
3
14
18

3
6
5n 4n+36
2
6
2n+3 3n+10
4
6
163+1.1n
3
40+6n
4n
5
50+8n
4n

1
0
0
1
0
0
4
2
4 39-81
4
7

86
86
186
186
186
186
86
86

86
86

86
86

86
86

86
86

86
86

0,25 0/1
0,25 0/1

alu0/1
alu0/1

200-500
80

86
ppro
sse2

p5
p5

Add 1 op if source is a memory operand.


Uses an extra op (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is
used as base pointer.
Add 1 op if source or destination, but not both, is a high 8-bit register (AH,
BH, CH, DH).
Has (false) dependence on the flags in most cases.
Not available on PMMX
Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode.

Floating point x87 instructions

0
2

mov
load

Notes

Page 203

1
1

Instruction set

0
0

Subunit

6
7

Execution unit

0
0

Port
Reciprocal throughput

1
1

Additional latency

r
m32/64

Latency

Microcode

Move instructions
FLD
FLD

Operands

ops

Instruction

87
87

Pentium 4
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST
FIST
FISTP
FLDZ
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1

m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m16
m32/64
m

st0,r
r
AX
AX
m16
m16
m16

r
m
m16
m32
r
m
m16
m32
r
m
m16
m32

r
m
r
m16
m32

3
3
1
2
3
3
1
3
2
3
2
3
1
2
4
3
1
4
6
4
4
4

4
75
0
0
8
311
0
3
0
0
0
0
0
0
0
0
0
0
0
4
4
7

1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
4
3
1
1
3
6
6

0
0
4
0
0
0
4
0
0
0
4
0
0
0
0
0
0
0
4
0
0
0
15
84
84

6
7

0
10
10
10
10
10

2-4

0
11
11

0
0
0

(3)

5
5
6
5
7
7
7
7
43
43
43
43
2
2
2
2
2
10

1
1
0
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0

2
2
2
23
212
212

0
0
0
0

Page 204

6
2
90
2
1
0
2-3 0
8
0
400 0
1
0
6
2
1
2
2-4 0
2-3 0
2-4 0
2
0
2
0
4
1
4
0
1
0
3
1
3
1
6
0
6
0
(8) 0,2

1
1
6
2
2
2
6
2
43
43
43
43
1
1
1
1
1
3
6
2
1
1
15

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1

load
load
mov
store
store
store
mov
load
load
store
store
store
mov
mov
fp
mov
mov

fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
287
287
87
87
87

add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc

87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387

g, h
g, h
g, h
g, h

Pentium 4
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)

g)

h)
i)

1
2
6
6
7
6
3
3
3
3
3
11

0
43
0
150 180
175 207
178 216
160 230
92 187
24 57
15 20
45 165
60 200
134 242

1
2
4
6
4
4
4
4

0
0
4
29
174
96
69
94

0
0

1
0

43
3
170
207
211
200
153

66
20
63
90
220

456
528
132
208

1
1
1
1
1
1
1
1
1
1
1
1

fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

1
0
1
0
96
1
172
420 0,1
532
96
208

div

87
87
387
387
387
87
87
87
87
87
87
87

mov
mov

87
87
87
87
87
87
sse
sse

g, h

i
i

Not available on PMMX


The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is 143.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 23, double precision: 38, long double precision
(default): 43.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
Takes 6 ops more and 40-80 clocks more when XMM registers are disabled.

Integer MMX and XMM instructions

0
1
2
0

fp
mmx
load
fp

alu

mmx
mmx
mmx
sse2

Notes

Page 205

1
2
1
2

Instruction set

1
0
0
1

Subunit

5
2
8
10

Execution unit

0
0
0
0

Port
Reciprocal throughput

2
2
1
2

Additional latency

r32, mm
mm, r32
mm,m32
r32, xmm

Latency

Microcode

Move instructions
MOVD
MOVD
MOVD
MOVD

Operands

ops

Instruction

Pentium 4
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW
Arithmetic instructions
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W

xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm

2
1
2
1
1
1
2
1
1
2
4
4
3
2
3
2

0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0

6
8
8
6
2
8
8
6
8
8

2
1
2
1
2
1
2
1
1
2
2
2
2
2
75
18

1
2
0,1
0
1
2
0
0
2
0
2
0
0,1
0,1
0
0

mm,r/m

mmx

shift

mmx

xmm,r/m

mmx

shift

mmx

mm,r/m

mmx

shift

mmx

xmm,r/m

mmx

shift

sse2

xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i

1
1
1
1
4
4
2
3
3
2
2

0
0
0
0
4
6
0
0
0
0
0

2
4
2
2

1
1
1
1

shift
shift
shift
shift

sse2
sse2
sse2
mmx
sse
sse2

1
1
1
1
1

1
1
1
1
0
0
0,1
1
1
1
1

mmx
mmx
mmx
mmx
mov
mov

7
8
9
3
4

2
2
2
1
7
10
3
2
2
2
2

mmx-alu0

mmx-int
mmx-int
int-mmx
int-mmx

sse
sse
sse2
sse
sse2

r,r/m

1,2

mmx

alu

mmx

a,j

r,r/m
mm,r/m
xmm,r/m

1
1
1

0
0
0

2
2
4

1
1
1

1,2
1
2

1
1
1

mmx
mmx
fp

alu
alu
add

mmx
sse2
sse2

a,j
a
a

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
1
1

0
0
0
0
0
0

2
6
6
6
6
2

1
1
1
1
1
1

1,2
1,2
1,2
1,2
1,2
1,2

1
1
1
1
1
1

mmx
fp
fp
fp
fp
mmx

alu
mul
mul
mul
mul
alu

mmx
mmx
sse
mmx
sse2
sse

a,j
a,j
a,j
a,j
a,j
a,j

8
8

1
0
0
1

1
1

Page 206

mmx
load
mov
mmx
load
mov
mov
load
mov
load
mov
mov-mmx
mov-mmx

shift

shift

sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2

k
k

sse2
sse2

mov
mov

sse
sse2

Pentium 4
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ
Other
EMMS
Notes:
a)
j)
k)

r,r/m
r,r/m
r,r/m

1
1
1

0
0
0

2
2
4

1
1
1

1,2
1,2
1,2

1
1
1

mmx
mmx
mmx

alu
alu
alu

sse
sse
sse

a,j
a,j
a,j

r,r/m
r,r/m

1
1

0
0

2
2

1
1

1,2
1,2

1
1

mmx
mmx

alu
alu

mmx
mmx

a,j
a,j

r,i/r/m
xmm,i

1
1

0
0

2
4

1
1

1,2
2

1
1

mmx
mmx

shift
shift

mmx
sse2

a,j
a

11

12

12

mmx

Add 1 op if source is a memory operand.


Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves

Floating point XMM instructions

2
2
2
1
1
1

0
2
0
0
2
0
1
1
2
0
1
1

sse

0
0
0
0
0
0

2
4
3
2
2
2

0
0
1
1
1
1

sse
sse/2
sse
sse
sse
sse

2
2
7

0
1
0

4
2

0
0

mov

mmx
mmx

shift
shift

mmx
mmx

shift
shift

sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse

MOVHPS/D, MOVLPS/D
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D

6
4
4
2

1
1
1
1

Page 207

fp
mmx
mmx
mmx

shift
shift
shift

Notes

m,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m

1
1
2
1
2
8
2
2
1
2
2
2

mov

Instruction set

0
0

Subunit

r,m

6
7
7
6

Execution unit

0
0
0
0
0
6
0
0
0
0
0
0

Port
Reciprocal throughput

1
1
2
1
4
4
1
1
1
2
1
1

Additional latency

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r

Operands

Latency

Microcode

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS
MOVHPS/D, MOVLPS/D

ops

Instruction

k
k

Pentium 4
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm

4
2
4
4
1
3
1
2
4
4
3
3
3
4
2
2

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

7
10
14
10
4
9
4
9
10
11
7
11
10
15
8
8

1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1

4
2
6
6
2
4
2
2
4
5
2
3
3
6
2,5
2,5

1
1
1
1
1
1
1
1
1
1
0,1
0,1
1
1
1
1

mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp

shift
sse2
shift
shift

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
1
1
1
2

0
0
0
0
0
0
0
0

4
4
6
23
39
38
69
4

1
1
1
0
0
0
0
1

2
2
2
23
39
38
69
4

1
1
1
1
1
1
1
1

fp
fp
fp
fp
fp
fp
fp
mmx

r,r/m

r,r/m
r,r/m

1
2

0
0

4
6

1
1

2
3

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
2
2

0
0
0
0
0
0

23
39
38
69
4
4

0
0
0
0
1
1

m
m

4
4

8
4

98

Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D

Other
LDMXCSR
STMXCSR
Notes:

Page 208

sse2
a
sse2
sse2
sse2
a
sse2
a
sse
a
a
a
a
a
sse2
sse

add
add
mul
div
div
div
div

sse
sse
sse
sse
sse
sse2
sse2
sse

a
a
a
a,h
a,h
a,h
a,h
a

fp

add

sse

1
1

fp
fp

add
add

sse
sse

a
a

mmx

alu

sse

23
39
38
69
3
4

1
1
1
1
1
1

fp
fp
fp
fp
mmx
mmx

div
div
div
div

sse
sse
sse2
sse2
sse
sse

a,h
a,h
a,h
a,h
a
a

100
6

1
1

sse2
sse2
sse2
sse
sse2
sse
sse2

sse
sse

a
a
a
a
a

a
a

Pentium 4
a)
h)
k)

Add 1 op if source is a memory operand.


Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves.

Page 209

Prescott

Intel Pentium 4 w. EM64T (Prescott)


List of instruction timings and op breakdown
Explanation of column headings:
Instruction:
Operands:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address.

ops:
Microcode:
Latency:

Number of ops issued from instruction decoder and stored in trace cache.
Number of additional ops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the next
dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal
numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately
because of the problem with memory intermediates explained above under
How the values were measured.

Additional latency:

This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1.
This is also called issue latency. This value indicates the number of clock cycles
from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25
indicates 4 instructions per clock cycle in one thread.

Reciprocal
throughput:

Port:

Execution unit:

Execution subunit:
Instruction set

The port through which each op goes to an execution unit. Two independent
ops can start to execute simultaneously only if they are going through different
ports.
Use this information to determine additional latency. When an instruction with
more than one op uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.

Integer instructions

86
86
x64

Notes

alu0/1
alu0/1
alu0/1

Instruction set

Page 210

0,25 0/1
0,25 0/1
0,5 0/1

Subunit

0
0
0

Execution unit

1
1

Port
Reciprocal throughput

0
0
0

Additional latency

1
1
1

Latency

r,r
r8/16/32,i
r64,i32

Microcode

Move instructions
MOV
MOV
MOV

Operands

ops

Instruction

Prescott
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVZX
MOVSX
MOVSX
MOVSX
MOVSXD
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVSB
REP MOVSW

r64,i64
r8/16,m
r32/64,m
m,r
m,i
m64,i32
r,sr
sr,r/m
r,mabs
mabs,r
m,r32
r,r
r16,r8
r,m
r16,r8
r32/64,r8/16
r,m
r64,r32
r,r/m
r,r
r,m
r
i
m
sr

r
m
sr

r,[m]
r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]

r,m

2
0
0
2
0
3
0
1
0
2
0
1
0
2
0
2
0
1
2
1
8
3
0
3
0
2
0
1
0
1
0
2
0
2
0
1
0
2
0
2
0
2
0
1
0
1
0
2
0
3
0
1
0
1
0
3
0
9,5
0
3
0
2
0
2
6 100
4
0
6
2
0
2
2
0
2
3
0
2
1
3
1
3
1
9
2
0
1
0
2
6
1
8
1
8
2
16
1
0
1
0
2,5
0
2
0
3,5
0
3
0
3,5
0
2
0
3,5
0
3
0
3,5
0
1
0
4
0
1
0
5
0
2
0
0
2
10
1
3
8
1
5n 4n+50
1
2
8
1 2.5n 3n
1
4
8
9 .3n .3n
1 .5-1.1n .6-1.4n
Page 211

1
1
1
2
2
2
8
27
1
2
2
0,25
1
1
1
0,5
1
0,5
3
1

2
2
2
9
9
16
1
10
30
70
15
0,25
0,25
0,5
1
1
1

1
28
8

1
2
2
0
0,3
0,3

alu1
load
load
store
store
store

0/1
0/1
2
0
0
2
0

alu0/1
alu0/1
load
alu0
alu0
load
alu0

0/1

alu0/1

0/1
0/1
0/1
1
0,1
1
1
0/1
1

alu0/1
alu0/1
alu0/1
alu
alu0,1
alu
int
alu0/1
int

x64
86
86
86
86
x64
86
86
x64
x64
sse2
386
386
386
386
386
386
x64
PPro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
86
386
386
386
86
86
86
86
86

86
8

86
86

86
86
86

b,c

a,q
l
l
c
c
a,c,o
a,c,o
a
a,e

m
p

n
d,n
m
m

Prescott
REP MOVSD
REP MOVSQ
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV

r
r,r/i
m
m

r,r
r,m
m,r
r,r/i
r,m
m,r
m,i
r,r
r,m
r
m
r
m

r8
r16
r32
r64
m8
m16
m32
m64
r16,r16
r16,r16,i
r32,r32
r32,(r32),i
r64,r64
r64,(r64),i
r16,m16
r32,m32
r64,m64
r,m,i
r8/m8
r16/m16
r32/m32
r64/m64

1
1
1
1
1
1
1
1
1

1
2
3
3
2
2
3
1
2
2
4
1
3
1
1
2
2
1
4
3
1
2
2
3
2
1
2
1
1
1
1
2
2
2
3
1
1
1
1

1.1n 1.4 n

86
x64
alu

1.1n 1.4 n

0
52
0
0
2
2
4

0
0
0
0
5
6
5
0
0
0
0
0
0
10
16
5
17
0
0
0
5
0
5
0
6
0
0
0
0
0
0
0
0
0
0
20
19
21
31

1
1
5
10
10
20
22
1
1
1
5
1
5
26
29
13
71
10
11
11
11
10
11
11
11
10
11
10
10
10
10
10
10
10
10
74
73
76
63

0
0
0
0

0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Page 212

1
>1000
1
1
50
50
124

86
sse
sse
sse
sse2
sse2

0,25 0/1
1
2
10
1
10
1
10
10
0,25 0/1
1
0,5 0/1
3
0,5 0
3

2,5
2,5
2,5
2,5
2,5
2,5
2,5
2,5
2,5
1-2.5

34
34
34
52

486

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

alu0/1

int,alu
int,alu

alu0/1
alu0/1
alu0

int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int

mul
fpdiv
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
fpdiv
fpdiv
fpdiv
fpdiv

86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
x64
86
86
86
x64
386
186
386
386
x64
x64
386
386
x64
186
86
86
386
x64

c
c
c

c
c

m
m
m
m

a
a
a
a

Prescott
IDIV
IDIV
IDIV
IDIV
CBW
CWD
CDQ
CQO
CWDE
CDQE
SCAS
REP SCAS
CMPS
REP CMPS
Logic
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL
SHR, SAR
SHR, SAR
SHL
SHR, SAR
SHR, SAR
ROL, ROR
ROL, ROR
ROL, ROR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL, SHR, SAR
ROL. ROR
SHL, SHR, SAR
ROL. ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD
SHRD
SHLD, SHRD
SHLD

r8/m8
r16/m16
r32/m32
r64/m64

r,r
r,m
m,r
r,r
r,m
r
m
r,i
r8/16/32,i
r64,i
r,CL
r8/16/32,CL
r64,CL
r8/16/32,i
r64,i
r8/16/32,CL
r64,CL
r,1
r,i
r,i
r,CL
r,CL
m8/16/32,i
m8/16/32,i
m8/16/32,cl
m8/16/32,cl
m8/16/32,1
m8/16/32,i
m8/16/32,cl
r8/16/32,r,i
r64,r64,i
r64,r64,i
r8/16/32,r,cl
r64,r64,cl

1
21
76
0
1
19
79
0
1
19
79
0
1
58
96
0
2
0
2
0
2
0
2
0
1
0
1
0
1
0
7
0
2
0
2
0
1
0
1
0
1
3
0
1 54+6n
4n
1
5
1 81+8n
5n

34
34
34
91
1
1
1
1
1
1
8

1
2
3
1
2
1
3
1
1
1
2
2
2
1
1
2
2
1
2
2
1
1
3
3
2
2
2
3
2
3
4
3
4
4

0,5
1
2
0,5
1
0,5
2
0,5
0,5
2
2
2

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
11
11
11
11
6
6
6
6
5
13
13
0
5
7
0
5

1
1
5
1
1
1
5
1
1
7
2
2
8
1
7
2
8
7
31
25
31
25
10
10
10
10
27
38
37
8
10
10
9
14

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Page 213

1
1
1
1
0
0/1
0/1
0/1
0/1
0/1

int
int
int
int
alu0
alu0/1
alu0/1
alu0/1
alu0/1
alu0/1

fpdiv
fpdiv
fpdiv
fpdiv

86
86
386
x64
86
86
386
x64
386
x64
86

a
a
a
a

86
10

86
86

1
7
2
8
7
31
25
31
25

27
38
37
7

alu0

alu0

alu0

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1

86
86
86
86
86
86
86
186
186
x64
86
86
x64
186
x64
86
x64
86
186
186
86
86
86
86
86
86
86
86
86
386
x64
x64
386
x64

c
c
c
c
c

d
d
d
d
d
d
d
d
d
d
d
d
d
d

Prescott
SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD, STD

r64,r64,cl
m,r,i
m,r,CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r/m
r
m

3
3
2
1
2
3
2
1
2
3
2
2
2
3
2
3
1

8
8
8
0
0
0
7
0
0
6
10
0
0
0
0
0
8

12
20
20
8
9
8
10
8
9
28
14
16
9
9

Control transfer instructions


JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
BOUND
m
INT
i
INTO

1
2
3
3
2
1
4
4
3
3
4
4
2
4
4
1
2
1
2
2
1

0
25
0
0
28
0
0
0
0
29
0
0
32
0
0
30
30
49
11
67
4

Other
NOP (90)
Long NOP (0F 1F)
PAUSE
LEAVE
CLI
STI
CPUID
RDTSC

1
0
1
0
1
2
4
0
1
5
1
11
1 49-90
1
12

15

0
0
5

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

10
10
8
9
8
10
8
9
10
14
4
1
2
8

1
1
1
1
1
1
1
1
1
1
1
1
1
1

x64
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86

53

1
154
15
10
157
2-4
4
4
7
160
7
9
160
7
7
160
160
325
12
470
26

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0,25 0/1
0,25 0/1
50
5
52
64
300-500
100

Page 214

alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
int
int

alu0

branch

alu0
alu0

branch
branch

alu0
alu0
alu0
alu0

branch
branch
branch
branch

alu0
alu0

branch
branch

alu0
alu0

branch
branch

alu0/1
alu0/1

86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
186
86
86

86
ppro
sse2
186
86
86

p5
p5

d
d
d
d

m
m

Prescott
RDPMC (bit 31 = 1)
RDPMC (bit 31 = 0)
MONITOR
MWAIT
Notes:
a)
b)
c)
d)
e)
l)
m)
n)
o)

p)

q)

1
4

37
154

100
240

p5
p5
(sse3)
(sse3)

Add 1 op if source is a memory operand.


Uses an extra op (port 3) if SIB byte used.
Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH,
CH, DH).
Has (false) dependence on the flags in most cases.
Not available on PMMX
Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3).
Not available in 64 bit mode.
Not available in 64 bit mode on some processors.
MOVSX uses an extra op if the destination register is smaller than the biggest
register size available. Use a 32 bit destination register in 16 bit and 32 bit
mode, and a 64 bit destination register in 64 bit mode for optimal performance.
LEA with a direct memory operand has 1 op and a reciprocal throughput of
0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 ops because of the SIB byte. The throughput is 1 in this case.
You may use a MOV instead.
These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode op and a reciprocal throughput of 17.

Floating point x87 instructions

Page 215

mov
load
load
load
mov
store
store
store
mov
load
load
store
store
mov

87
87
87
87
87
87
87
87
87
87
87
87
sse3
87

Notes

0
2
2
2
0
0
0
0
0
2
2
0
0
0

Instruction set

7
7

1
1
8
90
1
2
10
400
1
8
2
2,5
2,5
2

Subunit

0
0

Execution unit

Port
Reciprocal throughput

0
0
3
74
0
0
6
311
0
2
0
0
0
0

Additional latency

1
1
3
3
1
2
3
3
1
3
2
3
3
1

Latency

r
m32/64
m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m
m

Microcode

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST(P)
FISTTP
FLDZ

Operands

ops

Instruction

Prescott
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN, FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1

st0,r
r
AX
AX
m16
m16
m16

r
m
m16
m32
r
m
m16
m32
r
m
m16
m32

r
m
r
m16
m32

2
4
3
1
4
6
2
4
3

0
0
0
0
0
0
3
0
6

1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
3
3
1
1
3
8
9

0
0
3
0
0
0
3
0
0
0
3
3
0
0
0
0
0
0
3
0
0
0
14
86
92

1
2
3
5
8
4
3
4
3
3
3

0
0
100
150
170
97
25
16
190
63
58

0
0
0

6
6
7
6
8
8
8
8
45
45
45
45
3
3
3
3
3

1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0

28
220
220

1
1
1

45

200
200
270
250
96
27
270
170
170

Page 216

2
4
3
1
3
3
8
3
10

0
1
0
0
1
1
0
0
0,2

mov
fp
mov
mov

1
1
6
2
2
2
8
3
45
45
45
45
1
1
1
1
1
3
8
2
1
1
16

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1

fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

1
1
1
1
1
1
1
1
1
1
1

fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp

45
2
200
200
270
250

87
PPro
87
87
287
287
87
87
87

add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc

87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387

div

87
87
387
387
87
87
87
87
87
87
87

fp
fp

g,h
g,h
g,h
g,h

g,h

Prescott
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)

g)

h)
i)

1
2
1
1
2
2
2
2

0
0
4
30
181
96
121
118

1
0

0
0

1
1
120
200

500
570

0
0
1

mov
mov

0,1
160
244

87
87
87
87
87
87
sse
sse

i
i

Not available on PMMX


The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is > 100.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 32, double precision: 40, long double precision
(default): 45.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
Takes fewer microcode ops when XMM registers are disabled, but the
throughput is the same.

Integer MMX and XMM instructions

7
2

0
1

10

Page 217

alu

shift

shift

sse2

mmx
mmx
mmx
sse2
sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2
sse3

Notes

1
1

0
fp
1
mmx
2
load
0
fp
1
mmx
2
load
0,1
0
mov
1
mmx
2
load
0
mov
0
mov
2
load
0
mov
2
load
0
mov
2
load
0,1 mov-mmx

Instruction set

7
4

1
1
1
1
2
1
2
1
2
1
2
1
1
2
23
8
2,5
2

Subunit

1
1

Execution unit

6
3

Port
Reciprocal throughput

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0

Additional latency

2
1
1
1
2
1
2
1
1
1
2
1
1
2
4
4
4
3

Latency

r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
xmm,m
mm,xmm

Microcode

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q

Operands

ops

Instruction

k
k

Prescott
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVDDUP
MOVSHDUP
MOVSLDUP
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
Arithmetic instructions
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ

xmm,mm
m,mm
m,xmm
xmm,xmm

2
3
2
1

0
0
0
0

10

2
4
4
2

0,1 mov-mmx
0
mov
0
mov
1
mmx

xmm,xmm

mm,r/m

xmm,r/m

mm,r/m

xmm,r/m

xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
r,r32,i

1
1
1
1
1
1
2
2
2
2

r,r/m

sse2

shift

sse
sse2
sse3

mmx

shift

sse3

mmx

shift

mmx

mmx

shift

mmx

mmx

shift

mmx

mmx

shift

sse2

0
0
0
0
4
6
0
0
0
0

2
4
2
2

1
1
1
1

2
2
2
1
10
12
3
2
3
2

shift
shift
shift
shift

sse2
sse2
sse
sse
sse
sse2

1,2

mmx

alu

mmx

a,j

r,r/m
mm,r/m
xmm,r/m

1
1
1

0
0
0

2
2
5

1
1
1

1,2
1
2

1
1
1

mmx
mmx
fp

alu
alu
add

mmx
sse2
sse2

a,j
a
a

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
1
1
1
1
1

0
0
0
0
0
0
0
0
0

2
7
7
7
7
2
2
2
4

1
1
1
1
1
1
1
1
1

1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2

1
1
1
1
1
1
1
1
1

mmx
fp
fp
fp
fp
mmx
mmx
mmx
mmx

alu
mul
mul
mul
mul
alu
alu
alu
alu

mmx
mmx
sse
mmx
sse2
sse
sse
sse
sse

a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j

r,r/m
r,r/m

1
1

0
0

2
2

1
1

1,2
1,2

1
1

mmx
mmx

alu
alu

mmx
mmx

a,j
a,j

r,i/r/m
xmm,i

1
1

0
0

2
4

1
1

1,2
2

1
1

mmx
mmx

shift
shift

mmx
sse2

a,j

7
7
7
4

Page 218

1
mmx
1
mmx
1
mmx
1
mmx
0
mov
0
mov
0,1 mmx-alu0
1 mmx-int
1 mmx-int
1 int-mmx

sse
sse
sse2
sse

Prescott
Other
EMMS
Notes:
a)
j)
k)

10

10

12

mmx

Add 1 op if source is a memory operand.


Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.

Floating point XMM instructions

Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm

1
2
3
2
1
3
1
2
4

0
0
0
0
0
0
0
0
0

MOVHPS/D, MOVLPS/D

1
1
0

4
2

1
1

4
2

1
1

5
4
4
2

1
1
1
1

4
10
14
8
5
10
5
11
12

1
1
1
1
1
1
1
1
1

4
2
6
6
2
4
2
2
6

Page 219

0
2
0
0
2
0
1
1
2
0
1
1
2
0
1
1
0
1
1
1
1

1
1
1
1
1
1
1
1
1

mov

mov

mmx
mmx

shift
shift

mmx
mmx

shift
shift

fp
mmx
mmx
mmx

shift
shift
shift

mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx

shift
sse2
shift
shift
sse2
sse2

sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse3
sse3
sse
sse
sse
sse
sse

sse2
a
sse2
sse2
sse2
a
sse2
a
sse

Notes

MOVHPS/D, MOVLPS/D

2
4

1
1
2
1
2
8
2
2
1
2
2
2
2
2
2
2
4
3
2
2
2

Instruction set

0
0

Subunit

0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Execution unit

1
1
2
1
4
4
1
1
1
2
1
1
2
2
1
1
2
2
1
2
1

MOVSH/LDUP
MOVDDUP
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D

r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m

Port
Reciprocal throughput

Additional latency

Latency

Microcode

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS

Operands

ops

Instruction

k
k

a
a
a
a
a
a

Prescott
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI

xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm

4
3
4
3
4
2
2

0
0
0
0
0
0
0

12
8
12
20
20
12
17

1
0
1
1
1
1
1

5
2
3
4
5
4
4

1
0,1
0,1
1
1
1
1

fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp

sse2
sse
sse2
sse
sse2

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
3
1
1
1
1
1
2

0
0
0
0
0
0
0
0
0
0

5
5
5
13
7
32
41
40
71
6

1
1
1
1
1
1
1
1
1
1

2
2
2
5-6
2
23
41
40
71
4

1
1
1
1
1
1
1
1
1
1

fp
fp
fp
fp
fp
fp
fp
fp
fp
mmx

r,r/m

r,r/m
r,r/m

1
2

0
0

5
6

1
1

2
3

Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D

r,r/m

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m

1
1
1
1
2
2

0
0
0
0
0
0

32
41
40
71
5
6

1
1
1
1
1
1

m
m

2
3

11
0

Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
ADDSUBPS/D
HADDPS/D HSUBPS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D

Other
LDMXCSR
STMXCSR
Notes:
a)
h)
k)

a
a
a
a
a
sse2
sse

a
a

add
add
add
add
mul
div
div
div
div

sse
sse
sse3
sse3
sse
sse
sse
sse2
sse2
sse

a
a
a
a
a
a,h
a,h
a,h
a,h
a

fp

add

sse

1
1

fp
fp

add
add

sse
sse

a
a

mmx

alu

sse

32
41
40
71
3
4

1
1
1
1
1
1

fp
fp
fp
fp
mmx
mmx

div
div
div
div

sse
sse
sse2
sse2
sse
sse

a,h
a,h
a,h
a,h
a
a

13
3

1
1

sse
sse

Add 1 op if source is a memory operand.


Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.

Page 220

Atom

Intel Atom
List of instruction timings and op breakdown
Explanation of column headings:
Instruction:
Operands:

ops:
Unit:

Latency:

Reciprocal throughput:

Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit
xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
The number of ops from the decoder or ROM.
Tells which execution unit is used. Instructions that use the same unit cannot
execute simultaneously.
ALU0 and ALU1 means integer unit 0 or 1, respectively.
ALU0/1 means that either unit can be used. ALU0+1 means that both units
are used.
Mem means memory in/out unit.
FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions).
FP1 means floating point unit 1 (adder).
MUL means multiplier, shared between FP and integer units.
DIV means divider, shared between FP and integer units.
np means not pairable: Cannot execute simultaneously with any other instruction.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are
presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX MOVSXD
CMOVcc
CMOVcc
XCHG

Operands

ops

Unit

r,r
r,i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
r,r/m
r,r
r,m
r,r

1
1
1
1
1
1
2
7
8
1
1
1
1
3

ALU0/1
ALU0/1
ALU0, Mem
ALU0, Mem
ALU0, Mem

Latency Reciprocal
throughput
1
1
1-3
1
1

ALU0, Mem
ALU0
ALU0+1

1
2
6

Page 221

1/2
1/2
1
1
1
1
5
21
26
2,5
1
2
3
6

Remarks

All addr. modes


All addr. modes

Atom
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL

r,m
r
i
m
sr

r
(E/R)SP
m
sr

4
3
1
1
2
3
14
9
1
1
3
7
19
16
1
1
2

r,m
r
m
m
m

1
1
10
1
1
1
1
1

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

1
1
1
1
1
1
1
1
1
1
13
13
20
21
4
10
3
4
3
8
2
1

r8
r16
r32
r64
r16,r16
r32,r32

np
np

6
6
1

np
np

1
1

ALU0+1
ALU0/1

2
1
7

AGU1
ALU0

1-4
1
30

1
1
30
1
1
1/2
1
1

1/2
1
1
2
2
2
1/2
1
1/2

Mem
Mem

ALU0/1
ALU0/1, Mem

ALU0/1
ALU0/1

ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
Page 222

6
6
1
1
5
6
12
11
1
1
6
31
28
12
2
1/2
5

2
2
2
2
1
1
1
16
12
20
25
7
24
7
6
6
14
6
5

Implicit lock

Not in x64 mode

Not in x64 mode

Not in x64 mode


4 clock latency
on input register

Not in x64 mode


Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
7
6
6
14
5
2

Atom
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR
RCL
RCR
RCL
SHLD
SHLD
SHLD
SHLD
SHLD
SHLD
SHRD
SHRD
SHRD
SHRD
SHRD
SHRD
BT

r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r/m8
r/m16
r/m32
r/m 64
r/m8
r/m16
r/m32
r/m64

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r,1
r/m,i/cl
r/m,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r,r/i

6
2
1
7
3
5
4
8
9
12
12
38
26
29
29
60
2
1
1
2
1
1

ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0

13
5
5
14
6
7
7
14
22
33
49
183
38
45
61
207
5
1
1
5
1
1

1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0/1, Mem 1
1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0
1
1
ALU0
1
1
ALU0
1
1
ALU0
1
5
ALU0
7
2
ALU0
1
12-17
ALU0
12-15
14-20
ALU0
14-18
10
ALU0
10
2
ALU0
5
10
ALU0
11
9
ALU0
9
2
ALU0
5
9
ALU0
10
8
ALU0
8
2
ALU0
5
10
ALU0
9
7
ALU0
8
2
ALU0
5
9
ALU0
9
1
ALU1
1
Page 223

11
5
2
14

22
33
49
183
38
45
61
207

1/2
1
1
1/2
1
1
1
1
1

1-2 more if mem


1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1

Atom
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc
CLC STC
CMC
CLD
STD

m,r
m,i
r,r/i
m,r
m,i
r,r/m
r
m

Control transfer instructions


JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND
r,m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)

9
2
1
10
3
10
1
2
1
1
5
6

ALU1
ALU1
ALU1
ALU0+1
ALU0/1

1
29
1
2
30
1
3
8
8
1
37
1
2
38
1
1
36
36
11
4

ALU1

2
5
1/2
2
7
25

2
66
4
7
78
2
7
8
8
3
65
18
20
64
6
6
80
80
10
6

ALU1

np
np

3
5n+11
2
3n+10
4
4n+11
3
5n+16
5
6n+16

1
1

10
5
1
11
6
16
2

6
3n+50
5
2n+4
6
2n - 4n
6
3n+60
7
4n+40

ALU0/1
ALU0/1
Page 224

Not in x64 mode

Not in x64 mode

Not in x64 mode


Not in x64 mode

fastest for high n

1/2
1/2

Atom
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC

a,0
a,b

5
14
20+6b
4
40-80
16
24

24
23
6
100-170
29
48

Floating point x87 instructions


Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)

Operands

ops

r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m
m
m

1
1
4
52
1
3
8
189
1
1
3
3
1
2
2
3
4
4
2
3
1
1
166
83

1
3
9
92
1
7
12
221
1
7
11
11

1
1
1
1
1
1
1
5
3
3
3

5
5
71
1
1
1
1

r
AX
m16
m16
m16

m
m

r/m
r/m
r/m

r/m
r
m
m
m

Unit

Latency Reciprocal
throughput

1
321
177

Mul
Div

Mul
Div
Page 225

1
1
10
92
1
9
13
221
1
6
9
9
1
8
10
9
10
10
8
9
1
1
321
177

1
2
71
1
1
1
1
10
9
9
73

Remarks

SSE3

Atom
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT

3
1
1
26
37
19

1
1
~110
~130
48

Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN

30
15
1
9
112
25
63
100
91

56
24
71
~260
~260
~100
~220
~300
~300

Other
FNOP
WAIT
FNCLEX
FNINIT

1
2
4
23

Div

9
1
1

1
5
26

74

Integer MMX and XMM instructions


Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB

Operands

ops

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm

1
1
1
1
1
1
1
1
1
1
3
4
4
1
1
1
1

(x)mm, (x)mm

Unit

Latency Reciprocal
throughput

Mem
Mem

4
5
3
4
1
4
5
1
4
5
6
6
6
1
1
~400
~450

2
1
1
1
1/2
1
1
1/2
1
1
6
6
6
1
1
1
3

FP0

Mem
Mem
FP0/1
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem

Page 226

Remarks

Atom
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PINSRW
PEXTRW

(x)mm, (x)mm
(x)mm, (x)mm
mm,mm
xmm,xmm
mm,mm,i
xmm,xmm,i
xmm,xmm,i
xmm, xmm,i
mm,mm
xmm,xmm
r32,(x)mm
(x)mm,r32,i
r32,(x)mm,i

1
1
1
4
1
1
1
1
1
2
1
1
2

Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PSADBW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD
PSIGNB PSIGNW PSIGND

(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm,(x)mm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm

1
2
7
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

(x)mm,(x)mm

(x)mm,(x)mm
(x)mm,(x)mm
(x)xmm,i
xmm,i

Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS

FP0
FP0
FP0
FP0
FP0
FP0
FP0
Mem
Mem

1
1
1
6
1
1
1
1

4
3
5

FP0/1

1
1
1
6
1
1
1
1
2
7
2
1
5

1/2
5
8

FP0/1
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0/1
FP0/1
FP0/1
FP0/1

1
5
8
6
1
4
5
4
5
4
5
4
5
4
5
4
5
1
1
1
1

FP0/1

1/2

1
2
1
1

FP0/1
FP0
FP0
FP0

1
5
1
1

1/2
5
1
1

1/2
1
2
1
2
1
2
1
2
1
2
1
2
1/2
1/2
1/2
1/2

Floating point XMM instructions


Page 227

Atom
Operands

ops

Unit

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

FP0/1
Mem
Mem
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem
FP0

Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm

4
3
4
3
3
3
3
3
1
1
3
4
3
3
3
3

Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

1
1
1
3
1
3
5

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD

Mem
FP0
FP0
FP0
FP0
FP0
FP0

FP1
FP1
FP1
FP1
FP1
FP1
FP0+1
Page 228

Latency Reciprocal
throughput
1
4
5
6
6
1
4
5
5
4
4
1
4
~500
1
1
1
1
1
1

1/2
1
1
6
6
1/2
1
1
1
1
1
1
2
3
1
1
1
1
1

11
10
7
6
6
6
7
6
6
4
7
7
7
10
8
10

11
10
6
6
6
6
6
6
5
1
6
7
6
8
6
8

5
5
5
6
5
6
8

1
1
1
6
1
6
7

Remarks

Atom
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

5
1
1
1
6
3
3
6
6
1
5
1
3
4
1
3

FP0+1
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Div
FP0, Div
FP0, Div
FP0, Div

FP0
FP0
FP0
FP0
FP0

8
4
5
5
9
31
60
64
122
4
9
5
6
9
5
6

7
1
2
2
9
31
60
64
122
1
8
1
6
9
1
6

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

3
5
3
5
1
5

FP0, Div
FP0, Div
FP0, Div
FP0, Div
FP0
FP0

31
63
60
121
4
9

31
63
60
121
1
8

Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

1
1
1
1

FP0/1
FP0/1
FP0/1
FP0/1

1
1
1
1

1/2
1/2
1/2
1/2

Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR

m32
m32
m4096
m4096

4
4
121
116

5
14
142
149

6
15
144
150

Page 229

VIA Nano 2000

VIA Nano 2000 series


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops:

Port:

Latency:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 2000 processor has no reliable performance monitor counter for ops.
Therefore the number of ops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.

Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind in the same thread.

Integer instructions
Operands ops

Port

Latency

Reciprocal
thruoghput

Remarks

Move instructions
MOV
MOV

r,r
r,i

1
1

I2
I2

1
1

1
1

MOV
MOV
MOV
MOV
MOV
MOV
MOV

r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m

1
1
1

LD
SA, ST
SA, ST

2
2

1
1,5
1,5
1
2
20
20

20
20
Page 230

Latency 4 on
pointer register

VIA Nano 2000


MOVNTI
MOVSX MOVSXD
MOVZX
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA

m,r
r,r
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr

1
2
1
2
3

SA, ST

1,5

I2
LD, I2
LD
I1, I2
LD, I1
I2

1
3
2
2
5
3
20
6

1
1
1
1
2
3
20

SA, ST
SA, ST
Ld, SA, ST
8

r
(E/R)SP
m
sr

LD

9
1
1
r,m

SA

1
1
9
1

I2

30

30
1-2
1-2
14
14
14

I12
LD I12

LD I12 SA ST

5
1

1/2
1
2
1
1
2
1/2
1
1/2

m
m
m

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

I1
I1

1-2
1-2
2
17
8
15
1,25
4
5
20
9
12
1
1
6
1

LD
LD

1
2
3
1
2
3
1
2
1
3

I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST

5
1
1
5

37
37
22
Page 231

Implicit lock

Not in x64 mode

Not in x64 mode

Not in x64 mode


3 clock latency on
input register

Not in x64 mode


Not in x64 mode
Not in x64 mode

VIA Nano 2000


DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR

24
23
30
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64
1
1

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r64,r64,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r

1
2
3
1
2
1
1
1

1
2
2

MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
I1
I1

7-9
7-9
7-9
8-10
4-6
4-6
5-7
4-6
4-6
5-7
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1

I12
LD I12

LD I12 SA ST

5
1

I12
LD I12
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
Page 232

1
1
1
28+3n
11
7
33
43
11
7
33
43
1

2
10
8
3

1
1
2
1
1
2
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1

1/2
1
2
1/2
1
1
1
1
28+3n
11
7
33
43
11
7
33
43
1
8
1
2
10
8
2

Not in x64 mode


Not in x64 mode
Not in x64 mode
Extra latency to
other ports
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.

VIA Nano 2000


SETcc
SETcc
CLC STC CMC
CLD STD

r
m

I1

1
1
3
3

I1

3
3

I2

3
58

I2

3
3
55
1-3-8

3
3
1-3-8

1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block
do.
do.

Control transfer instructions


JMP
JMP

short/near
far

JMP
JMP
JMP
Conditional jump

r
m(near)
m(far)
short/near

J(E/R)CXZ
LOOP
LOOP(N)E

short
short
short

1-3-8
1-3-8
25

1-3-8
1-3-8
25

CALL
CALL

near
far

3
72

3
72

CALL
CALL
CALL

r
m(near)
m(far)

3
4
72

3
3
72

3
3
39
39

3
3
39
39
13
7

RETN
RETN
RETF
RETF
BOUND
INTO

i
i
r,m

String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q
REP STOSB/W/D/Q

1
3n+22
1-2
Small: 2n+2,
Big: 6 bytes
per clock

MOVSB/W/D/Q
REP MOVSB/W/D/Q

2
Small: 2n+45,
Big: 6 bytes
per clock

SCASB/W/D/Q
REP SCASB

1
2.2n

Page 233

8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.

8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.

Not in x64 mode


Not in x64 mode

VIA Nano 2000


REP SCASW/D/Q

Small: 2n+50
Big: 5 bytes
per clock

CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC

6
2.4n+24

1
1

All
I12

a,0
a,b
4
53-173
40

1
1/2
25
23
52+5b
4

Blocks all ports

39
40

Floating point x87 instructions


Operands ops

Port and
Unit

Latency

Reciprocal
thruoghput

1
2
2

MB
LD MB
LD MB

1
3
3

MB
MB SA ST
MB SA ST

I2

1
4
4
54
1
5
5
125
0
7
5
5
6
5
5

1
1
1
54
1
1-2
1-2
125
1

MB

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR

r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64

r
AX
m16
m16
m16

13
1
1

I2
MB

m
m

0
321
195

Arithmetic instructions
Page 234

1
10
2
5
3
13
2
1
1
321
195

Remarks

VIA Nano 2000


FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT

r/m
r/m
r/m

r/m
r
m
m
m
m

1
1
1
1
1
1
1

1
1

MB
MA
MA
MB
MB
MB
MB
MB
MB

2
4
15-42
1
1

MB

1
2
15-42
1
1
1
1
1
2
4
42
2
1
41

Lower precision:
Lat: 4, Thr: 2

151-171
106-155
29

Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN

39
36-57
73
51-159
270-360
50-200
~60
~170
300-370
~170

Other
FNOP
WAIT
FNCLEX
FNINIT

1
1

MB
I12

1
1/2
57
85

Integer MMX and XMM instructions


Operands ops

Port and
Unit

Latency

Reciprocal
thruoghput

3
2-3
4
2-3
1
2-3
2-3
1
2-3

1
1-2
1
1
1
1
1-2
1
1

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA

r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128

1
1

SA ST

1
1
1
1
1
1

LD
MB
LD
SA ST
MB
LD
Page 235

Remarks

VIA Nano 2000


MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW

m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm

1
1
1
1
1
1
3
3

SA ST
SA ST
LD
LD
MB
MB

2-3
2-3
2-3
2-3
1
1
~300
~300

1-2
1-2
1
1
1
1
2
2

v,v

MB

v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,i
mm,mm
xmm,xmm
r32,(x)mm
r32 ,(x)mm,i
(x)mm,r32,i

1
1
1
1
1
1
1

MB
MB
MB
MB
MB
MB
MB

1
1
1
1
1
1
1

3
3
9

1
1
1
1
1
1
1
1-3
1-3
1
1
9

v,v
v,v

1
1

MB
MB

1
1

1
1

v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v

3
3
1
1
1
1

MB
MB
MB
MA
MA
MA

1
1
1

MB
MB
MB
MB

3
3
1
3
3
3
4
10
2
1
1
1

3
3
1
1
1
1
2
8
1
1
1
1

Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULHRSW
PMULUDQ
PMADDWD
PMADDUBSW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD

v,v

MB

PSIGNB PSIGNW
PSIGND

v,v

MB

Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

v,v
v,v
v,i
x,i

1
1
1
1

MB
MB
MB
MB

1
1
1
1

1
1
1
1

Page 236

VIA Nano 2000


Other
EMMS

MB

Floating point XMM instructions


Operands ops

Port and
Unit

Latency

Reciprocal
thruoghput

1
1
1
1
1
1
1
1

MB
LD
SA ST
LD
SA ST
MB
LD
SA ST

MB

1
1
1
1
1
1

MB
MB
MB
MB
MB
MB

1
2-3
2-3
2-3
2-3
1
2-3
2-3
6
6
6
2
1
3
~300
1
1
1
1
1
1

1
1
1-2
1
1-2
1
1
1-2
1
1
1-2
1-2
1
1
2,5
1
1
1
1
1
1

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD

xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm

Arithmetic
ADDSS SUBSS

xmm,xmm

3-4
15
3-4
15
3
2
4
3
4
3
4
3
5
4
5
4

MBfadd
Page 237

2-3

Remarks

VIA Nano 2000


ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

MAXSS/D MINSS/D
MAXPS/D MINPS/D

xmm,xmm
xmm,xmm
xmm,xmm

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D

xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm

Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR

m32
m32
m4096
m4096

1
1
1
1
1

1
1

MBfadd
MBfadd

2-3
2-3
2-3
2-3
2-3
5
5
3
4
3
4
15-22
15-36
42-82
24-70
5
14
2
2

1
1

MBfadd
MBfadd

3
2
2

1
1
1

MA
MA
MA
MA

33
126
62
122
5
14

33
126
62
122
5
11

MB
MB
MB
MB

1
1
1
1

1
1
1
1

45
13
208
232

29
13
208
232

1
1

1
1
1
1

MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA

1
1
1
1
1
3
3
1
2
1
2
15-22
15-36
42-82
24-70
5
11
1
1

VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE

Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0

Clock cycles, approximately


160-400 clock giving 8 bytes
50-80 clock giving 0 bytes
4800 clock per 8 bytes
19200 clock per 8 bytes
Page 238

VIA Nano 2000


REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256

128 bits key


192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key

44 clock per 16 bytes


46 clock per 16 bytes
48 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
43 clock per 16 bytes
46 clock per 16 bytes
48 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
3 clock per byte
4 clock per byte

Page 239

Nano 3000

VIA Nano 3000 series


List of instruction timings and op breakdown
Explanation of column headings:
Operands:

ops:

Port:

Latency:

i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 3000 processor has no reliable performance monitor counter for ops.
Therefore the number of ops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.

Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind in the same thread.

Integer instructions
Operands

ops

Port

Latency Reciprocal
thruoghput

MOV
MOV

r,r
r,i

1
1

I2
I12

1
1

1
1/2

MOV
MOV
MOV
MOV
MOV
MOV

r,m
m,r
m,i
r,sr
m,sr
sr,r

1
1
1

LD
SA, ST
SA, ST
I12

2
2

1
1,5
1,5
1/2
1,5
20

Remarks

Move instructions

20
Page 240

Latency 4 on pointer
register

Nano 3000
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA

sr,m
m,r
r,r
r64,r32
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr

r
(E/R)SP
m
sr

1
1
2
1
1
3
3
1
1

3
9
2

SA, ST
I12
LD, I12
LD
I12
LD, I12
I12
LD, I1
SA, ST
SA, ST
LD, SA, ST

20
2
1
1
3
2
1
5
3
18
6

1
1
10

20
1,5
1/2
1
1
1
1/2
1
1,5
18
2
1-2
1-2
2
6
2
15
1,25
4
2
11
1
12
1
1
6

1
1

1
1

28

28
1
1

2
LD

3
3
16
1
1
2

r,m
r

1
1

m
m
m

12
1
1

I1
I1

SA
I2

LD
LD

Implicit lock

Not in x64 mode

Not in x64 mode

Not in x64 mode


Extra latency to other
ports

15

r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m

1
2
3
1
2
3
1
2
1
3
12
12
14

I12
LD I12

LD I12 SA ST

5
1

I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST

5
1
1
5

1/2
1
2
1
1
2
1/2
1
1/2
37
22
22

Page 241

Not in x64 mode


Not in x64 mode
Not in x64 mode

Nano 3000
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL

r8
r16
r32

14
7
13
1
3
3

I2
I2
I2

2
3
3

MUL IMUL
IMUL
IMUL

r64
r16,r16
r32,r32

3
1
1

MA
I2
I2

8
2
2

8
1
1

IMUL
IMUL
IMUL

r64,r64
r16,r16,i
r32,r32,i

1
1
1

MA
I2
I2

5
2
2

2
1
1

IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO

r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64

MA
MA
MA
MA
MA
MA
MA
MA
MA
I2
I2

5
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1

Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc

1
1

r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i/cl
r32,r32,i/cl
r64,r64,i/cl
r64,r64,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r8
m

24
24
31

1
I12
2
LD I12
3
LD I12 SA ST
1
I12
2
LD I12
1
I12
1
I1
1
I1
5+2n
I1
2
I1
2
I1
16
I1
23
I1
1
I1
6
I1
2
I1
2
I1
8
I1
5
I1
2
I1
1
I1
2
Page 242

1
5
1
1
1
1
28+3n
2
2
32
42
1

2
10
8
2
1

Not in x64 mode


Not in x64 mode
Not in x64 mode

Extra latency to other


ports

Extra latency to other


ports

Extra latency to other


ports
2
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1

1/2
1
2
1/2
1
1/2
1
1
28+3n
2
2
32
42
1
8
1
2
10
8
2
1
2

Nano 3000
CLC STC CMC
CLD STD

3
3

I1
I1

3
3

3
3

Control transfer instructions


JMP
JMP

short/near
far

1
14

I2

3
50

JMP
JMP
JMP

r
m(near)
m(far)

2
2
17

I2

3
3

3
3
42

Conditional jump
J(E/R)CXZ
LOOP
LOOP(N)E

short/near
short
short
short

1
2
2
5

CALL
CALL

near
far

CALL
CALL
CALL

r
m(near)
m(far)

RETN
RETN
RETF
RETF
BOUND
INTO

i
i
r,m

I2

1-3-8
1-3-8
1-3-8
24

1-3-8
1-3-8
1-3-8
24

2
17

3
58

2
3
19

3
4

3
3
54

3
4
20
20
9
3

3
3

3
3
49
49
13
7

String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q

2
3n
1

REP STOSB/W/D/Q
MOVSB/W/D/Q

1
3n+27
1-2
Small:
n+40, Big:
6-7
bytes/clk

2
Small:
2n+20,
Big: 6-7
bytes/clk

1
2.4n
Small:
2n+31,
Big: 5
bytes/clk

REP MOVSB/W/D/Q
SCASB/W/D/Q
REP SCASB

REP SCASW/D/Q
Page 243

8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block

8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.

Not in x64 mode


Not in x64 mode

Nano 3000
CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC

a,0
a,b

0-1
0-1
2
10

6
2.2n+30

I12
I12

0
0

2
55-146

1/2
1/2
6
21
52+5b
2

Sometimes fused

37
40

Floating point x87 instructions


Operands

ops

Port

r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64

MB
LD MB
LD MB

m
m

1
2
2
36
1
3
3
80
1
3
2
2
3
3
3
1
3
1
1
3
5
3
1
1
122
115

r/m

Latency Reciprocal
thruoghput

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)

r
AX
m16
m16
m16

MB
MB SA ST
MB SA ST
I2

1
4
4
54
1
5
5
125
0
7
5
5
6
5
5

MB

319
196

1
10
2
1
2
8
2
1
1
319
196

MB

I2
MB

MB
Page 244

1
1
1
54
1
1-2
1-2
125
1

Remarks

Nano 3000
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT

r/m
r/m

r/m
r
m
m
m
m

Math
FSCALE
FXTRACT

MA
MA
MB
MB
MB
MB
MB
MB

4
14-23
1
1

MB

11

2
38
~130
~130
27

22
13

37
57

1
1
1
1
1
3
3
3
3
1
15

FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN

2
14-23
1
1
1
1
1
2
4
16
2
1
38

Less at lower
precision

73
~150
270-360
50-200
~50
~50
300-370
~180

Other
FNOP
WAIT
FNCLEX
FNINIT

1
1

MB
I12

1
1/2
59
84

Integer MMX and XMM instructions


Operands

ops

Port

r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
x,x

1
1
1
1
1
1
1
1

MB
SA ST
I2
LD
MB
LD
SA ST
MB

Latency Reciprocal
thruoghput

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA

Page 245

3
2
4
2
1
2
2
1

1
1-2
1
1
1
1
1-2
1

Remarks

Nano 3000
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKUSDW
PUNPCKH/LBW/WD/DQ

PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PBLENDVB
PBLENDW
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRB/D/Q
PINSRW
PINSRB/D/Q
PMOVSX/ZXBW/BD/
BQ/WD/WQ/DQ

x, m128
m128, x
m128, x
x, m128
x, m128
mm, x
x,mm
m64,mm
m128,x
x,m128

1
1
1
1
1
1
1
2
2
1

LD
SA ST
SA ST
LD
LD
MB
MB

2
2
2
2
2
1
1
~360
~360
2

1
1-2
1-2
1
1
1
1
2
2
1

v,v
x,x
v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,xmm0
x,x,i
x,x,i
mm,mm
x,x
r32,(x)mm
r32 ,(x)mm,i
r32/64,x,i
(x)mm,r32,i
x,r32/64,i

1
1
1
1
1
1
1
1
1
1
1

MB
MB
MB
MB
MB
MB
MB
MB
MB
MB
MB

1
1
1
1
1
1
1
1
2
1
1

1
1
2
2

MB
MB
MB
MB

3
3
3
5
5

1
1
1
1
1
1
1
1
2
1
1
1-2
1-2
1
1
1
1
1

x,x

MB

v,v
v,v

1
1

MB
MB

1
1

1
1

v,v
v,v
v,v
x,x
v,v
v,v
x,x
v,v
x,x
v,v
v,v
v,v
x,x,i
v,v

3
3
1
1
1
1
1
1
1
1
7
1
1
1

MB
MB
MB
MB
MA
MA
MA
MA
MA
MA

3
3
1
1
3
3
3
3
3
4
10
2
2
1

3
3
1
1
1
1
1
1
1
2
8
1
1
1

Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PCMPEQQ
PMULL/HW PMULHUW
PMULHRSW
PMULLD
PMULUDQ
PMULDQ
PMADDWD
PMADDUBSW
PSADBW
MPSADBW
PAVGB/W

MB
MB
MB
Page 246

Nano 3000
PMIN/MAXSW
PMIN/MAXUB
PMIN/MAXSB/D
PMIN/MAXUW/D
PHMINPOSUW
PABSB PABSW PABSD
PSIGNB PSIGNW
PSIGND
Logic instructions
PAND(N) POR PXOR
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ

v,v
v,v
x,x
x,x
x,x

1
1
1
1
1

MB
MB
MB
MB
MB

1
1
1
1
2

1
1
1
1
1

v,v

MB

v,v

MB

v,v
v,v
v,v
(x)xmm,i
x,i

1
1
1
1
1

MB
MB
MB
MB
MB

1
3
1
1
1

1
1
1
1
1

MB

Operands

ops

Port

x,x
x,m128
m128,x
x,m128
m128,x
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
m128,x
x,x,i
x,x,i
x,x
x,x
x,x
x,x

1
1
1
1
2
1
1
2
2
2
3
1
1

MB
LD
SA ST
LD
SA ST
MB
LD
SA ST

x,x
x,x
x,x

2
1
2

Other
EMMS

Floating point XMM instructions


Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD

2
1
1
1
1
1
1

MB
MB
MB
MB
MB
MB

Page 247

Latency Reciprocal
thruoghput
1
2
2
2
2
1
2-3
2-3
6
6
6
2
1
3
~360
1
1
1
1
1
1

1
1
1
1
1
1
1
1-2
1
1
1-2
1-2
1
1
1-2
1
1
1
1
1
1

5
2
5

2
1

Remarks

Nano 3000
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI

x,x
x,x
x,x
x,x
x,x
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32
r32,x

1
1
1
2

x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x

1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
3
1
1

MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MBfadd
MBfadd

2
2
2
2
2
2
5
5
3
4
3
4
13
13-20
24
21-38
5
14
2
2

1
1
1
1
1
1
3
3
1
2
1
2
13
13-20
24
21-38
5
11
1
1

MAXSS/D MINSS/D
MAXPS/D MINPS/D

x,x
x,x
x,x

1
1
1

MBfadd
MBfadd
MBfadd

3
2
2

1
1
1

Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS

x,x
x,x
x,x
x,x
x,x
x,x

1
1
1
1
1
3

MA
MA
MA
MA

33
64
62
122
5
14

33
64
62
122
5
11

Logic
ANDPS/D

x,x

MB

Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D

MB

2
1
2
2
2
1
2
1

Page 248

2
3
2
5
4
5
4
4
4
5
4
5
4

1
1
1
2
2
1
1
2
1
1

Nano 3000
ANDNPS/D
ORPS/D
XORPS/D

x,x
x,x
x,x

Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR

m32
m32
m4096
m4096

1
1
1

MB
MB
MB

1
1
1

1
1
1

31
13
97
201

VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256

Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key

Clock cycles, approximately


160-400 clock giving 8 bytes
50-80 clock giving 0 bytes
1300 clock per 8 bytes
5455 clock per 8 bytes
15 clock per 16 bytes
17 clock per 16 bytes
18 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
23 clock per 16 bytes
26 clock per 16 bytes
27 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
5 clock per byte
5 clock per byte

Page 249

Você também pode gostar