Instruction Tables

Introduction
4. Instruction tables
Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs
By Agner Fog. Technical University of Denmark.

Copyright 1996 - 2014. Last updated 2014-02-19.
Introduction
This is the fourth in a series of five manuals:
1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac
platforms.
2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly
programmers and compiler makers.
4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation
breakdowns for Intel, AMD and VIA CPUs.
5. Calling conventions for different C++ compilers and operating systems.
The latest versions of these manuals are always available from www.agner.org/optimize.
Copyright conditions are listed below.
The present manual contains tables of instruction latencies, throughputs and micro-operation
breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA.
The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower
than the values published elsewhere. The discrepancies can be explained by the following factors:
My figures are experimental values while figures published by microprocessor vendors may be
based on theory or simulations.
My figures are obtained with a particular test method under particular conditions. It is possible that
different values can be obtained under other conditions.
Some latencies are difficult or impossible to measure accurately, especially for memory access
and type conversions that cannot be chained.
Latencies for moving data from one execution unit to another are listed explicitly in some of my
tables while they are included in the general latencies in some tables published by Intel.
Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64-bit).
Values for far calls and interrupts may be different in different modes. Call gates have not been
tested.
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.
If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version.
Page 1
Introduction
Copyright notice
This series of five manuals is copyrighted by Agner Fog. Public distribution and
mirroring is not allowed. Non-public distribution to a limited audience for
educational purposes is allowed. The code examples in these manuals can be
used without restrictions. A GNU Free Documentation License shall automatically
come into force when I die. See www.gnu.org/copyleft/fdl.html
Page 2
Definition of terms
Definition of terms
Operands
Latency
Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose
register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm
register, y = 256 bit ymm register, v = any vector register, sr = segment register, m =
any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are
minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly
more than 100 clock cycles on many processors, except in move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results may give a
similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way.
Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64
bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64
bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the
figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But
if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles
per instruction plus one extra clock cycle in the end. The latency in this case is listed
as 4 in the tables because this is the value it adds to a dependency chain.
Reciprocal
throughput
The throughput is the maximum number of instructions of the same kind that can be
executed per clock cycle when the operands of each instruction are independent of
the preceding instructions. The values listed are the reciprocals of the throughputs,
i.e. the average number of clock cycles per instruction when the instructions are not
part of a limiting dependency chain. For example, a reciprocal throughput of 2 for
FMUL means that a new FMUL instruction can start executing 2 clock cycles after a
previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution
units can handle 3 integer additions per clock cycle.
The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency.
The values listed are for a single thread or a single core. A missing value in the table
means that the value has not been measured.
Page 3
Definition of terms
ops
Uop or op is an abbreviation for micro-operation. Processors with out-of-order cores

are capable of splitting complex instructions into ops. For example, a read-modify instruction may be split into a read-op and a modify-op. The number of ops that an
instruction generates is important when certain bottlenecks in the pipeline limit the
number of ops per clock cycle.
Execution
unit
The execution core of a microprocessor has several execution units. Each execution
unit can handle a particular category of ops, for example floating point additions. The
information about which execution unit a particular op goes to can be useful for two
purposes. Firstly, two ops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle
when the result of a op executing in one execution unit is needed as input for a op
in another execution unit.
Execution
port
The execution units are clustered around a few execution ports on most Intel processors. Each op passes through an execution port to get to the right execution unit. An
execution port can be a bottleneck because it can handle only one op at a time. Two
ops cannot execute simultaneously if they need the same execution port, even if
they are going to different execution units.
Instruction
set
This indicates which instruction set an instruction belongs to. The instruction is only
available in processors that support this instruction set. The different instruction sets
are listed at the end of this manual. Availability in processors prior to 80386 does not
apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not
apply to 128-bit packed integer instructions, which require SSE2. Availability in the
SSE instruction set does not apply to double precision floating point instructions,
which require SSE2.
32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use
XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only
available under operating systems that support this register set.
How the values were measured

The values in the tables are measured with the use of my own test programs, which are available
from www.agner.org/optimize/testp.zip
The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock
frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent
of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp
counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD
Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup.
Instruction throughputs are measured with a long sequence of instructions of the same kind, where
subsequent instructions use different registers in order to avoid dependence of each instruction on
the previous one. The input registers are cleared in the cases where it is impossible to use different
registers. The test code is carefully constructed in each case to make sure that no other bottleneck
is limiting the throughput than the one that is being measured.
Instruction latencies are measured in a long dependency chain of identical instructions where the
output of each instruction is needed as input for the next instruction.
The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code
cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a
larger number of instructions is desired.
Page 4
Definition of terms
It is not possible to measure the latency of a memory read or write instruction with software methods.
It is only possible to measure the combined latency of a memory write followed by a memory read
from the same address. What is measured here is not actually the cache access time, because in
most cases the microprocessor is smart enough to make a "store forwarding" directly from the write
unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of
this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables.
But in fact, the only value that makes sense to performance optimization is the sum of the write time
and the read time.
A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and
XMM vector registers. The value that can be measured is the combined latency of data transfer from
one type of registers to another type and back again (A B A). The division of this latency between the A B latency and the B A latency is sometimes obvious, sometimes based on guesswork, op counts, indirect evidence, or triangular sequences such as A B Memory A. In
many cases, however, the division of the total latency between A B latency and B A latency is
arbitrary. However, what cannot be measured cannot matter for performance optimization. What
counts is the sum of the A B latency and the B A latency, not the individual terms.
The op counts are usually measured with the use of the performance monitor counters (PMCs) that
are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation.
The execution ports and execution units that are used by each instruction or op are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that
can give this information directly. In other cases it is necessary to obtain this information indirectly by
testing whether a particular instruction or op can execute simultaneously with another
instruction/op that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to
another. This delay can be used for detecting whether two different instructions/ops are using the
same or different execution units.
Page 5
Microprocessors tested
Microprocessor versions tested

The tables in this manual are based on testing of the following microprocessors
Processor name
AMD K7 Athlon
AMD K8 Opteron
AMD K10 Opteron
AMD Bulldozer
AMD Piledriver
AMD Steamroller
AMD Bobcat
AMD Kabini
Intel Pentium
Intel Pentium MMX
Intel Pentium II
Intel Pentium III
Intel Pentium 4
Intel Pentium 4 EM64T
Intel Pentium M
Intel Core Duo
Intel Core 2 (65 nm)
Intel Core 2 (45 nm)
Intel Core i7
Intel 2nd gen. Core
Intel 3rd gen. Core
Intel 4th gen. Core
Intel Atom 330
VIA Nano L2200
VIA Nano L3050
Family Model
Microarchitecture number number
Code name
(hex)
(hex)
Comment
Bulldozer, Zambezi
Piledriver
Steamroller, Kaveri
Bobcat
Jaguar
P5
P5
P6
P6
Netburst
Netburst, Prescott
Dothan
Yonah
Merom
Wolfdale
Nehalem
Sandy Bridge
Ivy Bridge
Haswell
Diamondville
Isaiah
6
F
10
15
15
15
14
16
5
5
6
6
F
F
6
6
6
6
6
6
6
6
6
6
6
Page 6
6
5
2
1
2
30
1
0
2
4
6
7
2
4
D
E
F
17
1A
2A
3A
3C
1C
F
F
Step. 2, rev. A5
Stepping A
2350, step. 1
FX-6100, step 2
FX-8350, step 0. And others
A10-7850K, step 1
E350, step. 0
A4-5000, step 1
Stepping 4
Stepping 4, rev. B0
Xeon. Stepping 1
Stepping 6, rev. B1
Not fully tested
T5500, Step. 6, rev. B2
E8400, Step. 6
i7-920, Step. 5, rev. D0
i5-2500, Step 7
i7-3770K, Step 9
i7-4770K, step. 3
Step. 2
Step. 2
Step. 8 (prerel. sample)
AMD K7
AMD K7
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the
pipeline.
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means
any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both
used. AGU means any of the three integer address generation units. FADD
means floating point adder unit. FMUL means floating point multiplier unit.
FMISC means floating point store and miscellaneous unit. FA/M means FADD
or FMUL is used. FANY means any of the three floating point units can be
used. Two macro-operations can execute simultaneously if they go to different
execution units.
Integer instructions
Instruction
Move instructions
MOV
MOV
Operands
r,r
r,i
Ops
1
1
Latency Reciprocal
throughput
1
1
1/3
1/3
MOV
MOV
MOV
MOV
r8,m8
r16,m16
r32,m32
m8,r8H
1
1
1
1
4
4
3
8
1/2
1/2
1/2
1/2
MOV
m8,r8L
1/2
MOV
MOV
MOV
m16/32,r
m,i
r,sr
1
1
1
2
2
2
1/2
1/2
1
Page 7
Execution
unit
Notes
ALU
ALU
Any addr. mode.
Add 1 clk if code
segment base
ALU, AGU 0
do.
ALU, AGU
do.
AGU
AH,
BH,
CH, DH
AGU
Any other 8-bit
register
AGU
Any addressing
mode
AGU
AGU
AMD K7
MOV
MOVZX, MOVSX
MOVZX, MOVSX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
sr,r/m
r,r
r,m
r,r
r,m
r,r
6
1
1
1
1
3
9-13
1
4
1
r,m
3
2
1
1
2
2
1
9
2
3
6
9
2
9
2
1
4
2
1
10
1
16
5
1
1
7
1
1
7
1
r8/m8
1
1
1
1
1
1
1
1
1
1
9
12
16
4
31
3
1
7
5
6
7
5
13
3
r16/m16
r32/m32
r16,r16/m16
3
3
2
3
4
3
r
i
m
sr
r
m
DS/ES/FS/GS
SS
r16,[m]
r32,[m]
r,m
r
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
3
2
3
2
1
1
Page 8
8
1/3
1/2
1/3
1/2
1
16
1
1
1
1
1
4
1
1
10
18
1
4
1
1/3
2
2
1
9
1/3
1/3
1/2
2,5
1/3
1/2
2,5
1/3
1/2
1/3
3
5
6
7
ALU
ALU, AGU
ALU
ALU, AGU
ALU
Timing depends
ALU, AGU on hw
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
Any addr. size
AGU
Any addr. size
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU
ALU0
2
3
2
ALU0_1
ALU0_1
ALU0
latency ax=3,
dx=4
AMD K7
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE
CWD, CDQ
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR
r32,r32/m32
r16,(r16),i
r32,(r32),i
r16,m16,i
r32,m32,i
r8/m8
r16/m16
r32/m32
r8
r16
r32
m8
m16
m32
2
2
2
3
3
32
47
79
41
56
88
42
57
89
1
1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
r,r
r,r
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2
5
4
8
19
23
4
4
5
24
24
40
17
25
41
17
25
41
1
1
1
1
7
1
1
1
7
1
1
1
4
3
3
3
7
7
5
8
6
7
4
4
7
1
2
7
7
6
7
9
Page 9
2,5
1
2
2
2
23
23
40
17
25
41
17
25
41
1/3
1/3
ALU0
ALU0
ALU0
ALU0
ALU0
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1/3
1/2
2,5
1/3
1/2
1/3
2,5
1/3
1/3
1/3
4
3
3
3
3
4
4
4
4
3
2
3
3
1/3
1/2
2
1
2
2
3
7
9
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
AMD K7
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r,m
r,m
r
m
Control transfer instructions

JMP
short/near
JMP
JMP
JMP
20
23
1
1
1
1
2
3
8
10
1
8
10
1/3
1/2
1/3
1/3
1
2
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
low values = real
mode
far
r
m(near)
16-20
1
1
23-32
m(far)
short/near
short
short
near
17-21
1
2
7
3
25-33
CALL
CALL
CALL
far
r
m(near)
16-22
4
5
23-32
3
3
CALL
RETN
RETN
m(far)
16-22
2
2
24-33
3
3
15-23
24-35
15-24
32
33
24-35
81
42
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
5
4
3
7
4
5
5
7
6
JMP
Jcc
J(E)CXZ
LOOP
CALL
RETF
RETF
IRET
INT
BOUND
2
2
3-4
2
2
2
2
1
3
1-4
2
2
6
3-4
Page 10
1/3 - 2
1/3 - 2
3-4
2
ALU
ALU, AGU
ALU
ALU
ALU
ALU
low values = real

mode
rcp. t.= 2 if jump
rcp. t.= 2 if jump
low values = real

mode
3
3
ALU
ALU, AGU
low values = real
mode
3
3
2
2
2
1
3
1-4
2
2
6
3-4
ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no
jump
values are for no
jump
values per count

values per count
values per count
values per count
values per count
AMD K7
Other
NOP (90)
Long NOP (0F 1F)
ENTER
1
1
i,0
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
3
8-9
16-17
19-28
5
9
0
0
12
1/3
1/3
12
ALU
ALU
12
3 ops, 5 clk if 16
bit
3
5
27
44-74
11
11
Floating point x87 instructions

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
FNSTSW
FSTSW
FNSTSW
FNSTCW
Ops
1
1
7
30
1
1
10
260
1
1
1
1
Latency Reciprocal
throughput
2
4
16
41
2
3
7
0
9
7
1/2
1/2
4
39
1/2
1
5
188
0,4
1
1
1
Execution
unit
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC, FA/M
FMISC
42
Low latency immediately after

FMISC, FA/M FCOMI
FANY
FANY
Low latency immediately after
FMISC, ALU FCOM FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
faster if
FMISC, ALU unchanged
4
4
4
4
1
1-2
1
2
FADD
FADD,FMISC
FMUL
FMUL,FMISC
11-25
8-22
FMUL
9
1
1
AX
AX
m16
m16
2
3
2
3
6-12
6-12
FLDCW
m16
14
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
r/m
m
r/m
m
1
2
1
2
FDIV(R)(P)
r/m
Page 11
5
1/3
1/3
12
12
8
1
Low values are

for round divisors
AMD K7
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
2
1
1
1
1
2
1
2
5
1
1
12-26
2
2
2
3
Math
FSQRT
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
44
51
76
46
72
5
7
8
49
63
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
7
25
76
65
44
85
r/m
r
m
9-23
1
1
1
1
1
1
2
3
8
8
FMUL,FMISC
FMUL
FADD
FADD
FADD
35
90-100
90-100
100-150
100-200
160-170
8
11
27
126
147
12
FMUL
0
0
1/3
1/3
24
92
147
120
59
87
FANY
ALU
FMISC
FMISC
2
10
7-10
8-11
do.
FADD, FMISC
FADD
FMISC, ALU
FMUL
FMUL
Integer MMX instructions

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVNTQ
PACKSSWB/DW
PACKUSWB
Operands
Ops
Latency Reciprocal
throughput
r32, mm
mm, r32
mm,m32
m32, r
mm,mm
mm,m64
m64,mm
m,mm
2
2
1
1
1
1
1
1
7
9
mm,r/m
Page 12
Execution
unit
2
2
1/2
1
1/2
1/2
1
2
FMICS, ALU
FANY, ALU
FANY
FMISC
FA/M
FANY
FMISC
FMISC
FA/M
Notes
AMD K7
PUNPCKH/LBW/WD
PSHUFW
MASKMOVQ
PMOVMSKB
PEXTRW
PINSRW
mm,r/m
mm,mm,i
mm,mm
r32,mm
r32,mm,i
mm,r32,i
1
1
32
3
2
2
mm,r/m
mm,r/m
2
2
5
12
2
1/2
24
3
2
2
FA/M
FA/M
FADD
FMISC, ALU
FA/M
1
1
2
2
1/2
1/2
FA/M
FA/M
mm,r/m
mm,r/m
mm,r/m
mm,r/m
mm,r/m
1
1
1
1
1
3
3
2
2
3
1
1
1/2
1/2
1
FMUL
FMUL
FA/M
FA/M
FADD
mm,r/m
1/2
FA/M
mm,i/mm/m
1/2
FA/M
1/3
FANY
PADDB/W/D PADDSB/W
PADDUSB/W
PSUBB/W/D PSUBSB/W
PSUBUSB/W
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PAVGB/W
PMIN/MAX SW/UB
PSADBW
Logic
PAND PANDN POR
PXOR
PSLL/RLW/D/Q
PSRAW/D
Other
EMMS
Floating point XMM instructions

Instruction
Move instructions
MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHLPS, MOVLHPS
MOVHPS, MOVLPS
MOVHPS, MOVLPS
MOVNTPS
MOVMSKPS
SHUFPS
Operands
Ops
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
m,r
r32,r
r,r/m,i
2
2
2
2
5
5
1
2
1
1
1
1
2
3
3
Latency Reciprocal
throughput
2
2
4
3
2
3
Page 13
1
2
2
1
2
2
1
1
1
1/2
1/2
1
4
2
3
Execution
unit
FA/M
FMISC
FMISC
FA/M
FA/M
FANY FMISC
FMISC
FA/M
FMISC
FMISC
FMISC
FADD
FMUL
Notes
AMD K7
UNPCK H/L PS
Conversion
CVTPI2PS
CVT(T)PS2PI
CVTSI2SS
CVT(T)SS2SI
Arithmetic
ADDSS SUBSS
ADDPS SUBPS
MULSS
MULPS
r,r/m
xmm,mm
mm,xmm
xmm,r32
r32,xmm
1
1
4
2
4
6
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
4
4
4
4
FMUL
10
3
FMISC
FMISC
FMISC
FMISC
1
2
1
2
FADD
FADD
FMUL
FMUL
DIVSS
DIVPS
RCPSS
RCPPS
MAXSS MINSS
MAXPS MINPS
CMPccSS
CMPccPS
COMISS UCOMISS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
1
2
1
11-16
18-30
3
3
2
2
2
2
2
8-13
18-30
1
2
1
2
1
2
1
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
FADD
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
FMUL
Math
SQRTSS
SQRTPS
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
19
36
3
3
16
36
1
2
FMUL
FMUL
FMUL
FMUL
Other
LDMXCSR
STMXCSR
m
m
8
3
Low values are

for round divisors, e.g. powers
of 2.
do.
9
10
3DNow instructions (obsolete)

Instruction
Operands
Move and convert instructions

PREFETCH(W)
m
PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm
Ops
1
1
1
1
1
1
Latency Reciprocal
throughput
5
5
5
5
2
Page 14
1/2
1
1
1
1
1/2
Execution
unit
AGU
FMISC
FMISC
FMISC
FMISC
FA/M
Notes
3DNow E
3DNow E
3DNow E
AMD K7
PAVGUSB
PMULHRW
mm,mm
mm,mm
1
1
2
3
1/2
1
FA/M
FMUL
Floating point instructions

PFADD/SUB/SUBR
PFCMPEQ/GE/GT
PFMAX/MIN
PFMUL
PFACC
PFNACC, PFPNACC
PFRCP
PFRCPIT1/2
PFRSQRT
PFRSQIT1
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
1
1
1
1
1
1
1
1
1
1
4
2
2
4
4
4
3
4
3
4
1
1
1
1
1
1
1
1
1
1
FADD
FADD
FADD
FMUL
FADD
FADD
FMUL
FMUL
FMUL
FMUL
Other
FEMMS
mm,mm
1/3
FANY
Page 15
3DNow E
K8
AMD K8
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
Operands
Ops
Latency Reciprocal Execution

throughput unit
r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H
1
1
1
1
1
1
1
1
1
4
4
3
3
8
1/3
1/3
1/2
1/2
1/2
1/2
1/2
ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r
1
1
1
1
1
6
1
3
3
3
3
2
9-13
1/2
1/2
1/2
1/2
1/2-1
8
2-3
AGU
AGU
AGU
AGU
Page 16
AGU
Notes
Any addressing mode.

Add 1 clock if code
segment base 0
AH, BH, CH, DH
Any other 8-bit register
Any addressing mode
K8
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r
1
1
1
1
1
1
3
1
4
1
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
IN
OUT
r,m
3
2
1
1
2
2
5
9
2
3
4-6
7-9
25
9
2
1
1
4
1
1
10
1
1
1
6
1
7
270
300
16
5
1
1
1
1
2
4
1
1
8
28
10
4
3
2
2
3
1
1
1
1
1
1
1
1
1
1
1
1
9
12
16
4
1
1
7
1
1
7
1
r
i
m
sr
r
m
DS/ES/FS/GS
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
SS
r16,[m]
r32,[m]
r64,[m]
r,m
r
m
m
r,i/DX
i/DX,r
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
1
2
1
7
5
6
7
5
Page 17
1/3
1/2
1/3
1/2
1/3
1/2
1
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
16
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
ALU
ALU
ALU
1
1
1
1
2
4
1
1
8
28
10
4
1
1/3
1/3
2
1/3
1/3
9
1/3
1/2
1/2
8
5
16
1/3
1/2
2,5
1/3
1/2
2,5
1/3
1/2
1/3
3
5
6
7
ALU
AGU
AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
Timing depends on
hw
Any address size

Any address size
Any address size
K8
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,R
OR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
31
1
3
2
2
1
1
1
2
1
1
3
3
3
31
46
78
143
40
55
87
152
41
56
88
153
1
1
13
3
3-4
3
4-5
3
3
4
4
3
4
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8
r16
r32
r64
m8
m16
m32
m64
15
23
39
71
17
25
41
73
17
25
41
73
1
1
1
2
1
2
1
1
2
1
1
2
2
2
2
15
23
39
71
17
25
41
73
17
25
41
73
1/3
1/3
ALU
ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
7
1
1
1
7
1
1
1
3
3
4
3
1/3
1/2
2,5
1/3
1/2
1/3
2,5
1/3
1/3
1/3
3
3
4
3
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1
1
10
9
9
8
6
7
7
7
9
8
7
8
3
3
3
4
4
4
4
3
3
3
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
Page 18
latency ax=3, dx=4

latency rax=4, rdx=5
K8
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSF
BSR
BSF
BSF
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r16/32,r
r64,r
r,r
r16,m
r32,m
r64,m
r,m
r
m

JMP
short/near
JMP
JMP
JMP
8
1
1
5
2
5
4
8
8
21
22
28
20
22
25
28
1
1
1
1
1
2
6
1
2
7
7
5
8
8
9
10
8
9
10
10
1
far
r
m(near)
16-20
1
1
23-32
m(far)
short/near
short
short
near
17-21
1
2
7
3
25-33
CALL
CALL
CALL
far
r
m(near)
16-22
4
5
23-32
3
3
CALL
RETN
RETN
m(far)
16-22
2
2
24-33
3
3
15-23
24-35
15-24
32
33
6
2
24-35
81
42
JMP
Jcc
J(E/R)CXZ
LOOP
CALL
RETF
RETF
IRET
INT
BOUND
INTO
i
i
m
3
1/3
1/2
2
1
2
2
5
3
8
9
10
8
9
10
10
1/3
1/2
1/3
1/3
1/3
1/3
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
low values = real
mode
2
2
3-4
2
1/3 - 2
1/3 - 2
3-4
2
Page 19
ALU
ALU
ALU
ALU
low values = real

mode
recip. thrp.= 2 if jump
low values = real

mode
3
3
ALU
ALU, AGU
low values = real
mode
3
3
2
2
String instructions
ALU
ALU, AGU
ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no jump
K8
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
2
5
2
4
2
1.5 - 2 0.5 - 1
7
3
3
1-2
5
2
5
2
2
3
6
2
2
2
2
0.5 - 1
3
1-2
2
2
3
2
Other
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
6
10
9
12
1/3
1/3
12
3
5
27
values are per count

ALU
ALU
12
3 ops, 5 clk if 16 bit
7
7

Instruction
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
7
30
1
1
10
260
1
1
1
1
2
4
16
41
2
3
7
173
0
9
7
1/2
1/2
4
39
1/2
1
5
160
0,4
1
1
1
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
9
1
1
4-15
4
2
1/3
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
AX
AX
m16
m16
m16
2
3
2
3
18
6-12
6-12
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1

throughput unit
Page 20
12
12
8
1
50
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC, FA/M
FMISC
Low latency immediFMISC, FA/M ately after FCOMI
FANY
FANY
Low latency immediately after FCOM
FMISC, ALU FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged
K8
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
r/m
m
r/m
m
1
2
1
2
4
4
4
4
1
1-2
1
2
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
r/m
m
1
2
1
1
1
1
2
1
2
5
1
1
11-25
12-26
2
2
2
3
8-22
9-23
1
1
1
1
1
1
1
3
8
8
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
1
66
73
98
67
97
5
7
53
72
75
27
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
8
26
77
70
61
101
r/m
r
m
2
10
7-10
8-11
140-190
150-190
170-200
150-180
217
8
12
126
179
175
0
0
12
1
FADD
FADD,FMISC
FMUL
FMUL,FMISC
Low values are for
round divisors
FMUL
FMUL,FMISC
do.
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD
FMISC, ALU
FMUL
FMUL
FMUL
FMISC
1/3
1/3
27
100
171
136
56
95
FANY
ALU
FMISC
FMISC
Integer MMX and XMM instructions

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
Operands
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
Ops
2
2
1
3
3

throughput unit
4
9
2
3
Page 21
2
2
1/2
2
2
FMICS, ALU
FANY, ALU
FANY
FMISC, ALU
Notes
K8
MOVD
MOVD
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
PINSRW
xmm,m32
m32, r
2
1
1
1
r64,mm/xmm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
2
2
3
1
2
1
2
1
2
2
2
4
5
1
2
1
2
4
9
9
2
2
mm,r/m
FA/M
xmm,r/m
FA/M
mm,r/m
FA/M
xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,mm/x,i
mm,r32,i
xmm,r32,i
2
2
1
3
1
2
32
64
1
2
2
3
2
2
2
3
2
2
FA/M
FA/M
FA/M
FA/M
FA/M
FA/M
2
5
12
12
2
1
1/2
1,5
1/2
1
13
26
1
2
2
3
FADD
FMISC, ALU
FA/M
FA/M
2
2
2
2
2
1/2
1
1/2
1
1
1
2
2
2
2
1/2
1
2
3
FANY
FMISC
Moves 64 bits.Name
FMISC, ALU of instruction differs
FANY, ALU
do.
FANY, ALU
do.
FA/M
FA/M, FMISC
FANY
FANY, FMISC
FMISC
FA/M
FMISC
FMISC
FA/M
FA/M, FMISC
FMISC
FMISC
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
mm,r/m
1/2
FA/M
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
xmm,r/m
FA/M
Page 22
K8
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMADDWD
PMADDWD
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
mm,r/m
xmm,r/m
1
2
2
2
1/2
1
FA/M
FA/M
mm,r/m
FMUL
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
2
1
2
3
3
3
2
2
2
2
3
3
2
1
2
1/2
1
1/2
1
1
2
FMUL
FMUL
FMUL
FA/M
FA/M
FA/M
FA/M
FADD
FADD
mm,r/m
1/2
FA/M
xmm,r/m
FA/M
mm,i/mm/m
1/2
FA/M
x,i/x/m
xmm,i
2
2
2
2
1
1
FA/M
FA/M
1/3
FANY
Other
EMMS

Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVDDUP
Operands
Ops

throughput unit
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
2
2
2
2
4
5
1
2
1
2
4
3
1
2
2
1
2
2
1
1
1
FA/M
FANY FMISC
FMISC
r,r
1/2
FA/M
r,m
FMISC
m,r
r,r
1
2
1
1
FMISC
2
Page 23
Notes
FA/M
FMISC
FMISC
FA/M
SSE3
K8
MOVSH/LDUP
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D
MULPS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
r,r
m,r
r32,r
r,r/m,i
r,r/m
2
2
1
3
2
2
8
3
3
2
3
1
2
3
FMISC
FADD
FMUL
FMUL
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
SSE3
2
4
3
1
2
2
2
4
1
2
1
3
3
2
2
2
4
8
8
2
5
5
5
8
4
5
6
8
14
12
10
9
2
3
8
1
2
2
2
3
1
2
1
2
2
2
2
2
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
r,r/m
r,r/m
1
2
4
4
1
2
FADD
FADD
r,r/m
r,r/m
r,r/m
2
1
2
4
4
4
2
1
2
FADD
FMUL
FMUL
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
1
2
1
2
11-16
18-30
11-20
16-34
3
3
2
2
2
2
8-13
18-30
8-17
16-34
1
2
1
2
1
2
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
r,r/m
FADD
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
FMUL
Math
SQRTSS
SQRTPS
r,r/m
r,r/m
1
2
19
36
16
36
FMUL
FMUL
Page 24
SSE3
Low values are for

round divisors, e.g.
powers of 2.
do.
do.
do.
K8
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
Other
LDMXCSR
STMXCSR
m
m
8
3
27
48
3
3
24
48
1
2
9
10
Page 25
FMUL
FMUL
FMUL
FMUL
K10
AMD K10
Instruction:
Operands:
Ops:
Latency:
JNE, etc.
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
Operands
r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H
m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r
r,r
Ops
1
1
1
1
1
1
1
1
1
1
1
1
6
1
1

throughput unit
1
1
4
4
3
3
8
3
3
3
3
3-4
8-26
1
Page 26
1/3
1/3
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
8
1
1/3
ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
AGU
AGU
AGU
AGU
Notes
Any addr. mode. Add

1 clock if code segment base 0
AH, BH, CH, DH
Any other 8-bit reg.
Any addressing mode
from AMD manual

AGU
ALU
K10
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH(W)
SFENCE
LFENCE
MFENCE
IN
OUT
1
1
1
1
1
2
2
2
r
1
i
1
m
2
sr
2
9
9
r
1
m
3
6
DS/ES/FS/GS
SS
10
28
9
r16,[m]
2
r32/64,[m]
1
r32/64,[m]
1
4
1
1
r,m
10
r
1
m
1
m
1
m
1
6
1
4
r,i/DX
~270
i/DX,r
~300
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
1
1
1
1
1
1
1
1
1
1
9
12
16
4
30
4
1
4
1
4
1
21
5
6
3
10
26
16
6
3
1
2
3
1
1
1
1
4
1
4
1
1
7
5
6
7
5
13
Page 27
1/2
1/3
1/2
1/3
1/2
1
19
5
1/2
1/2
1
1
3
6
1/2
1
8
16
11
6
1
1/3
1/3
2
1/3
1
10
1/3
1/2
1/2
1/2
8
1
33
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
AGU
ALU
ALU
ALU
1/3
1/2
1
1/3
1/2
1
1/3
1/2
1/3
2
5
6
7
5
13
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU
ALU
AGU
AGU
AGU
Timing depends on hw
Any address size

2 source operands
W. scale or 3 opr.
3DNow
K10
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
IDIV
IDIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r8
m8
r16/m16
r32/m32
r64/m64
r16/m16
r32/m32
r64/m64
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
1
3
2
2
1
1
1
2
1
1
3
3
3
1
1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2
3
3
3
4
3
3
4
4
3
4
17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1
1
1
4
1
1
7
1
1
1
3
3
4
3
7
7
7
7
8
7
3
3
7,5
1
7
2
Page 28
1
2
1
2
1
1
2
1
1
2
2
2
2
17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1/3
1/3
ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1/3
1/2
1
1/3
1/2
1/3
1
1/3
1/3
1
3
3
4
3
1
1
5
6
6
5
2
3
6
1/3
1/2
2
1/3
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
latency ax=3, dx=4

Depends on number
of significant bits in
absolute value of dividend. See AMD software optimization
guide.
K10
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSR
BSF
BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
m,i
m,i
m,r
m,r
r,r
r,r
r,m
r,m
r,r/m
r,r/m
r
m

JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
INT
i
BOUND
m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
5
4
8
8
6
7
7
8
1
1
1
1
1
1
1
2
1
16-20
1
1
17-21
1
2
7
3
16-22
4
5
16-22
2
2
15-23
15-24
32
33
6
2
4
5
4
2
7
3
5
5
7
3
9
9
8
8
4
4
7
7
2
2
1
1,5
1,5
10
7
3
3
3
3
1
1
1/3
1/2
1/3
1/3
1/3
2/3
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
2
2
ALU
ALU, AGU
1/3 - 2
2/3 - 2
3
2
ALU
ALU
ALU
ALU
3
3
ALU
ALU, AGU
23-32
low values = real mode
25-33
2
23-32
3
3
24-33
3
3
24-35
24-35
81
42
SSE4.A / SSE4.2
SSE4.A, AMD only
3
3
Page 29
ALU
ALU
2
2
2
2
2
1
3
1
2
2
3
1

2
2
2
1
3
1
2
2
3
1
real mode
real mode

K10
Other
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
30
13
1/3
1/3
ALU
ALU
12
3
5
27
3 ops, 5 clk if 16 bit
67
5

Instruction
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
7
20
1
1
10
218
1
1
1
1
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
9
1
1
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
AX
AX
m16
m16
m16
2
3
2
3
12
r/m
m
r/m
m
r/m
m
1
2
1
2
1
2
1
1
1
1
2
1
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1

throughput unit
2
4
13
94
2
2
8
167
0
6
4
1/2
1/2
4
30
1/2
1
7
163
1/3
1
1
1
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC
FMISC
1/3
1/3
Low latency immediFMISC, FA/M ately after FCOMI

FANY
FANY
16
14
9
2
14
FMISC, ALU after FCOM FTST

FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged
Low latency immediately
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
r/m
r
m
4
4
?
31
2
Page 30
1
4
1
4
24
24
2
1
1
1
1
1
FADD
FADD,FMISC
FMUL
FMUL,FMISC
FMUL
FMUL,FMISC
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD
K10
FXAM
FRNDINT
FPREM
FPREM1
2
6
1
1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
1
45
51
76
45
9
5
11
8
8
12
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
8
26
77
70
61
85
m
m
m
m
35
~51?
~90?
~125?
~119
151?
9
9
65
13
114
0
0
162
133
63
89
1
37
7
7
FMISC, ALU
35
1
FMUL
FMISC
FMUL
FMUL
45?
29
41
30?
30?
44?
1/3
1/3
28
103
149
149
58
79
FANY
ALU
FMISC
FMISC

Instruction
Operands
Ops

throughput unit
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,mm/x
1
2
1
1
2
1
1
3
6
4
3
6
2
2
1
3
1/2
1
3
1/2
1
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m
1
2
2
1
1
1
1
1
1
1
3
6
6
2
2,5
4
2
2
2,5
2
1
3
3
1/2
1/3
1/2
1/2
1
1/3
1/2
Page 31
Notes
FADD
FANY
FADD
FMISC
Moves 64 bits.Name
of instruction differs
do.
FMUL, ALU
do.
FA/M
FANY
FANY
?
FMISC
FANY
?
FADD
K10
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
2
1
3
1
1
1
2
2
2
3
2
2
1
1/2
2
1/3
1/3
1
1
FMUL,FMISC
mm,r/m
1/2
FA/M
xmm,r/m
1/2
FA/M
mm,r/m
1/2
FA/M
xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,(x)mm,i
(x)mm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i
1
1
1
1
1
1
32
64
1
2
2
3
3
1
1
3
3
3
3
2
2
FA/M
FA/M
FA/M
FA/M
FA/M
FA/M
3
6
9
6
6
2
2
1/2
1/2
1/2
1/2
1/2
1/2
13
24
1
1
3
2
2
1/2
1/2
1
1
2
2
1/2
1/2
FA/M
FA/M
1
1
1
1
1
3
3
2
2
3
1
1
1/2
1/2
1
FMUL
FMUL
FA/M
FA/M
FADD
mm/xmm,r/m
1/2
FA/M
mm,i/mm/m
1/2
FA/M
x,i/(x)mm
1/2
FA/M
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
mm/xmm,r/m
PCMPEQ/GT B/W/D mm/xmm,r/m
PMULLW PMULHW
PMULHUW
PMULUDQ
mm/xmm,r/m
PMADDWD
mm/xmm,r/m
PAVGB/W
mm/xmm,r/m
PMIN/MAX SW/UB
mm/xmm,r/m
PSADBW
mm/xmm,r/m
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
Page 32
FANY
FANY
FMISC
FMUL,FMISC
FADD
FA/M
FA/M
FA/M
FA/M
FA/M
SSE4.A, AMD only

SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
K10
PSLLDQ, PSRLDQ
xmm,i
Other
EMMS
1/2
FA/M
1/3
FANY

Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D
Operands
Ops

throughput unit
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
1
1
2
1
1
3
1
1
1
2,5
2
2
2,5
2
3
2
2
2
1/2
1/2
1
1/2
1/2
2
1/2
1/2
1
FANY
?
FMUL,FMISC
FANY
?
FMISC
FA/M
?
FMISC
r,r
1/2
FA/M
r,m
1/2
FA/M
m,r
m,r
m,r
r32,r
r,r/m,i
r,r/m
1
2
1
1
1
1
3
3
3
1
3
1
1
1/2
1/2
FMISC
FMUL,FMISC
FMISC
FADD
FA/M
FA/M
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
1
2
3
3
1
1
1
2
2
1
1
2
3
3
2
2
2
7
8
7
4
4
4
7
7
4
4
7
14
14
8
8
1
1
2
2
1
1
1
1
1
1
1
1
3
3
1
1
FMISC
FADD,FMISC
FADD,FMISC
r,r/m
FADD
Page 33
FMISC
FMISC
FMISC
FMISC
FMISC
Notes
SSE4.A, AMD only
K10
ADDPS/D SUBPS/D
MULSS/D
MULPS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
1
1
1
1
1
r,r/m
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
Other
LDMXCSR
STMXCSR
m
m
4
4
4
16
18
20
20
3
2
2
2
2
1
1
1
13
15
17
17
1
1
1
1
1
FADD
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
FADD
1/2
FA/M
1
1
1
1
1
1
19
21
27
27
3
3
16
18
24
24
1
1
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
12
3
12
12
10
11
Obsolete 3DNow instructions

Instruction
Operands
Ops

throughput unit
Move and convert instructions

PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm
1
1
1
1
1
5
5
5
5
2
1
1
1
1
1/2
FMISC
FMISC
FMISC
FMISC
FA/M
PAVGUSB
PMULHRW
mm,mm
mm,mm
1
1
2
3
1/2
1
FA/M
FMUL
Floating point instructions

PFADD/SUB/SUBR
mm,mm
PFCMPEQ/GE/GT
mm,mm
PFMAX/MIN
mm,mm
PFMUL
mm,mm
1
1
1
1
4
2
2
4
1
1
1
1
FADD
FADD
FADD
FMUL
Page 34
Notes
3DNow extension
3DNow extension
3DNow extension
K10
PFACC
PFNACC, PFPNACC
PFRCP
PFRCPIT1/2
PFRSQRT
PFRSQIT1
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
1
1
1
1
1
1
Other
FEMMS
mm,mm
4
4
3
4
3
4
1
1
1
1
1
1
FADD
FADD
FMUL
FMUL
FMUL
FMUL
1/3
FANY
Thank you to Xucheng Tang for doing the measurements on the K10.
Page 35
3DNow extension
Bulldozer
AMD Bulldozer
Instruction:
Operands:
Ops:
Latency:
etc.
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
Execution pipe:
Indicates which execution pipe or unit is used for the macro-operations:

Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
P01: can use either P0 or P1
Two macro-operations can execute simultaneously if they go to different
execution pipes
Domain:
Tells which execution unit domain is used:

ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.
Page 36
Bulldozer
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE
ADD, SUB
ADD, SUB
Operands
Ops
r,r
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
4
4
r,m
2
2
1
1
2
8
9
1
2
34
14
2
2
~50
6
1
1
4
2
1
1
1
1
1
6
1
6
2
1
3
2
1
1
1
1
1
1
r
i
m
r
m
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
r,r
r,i
Latency Reciprocal
throughput
5
1
5
4
1
5
1
1
0.5
0.5
0.5
1
1
2
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
EX01
EX01
AG01
EX01 AG01
~50
2
1
1
1.5
4
9
1
1
19
8
EX01
2-3
2-3
Page 37
Execution
pipes
Notes
all addr. modes

all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
89
0,25
89
EX01
EX01
0.5
0.5
EX01
EX01
Timing depends on
hw
any addr. size

16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
AMD 3DNow
Bulldozer
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
r
m
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,i
r,m
m,r
m,i
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
9
1
2
1
1
1
1
1
2
1
1
2
2
2
14
18
16
16
33
36
36
36
1
1
2
1
1
1
1
1
1
7-8
7-8
1
1
1
9
9
1
1
1
7-8
6
9
10
6
20
4
4
4
6
4
4
6
5
4
6
20
15-27
16-43
16-75
23
23-33
22-48
22-79
1
1
1
1
1
7-8
7-8
1
Page 38
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
1
20
2
2
2
4
2
2
4
2
2
4
2
2
4
20
15-28
16-43
16-75
20
20-27
20-43
20-75
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
0.5
1
1
0.5
EX01
EX01
EX01
EX01
EX01
EX01
Bulldozer
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
LZCNT
POPCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
EXTRQ
EXTRQ
INSERTQ
INSERTQ
r,i
m,r
m,i
r
m
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r,r
r,r/m
r
m
r16/32,r16/32
r64,r64
r,r
x,i,i
x,x
x,x,i,i
x,x

JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
1
1
2
2
1
1
2
1
1
1
1
1
7
1
1
1
8
9
1
8
8
3
4
1
3
4
4
2
4
1
0.5
0.5
0.5
0.5
1
0.5
0.5
3
3,5
3,5
0.5
0.5
3,5
1
2
5
3
4
4
5
2
2
0.5
1
0.5
4
4
2
3
3
3
3
1
1
1
1
1
1
1
1
Page 39
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
EX1
EX01
EX01
EX01
EX01
SSE4.A
SSE4.2
3
4
2
4
2
1
1
1
1
P1
P1
P1
P1
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
2
2
2
1-2
1-2
1-2
1-2
1-2
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
Bulldozer
CALL
CALL
CALL
RET
RET
BOUND
INTO
near
r
m
i
m
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
CRC32
CRC32
CRC32
XGETBV
m,r
m,r
m,r
m8,r8
m8,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128
a,0
a,b
r32,r8
r32,r16
r32,r32
2
2
3
1
4
11
4
2
2
2
2
2-3
5
24
3
6n
3
2n
3 per 16B
5
2n
4 per 16B
3
7n
6
9n
3
3n
3
2n
3 per 16B
3
2n
3 per 16B
3
4n
3
4n
1
4
4
5
5
6
6
18
18
22
22
1
1
40
13
11+5b
2
37-63
36
22
3
5
5
4
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~55
10
~51
15
~51
14
~52
15
~53
52
~94
3
5
6
Page 40
0.25
0.25
43
22
16+4b
4
112-280
42
300
2
5
6
31
none
none
Bulldozer

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
4
3
1
3
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
1
2
1
2
1
2
2
1
1
1
2
2
1
1
1
1
1
1
1
10-162
160-170
Latency Reciprocal
throughput
2
8
14
61
2
8
9
240
0
12
8
3
0
~13
~13
5-6
5-6
10-42
~20
4
19-62
19-65
0.5
1
4
40
0.5
1
20
244
0.5
1
1
0.5
3
0.25
0.25
22
19
3
2
1
2
1
2
5-18
0.5
0.5
0.5
1
1
0.5
0.5
1
10-53
65-210
~160
Page 41
0.5
65-210
~160
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2 P3
P01
P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01
P01
P01
P0
P0
P0
P01
P01
P0 P1 P3
P0 P1 P3
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
Bulldozer
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
12-166
11-190
10-355
8
12
10
10-175
10-175
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
1
1
18
31
103
76
m864
m864
95-160
95-245
60-440
52
10
64-71
300
312
95-160
95-245
60-440
5
0.25
0.25
57
170
300
312
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
none
none
P0
P0
P0 P1 P2 P3
P0 P3

Instruction
Operands
Move instructions
MOVD
r32/64, mm/x
MOVD
mm/x, r32/64
MOVD
mm/x,m32
MOVD
m32,mm/x
MOVQ
mm/x,mm/x
MOVQ
mm/x,m64
MOVQ
m64,mm/x
MOVDQA
xmm,xmm
MOVDQA
xmm,m
MOVDQA
m,xmm
VMOVDQA
ymm,ymm
VMOVDQA
ymm,m256
VMOVDQA
m256,ymm
MOVDQU
xmm,xmm
MOVDQU
xmm,m
MOVDQU
m,xmm
LDDQU
xmm,m
VMOVDQU
ymm,m256
VMOVDQU
m256,ymm
MOVDQ2Q
mm,xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m,mm
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
(x)mm,r/m
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
Ops
Latency Reciprocal
throughput
1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2
8
1
1
1
1
1
1
1
8
10
6
5
2
6
5
0
6
5
2
6
5
0
6
5
6
6
6
2
2
6
6
6
2
2
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
3
0.25
0.5
1
0.5
1-2
10
0.5
0.5
2
2
0.5
1
1
Page 42
Execution
pipes
Notes
P23
P3
none
inherit domain
P3
P23
P3
none
P3
P2 P3
P23
P23
P3
P3
P1
P1
P1
inherit domain
Bulldozer
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
PMOVSXBW/BD/BQ/
WD/WQ/DQ
xmm,xmm
PMOVZXBW/BD/BQ/W
D/WQ/DQ
xmm,xmm
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
1
1
1
1
1
1
1
1
31
64
2
2
2
2
2
3
2
2
2
2
2
38
48
10
10
12
1
1
1
1
1
1
1
0.5
37
61
1
1
2
P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1
P1
SSE4.1
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P23
(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
4
1
1
1
2
5
5
2
2
2
0.5
2
2
0.5
0.5
0.5
P23
P1 P23
P1 P23
P23
P23
P23
(x)mm,r/m
xmm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
1
1
4
5
4
4
4
4
2
1
2
1
1
1
1
0.5
P0
P0
P0
P0
P0
P0
P23
(x)mm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i
1
2
1
1
2
8
2
4
2
2
4
8
0.5
1
0.5
0.5
1
4
P23
P1 P23
P23
P23
P23
P1 P23
VPCOMB/W/D/Q
x,x,x/m,i
0.5
P23
VPCOMUB/W/D/Q
x,x,x/m,i
0.5
P23
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW
Page 43
SSE4.1
AVX
SSSE3
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
SSE4.1
SSSE3
SSSE3
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
Bulldozer
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
x,x/m
0.5
P23
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P23
P23
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P23
(x)mm,r/m
P1
(x)mm,i
xmm,i
xmm,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P3
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
27
27
7
7
17
10
14
7
10
10
3
4
P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
5
2
2
2
2
1
1
12
5
5
5
5
5
5
7
2
2
2
2
1
1
P1
P01
P01
P01
P01
P0
P0
pclmul
aes
aes
aes
aes
aes
aes
Execution
pipes
Domain, notes
Other
EMMS
0.25
Floating point XMM and YMM instructions

Instruction
Operands
Ops
Latency Reciprocal
throughput
Move instructions
Page 44
Bulldozer
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
0.5
y,m256
1-2
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
1
4
8
1
1
1
5
5
6
2
6
5
1
3
10
0.5
0.5
1
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
1
2
1
1
2
7
8
7
2
10
1
1
1
1
1
P1 P3
P3
P1
P1 P3
P3
4
1
2
1
2
1
2
3
4
0.5
1
1
2
1
0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
1
1
2
1
P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1
SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
P1
ivec
P23
P23
P23
P1
ivec
P1
ivec
P1
P1
P1 P3
ivec
ivec
1
1
2
1
2
1
2
8
10
1
2
1
2
1
1
2
2
1
2
2
2
1
1
2
2
1
2
2
2
2
3
3
2
2
4
2
2
2
2
2
2
6
6
6
6
2
2
2
2
10
Page 45
none
P23
inherit domain
ivec
P3
P3
P2 P3
P01
fp
ivec
Bulldozer
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
1
2
1
1
2
2
1
2
18
34
14
2
7
2
2
9
9
9
22
25
1
1
1
1
1
1
1
0.5
1
7
13
P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
2
4
2
4
1
1
1
2
1
2
2
4
2
4
1
1
2
2
2
2
2
2
7
7
7
7
4
4
4
4
4
4
7
8
7
7
4
4
7
7
14
13
14
13
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0
P0
P0 P1
P0 P1
P0
P0
P0
P0
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
x,x/m
x,x/m
1
1
5-6
5-6
0.5
0.5
P01
P01
fma
fma
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
y,y,y/m
x,x/m
y,y,y/m
2
1
2
5-6
5-6
5-6
1
0.5
1
P01
P01
P01
fma
fma
fma
HADDPS/D HSUBPS/D
x,x
10
P01 P1
ivec/fma
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
x,m128
P01 P1
ivec/fma
y,y,y
P01 P1
ivec/fma
y,y,m
10
P01 P1
ivec/fma
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
10
Page 46
ivec
ivec
Bulldozer
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
1
1
2
1
2
1
2
1
2
5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5
0.5
0.5
1
4.5-9.5
9-19
4.5-11
9-22
1
2
P01
P01
P01
P01
P01
P01
P01
P01
P01
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
P01 P3
fp
x,x/m
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m128,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m256,i
29
DPPD
x,x,i
15
DPPD
x,m128,i
17
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
6
7
13
13
5
6
0.5
0.5
1
P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01
fp
fma
fma
fma
fma
fma
fma
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
27
15
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m
1
2
1
2
1
2
2
3
14-15
14-15
24-26
24-26
5
5
10
10
4.5-12
9-24
4.5-16.5
9-33
1
2
2
2
P01
P01
P01
P01
P01
P01
P01
P01
fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP
AND/ANDN/OR/XORPS/
PD
x,x/m
0.5
P23
ivec
VAND/ANDN/OR/XOR
PS/PD
y,y,y/m
P23
ivec
Logic
Other
VZEROUPPER
VZEROUPPER
9
16
4
5
Page 47
32 bit mode
64 bit mode
Bulldozer
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
m32
m32
m4096
m4096
m
m
17
32
1
2
67
116
122
177
10
19
136
176
196
250
Page 48
6
10
4
19
136
176
196
250
P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
32 bit mode
64 bit mode
Piledriver
AMD Piledriver
Instruction:
Operands:
Ops:
Latency:
etc.
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).
Execution pipe:

Integer pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
execution pipes
Domain:

fma domain.
store instruction.
Page 49
Piledriver
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
Operands
Ops
r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r16,r8
r32,r
r64,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
4
4
r,m
2
2
1
1
2
8
9
1
2
34
14
2
2
~40
6
1
1
4
2
1
1
1
1
1
7
2
1
3
2
1
1
r
i
m
r
m
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
Latency Reciprocal
throughput
4
1
1
1
5
4
1
5
1
1
1
1
1
0.5
0.5
0.3
0.3
0.5
0.5
1
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01
~40
2
1
1
1
4
9
1
1
18
8
EX01
2-3
2-3
Page 50
Execution
pipes
all addr. modes

all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
81
Notes
EX01
EX01
Timing depends on
hw
any addr. size

16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
PREFETCHW
Piledriver
LFENCE
MFENCE
1
7
ADD, SUB
r,r
ADD, SUB
r,i
ADD, SUB
r,m
ADD, SUB
m,r
ADD, SUB
m,i
ADC, SBB
r,r
ADC, SBB
r,i
ADC, SBB
r,m
ADC, SBB
m,r
ADC, SBB
m,i
CMP
r,r
CMP
r,i
CMP
r,m
CMP
m,i
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
IMUL
r16,m16,i
IMUL
r32,m32,i
IMUL
r64,m64,i
DIV
r8/m8
DIV
r16/m16
DIV
r32/m32
DIV
r64/m64
IDIV
r8/m8
IDIV
r16/m16
IDIV
r32/m32
IDIV
r64/m64
CBW, CWDE, CDQE
CDQ, CQO
CWD
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2
17-22
13-26
12-40
13-71
17-21
13-26
13-40
13-71
1
1
1
Logic instructions
AND, OR, XOR
AND, OR, XOR
1
1
1
1
r,r
r,i
0,25
81
1
1
7-8
7-8
1
1
1
9
9
1
1
1
7-8
6
9
10
6
15
4
4
4
6
4
4
6
5
4
6
Page 51
0.5
0.5
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-22
13-25
12-40
13-71
13-18
13-25
13-40
13-71
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
EX01
EX01
Piledriver
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL
r,m
m,r
m,i
r,r
r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m
r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
7-8
7-8
1
1
1
7-8
1
1
1
1
7
7
1
7
6
3
3
1
2
20
21
3
4
4
1
0.5
1
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
3
3
3,5
0.5
0.5
3,5
1
3
4
4
5
0.5
1
0.5
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 52
3
4
2
4
2
2
0.67
0.67
1
1
1
1
1
1
1
1
1
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
BMI1
SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
Piledriver
BLSI
BLSIC
T1MSKC
TZMSK
r,r
r,r
r,r
r,r

JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
m8/m16
m32/m64
m,r
m,r
m,r
m,r8/16
m,r8/16
m,r32/64
m,r32/64
m64
m64
m128
m128
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
4
11
4
2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2
3
6n
6n
3
1n
3 per 16B
5
1-3n
4.5 pr 16B
3
7n
6
9n
3
3n
2.5n
3
1n
3 per 16B
3
1n
3 per 16B
3
3-4n
3
4n
1
4
4
5
5
6
6
18
18
22
22
AMD TBM
AMD TBM
AMD TBM
AMD TBM
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~40
20
~39
23
~40
20
~40
25
~42
66
~80
0.25
Page 53
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
none
Piledriver
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDPMC
CRC32
CRC32
CRC32
a,0
a,b
r32,r8
r32,r16
r32,r32
1
40
13
20+3b
2
38-64
4
36
21
3
5
5
3
5
6
0.25
40
21
16+4b
4
105-271
30
42
310
2
5
6
none

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2
2
7
20
64
2
7
22
220
0
11
7
1
2
1
2
1
1
2
1
1
1
2
2
5-6
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
Latency Reciprocal
throughput
3
0
5-6
9-40
Page 54
0.5
1
4
35
0.5
1
20
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2
1
2
1
2
4-16
0.5
0.5
0.5
1
1
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2 P3
P01
P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
Piledriver
FTST
FXAM
FRNDINT
FPREM
FPREM1
1
1
1
1
1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
~20
4
17-60
17-60
1
14-50
1
10-162 60-210
160-170
~154
12-166 86-141
11-190 166-231
10-355 60-352
8
44
12
7
10
60-73
10-176
10-176
m864
m864
1
1
18
31
103
76
300
236
0.5
0.5
1
P01
P01
P0
P0
P0
5-20
0.5
60-146
~154
86-141
86-204
60-352
5
5
P01
P01
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
0.25
0.25
54
134
300
236
none
none
P0
P0
P0 P1 P2 P3
P0 P3
fp
fp
fp
fp
fp

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU
Operands
Ops
r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256
1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2
Latency Reciprocal
throughput
8
10
6
5
2
6
5
0
6
5
2
6
11
0
6
5
6
6
Page 55
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
17
0.25
0.5
1
0.5
1
Execution
pipes
Notes
P3
P3
P23
P3
none
inherit domain
P3
P23
P3
none
P3
inherit domain
Piledriver
VMOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/D
Q
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFB
PSHUFD
PSHUFW
PSHUFL/HW
PALIGNR
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB/W/D/Q
PINSRB/W/D/Q
EXTRQ
EXTRQ
INSERTQ
INSERTQ
PMOVSXBW/BD/BQ/
WD/WQ/DQ
PMOVZXBW/BD/BQ/W
D/WQ/DQ
VPCMOV
VPCMOV
VPPERM
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
m256,ymm
mm,xmm
xmm,mm
m,mm
m,xmm
xmm,m
(x)mm,r/m
(x)mm,r/m
8
1
1
1
1
1
1
1
14
2
2
5
5
6
2
2
20
0.5
0.5
2
2
0.5
1
1
P2 P3
P23
P23
P3
P3
(x)mm,r/m
xmm,r/m
xmm,r/m
(x)mm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
(x)mm,r/m,i
xmm,r/m
mm,mm
xmm,xmm
r32,mm/x
r,x/mm,i
x/mm,r,i
x,i,i
x,x
x,x,i,i
x,x
1
1
1
1
1
1
1
1
1
31
64
2
2
2
1
1
1
1
2
2
2
3
2
2
2
2
2
36
59
10
10
12
3
1
1
1
1
1
1
1
1
1
1
1
0.5
59
92
1
1
2
1
1
1
1
P1
P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1
P1
P1
P1
P1
AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A
x,x
P1
SSE4.1
x,x
x,x,x,x/m
y,y,y,y/m
x,x,x,x/m
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P23
(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
4
1
1
1
2
5
5
2
2
2
0.5
2
2
0.5
0.5
0.5
P23
P1 P23
P1 P23
P23
P23
P23
(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
4
5
4
4
4
1
2
1
1
1
P0
P0
P0
P0
P0
Page 56
P1
P1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
Piledriver
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW
(x)mm,r/m
(x)mm,r/m
1
1
4
2
1
0.5
P0
P23
(x)mm,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i
1
2
1
1
2
8
2
4
2
2
4
8
0.5
1
0.5
0.5
1
4
P23
P1 P23
P23
P23
P23
P1 P23
VPCOMB/W/D/Q
x,x,x/m,i
0.5
P23
VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
x,x,x/m,i
0.5
P23
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
x,x/m
0.5
P23
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P23
P23
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P23
(x)mm,r/m
P1
(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P3
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
27
27
7
7
16
10
13
7
10
10
3
4
P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x,x,i
x,x,m,i
x,x
5
6
7
2
12
12
12
5
7
7
7
2
P1
P1
P1
P01
pclmul
pclmul
pclmul
aes
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC
Page 57
SSE4.1
SSSE3
SSSE3
Piledriver
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
x,x
x,x
x,x
x,x
x,x,i
Other
EMMS
2
2
2
1
1
5
5
5
5
5
2
2
2
1
1
P01
P01
P01
P0
P0
aes
aes
aes
aes
aes
Execution
pipes
Domain, notes
none
P23
inherit domain
ivec
P3
P3
P2 P3
P01
fp
0.25

Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
Operands
Ops
Latency Reciprocal
throughput
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
0.5
y,m256
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
1
4
8
1
1
1
1
1
2
1
1
2
2
1
4
1
1
2
1
2
1
2
8
10
1
2
1
2
1
5
11
15
2
6
5
8
7
7
6
2
10
1
17
20
0.5
0.5
1
1
0.5
1
1
1
1
1
2
18
4
1
2
1
2
1
2
3
4
0.5
1
1
2
1
2
2
3
3
2
2
4
2
2
2
2
2
Page 58
P1
P01
P1 P3
P3
P1
P1 P3
ivec
P3
P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1
AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
Piledriver
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
1
2
2
1
2
2
2
1
1
2
2
1
2
2
2
1
2
1
1
2
2
1
2
18
34
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i
2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4
6
2
6
2
7
2
13
7
13
~100
~190
0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
~90
~180
8
7
8
8
4
4
4
4
4
4
8
8
8
7
8
4
7
7
13
12
13
12
8
8
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2
2
6
6
6
6
2
2
2
2
Page 59
P1
ivec
P23
P23
P23
P1
ivec
P1
ivec
P1
P1
P1 P3
P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3
ivec
ivec
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P23
P0
P0 P1
P0 P1
P0
P0 P3
P0
P0 P3
P0 P1
P0 P1
ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C
ivec
ivec
Piledriver
VCVTPH2PS
VCVTPH2PS
x,x/m
y,x/m
2
4
8
8
2
2
P0 P1
P0 P1
F16C
F16C
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
x,x/m
x,x/m
1
1
5-6
5-6
0.5
0.5
P01
P01
fma
fma
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
y,y,y/m
x,x/m
y,y,y/m
2
1
2
5-6
5-6
5-6
1
0.5
1
P01
P01
P01
fma
fma
fma
HADDPS/D HSUBPS/D
x,x
10
P01 P1
ivec/fma
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,m
P01 P1
ivec/fma
y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
8
1
1
2
1
2
1
2
1
2
10
5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5
4
0.5
0.5
1
5-10
9-20
5-10
9-18
1
2
P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01
ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
P01 P3
fp
x,x/m
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m,i
29
DPPD
x,x,i
15
DPPD
x,m,i
17
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
6
7
13
13
5
6
1
1
1
P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01
5-6
5-6
5-6
0.5
0.5
1
P01
P01
P01
fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
27
15
Page 60
Piledriver
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m
1
2
1
2
1
2
2
3
13-15
14-15
24-26
24-26
5
5
10
10
5-12
9-24
5-15
9-29
1
2
2
2
P01
P01
P01
P01
P01
P01
P01
P01
fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP
AND/ANDN/OR/XORPS/
PD
x,x/m
0.5
P23
ivec
VAND/ANDN/OR/XOR
PS/PD
y,y,y/m
P23
ivec
136
176
196
250
4
5
6
10
34
17
136
176
196
250
P2 P3
P2 P3
P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
32 bit mode
64 bit mode
32 bit mode
64 bit mode
m32
m32
m4096
m4096
m
m
9
16
17
32
7
2
67
116
122
177
Logic
Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
Page 61
Steamroller
AMD Steamroller
Instruction:
Operands:
Ops:
Latency:
etc.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. The latency listed does not include the
memory operand where the listing for register and memory operand are joined
(r/m).
Execution pipe:

Integer pipes:
P0: floating point add, mul, div. Integer add, mul, bool
P1: floating point add, mul, div. Shuffle, shift, pack
P2: Integer add. Bool, store
execution pipes
Domain:

fma domain.
store instruction.
Page 62
Steamroller
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
POP
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE
Operands
Ops
r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
3
4
r,m
2
2
1
1
2
8
9
1
2
34
14
1
2
1
~38
6
1
1
4
2
1
1
1
1
1
7
1
7
2
1
3
2
1
1
r
i
m
r
m
sp
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
Latency Reciprocal
throughput
4
1
5
4
1
5
1
1
1
1
1
0.5
0.5
0.25
0.25
0.5
0.5
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01
~38
2
1
1
1
4
9
1
1
19
8
EX01
2
2-3
2
Page 63
Execution
pipes
all addr. modes

all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
~80
0,25
~80
Notes
EX01
EX01
Timing depends on
hw
any addr. size

16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
PREFETCHW
Steamroller
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,i
r
m
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,i
r,m
m,r
m,i
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2
1
1
1
1
1
1
1
1
7
7
1
1
1
9
9
1
1
1
7
6
8
10
6
15
4
4
4
6
4
4
6
5
4
6
17-22
15-25
13-39
13-70
17-22
14-25
13-39
13-70
1
1
1
1
1
7
7
1
Page 64
0.5
0.5
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-17
15-25
13-39
13-70
13-17
14-24
13-39
13-70
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
0.5
1
1
0.5
EX01
EX01
EX01
EX01
EX01
EX01
Steamroller
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL
BLSI
BLSIC
T1MSKC
TZMSK
r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m
r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7-8
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
7
1
1
1
1
7
7
1
7
7
3
4
1
3
4
4
1
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
3
4
4
0.5
0.5
3,5
1
2
5
3
4
4
5
0.5
1
0.5
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 65
3
4
2
4
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
BMI1
SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
Steamroller
JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
CMPXCHG
CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
m8/m16
m32/m64
m,r
m,r
m,r
m,r8
m,r16
m,r32/64
m8,r8
m16,r16
m,r32/64
m64
m64
m128
m128
1
1
1
1
1
1
1
1
2
2
3
1
4
11
4
2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2
3
6n
6n
3
1n
3 per 16B
5
~1n
4-5 pr 16B
3
7n
6
9n
3
3n
2.5n
3
~1n
2 per 16B
3
~1n
~2 per 16B
3
3-4n
3
4n
1
4
4
5
6
6
5
6
6
18
18
24
24
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~39
9-12
~39
15
15
13
~40
~40
~40
~14
~42
~47
~80
1
1
8
0.25
0.25
4
Page 66
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
none
none
Steamroller
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32
CRC32
a,0
a,b
r32,r8
r32,r16
r32,r32
13
11+5b
2
38-64
4
44
44
22
3
5
7
3
5
6
21
20-30
3
100-300
30
78
105
360
2
5
6
rdtscp

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2
2
7
11
52
2
7
14
222
0
11
7
1
2
1
2
1
1
2
1
1
1
2
2
1
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
Latency Reciprocal
throughput
3
0
11
5
9-37
Page 67
0.5
1
4
34
0.5
1
19
222
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2
1
2
1
2
4-16
4
0.5
0.5
0.5
1
1
0.5
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2
P01
P0 P1 P2
P01
P01
P0 P2
P01
P0 P1 P2
none
none
P0 P2
P0 P2
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01 P2
P01
P01
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
Steamroller
FXAM
FRNDINT
FPREM FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
m864
m864
1
1
1
26
4
17-60
0.5
1
12-53
P01
P0
P0
1
1
10-164
18-166
12-168
11-192
10-365
10
12
10-18
9-183
206
10-50
5-20
0.5
60-165
P01
P01
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
1
1
18
31
98
73
60-210
76-158
90-245
60-440
49
8
60-74
60-280
~390
256
166
90-165
90-210
60-365
5
5
0.25
0.25
63
131
256
166
fp
fp
fp
none
none
P0
P0
P0 P1 P2
P0 P2

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU
VMOVDQU
MOVDQ2Q
MOVQ2DQ
Operands
Ops
r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256
m256,ymm
mm,xmm
xmm,mm
1
2
1
1
1
1
1
1
1
1
2
2
2
1
1
1
1
2
2
1
1
Latency Reciprocal
throughput
4
5
2
3
2
2
3
0
2
3
2
3
4
0
2
3
2
3
4
1
1
Page 68
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
1
0.25
0.5
1
0.5
1
1
0.5
0.5
Execution
pipes
Notes
P2
P02
none
inherit domain
P2
P02
P2
none
P2
P02
P02
inherit domain
Steamroller
MOVNTQ
m,mm
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
(x)mm,r/m
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
EXTRQ
x,i,i
EXTRQ
x,x
INSERTQ
x,x,i,i
INSERTQ
x,x
PMOVSXBW/BD/BQ/
WD/WQ/DQ
x,x
PMOVZXBW/BD/BQ/W
D/WQ/DQ
x,x
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
1
1
1
1
1
3
3
2
2
2
1
1
0.5
1
1
P2
P2
1
1
1
1
1
1
1
1
1
31
65
2
2
2
1
1
1
1
2
2
2
3
2
2
2
2
2
32
45
5
5
6
3
1
1
1
1
1
1
1
1
1
1
1
0.5
16
31
1
1
1
1
1
1
1
P1
P1
P1
P1
P1
P1
P1
P1
P02
P2
P0 P1 P2
P1 P2
P1 P2
P1
P1
P1
P1
P1
AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A
P1
SSE4.1
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P02
(x)mm,r/m
x,x
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
1
1
1
2
5
2
2
2
0.5
2
0.5
0.5
0.5
P02
P02 2P1
P02
P02
P02
(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
1
1
4
5
4
4
4
4
2
1
2
1
1
1
1
0.5
P0
P0
P0
P0
P0
P0
P02
(x)mm,r/m
x,r/m
1
2
2
4
0.5
1
P02
P1 P02
Page 69
P1
P1
SSE4.1
SSE4.1
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
SSE4.1
Steamroller
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i
1
1
2
8
2
2
4
8
0.5
0.5
1
4
P02
P02
P02
P1 P02
VPCOMB/W/D/Q
x,x,x/m,i
0.5
P02
VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
x,x,x/m,i
0.5
P02
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
x,x/m
0.5
P02
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P02
P02
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
0.5
P02
(x)mm,r/m
P1
(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
14
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P2
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
30
30
9
8
11
10
5
6
11
10
5
6
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x,x,i
x,x,m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
7
7
8
2
2
2
2
1
1
11
11
7
7
7
1
1
1
1
1
1
P1
P1
P1
P01
P01
P01
P01
P0
P0
pclmul
pclmul
pclmul
aes
aes
aes
aes
aes
aes
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
5
5
5
5
5
5
Page 70
SSSE3
SSSE3
Steamroller
Other
EMMS
0.25

Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
Operands
Ops
Latency Reciprocal
throughput
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
0.5
y,m256
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
1
2
2
1
1
1
1
1
2
1
1
2
2
1
2
1
1
2
1
2
1
2
8
12
1
2
1
2
1
1
2
2
1
2
3
3
3
2
2
3
3
3
4
3
2
5
15
3
3
1
2
2
0.5
0.5
1
1
0.5
1
1
1
1
1
1
2-3
3
1
2
1
2
1
2
3.5
4
0.5
1
0.5
1
1
0.5
2
1
0.5
0.5
2
2
3
3
2
2
4
2
2
2
2
2
2
8
8
Page 71
Execution
pipes
Domain, notes
none
P02
inherit domain
ivec
P2
P2
P2
P01
fp
P2
P1
P01
P1 P2
P2
P1
P1 P2
P1 P2
P2
P2
P2
P2
P2
P1
P1
P1
P1
P0 P2
P0 P2
P01
P01
P01
P01
P1
P1
P02
ivec
AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
fp
fp
ivec
ivec
Steamroller
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D SUBSS/D
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
2
1
1
2
2
1
2
2
2
1
2
1
1
2
2
1
2
20
41
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i
x,x/m
y,x/m
x,x/m
8
8
2
10
2
10
2
9
2
10
9
9
~35
~35
0.5
0.5
1
0.5
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
8
16
2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4
2
4
6
6
6
6
4
4
4
4
4
4
7
7
7
7
6
5
7
7
13
12
12
12
7
7
7
7
5-6
2
2
2
Page 72
P02
P02
P1
ivec
P1
ivec
P1
P1
P1 P2
P1 P2
P02
P0 P2
P1
P1
P02
P02
P01
P01
P0 P1 P2
P0 P1 P2
ivec
ivec
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2
2
2
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P2
P0
P0 P1
P0 P1
P0
P0 P2
P0
P0 P2
P0 P1
P0 P1
P0 P1
P0 P1
ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C
F16C
F16C
P01
fma
ivec
ivec
Steamroller
ADDPS/D SUBPS/D
x,x/m
5-6
P01
fma
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
y,y,y/m
x,x/m
y,y,y/m
2
1
2
5-6
5-6
5-6
2
1
1
P01
P01
P01
fma
fma
fma
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,x
10
P0 P1
ivec/fma
y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
8
1
1
2
1
2
1
2
1
2
10
5-6
5-6
5-6
9-17
9-17
9-32
9-32
5
5
4
0.5
0.5
1
4-6
9-12
4-13
9-27
1
2
P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01
ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
P01 P2
fp
x,x/m
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
9
DPPS
x,m,i
10
VDPPS
y,y,y,i
13
VDPPS
y,m,i
15
DPPD
x,x,i
7
DPPD
x,m,i
8
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
4
5
8
8
3
4
0.5
0.5
1
P0
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P01
P01
P01
5-6
5-6
5-6
0.5
0.5
1
P01
P01
P01
fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
12-13
12-13
26-29
27-28
5
5
4-9
9-18
4-18
9-37
1
2
P01
P01
P01
P01
P01
P01
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
1
2
1
2
1
2
25
14
Page 73
fp
fp
fp
fp
fp
fp
Steamroller
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
x,x
x,m
2
4
10
2
2
P01
P01
AMD XOP
AMD XOP
AND/ANDN/OR/XORPS/
PD
x,x/m
0.5
P02
ivec
VAND/ANDN/OR/XOR
PS/PD
y,y,y/m
P02
ivec
m32
m32
m4096
m4096
m
m
9
16
17
32
9
2
59-67
104-112
121-137
191-209
Logic
Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
4
5
6
10
36
17
78
160
147-166
291-297
Page 74
P02
P02
P0 P2
P0 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Bobcat
AMD Bobcat
Instruction:
Operands:
Ops:
Latency:
JNE, etc.
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar
instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.
This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
Operands
r,r
r,i
r,m
m,r
m8,r8H
m,i
m,r
r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m
Ops Latency Reciprocal

throughput
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
4
4
7
6
1
5
1
5
1
1
20
Page 75
0.5
0.5
1
1
1
1
1
0.5
1
0.5
1
0.5
1
1
Execution
pipe
I0/1
I0/1
AGU
AGU
AGU
AGU
AGU
I0/1
Notes
Any addr. mode

Any addr. mode
AH, BH, CH, DH
I0/1
I0/1
Timing dep. on hw
Bobcat
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH
SFENCE
LFENCE
MFENCE
r
i
m
r
m
r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]
r
m
m
m
ADD, SUB
r,r/i
ADD, SUB
r,m
ADD, SUB
m,r
ADC, SBB
r,r/i
ADC, SBB
r,m
ADC, SBB
m,r/i
CMP
r,r/i
CMP
r,m
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA
AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
DIV
r8/m8
2
1
1
3
9
9
1
4
29
9
2
1
1
1
4
1
1
1
1
1
1
4
1
4
1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
33
1
3
2
2
1
1
1
2
1
1
1
3
1
2-4
4
1
1
1
1
6-7
1
1
6
5
10
7
8
5
23
3
3-5
3-4
6-7
3
3
6
4
3
7
27
Page 76
1
1
2
6
9
1
4
22
8
2
0.5
1
0.5
2
0.5
I0
I0/1
I0
I0/1
I0/1
0.5
1
1
1
~45
1
~45
I0/1
AGU
AGU
AGU
AGU
AGU
AGU
0.5
1
1
1
1
I0/1
0.5
1
0.5
I0/1
23
1
2
1
1
4
3
1
4
27
Any address size

no scale, no offset
w. scale or offset
RIP relative
AMD only
I0/1
I0/1
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
latency ax=3, dx=5

latency eax=3, edx=4
Bobcat
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF, BSR
BSF, BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
1
1
1
1
1
1
1
1
1
33
49
81
29
37
55
81
1
1
33
49
81
29
37
55
81
I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
1
1
1
1
1
1
1
1
1
1
9
7
9
9
0.5
1
1
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4
I0/1
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r,r
r,m
r,r/m
r,r/m
r
m
1
1
10
9
9
8
6
7
8
1
1
5
2
5
4-5
8
8
11
11
9
8
1
1
1
1
1
2
7
7
1
1
1
1
1
5
4
6
5
18
3
4
18
16
15
6
12
5
1
Page 77
I0/1
I0/1
I0/1
I0/1
I0/1
1
1
~15
~14
15
15
3
4
15
0.5
1
3
1
15
15
13
15
6
6
5
0.5
1
0.5
0.5
1
2
SSE4.A/SSE4.2
SSE4.A, AMD only
I0/1
I0/1
I0
I0,I1
Bobcat
JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i
BOUND
m
INTO
1
1
1
1
2
8
2
2
5
1
4
8
4
2
2
2
1/2 - 2
1-2
4
2
2
2
~3
~4
4
2
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
5
4
2
7
2
5
6
7
6
~3
~3
2
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
recip. t. = 2 if jump
recip. t. = 2 if jump
values for no jump

values for no jump

best case 6-7 B/clk
5
best case 5 B/clk
3
3
4
3
1
0
1
0
6
i,0
12
a,b
10+6b
2
30-52 70-830
26
14
0.5
0.5
6

I0/1
I0/1
36
34+6b
32 bit mode
87
8

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80

throughput
1
1
7
21
1
1
16
217
2
6
14
30
2
6
19
177
Page 78
0.5
1
5
35
0.5
1
9
180
Execution
pipe
FP0/1
FP0/1
FP0/1
FP1
Notes
Bobcat
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FNSTCW
FLDCW
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
r
m
m
st0,r
r
AX
m16
m16
m16
r
m
m
r
m
m
r
m
m
r
m
r
m
1
1
1
1
12
1
1
2
2
3
12
1
1
2
1
1
2
1
1
2
1
1
1
1
1
2
1
2
5
1
1
0
9
6
7
1
~20
~20
3
3
5
5
19
1
1
3
3
3
19
19
19
2
1
1
1
2
1
1
2
11
11-16
11-19
1
31
1
4-44 27-105
11-51 51-94
11-75 48-110
~45
~113
9-75 49-163
5
8
7
9
30-56
~60
8
29
12
44
1
1
9
26
1
1
1
1
7
1
1
10
10
2
10
0
0
Page 79
1
27-105
51-94
48-110
~113
49-163
0.5
0.5
30
78
FP1
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1
FP0
FP1
FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
FP1
FP0, FP1
FP1
FP1
FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
ALU
FP0, FP1
FP0, FP1
Bobcat
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
m
m
m
m
85
80
71
111
163
123
105
118
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1

Instruction
Operands
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,(x)mm
1
1
1
1
3
2
1
7
7
5
6
6
5
6
1
3
1
1
3
1
2
FP0
FP0/1
FP0/1
FP0
FP1
FP1
FP1
r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
1
2
3
1
2
1
2
1
2
2
2
2
2
1
2
1
2
7
7
7
1
1
5
5
6
1
6
6
6-9
6-9
1
1
13
13
1
3
3
0.5
1
1
1
2
1
2
3
2-5.5
3-6
0.5
1
1,5
3
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1
FP0/1
AGU
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1
mm,r/m
0.5
FP0/1
xmm,r/m
FP0/1
mm,r/m
0.5
xmm,r/m
xmm,r/m
xmm,r/m
mm,mm
xmm,xmm
xmm,xmm,i
mm,mm,i
xmm,xmm,i
xmm,xmm,i
mm,mm
xmm,xmm
2
2
1
1
6
3
1
2
20
32
64
1
1
1
2
3
2
1
2
19
146-1400
279-3000
1
1
0.5
1
3
2
0.5
2
12
130-1170
260-2300
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU, LDDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/D
Q
PUNPCKH/LBW/WD/D
Q
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFB
PSHUFB
PSHUFD
PSHUFW
PSHUFL/HW
PALIGNR
MASKMOVQ
MASKMOVDQU

throughput
Page 80
Execution
pipe
FP0, FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
Notes
Moves 64 bits.
Name differs
do.
do.
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Bobcat
PMOVMSKB
PEXTRW
PINSRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PHADD/SUBW/SW/D
PHADD/SUBW/SW/D
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULHRSW
PMULHRSW
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR
r32,(x)mm
r32,(x)mm,i
mm,r32,i
xmm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i
1
2
2
3
3
3
1
1
8
12
10
10
3-4
3-4
1
2
2
2
6
3
3
1
2
FP0
FP0, FP1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0/1
FP0/1
mm,r/m
0.5
FP0/1
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
1
4
1
1
1
0.5
1
0.5
1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
mm,r/m
FP0
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
2
2
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP0, FP1
mm,r/m
0.5
FP0/1
Page 81
SSE4.A, AMD only

SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Bobcat
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
xmm,r/m
FP0/1
mm,i/mm/m
FP0/1
xmm,i/xmm/m
2
2
1
1
1
1
FP0/1
FP0/1
0.5
FP0/1
xmm,i
Other
EMMS

Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
MOVSHDUP,
MOVSLDUP
MOVSHDUP,
MOVSLDUP
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
Operands

throughput
Execution
pipe
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
2
2
2
2
2
2
1
2
1
1
6
6
1
6-9
6-9
1
6
5
1
2
3
1
2-6
3-6
0.5
2
2
FP0/1
AGU
FP1
FP0/1
AGU
FP1
FP0/1
FP1
FP1
r,r
0.5
FP0/1
r,m
AGU
m,r
m,r
m,r
r,r
r,m64
1
2
1
2
2
5
12
12
2
7
3
3
2
1
2
FP1
FP1
FP1
FP0/1
FP0/1
r,r
FP0/1
r,m
r32,r
r,r/m,i
r,r/m
2
1
3
2
12
~6
2
1
3
2
2
1
AGU
FP0
FP0/1
FP0/1
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
2
4
3
1
2
2
2
4
1
5
5
5
4
4
5
4
6
4
2
3
3
1
4
2
4
3
2
FP1
FP0, FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1
Page 82
Notes
SSE4.A, AMD only

SSE3
SSE3
Bobcat
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
2
1
3
3
2
2
2
5
4
6
12
11
12
11
2
1
2
3
3
1
1
FP1
FP1
FP0, FP1
FP0, FP1
FP1
FP0, FP1
FP0, FP1
r,r/m
r,r/m
r,r/m
1
2
2
3
3
3
1
2
2
FP0
FP0
FP0
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
2
1
1
2
2
1
2
1
2
1
2
1
2
1
2
3
2
4
2
4
13
38
17
34
3
3
2
2
2
2
2
1
2
2
4
13
38
17
34
1
2
1
2
1
2
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
r,r/m
FP0
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
FP0/1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
14
48
24
48
3
3
14
48
24
48
1
2
FP1
FP1
FP1
FP1
FP1
FP1
Other
LDMXCSR
STMXCSR
m
m
12
3
10
11
FP0, FP1
FP0, FP1
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
Page 83
SSE3
SSE3
Jaguar
AMD Jaguar
Instruction:
Operands:
Ops:
Latency:
JNE, etc.
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.
instruction of the same kind can begin to execute. A value of 1/2 indicates that
the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.
Instruction
Move instructions
MOV
MOV
Operands

throughput
Execution
pipe
r,r
r,i
1
1
0.5
0.5
I0/1
I0/1
MOV
r8/16,m
AGU
MOV
m,r8/16
AGU
MOV
r32/64,m
AGU
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
m,r32/64
m,i
m,r
r,r
r,m
r64,r32
1
1
1
1
1
1
1
1
1
0.5
1
0.5
AGU
AGU
AGU
I0/1
6
1
4
1
Page 84
Notes
Any addressing
mode
Any addressing
mode
Any addressing
mode
Any addressing
mode
Jaguar
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
MOVBE
MOVBE
PREFETCHNTA
PREFETCHT0/1/2
PREFETCHW
LFENCE
MFENCE
SFENCE
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA
AAS
DAA
DAS
AAD
AAM
r64,m32
r,r
r,m
r8,r8
r,r
1
1
1
3
2
3
1
r,m
3
2
1
1
2
2
9
9
1
3
1
29
9
2
1
1
1
4
1
1
1
1
1
1
1
1
1
4
4
16
5
1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
8
r
i
m
SP
r
m
SP
r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]
r
r,m
m,r
m
m
m
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
2
1
3
1
2
3
1
1
1
6
1
8
1
1
6
5
8
6
8
5
14
Page 85
1
0.5
1
2
1
I0/1
I0/1
I0/1
Timing depends on
hw
3
1
1
1
1
6
8
1
2
2
18
8
2
0.5
1
0.5
2
0.5
1
0.5
1
1
~100
~100
~100
0.5
~45
~45
I0
I0/1
I0
I0/1
I0/1
I0/1
MOVBE
MOVBE
AGU
AGU
AGU
AGU
AGU
AGU
0.5
1
1
1
1
I0/1
0.5
1
0.5
1
I0/1
13
Any address size

1-2 comp., no scale
3 comp. or scale
RIP relative
I0/1
I0/1
Jaguar
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
ANDN
ANDN
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
1
3
2
2
1
1
1
2
1
1
1
2
2
2
1
2
2
2
1
1
3
3
3
6
3
3
6
4
3
6
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43
1
1
1
3
2
5
1
1
4
1
1
4
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1
r,i
r,r
r,m
m,r
r,r,r
r,r,m
r,i
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
1
1
1
1
1
2
1
1
1
1
1
1
1
1
9
7
9
7
1
1
0.5
0.5
1
1
0.5
1
0.5
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4
I0/1
I0/1
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
1
1
10
9
9
8
6
7
8
1
1
5
6
1
1
1
1
6
1
1
1
5
4
5
4
3
4
Page 86
1
1
11
11
11
11
3
4
11
0.5
1
3
BMI1
BMI1
I0/1
I0/1
I0/1
I0/1
I0/1
I0/1
Jaguar
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR
BSF, BSR
POPCNT
LZCNT
TZCNT
BLSI BLSR
BLSI BLSR
BLSMSK
BLSMSK
BEXTR
BEXTR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r,r/i
m,i
m,i
m,r
r,r
r,r
r,m
r,r/m
r,r
r,r
r,r
r,m
r,r
r,m
r,r,r
r,m,r
r
m
2
5
4
8
7
8
8
1
1
2
2
3
2
3
1
2
1
1
1
1
1
2
4
4
1
1
2
2
2
1
1
1
11
11
11
4
4
4
0.5
0.5
1
1
2
1
2
0.5
1
0.5
1
0.5
1
1
2

JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i
1
1
1
1
2
8
10
2
2
5
1
4
2
2
2
0.5 - 2
1-2
5
6
2
2
2
3
3
BOUND
4
~5n
4
~2n
2/16B
7
~2n
2/16B
5
~6n
2
~3n
2
~n
1/16B
4
~1.5n
1/16B
3
~3n
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
Page 87
SSE4A/SSE4.2
SSE4A/LZCNT
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1
I0/1
I0/1
I0
I0,I1
2 if jumping
2 if jumping
values are for no

jump
values are for no
jump
for small n
best case
for small n
best case
Jaguar
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32
7
~6n
m,r
m,r
m,r
m,r8
m,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128
r,r
r,m
1
4
4
5
5
6
6
18
18
28
28
4
~3n
19
11
16
11
16
11
17
11
19
32
38
1
1
37
i,0
12
a,b
10+6b
2
30-59 70-230
5
34
34
30
3
3
4
0.5
0.5
46
I0/1
I0/1
18
17+3b
3
32 bit mode
5
41
42
27
2
2
rdtscp

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
st0,r
r
AX
m16

throughput
1
1
7
21
1
1
10
217
1
1
1
1
12
1
1
2
2
2
4
9
24
2
3
9
167
0
8
4
7
1
Page 88
0.5
1
5
29
0.5
1
7
168
1
1
1
1
7
1
1
11
11
Execution
pipe
FP0/1
FP0/1
FP0/1
FP1
FP1
FP1
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1
Notes
Jaguar
FNSTCW
FLDCW
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m16
m16
3
12
r
m
m
r
m
m
r
m
m
1
1
2
1
1
1
1
1
2
1
1
1
1
1
2
1
2
5
1
1
r
m
r
m
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
1
1
4-44
11-51
11-76
11-45
9-75
5
7
8
8-51
61
m
m
1
1
9
27
88
80
22
8
11-54
11-56
35
30-139
38-93
55-122
55-177
44-167
27
9
32-37
30-120
~160
138-150
136
2
9
FP0
FP1
1
1
2
3
3
FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
1FP1
FP0, FP1
FP1
FP1
22
22
22
2
1
1
1
2
1
1
2
4
35
1
30-151
30-120
~160
FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
0.5
0.5
32
78
138-150
136
FP0/1
ALU
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
55-180
55-177
44-167
6

Instruction
Move instructions
MOVD
Operands
r32, mm

throughput
1
4
Page 89
Execution
pipe
FP0
Notes
Jaguar
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
mm, r32
mm,m32
r32, x
x, r32
x,m32
m32,(x)mm
2
1
1
2
1
1
6
4
4
6
4
3
1
1
1
1
1
1
FP0/1
AGU
FP0
FP1
AGU
FP1
MOVD / MOVQ
r64,(x)mm
MOVQ
mm,r64
MOVQ
x,r64
MOVQ
mm,mm
MOVQ
x,x
MOVQ
(x)mm,m64
MOVQ
m64,(x)mm
MOVDQA
x,x
VMOVDQA
y,y
MOVDQA
x,m
VMOVDQA
y,m
MOVDQA
m,x
VMOVDQA
m,y
MOVDQU, LDDQU
x.m
MOVDQU
m,x
MOVDQ2Q
mm,x
MOVQ2DQ
x,mm
MOVNTQ
m,mm
MOVNTDQ
m,x
PACKSSWB/DW
PACKUSWB
mm,r/m
PACKSSWB/DW
PACKUSWB
x,r/m
PUNPCKH/LBW/WD/D
Q
mm,r/m
PUNPCKH/LBW/WD/D
Q
x,r/m
PUNPCKH/LQDQ
x,r/m
PSHUFB
mm,mm
PSHUFB
x,x
PSHUFD
x,x,i
PSHUFW
mm,mm,i
PSHUFL/HW
x,x,i
PALIGNR
x,x,i
PBLENDW
x,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
x,x
PMOVMSKB
r32,(x)mm
PEXTRW
r32,(x)mm,i
PINSRW
mm,r32,i
PINSRB/W/D/Q
x,r,i
PINSRB/W/D/Q
x,m,i
PEXTRB/W/D/Q
r,x,i
PEXTRB/W/D/Q
m,x,i
INSERTQ
x,x
1
2
2
1
1
1
1
1
2
1
2
1
2
1
1
1
1
1
1
4
6
6
1
1
4
3
1
1
4
4
3
3
4
3
1
1
429
429
1
1
1
0.5
0.5
1
1
0.5
1
1
2
1
2
1
1
0.5
0.5
2
2
FP0
FP0/1
FP0/1
FP0/1
FP0/1
AGU
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1
0.5
FP0/1
0.5
FP0/1
0.5
FP0/1
1
1
1
3
1
1
1
1
1
32
64
1
1
2
2
1
1
1
3
2
2
1
4
2
1
1
2
1
432
43-2210
3
4
8
7
0.5
0.5
0.5
2
0.5
0.5
0.5
0.5
0.5
17
34
1
1
1
1
1
1
1
2
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0, FP1
3
2
Page 90
Moves 64 bits.Name
of instruction differs
do.
do.
AVX
AVX
AVX
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
SSE4.1
SSE4.1
SSE4.1
SSE4A, AMD only
Jaguar
INSERTQ
EXTRQ
EXTRQ
PMOVSXBW/BD/BQ/
WD/WQ/DQ
PMOVZXBW/BD/BQ/
WD/WQ/DQ
x,x,i,i
x,x
x,x,i,i
3
1
1
2
1
1
2
0.5
0.5
FP0, FP1
FP0/1
FP0/1
SSE4A, AMD only

SSE4A, AMD only
SSE4A, AMD only
x,x
0.5
FP0/1
SSE4.1
x,x
0.5
FP0/1
SSE4.1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
1
3
1
1
1
1
1
1
1
1
1
3
2
4
2
2
2
2
1
1
1
1
2
4
1
2
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
FP0
FP0 FP1
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
(x)mm,r/m
0.5
FP0/1
mm,i/mm/m
0.5
FP0/1
x,x
0.5
FP0/1
x,i
x,i
x,x/m
1
1
1
1
2
3
0.5
0.5
1
FP0/1
FP0/1
FP0
SSE4.1
x,x,i
x,m,i
x,x,i
x,m,i
9
10
9
10
5
5
9
9
FP0/1
FP0/1
FP0/1
FP0/1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
(x)mm,r/m
PHADD/SUBW/SW/D
mm,r/m
PHADD/SUBW/SW/D
x,r/m
PCMPEQ/GT B/W/D
mm,r/m
PCMPEQ/GT B/W/D
x,r/m
PCMPEQQ
(x)mm,r/m
PCMPGTQ
(x)mm,r/m
PMULLW PMULHW
PMULHUW
PMULUDQ
(x)mm,r/m
PMULLD
x,r/m
PMULDQ
x,r/m
PMULHRSW
(x)mm,r/m
PMADDWD
(x)mm,r/m
PMADDUBSW
(x)mm,r/m
PAVGB/W
(x)mm,r/m
PMIN/MAX SW/UB
(x)mm,r/m
PABSB/W/D
(x)mm,r/m
PSIGNB/W/D
(x)mm,r/m
PSADBW
(x)mm,r/m
MPSADBW
x,x,i
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
9
Page 91
Suppl. SSE3
Suppl. SSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
SSE4.1
Jaguar
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
x,x,i
x,m,i
x,x,i
x,m,i
3
4
3
4
x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
1
2
2
2
2
1
1
3
5
5
5
5
2
2
Other
EMMS
2
2
8
2
FP0/1
FP0/1
FP0/1
FP0/1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
1
1
1
1
1
1
1
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
PCLMUL
AES
AES
AES
AES
AES
AES
0.5
FP0/1

Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
Operands

throughput
Execution
pipe
x,x
y,y
x,m
y,m
m,x
m,y
x,x
y,y
x,m
y,m
m,x
m,y
x,x
x,m
m,x
1
2
1
2
1
2
1
2
1
2
1
2
1
1
1
1
1
4
4
3
3
1
1
4
4
3
3
1
4
3
0.5
1
1
2
1
2
0.5
1
1
2
1
2
0.5
1
1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
AGU
FP1
x,x
FP0/1
x,m
FP0/1
m,x
m,x
m,x
x,x
x,m64
y,y
y,m
x,x
x,m
y,y
1
1
1
1
1
2
2
1
1
2
4
429
1
1
1
0.5
1
1
2
0.5
1
1
FP1
FP1
FP1
FP0/1
AGU
FP0/1
AGU
FP0/1
AGU
FP0/1
2
2
1
1
Page 92
Notes
SSE4A, AMD only

SSE3
SSE3
AVX
AVX
AVX
Jaguar
VMOVSH/LDUP
MOVMSKPS/D
VMOVMSKPS/D
SHUFPS/D
VSHUFPS/D
UNPCK H/L PS/D
VUNPCK H/L PS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
y,m
r32,x
r32,y
x,x/m,i
y,y,y,i
x,x/m
y,y,y
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
1
1
1
2
1
2
1
1
1
1
1
1
2
2
1
2
19
36
Conversion
CVTPS2PD
VCVTPS2PD
CVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS/PD
VCVTDQ2PS/PD
CVT(T)PS2DQ
VCVT(T)PS2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
x,x/m
y,x/m
x,x/m
x,y
x,x/m
x,x/m
x,x/m
y,y
x,x/m
y,y
x,x/m
y,y
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
x/m,x,i
x/m,y,i
x,x/m
y,x/m
x,x/m
x,x/m
y,y/m
x,x/m
y,y/m
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
6
1
13
15
15
21
32
2
1
1
0.5
1
0.5
1
1
1
0.5
1
1
1
1
2
1
2
16
22
AGU
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0/1
FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1
AVX
AVX
>300 clk if mask=0
>300 clk if mask=0
AVX
AVX
1
2
1
3
2
2
1
2
1
2
1
3
1
1
1
1
2
2
2
2
1
3
1
2
3
4
4
6
5
4
4
4
4
4
4
7
4
4
4
4
9
9
8
8
4
6
4
5
1
2
1
2
8
7
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
2
1
2
FP1
FP1
FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1
FP1
F16C
F16C
F16C
F16C
1
1
2
1
2
3
3
3
3
3
1
1
2
1
2
FP0
FP0
FP0
FP0
FP0
3
3
2
2
2
2
3
3
1
12
Page 93
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
Jaguar
HADD/SUBPS/D
VHADD/SUBPS/D
MULSS/PS
VMULPS
MULSD/PD
VMULPD
DIVSS
DIVPS
VDIVPS
DIVSD
DIVPD
VDIVPD
RCPSS
RCPPS
VRCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
CMPccSS/D
CMPccPS/D
VCMPccPS/D
(U)COMISS/D
ROUNDSS/SD/PS/PD
VROUNDSS/D/PS/D
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m,i
y,y/m,i
x,x,i
x,m,i
y,y,y,i
y,m,i
x,x,i
x,m,i
1
2
1
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
5
6
10
12
3
4
4
4
2
2
4
4
14
19
38
19
19
38
2
2
2
2
2
2
2
2
2
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
VANDPS/D, etc.
x,x/m
y,y/m
1
2
Math
SQRTSS
SQRTPS
VSQRTPS
SQRTSD
SQRTPD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
y,y/m
m
m
Other
LDMXCSR
STMXCSR
VZEROUPPER
VZEROUPPER
VZEROALL
1
2
1
2
2
2
14
19
38
19
19
38
1
1
2
1
1
2
1
1
2
1
1
2
4
4
7
7
3
3
FP0
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
1
1
0.5
1
FP0/1
FP0/1
1
2
2
1
2
2
1
2
16
21
42
27
27
54
2
2
16
21
42
27
27
54
1
2
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
12
3
21
37
41
9
13
8
12
30
46
58
FP0, FP1
FP0, FP1
4
4
11
12
9
Page 94
SSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
32 bit mode
64 bit mode
32 bit mode
Jaguar
VZEROALL
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
XSAVE
XSAVE
XRSTOR
XRSTOR
73
66
58
115
123
130
114
219
251
66
58
189
198
145
129
342
375
Page 95
90
66
58
189
197
145
129
342
375
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Intel Pentium
Intel Pentium and Pentium MMX

List of instruction timings
Operands
Clock cycles
Pairability
r = register, accum = al, ax or eax, m = memory, i = immediate data, sr =

segment register, m32 = 32 bit memory operand, etc.
The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably.
u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe,
np = not pairable.
Integer instructions (Pentium and Pentium MMX)

Instruction
NOP
MOV
MOV
MOV
MOV
XCHG
XCHG
XCHG
XLAT
PUSH
POP
PUSH
POP
PUSH
POP
PUSHF
POPF
PUSHA POPA
PUSHAD POPAD
LAHF SAHF
MOVSX MOVZX
LEA
LDS LES LFS LGS LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
TEST
TEST
TEST
TEST
INC DEC
Operands
r/m, r/m/i
r/m, sr
sr , r/m
m , accum
(E)AX, r
r,r
r,m
r/i
r
m
m
sr
sr
r , r/m
r,m
m
r , r/i
r,m
m , r/i
r , r/i
r,m
m , r/i
r , r/i
m , r/i
r,r
m,r
r,i
m,i
r
Clock cycles Pairability

1
uv
1
uv
1
np
>= 2 b)
np
1
uv h)
2
np
3
np
>15
np
4
np
1
uv
1
uv
2
np
3
np
1 b)
np
>= 3 b)
np
3-5
np
4-6
np
5-9 i)
np
5
np
2
np
3 a)
np
1
uv
4 c)
np
1
uv
2
uv
3
uv
1
u
2
u
3
u
1
uv
2
uv
1
uv
2
uv
1
f)
2
np
1
uv
Page 96
Intel Pentium
INC DEC
NEG NOT
MUL IMUL
MUL IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR SAL
SHR SHL SAR SAL
SHR SHL SAR SAL
ROR ROL RCR RCL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
JMP CALL
JMP CALL
conditional jump
CALL JMP
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
BOUND
CLC STC CMC CLD STD
CLI STI
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
m
r/m
r8/r16/m8/m16
all other versions
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32
r,i
m,i
r/m, CL
r/m, 1
r/m, i(><1)
r/m, CL
r/m, i(><1)
r/m, CL
r, i/CL
m, i/CL
r, r/i
m, i
m, i
r, r/i
m, i
m, r
r , r/m
r/m
short/near
far
short/near
r/m
i
i
short
short
r,m
3
1/3
11
9 d)
17
25
41
22
30
46
3
2
1
3
4/5
1/3
1/3
4/5
8/10
7/9
4 a)
5 a)
4 a)
4 a)
9 a)
7 a)
8 a)
14 a)
7-73 a)
1/2 a)
1 e)
>= 3 e)
1/4/5/6 e)
2/5 e
2/5 e
3/6 e)
4/7 e)
5/8 e)
4-11 e)
5-10 e)
8
2
6-9
2
7+3*n g)
3
10+n g)
4
12+n g)
4
Page 97
uv
np
np
np
np
np
np
np
np
np
np
np
u
u
np
u
np
np
np
np
np
np
np
np
np
np
np
np
np
np
v
np
v
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
Intel Pentium
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
CPUID
RDTSC
Notes:
a
b
c
d
e
f
g
h
i
j
9+4*n g)
5
8+4*n g)
1 a)
13-16 a)
6-13 a) j)
np
np
np
np
np
np
This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction.
versions with FS and GS have a 0FH prefix. see note a.
versions with SS, FS, and GS have a 0FH prefix. see note a.
versions with two operands and no immediate have a 0FH prefix, see
note
a.
high values
are for mispredicted jumps/branches.
only pairable if register is AL, AX or EAX.
add one clock cycle for decoding the repeat prefix unless preceded by a
multi-cycle instruction (such as CLD).
pairs as if it were writing to the accumulator.
9 if SP divisible by 4 (imperfect pairing).
on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual
mode. On PMMX: 8 and 13 clocks respectively.
Floating point instructions (Pentium and Pentium MMX)

Explanation of column headings
Operands
Clock cycles
Pairability
i-ov
fp-ov
Instruction
FLD
FLD
FBLD
FST(P)
FST(P)
FST(P)
FBSTP
FILD
FIST(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FNSTSW
FLDCW
FNSTCW
r = register, m = memory, m32 = 32-bit memory operand, etc.

The numbers are minimum values. Cache misses, misalignment,
denormal operands, and exceptions may increase the clock counts
considerably.
+ = pairable with FXCH, np = not pairable with FXCH.
Overlap with integer instructions. i-ov = 4 means that the last four clock
cycles can overlap with subsequent integer instructions.
Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
Operand
r/m32/m64
m80
m80
r
m32/m64
m80
m80
m
m
AX/m16
m16
m16
Clock cycles Pairability

1
0
3
np
48-58
np
1
np
2 m)
np
3 m)
np
148-154
np
3
np
6
np
2
np
5 s)
np
6 q)
np
8
np
2
np
Page 98
i-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0
fp-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0
Intel Pentium
FADD(P)
FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FCHS FABS
FCOM(P)(P) FUCOM
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM
FTST
FXAM
FPREM
FPREM1
FRNDINT
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FXCH
FINCSTP FDECSTP
FFREE
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
m
n
o
p
q
r
s
r/m
r/m
r/m
r/m
r/m
m
m
m
m
r
r
m
m
3
3
3
19/33/39 p)
1
1
6
6
22/36/42 p)
4
1
17-21
16-64
20-70
9-20
20-32
12-66
70
65-100 r)
89-112 r)
53-59 r)
103 r)
105 r)
120-147 r)
112-134 r)
1
1
2
2
6-9
12-22
124-300
70-95
1
0
0
0
0
0
0
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
2
2
2
38 o)
0
0
2
2
38 o)
0
0
4
2
2
0
5
0
69 o)
2
2
2
2
2
36 o)
2
0
0
0
0
0
0
0
0
0
2
2
2 n)
2
0
0
2
2
2
0
0
0
2
2
0
0
0
2
2
2
2
2
2
0
2
0
0
0
0
0
0
0
0
0
The value to store is needed one clock cycle in advance.

1 if the overlapping instruction is also an FMUL.
Cannot overlap integer multiplication instructions.
FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9
of the floating point control word.
The first 4 clock cycles can overlap with preceding integer instructions.
Clock counts are typical. Trivial cases may be faster, extreme cases may
be slower.
May be up to 3 clocks more when output needed for FST, FCHS, or
FABS.
MMX instructions (Pentium MMX)
Page 99
Intel Pentium
A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX
multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one
multiplication per clock cycle.
The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS
takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes
approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX.
There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit
is one step later in the pipeline than the load unit. But the penalty comes when you store data from an
MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance.
This is analogous to the floating point store instructions.
All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Page 100
Pentium II and III
Intel Pentium II and Pentium III

List of instruction timings and op breakdown
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc.
ops:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:
The number of ops that the instruction generates for each execution port.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency chain. (This is
not the same as the time spent in the execution unit. Values may be inaccurate
in situations where they cannot be measured exactly, especially with memory
operands). The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably. Floating point operands
are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind.
Integer instructions (Pentium Pro, Pentium II and Pentium III)

Instruction
Operands
p0
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
POP
POP
PUSH
POP
PUSH
POP
PUSHF(D)
r,r/i
r,m
m,r/i
r,sr
m,sr
sr,r
sr,m
r,r
r,m
r,r
r,m
r,r
r,m
p1
ops
p01 p2 p3
1
1
1
1
1
1
8
7
Latency
p4
1
1
5
8
1
1
1
1
1
r/i
r
(E)SP
m
m
sr
sr
3
1
1
3
4
1
1
1
2
1
5
2
8
11
Page 101
1
1
1
1
1
1
1
1
1
1
1
1
1
high b)
Reciprocal
throughput
Pentium II and III

POPF(D)
PUSHA(D)
POPA(D)
LAHF SAHF
LEA
LDS LES LFS LGS
LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP TEST
CMP TEST
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR ROR
ROL
SHR SHL SAR ROR
ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
10
r,m
6
2
2
1
1
8
1 c)
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
8
1
1
1
2
2
3
1
1
1
1
3
1
1
1
1
1
1
r,(r),(i)
(r),m
r8
r16
r32
m8
m16
m32
1
1
1
1
2
3
3
2
2
2
2
2
4
15
4
4
19
23
39
19
23
39
1
1
1
1
1
1
1
1
1
1
1
1
r,i/CL
m,i/CL
r,1
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r/i
r,r/i
m,r/i
r,r
r,m
r
m
1
1
4
3
1
4
4
2
2
1
4
3
2
3
2
1
1
1
1
1
1
6
1
6
1
1
1
1
Page 102
1
1
1
1
1
1
1
1
1
1
1
1
1
1
12
21
37
12
21
37
Pentium II and III

JMP
JMP
JMP
JMP
JMP
conditional jump
CALL
CALL
CALL
CALL
CALL
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
LOOP(N)E
ENTER
ENTER
LEAVE
BOUND
CLC STC CMC
CLD STD
CLI
STI
INTO
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
NOP (90)
Long NOP (0F 1F)
CPUID
RDTSC
IN
OUT
PREFETCHNTA d)
PREFETCHT0/1/2 d)
SFENCE d)
Notes
a)
short/near
far
r
m(near)
m(far)
short/near
near
far
r
m(near)
m(far)
1
1
1
1
1
1
r,m
2
2
1
2
21
1
28
1
1
1
2
4
1
1
2
3
28
i
i
short
short
short
i,0
a,b
21
23
23
1
8
8
12
18 +4b
2
6
1
4
1
2
1
1
3
3
1
2
1
1
2
2
2
1
2
1
1
2
2
2
2
2
1
1
1
2
2
ca.
7
1
b-1
1
2b
1
2
9
17
5
2
10+6n
1
a)
3
a)
2
ca. 5n
1
ca. 6n
12+7n
12+9n
r
1
1
1
0,5
1
23-48
31
18
18
m
m
>300
>300
1
1
1
Faster under certain conditions: see manual 3: "The microarchitecture of Intel,

AMD and VIA CPUs".
Page 103
Pentium II and III

b)
c)
d)
Has an implicit LOCK prefix.

3 if constant without base or index register
P3 only.
Floating point x87 instructions (Pentium Pro, II and III)

Instruction
Operands
p0
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
FSCALE
FXTRACT
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
r
AX
m16
m16
m16
r
m
r
m
r
m
r
m
r
m
m
m
m
m
p1
ops
p01 p2 p3
Latency
p4
Reciprocal
throughput
1
1
2
2
2
38
1
1
2
2
2
165
3
2
1
2
2
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
6
6
6
6
1
1
23
33
30
56
15
1
2
2
1
1
0
5
5
f)
2
7
1
1
10
1
1
1
1
1
1
1
1
1
1
1
1
3
3-4
5
5-6
38 h)
38 h)
2
1
1
1
1
1
1
2
Page 104
1
1
2 g)
2 g)
37
37
Pentium II and III

FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FINCSTP FDECSTP
FFREE
FFREEP
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
e)
f)
g)
h)
i)
r
r
1
17-97
18-110
17-48
36-54
31-53
21-102
25-86
1
1
1
2
27-103
29-130
66
103
98-107
13-143
44-143
69
e)
e)
e)
e)
e)
e)
e)
e,i)
3
13
141
72
2
Not pipelined
FXCH generates 1 op that is resolved by register renaming without going to any
port.
FMUL uses the same circuitry as integer multiplication. Therefore, the combined
throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL
per 3 clock cycles.
FDIV latency depends on precision specified in control word: 64 bits precision
gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/(latency-1).
Faster for lower precision.
Integer MMX instructions (Pentium II and Pentium III)

Instruction
Operands
p0
MOVD MOVQ
MOVD MOVQ
MOVD MOVQ
PADD PSUB PCMP
PADD PSUB PCMP
PMUL PMADD
PMUL PMADD
PAND(N) POR PXOR
PAND(N) POR PXOR
PSRA PSRL PSLL
PSRA PSRL PSLL
PACK PUNPCK
PACK PUNPCK
EMMS
MASKMOVQ d)
PMOVMSKB d)
MOVNTQ d)
r,r
mm,m32/64
m32/64,mm
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm/i
mm,m64
mm,mm
mm,m64
p1
ops
p01 p2 p3
1
1
1
1
1
1
1
1
Latency
p4
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
11
mm,mm
r32,mm
m64,mm
Page 105
6 k)
2-8
1
Reciprocal
throughput
0,5
1
1
0,5
1
1
1
0,5
1
1
1
1
1
2 - 30
1
1 - 30
Pentium II and III

PSHUFW d)
PSHUFW d)
PEXTRW d)
PINSRW d)
PINSRW d)
PAVGB PAVGW d)
PAVGB PAVGW d)
PMIN/MAXUB/SW d)
PMIN/MAXUB/SW d)
PMULHUW d)
PMULHUW d)
PSADBW d)
PSADBW d)
Notes:
d)
k)
mm,mm,i
mm,m64,i
r32,mm,i
mm,r32,i
mm,m16,i
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
1
1
1
1
1
1
2
2
1
2
1
2
1
2
3
4
5
6
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
0,5
1
0,5
1
1
1
2
2
P3 only.
The delay can be hidden by inserting other instructions between EMMS and any
subsequent floating point instruction.
Floating point XMM instructions (Pentium III)

Instruction
Operands
p0
MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHPS MOVLPS
MOVHPS MOVLPS
MOVLHPS MOVHLPS
MOVMSKPS
MOVNTPS
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVTPS2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVTSS2SI
ADDPS SUBPS
ADDPS SUBPS
ADDSS SUBSS
ADDSS SUBSS
MULPS
MULPS
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32
m32,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
p1
ops
p01 p2 p3
2
2
2
4
4
1
1
1
1
1
1
1
Latency
p4
2
4
1
1
1
2
2
2
2
1
2
2
1
1
2
2
1
1
2
2
1
2
1
2
1
2
2
1
2
Page 106
1
2
3
2
3
1
1
1
1
1
1
1
2
3
4
3
4
4
5
3
4
3
3
3
3
4
4
Reciprocal
throughput
1
2
2
4
4
1
1
1
1
1
1
1
2 - 15
1
2
1
1
2
2
1
2
2
2
1
1
2
2
Pentium II and III

MULSS
MULSS
DIVPS
DIVPS
DIVSS
DIVSS
AND(N)PS ORPS XORPS
AND(N)PS ORPS XORPS
MAXPS MINPS
MAXPS MINPS
MAXSS MINSS
MAXSS MINSS
CMPccPS
CMPccPS
CMPccSS
CMPccSS
COMISS UCOMISS
COMISS UCOMISS
SQRTPS
SQRTPS
SQRTSS
SQRTSS
RSQRTPS
RSQRTPS
RSQRTSS
RSQRTSS
RCPPS
RCPPS
RCPSS
RCPSS
SHUFPS
SHUFPS
UNPCKHPS UNPCKLPS
UNPCKHPS UNPCKLPS
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m128
m32
m32
m4096
m4096
1
1
2
2
1
1
1
2
1
2
2
2
2
1
1
2
2
1
1
1
1
2
2
1
2
1
1
2
2
2
2
2
2
1
1
2
2
1
1
2
1
2
1
2
1
2
2
2
2
1
2
2
11
6
116
89
Page 107
4
4
48
48
18
18
2
2
3
3
3
3
3
3
3
3
1
1
56
57
30
31
2
3
1
2
2
3
1
2
2
2
3
3
15
7
62
68
1
1
34
34
17
17
2
2
2
2
1
1
2
2
1
1
1
1
56
56
28
28
2
2
1
1
2
2
1
1
2
2
2
2
15
9
Pentium M
Intel Pentium M, Core Solo and Core Duo

Operands:
ops fused domain:

ops unfused domain:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:
i = immediate data, r = register, mm = 64 bit mmx register, xmm =

128 bit xmm register, sr = segment register, m = memory, m32 =
32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count
as two.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant
first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency
chain. (This is not the same as the time spent in the execution
unit. Values may be inaccurate in situations where they cannot be
measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating
point operands are presumed to be normal numbers. Denormal
numbers, NAN's and infinity increase the delays by 50-150
clocks, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give
a similar delay.
The average number of clock cycles per instruction for a series of
independent instructions of the same kind.
Instruction
Operands
ops
fused
domain
ops unfused domain

p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r32
r,r
r,m
r,r
r,m
r,r
1
1
1
2
1
2
8
8
2
1
1
2
2
3
0,5
1
1
1
1
1
8
7
1
1
1
1
1
5
8
1
1
1
2
0,5
1
1,5
1,5
1
1
1
1
Page 108
1
1
3
Pentium M
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LAHF SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
SFENCE/LFENCE/MFENCE
IN
OUT
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
r,m
r
i
m
sr
r
(E)SP
m
sr
r,m
r
m
m
m
7
2
1
2
2
2
16
18
1
3
2
10
17
10
1
2
1
2
11
1
1
2
4
1
1
1
1
1
11
2
2
9
6
2
1
10
1
1
1
1
1
1
1
1
8
high b)
1
1
1
1
1
8
1
1
1
1
1
8
1
1
2
1
1
1
1
6
8
7
1
1
1
1
8
3
1
1
1
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r
m
r8
r16/r32
r,r
r,r,i
m8
m16/m32
r,m
r,m,i
r8
r16
r32
1
1
3
2
2
7
1
1
2
1
3
1
3
4
1
3
1
1
1
3
1
2
5
4
4
1
1
6
18
18
16
7
1
1
1
>300
>300
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
1
1
1
1
0,5
1
1
2
0,5
1
1
0,5
2
15
4
5
4
4
4
5
4
4
15-16 c)
15-24 c)
15-39 c)
1
1
1
1
1
1
1
1
12
12-20 c)
12-20 c)
1
1
1
1
3
1
1
1
3
1
1
4
3
3
Page 109
2
2
1
1
1
1
1
1
1
Pentium M
DIV IDIV
DIV IDIV
DIV IDIV
CBW CWDE
CWD CDQ
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
TEST
SHR SHL SAR ROR ROL
SHR SHL SAR ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD STD
m8
m16
m32
r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r,i/CL
m,i/CL
r,1
r8,i/CL
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
conditional jump
short/near
J(E)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r
6
5
5
1
1
1
1
3
1
1
2
1
3
2
9
8
6
7
12
11
10
2
4
1
8
2
1
10
3
2
2
1
2
1
4
1
22
1
2
25
1
2
11
11
4
32
4
4
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
4
3
2
6
5
5
2
1
1
1
1
2
0,5
1
1
0,5
1
1
1
1
1
1
1
1
1
1
4
4
3
2
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
8
8
1
27
1
1
1
7
1
2
23
1
1
1
1
21
2
11
10
9
1
1
1
1
Page 110
12
12-20 c)
12-20 c)
1
1
2
2
15-16 c)
15-24 c)
15-39 c)
1
1
1
2
1
1
2
1
1
28
1
2
31
1
1
6
6
2
27
9
Pentium M
CALL
CALL
RETN
RETN
RETF
RETF
BOUND
INTO
m(near)
m(far)
i
i
r,m
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CLI
STI
ENTER
ENTER
LEAVE
CPUID
RDTSC
Notes:
a)
b)
c)
4
35
2
3
27
27
15
5
1
29
1
1
24
24
7
2
6n
3
5n
6
6n
3
7n
6
9n
2
1
6
5
1
2
1
1
3
3
2
1
2
1
2
2
30
2
2
30
30
8
4
4
0,5
1
0,7
0,7
0,5
1,3
0,6
0,7
0,5
10+6n
ca. 5n
1
ca. 6n
1
12+7n
4
12+9n
1
1
2
1
a)
3
a)
2
1
1
2
0,5
1
9
17
i,0
a,b
12
ca.
3
38-59
13
10
18 +4b
2
1
1
b-1 2b
1
38-59
13
ca. 130
42
Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
High values are typical, low values are for round divisors. Core Solo/Duo is
more efficient than Pentium M in cases with round values that allow an earlyout algorithm.

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
Operands
r
m32/64
m80
m80
r
ops
fused
domain
1
1
4
40
1
ops unfused domain

p0
p1 p01 p2
1
2
38
1
Page 111
1
2
2
p3
Latency Reciprocal
p4
throughput
1
1
Pentium M
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE
FFREEP
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
m32/m64
m80
m80
r
m
m
m
r
AX
m16
m16
m16
r
r
r
m
r
m
r
m
r
m
r
m
m
m
m
1
6
169
1
4
4
4
1
2
2
3
2
3
3
1
1
2
142
72
1
1
1
1
1
1
1
1
1
1
2
1
6
6
6
6
1
1
26
15
28
15
1
80-100
90-110
~ 20
~ 40
~ 55
1
2
2
2
165
3
2
2
1
2
2
3
1
1
1
1
1
2
142
72
1
1
1
1
1
2
7
19
3
1
2
131
91
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
3
3
5
5
9-38 c)
9-38 c)
1
1
1
1
1
1
3
5
9-38 c)
26
15
15
1
80-100
90-110
~20
~40
~55
1
1
2
2
8-37 c)
8-37 c)
1
1
1
1
1
1
3
3
8-37 c)
4
1
1
37
19
28
Page 112
0
5
5
5
3
167
0.33 f)
2
2
2
1
1
3
5
5
3
1
2
2
43
9
9 h)
80-110
100-130
~45
~60
~65
Pentium M
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
c)
f)
g)
~ 100
~ 85
~100
~85
1
2
3
14
~140
~140
1
1
13
27
3
14
High values are typical, low values are for low precision or round divisors.
FXCH generates 1 op that is resolved by register renaming without going to
any port.
SSE3 instruction only available on Core Solo and Core Duo.

Instruction
Operands
ops
fused
domain
ops unfused domain

p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
r32,mm
mm,r32
mm,m32
m32,mm
r32,xmm
xmm,r32
xmm,m32
m32, xmm
mm,mm
mm,m64
m64,mm
xmm,xmm
xmm,m64
m64, xmm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
2
2
1
1
1
1
2
2
1
2
2
2
4
8
4
1
2
1
4
1
1
mm,mm
mm,m64
xmm,xmm
1
1
1
1
1
2
1
1
1
1
1
2
1
1
1
1
1
2
2
2
5-6
1
1
2
2-3 2-3
1
1
1
1
2
Page 113
1
1
1
2
0,5
0,5
1
1
1
1
1
1
0,5
1
1
1
1
1
1
2
2
2-10
4-20
2
1
1
2
3
Pentium M
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKHQDQ
PUNPCKHQDQ
PUNPCKLQDQ
PUNPCKLQDQ
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
xmm,xmm
xmm, m128
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
mm,mm
xmm,xmm
r32,mm
r32,xmm
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i
4
1
1
2
3
2
3
1
1
1
2
3
4
2
3
3
8
1
1
2
4
1
2
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULUDQ
PMULUDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDWD
PMADDWD
PAVGB/W
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
1
1
2
4
2
2
4
6
1
1
2
2
1
1
2
4
1
1
2
4
1
1
2
4
1
1
1
1
2
1
2
1
1
2
2
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
2
2
2
1
1
2
1
1
1
1
2
1
2
j)
1
2
1
1
2
2
2
2
4
4
1
1
2
2
1
1
2
2
1
1
2
2
Page 114
1
1
2
3
1
1
1
1
1
2
1
2
0,5
1
1
2
1
1
2
2
0,5
1
1
2
1
1
2
2
1
1
2
2
1
1
2
2
0,5
1
2
2
1
2
2
1
1
1
2
2
1
2
1
1
2
2
1
1
2
1
2
2
1
1
2
2
1
1
1
1
1
1
2
2
1
1
3
3
3
3
4
4
4
4
3
3
3
3
1
Pentium M
PAVGB/W
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PSADBW
PSADBW
PSADBW
PSADBW
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
1
2
4
1
1
2
4
2
2
4
6
1
2
2
1
1
2
2
2
2
4
4
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
1
1
2
4
1
1
2
3
3
4
1
1
2
2
Other
EMMS
Notes:
g)
j)
k)
1
1
2
2
1
1
2
1
1
1
2
4
4
4
4
1
2
0,5
1
1
2
1
1
2
2
2
3
6 k)
1
1
2
1
1
2
2
1
1
1
11
11
1
1
2
0,5
1
1
2
1
1
2
2

Also uses some execution units under port 1.
You may hide the delay by inserting other instructions between EMMS and
any subsequent floating point instruction.

Instruction
Operands
ops
fused
domain
ops unfused domain

p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
2
2
2
4
8
1
2
1
1
1
1
1
2
2
2
4
4
Page 115
1
1
1
j)
1
2
3
2
3
1
1
1
1
1
1
2
1
2
2
2
4
1
1
1
1
1
1
1
Pentium M
MOVNTPS/D
SHUFPS/D
SHUFPS/D
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
2
3
4
2
2
4
4
4
2
3
Conversion
CVTPS2PD
CVTPS2PD
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
4
4
4
6
2
3
2
3
2
4
2
4
4
5
4
6
1
2
1
2
4
5
3
5
2
2
3
2
3
2
3
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,xmm
1
2
2
4
2
6?
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
HADDPS HSUBPS g)
2
2
1
1
1
2
2
2
1
2
2
1
3
3
2
2
1
1
3-4
2
1
1
2
2
1
1
1
4
2
2
2
1
1
1
1
2
2
Page 116
1
1
1
1
1
1
1
3
2
3
2
4
1
4
2
3
1
3
1
5
1
3
3
1
1
4
1
2
2
2
2
2
4
4
4
4
2
2
2
2
1
1
4
2
4
4
1
4
1
1
1
1
1
2
2
2
?
4
1
1
2
3
3
3
3
3
7
3
2
2
1
2
5
5
1
1
3
3
3
3
2
2
2
2
2
2
2
2
2
2
3
3
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
4
Pentium M
HADDPD HSUBPD g)
MULSS
MULSD
MULSS
MULSD
MULPS
MULPD
MULPS
MULPD
DIVSS
DIVSD
DIVSS
DIVSD
DIVPS
DIVPD
DIVPS
DIVPD
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
RCPSS
RCPSS
RCPPS
RCPPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
3
1
1
2
2
2
2
4
4
1
1
2
2
2
2
4
4
1
2
2
4
1
2
1
2
2
4
1
2
2
4
Math
SQRTSS
SQRTSS
SQRTSD
SQRTSD
SQRTPS
SQRTPD
SQRTPS
SQRTPD
RSQRTSS
RSQRTSS
RSQRTPS
RSQRTPS
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
2
3
1
2
2
2
4
4
1
2
2
4
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
xmm,xmm
xmm,m128
2
4
3
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
1
1
2
2
3
2
1
2
3
3
3
3
3
1
3
2
2
2
1
1
2
2
2
2
Page 117
1
1
1
2
2
Other
4
4
5
4
5
4
5
4
5
9-18 c)
9-32 c)
9-18 c)
9-32 c)
16-34 c)
16-62 c)
16-34 c)
16-62 c)
3
6-30
1
5-58
1
8-56
16-114
2
2
1
1
3
2
3
1
3
2
2
2
1
2
2
1
2
1
2
2
4
2
4
8-17 c)
8-31 c)
8-17 c)
8-31 c)
16-34 c)
16-62 c)
16-34 c)
16-62 c)
1
1
2
2
1
1
1
1
2
2
1
1
2
2
4-28
4-28
4-57
4-57
16-55
16-114
16-55
16-114
1
1
2
2
1
1
Pentium M
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
c)
g)
j)
m32
m32
m4096
m4096
9
6
118
87
9
6
32
43
43
43
44
High values are typical, low values are for round divisors.
Also uses some execution units under port 1.
Page 118
20
12
63
72
Merom
Intel Core 2 (Merom, 65nm)

Operands:
ops fused domain:

ops unfused domain:
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:
Latency:
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of ops at the decode, rename, allocate and retirement stages in
the pipeline. Fused ops count as one.
The number of ops for each execution port. Fused ops count as two. Fused
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops
fused domain. An x under p0, p1 or p5 means that at least one of the ops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one op which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these ops go to.
The total number of ops going to port 0, 1 and 5.
The number of ops going to port 0 (execution units).
The number of ops going to port 2 (memory read).
The number of ops going to port 3 (memory write address).
The number of ops going to port 4 (memory write data).
Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a op in the integer unit (int) is read by a
op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output
in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an
extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless
of which instruction comes next. Nothing listed under unit means that additional
delays are either unlikely to occur or unavoidable and therefore included in the
latency figure.
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Instruction
Operands
ops ops unfused domain

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
Move instructions
Page 119
Laten- Recicy
procal
throughput
Merom
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
1
2
8
8
2
r,r
r,m
r,r
r,m
r,r
r,m
2
2
3
x
1
x
x
x
x
x
x
x
x
x
m8
1
1
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
1
1
2
2
2
4
1
1
1
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r
m
m
m
x
1
4
3
x
x
x
x
1
1
4
5
1
1
1
1
1
1
1
1
1
1
15
9
3
9
23
2
1
2
1
2
11
x
x
1
1
x
x
x
x
1
1
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
8
1
1
1
1
1
1
1
2
2
3
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 120
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
1
2
3
3
0,33
1
1
1
1
1
16
16
2
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
0,33
1
1
int
int
int
int
int
int
int
int
int
2
2
high b)
4
3
2
1
1
1
1
1
7
8
1
1,5
17
20
1
4
1
4
240
1
6
2
2
7
1
1
1
7
0,33
1
1
1
17
1
1
8
9
9
117
0,33
1
1
2
2
0,33
1
0,33
Merom
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl
3
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
3
5
4
32
56
4
6
5
32
56
1
1
1
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
3
5
4
32
56
3
5
4
31
55
1
1
1
1
2
1
1
1
3
1
3
2
9
8
6
4
12
1
1
1
1
1
1
2
1
2
2
9
8
6
3
9
x
x
x
x
x
x
1
x
1
x
x
x
1
1
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 121
x
x
x
x
x
x
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5
18
18-26
18-42
29-61
39-72
18
18-26
18-42
29-61
39-72
1
1
1
6
1
1
6
1
6
2
12
11
11
7
14
1
1
1
1
1,5
1,5
4
1
1
2
1
1
2
1
1,5
1,5
4
1
1
2
2
1
2
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)
0,33
1
1
0,33
1
0,5
1
1
1
2
Merom
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
m8,i/cl
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
11
10
2
3
1
10
2
1
11
3
2
2
1
2
1
7
6
8
7
2
2
1
9
1
1
8
1
2
2
1
1
1
7
6
1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5
1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
x
30
30
13
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
3
2
4+7n - 14+6n
4
2
8+5n - 20+1.2n
8
5
1
1
1
7+7n - 13+n
4
3
7+8n - 17+7n
7
5
7+10n - 7+9n
Page 122
x
x
x
x
1
2
1
1
1
x
x
x
1
1
1
2
1
1
2
2
2
1
1
1
1
1
1
5
1
2
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
13
13
2
7
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
5
6
2
1
1
0
0
0
0
1
1
5
1
1
2
1
1
0,33
4
14
1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3
1
1+5n - 21+3n
1
7+2n - 0.55n
1+3n - 0.63n
1
3+8n - 23+6n
3
2+7n - 22+5n
Merom
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
i,0
a,b
1
1
3
12
1
1
3
10
3
46-100
29
23
x
x
x
x
x
x
x
x
x
1
int
int
int
int
int
int
int
int
int
0,33
1
8
8
180-215
64
54
Applies to all addressing modes

Low values are for small results, high values for high results.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m
r
AX
m16
m16
m16
r
m
m

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
170
1
1
2
3
3
1
2
2
2
1
2
2
3
1
2
142
78
1
1
2
38
1
3
166
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1
2
x
x
1
2
2
2
1
x
x
x
x
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
2
2
1
1
1
1
2
Page 123
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6
1
184
169
1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1
2
192
177
Merom
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT
r
m
r
m
r
m
r
m
r
m
m
m
m
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
1
21-27 21-27
7-15 7-15
27
82
1
~96
~100
~19
~53
~98
~70
27
82
1
~96
~100
~19
~53
~98
~70
1
2
4
15
1
2
4
15
1
1
1
1
1
1
1
1
1
1
2
2
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
1
1
5
2
2
6-38 d) 5-37 d)
5-37 d)
1
1
1
1
1
1
1
2
2
5-37 d)
2
1
1
16-56
22-29
41
170
6-69
~96
~115
~45
~96
~136
~119
float
float
float
float
1
1
15
63
Round divisors or low precision give low values.

Resolved by register renaming. Generates no ops in the unfused domain.
SSE3 instruction set.

Instruction
Operands
Move instructions
MOVD k)
MOVD k)
MOVD k)
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
int
1
Page 124
1
int
Laten- Recicy
procal
throughput
2
3
2
0,33
1
0,5
Merom
MOVD k)
(x)mm,m32/64
MOVQ
(x)mm, (x)mm
MOVQ
(x)mm,m64
MOVQ
m64, (x)mm
MOVDQA
xmm, xmm
MOVDQA
xmm, m128
MOVDQA
m128, xmm
MOVDQU
m128, xmm
MOVDQU
xmm, m128
LDDQU g)
xmm, m128
MOVDQ2Q
mm, xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m64,mm
MOVNTDQ
m128,xmm
mm,mm
PACKSSWB/DW
PACKUSWB
mm,m64
xmm,xmm
PACKSSWB/DW
PACKUSWB
xmm,m128
PUNPCKH/LBW/WD/DQ
mm,mm
PUNPCKH/LBW/WD/DQ
mm,m64
PUNPCKH/LBW/WD/DQ
xmm,xmm
PUNPCKH/LBW/WD/DQ xmm,m128
PUNPCKH/LQDQ
xmm,xmm
PUNPCKH/LQDQ
xmm, m128
PSHUFB h)
mm,mm
PSHUFB h)
mm,m64
PSHUFB h)
xmm,xmm
PSHUFB h)
xmm,m128
PSHUFW
mm,mm,i
PSHUFW
mm,m64,i
PSHUFD
xmm,xmm,i
PSHUFD
xmm,m128,i
PSHUFL/HW
xmm,xmm,i
PSHUFL/HW
xmm, m128,i
PALIGNR h)
mm,mm,i
PALIGNR h)
mm,m64,i
PALIGNR h)
xmm,xmm,i
PALIGNR h)
xmm,m128,i
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,(x)mm
PEXTRW
r32,mm,i
PEXTRW
r32,xmm,i
PINSRW
mm,r32,i
PINSRW
mm,m16,i
PINSRW
xmm,r32,i
PINSRW
xmm,m16,i
1
1
1
1
1
1
1
9
4
4
1
1
1
1
1
1
3
4
1
1
3
4
1
2
1
2
4
5
1
2
2
3
1
2
2
2
2
2
4
10
1
2
3
1
2
3
4
PADD/SUB(U)(S)B/W/D (x)mm, (x)mm
PADD/SUB(U)(S)B/W/D
(x)mm,m
PADDQ PSUBQ
(x)mm, (x)mm
PADDQ PSUBQ
(x)mm,m
1
1
2
2
1
1
int
int
int
x
1
1
int
int
1
4
2
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
1
2
2
1
2
int
int
int
int
1
1
1
1
3
3
1
1
3
3
1
1
1
1
4
4
1
1
2
2
1
1
2
2
2
2
1
1
1
2
3
1
1
3
3
1
1
2
2
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
Page 125
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
x
x
x
1
1
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
2
2
1
2
3
1
2
3
3-8
2-8
2-8
1
1
1
1
int
int
fltint
int
int
int
fltint
int
int
int
int
int
int
int
int
int
fltint
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
3
1
3
1
1
3
1
3
1
2
2
2
3
5
2
6
1
0,33
1
1
0,33
1
1
4
2
2
0,33
0,33
2
2
1
1
2
2
1
1
2
2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
2-5
6-10
1
1
1
1
1
1,5
1,5
0,5
1
1
1
Merom
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
Notes:
g)
h)
k)
mm,mm
int
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
8
3
4
5
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
3
3
5
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(x)mm,(x)mm
(x)mm,m
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
1
1
1
1
1
2
3
2
1
1
1
1
1
2
2
2
x
x
1
1
1
x
x
x
x
x
11
11
int
int
1
1
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
x
x
1
1
x
x
x
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
float
4
4
3
5
1
3
3
3
3
3
1
1
1
1
3
1
1
1
2
2
4
4
2
2
3
3
0,5
1
1
1
1
1
1
1
1
1
1
1
0,5
1
0,5
1
0,5
1
0,5
1
1
1
0,33
1
1
1
1
1
1
1

Supplementary SSE3 instruction set.
MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits.
Page 126
Merom

Instruction
Operands
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
1
1
1
4
9
1
1
1
2
2
1
1
1
1
3
4
1
2
1
2
1
2
3
4
1
2
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
2
2
2
2
2
2
2
2
1
1
1
1
2
3
2
2
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
x
int
int
1
1
2
4
1
1
x
x
x
x
1
x
x
2
1
1
int
2
int
int
1
1
1
1
1
1
1
1
1
1
int
1
1
3
3
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
2
2
Page 127
1
1
float
float
1
3
3
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
Laten- Recicy
procal
throughput
1
2
3
2-4
3-4
1
2
3
3
5
3
1
1
1
fltint
fltint
float
float
int
int
int
int
fltint
int
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
1
1
1
3
1
4
2
2
3
3
4
4
0,33
1
1
2
4
0,33
1
1
1
1
1
1
1
2-3
2
2
1
1
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
Merom
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS
DIVSS
DIVSD
DIVSD
DIVPS
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
1
1
1
1
1
1
6
7
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
6
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 128
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
4
4
4
3
4
3
3
3
9
5
4
5
4
5
6-18 d)
6-32 d)
6-18 d)
6-32 d)
3
3
3
3
3
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
1
1
1
1
3
3
2
2
1
1
1
1
1
1
1
1
5-17 d)
5-17 d)
5-31 d)
5-31 d)
5-17 d)
5-17 d)
5-31 d)
5-31 d)
2
2
1
1
1
1
1
1
1
Merom
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
xmm,m32/64
xmm,xmm
xmm,m128
1
1
1
1
1
1
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
Logic
AND/ANDN/OR/XORPS/D xmm,xmm
AND/ANDN/OR/XORPS/D xmm,m128
1
1
1
1
x
x
14
6
141
119
13
4
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)
m32
m32
m4096
m4096
1
1
1
float
float
float
6-29
float
float
float
float
float
float
int
int
1
1
1
1
x
x
x
x
6-58
3
1
1
1
145
164
Round divisors give low values.

Page 129
1
1
1
6-29
6-29
6-58
6-58
2
2
0,33
1
42
19
145
164
Wolfdale
Intel Core 2 (Wolfdale, 45nm)

Operands:
ops fused domain:

ops unfused domain:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit

xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
macro-ops count as one. The instruction has op fusion if the sum of the
numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under
ops fused domain. An x under p0, p1 or p5 means that at least one of the
ops listed under p015 can optionally go to this port. For example, a 1 under
p015 and an x under p0 and p5 means one op which can go to either port 0
or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these
ops go to.
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:

Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a op in the integer unit (int) is read by a
op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included
under latency when they are unavoidable. For example, movd eax,xmm0 has
an extra 1 clock delay for moving from the XMM-integer unit to the general
purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure.
Latency:
Instruction
Operands

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
Page 130
Laten- Recicy
procal
throughput
Wolfdale
Move instructions
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
1
2
8
8
2
r,r
r16/32,m
r64,m
r,r
r,m
r,r
r,m
m8
1
1
2
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
1
1
2
2
2
4
1
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r
m
m
m
x
1
4
3
1
2
2
3
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
4
5
x
x
x
x
x
x
1
1
x
x
x
x
0,33
1
1
1
2
1
3
9
23
2
1
2
1
2
11
0,33
1
1
1
1
1
16
16
2
1
1
1
1
1
1
1
1
15
9
1
1
1
2
3
3
1
1
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
8
2
high b)
4
3
2
1
20
1
4
1
4
1
1
1
2
2
3
1
x
x
x
x
x
x
x
1
1
1
1
x
x
x
x
x
x
x
Page 131
x
x
x
x
x
x
x
1
1
1
1
1
7
8
1
1,5
17
1
1
1
1
1
1
1
120
1
1
1
1
1
6
2
2
7
1
7
0,33
1
1
1
17
1
1
8
6
9
90
0,33
1
1
2
2
0,33
Wolfdale
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
m,r/i
r
m
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r,i/cl
1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
7
7
32-38
56-62
4
7
7
32
56
1
1
1
1
2
1
1
1
3
1
3
2
9
8
6
1
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
7
7
x
x
x
x
x
x
x
x
x
x
x
1
x
x
1
x
x
x
1
1
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
2
1
2
2
9
8
6
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 132
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
6
17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5
x
x
x
3
7
6
31
55
1
1
56-62
x
x
1
1 2 1
x x x
2 3 2
9 10 13
x x x
1 2
2 3 2
x x x
x x x
x x x
x x x
x
x
32-38
1
1
1
1
1
1
1
1
1
1
1
0,33
1
1
1
1
1,5
1,5
4
1
1
2
1
1
2
1
1,5
1,5
4
1
1
2
2
1
2
9-18 c)
14-22 c)
14-23 c)
18-57 c)
34-88 c)
9-18
14-22 c)
14-23 c)
34-88 c)
39-72 c)
1
1
1
1
1
1
1
1
1
1
6
1
1
1
6
1
6
2
12
11
11
0,33
1
1
0,33
1
0,5
1
1
1
2
Wolfdale
RCR RCL
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
m,1
m8,i/cl
m8,i/cl
m,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
4
12
11
10
2
3
1
9
3
1
10
3
2
2
1
2
1
6
6
3
9
8
7
2
2
1
8
2
1
7
1
2
2
1
1
1
6
6
1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5
1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
1
30
30
13
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
3
2
4+7n-14+6n
4
2
8+5n-20+1.2n
8
5
1
1
1
7+7n-13+n
4
3
7+8n-17+7n
Page 133
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
7
14
13
13
2
7
1
1
4
1
1
1
1
1
1
1
1
1
1
5
6
2
1
1
1
1
1
1
1
0
0
1
2
1
1
1
x
x
x
1
1
0
0
1
2
1
1
2
2
2
1
1
1
1
1
1
1
1
0,33
3
14
1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3
1
1+5n-21+3n
1
1
7+2n-0.55n
5
1+3n-0.63n
1
1
3+8n-23+6n
Wolfdale
CMPS
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
7
5
7+10n-7+9n
i,0
a,b
1
1
3
12
1
1
3
10
3
53-117
13
23
3
2+7n-22+5n
x
x
x
x
x
x
x
x
x
1
0,33
1
8
8
1
53-211
32
54

Low values are for small results, high values for high results. The reciprocal
throughput is only slightly less than the latency.

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m
r
AX
m16
m16
m16

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
171
1
1
2
3
3
1
2
2
2
1
2
2
3
1
1
1
2
38
1
3
167
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
x
1
x
x
x
x
x
x
1
1
1
1
1
1
1
2
2
1
2
2
1
2
2
1
1
1
1
1
1
1
2
2
1
1
1
1
1
Page 134
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6
1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1
Wolfdale
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
r
m
m
2
141
78
2
95
51
r
m
r
m
r
m
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
26-29
28-35
17-19
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
r
m
r
m
m
m
m
1
2
4
15
x
x
x
x
x 7 23 23
x 27
1
1
1
1
1
1
1
1
1
1
float
float
float
1
1
5
2
2
6-21 d) 5-20 d)
6-21 d) 5-20 d)
1
1
1
1
1
1
43
~170
6-20
32-85
70-100
38-107
45
50-100
40-130
55-130
x
x
x
x
x
1
x
x
x
x
x
x
x
x
x
x
float
float
float
float
float
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
float
float
float
float
float
x
x
x
float
float
float
float
1
1
1
x
x
x
x
x
1
1
1
1
2
142
177
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
1
1
1
1
2
1
1
2
1
1
x
x
x
x
x
x
28
28
53-84
1
1
18-85
76-100
18105
19
19
57-65
19-100
23-87
1
2
4
15
x
x
x
3
5
6-21
1
2
2
5-20 d)
2
1
1
13-40
18-41
10-22
1
1
15
63

Page 135
Wolfdale

Instruction
Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)
Operands

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128
1
1
1
1
1
1
1
1
1
1
9
4
4
1
1
1
1
1
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
1
int
int
int
int
1
1
x
1
1
int
int
1
4
2
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
1
2
2
1
2
1
2
int
int
int
int
1
1
int
1
1
Page 136
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,33
1
0,5
1
0,33
1
1
0,33
1
1
4
2
2
0,33
0,33
2
2
1
int
int
2
3
2
2
1
2
3
1
2
3
3-8
2-8
2-8
1
1
1
1
Laten- Recicy
procal
throughput
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Wolfdale
PSHUFB h)
PSHUFB h)
PSHUFB h)
PSHUFB h)
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
x, m128,i
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
x,x,xmm0
x,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i
1
2
1
1
1
2
1
2
1
2
2
3
1
1
2
2
1
1
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
2
3
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
1
2
2
1
1
v,v
(x)mm,m
v,v
(x)mm,m
1
1
2
2
1
1
2
2
x
x
x
x
x
x
x
x
v,v
(x)mm,m64
v,v
(x)mm,m64
v,v
(x)mm,m
xmm,xmm
xmm,m128
4
3
4
1
1
1
1
3
3
3
1
1
1
1
1
1
1
x
x
2
2
2
x
x
1
1
1
1
1
x
x
x
?
x
x
3
x
x
x
?
x
x
Page 137
x
x
x
1
x
1
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
3
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
1
1
1
2
1
2
1
2
3
3
3
3
3
1
2
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2-5
6-10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,5
1
1
1
2
2
2
2
0,5
1
1
1
Wolfdale
PMULL/HW PMULHUW
v,v
PMULL/HW PMULHUW
(x)mm,m
PMULHRSW h)
v,v
PMULHRSW h)
(x)mm,m
PMULLD j)
xmm,xmm
PMULLD j)
xmm,m128
PMULDQ j)
xmm,xmm
PMULDQ j)
xmm,m128
PMULUDQ
v,v
PMULUDQ
(x)mm,m
PMADDWD
v,v
PMADDWD
(x)mm,m
PMADDUBSW h)
v,v
PMADDUBSW h)
(x)mm,m
PAVGB/W
v,v
PAVGB/W
(x)mm,m
PMIN/MAXSB j)
xmm,xmm
PMIN/MAXSB j)
xmm,m128
PMIN/MAXUB
v,v
PMIN/MAXUB
(x)mm,m
PMIN/MAXSW
v,v
PMIN/MAXSW
(x)mm,m
PMIN/MAXUW j)
xmm,xmm
PMIN/MAXUW j)
xmm,m
PMIN/MAXSD j)
xmm,xmm
PMIN/MAXSD j)
xmm,m128
PMIN/MAXUD j)
xmm,xmm
PMIN/MAXUD j)
xmm,m128
PHMINPOSUW j)
xmm,xmm
PHMINPOSUW j)
xmm,m128
PABSB PABSW PABSD h)
v,v
PABSB PABSW PABSD
h)
(x)mm,m
PSIGNB PSIGNW
PSIGND h)
v,v
PSIGNB PSIGNW
PSIGND h)
(x)mm,m
PSADBW
v,v
PSADBW
(x)mm,m
MPSADBW j)
xmm,xmm,i
MPSADBW j)
xmm,m,i
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
v,v
(x)mm,m
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
1
1
1
1
4
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
4
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
x
x
1
1
x
x
x
x
1
1
1
2
2
1
1
1
1
1
x
x
1
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
4
4
x
1
1
1
3
4
1
1
1
3
3
1
1
2
2
1
1
1
2
3
1
1
1
2
2
1
1
1
2
2
1
x
x
1
1
1
1
1
x
x
x
x
1
1
1
1
2
2
x
x
x
x
x
x
x
x
Page 138
1
1
1
3
3
5
5
3
3
3
3
1
1
1
1
1
1
1
4
1
int
int
x
x
x
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
1
1
1
2
4
1
1
1
1
1
1
1
1
0,5
1
1
1
0,5
1
0,5
1
1
1
1
1
1
1
4
4
0,5
1
3
5
1
1
1
1
2
1
0,5
1
1
1
2
2
0,33
1
1
1
1
1
1
1
1
1
Wolfdale
Other
EMMS
Notes:
g)
h)
j)
k)
m)
11
11
float

SSE4.1 instruction set
MASM uses the name MOVD rather than MOVQ for this instruction even
when moving 64 bits
Only available in 64 bit mode

Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
BLENDPS/PD j)
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)
Operands
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
x,x,xmm0
x,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i

Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
4
9
1
1
1
2
2
1
1
1
1
1
2
1
2
1
1
2
2
1
2
1
2
1
1
1
2
2
2
1
2
int
int
1
1
2
4
1
1
x
x
x
x
1
x
x
2
1
1
int
2
int
int
1
1
1
1
1
1
1
1
1
1
int
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Conversion
Page 139
x
1
1
1
1
1
1
2
3
2-4
3-4
1
2
3
3
5
3
1
1
1
1
1
2
2
1
1
x
1
1
float
float
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
2
1
1
1
Laten- Recicy
procal
throughput
int
int
float
float
int
int
int
int
int
int
int
int
int
int
float
float
int
int
int
int
1
1
1
2
1
1
1
1
4
1
0,33
1
1
2
4
0,33
1
1
1
1
1
1
1
2-3
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
Wolfdale
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
1
1
1
1
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
Page 140
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
x
x
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
4
2
2
3
3
4
4
3
3
4
4
4
3
4
3
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
3
1
1
3
1
1
7
3
3
6
1,5
1,5
4
1
1
5
1
1
4
1
1
5
1
1
6-13 d) 5-12 d)
Wolfdale
DIVSS
DIVSD
DIVSD
DIVPS
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D j)
ROUNDSS/D j)
ROUNDPS/D j)
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
x,m32/64
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4
1
1
1
1
1
1
1
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm
xmm,m128
1
1
1
1
x
x
x
x
x
x
m32
m32
m4096
m4096
13
10
151
121
12
8
67
74
x
x
x
x
x
x
x
x
x 1
x
1 1
x 8 38 38
x 47
2
2
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
x
x
1
1
1
1
1
1
1
1
1
x
x
5-12 d)
6-21 d) 5-20 d)
5-20 d)
6-13 d) 5-12 d)
5-12 d)
6-21 d) 5-20 d)
5-20 d)
3
2
2
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
11
3
3
9
3
3
6-13
float
float
float
float
float
float
int
int
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
6-20
3
5-12
5-12
5-19
5-19
2
2
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)
Round divisors give low values.

Page 141
0,33
1
38
20
145
150
Nehalem
Intel Nehalem
Operands:
ops fused domain:

ops unfused domain:
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Domain:
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops
fused domain. An x under p0, p1 or p5 means that at least one of the ops
Tells which execution unit domain is used: "int" = integer unit (general purpose
registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM
and x87 floating point). An additional "bypass delay" is generated if a register
written by a op in one domain is read by a op in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units.
The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural
domain such as a general purpose register (e.g. eax) in the "ivec" domain. For
example, the PEXTRW instruction executes in the "int" domain. The source
operand is an xmm register and the destination operand is a general purpose
register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm
operand is most likely to come from the "ivec" domain. If the xmm operand
comes from the "fp" domain then the bypass delay will be 2 rather than one.
The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the
result in the integer flags. Almost all instructions that read these flags execute
in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency
of the instruction itself and 2 is the bypass delay from the "fp" domain to the
"int" domain.
The bypass delay from the memory read unit to any other unit and from any
unit to the memory write unit are included in the latency figures in the table.
Where the domain is not listed, the bypass delays are either unlikely to occur
or unavoidable and therefore included in the latency figure.
Page 142
Nehalem
Latency:
Instruction
Move instructions
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
Operands

Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
1
2
6
6
2
r,r
r,m
r,r
r,m
r,r
r,m
1
2
2
3
7
2
1
1
2
2
3
18
1
3
2
7
8
10
1
2
1
1
1
9
1
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r32
r64
m
m
x
1
3
2
x
x
x
x
1
1
3
4
1
1
1
1
x
1
2
2
3
x
1
x
x
x
x
x
x
x
x
x
1
1
1
1
1
2
2
x
x
x
1
x
x
2
7
2
1
2
1
1
1
3
x
x
x
x
1
1
1
x
x
x
Page 143
1
1
1
5
1
8
6
1
1
1
1
1
1
1
8
1
1
1
1
1
1
8
Laten- Recicy
procal
throughput
int
int
int
int
int
int
int
int
int
~270
0.33
1
1
1
1
1
13
14
1
int
0.33
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
2
3
3
2
20 b)
5
3
1
4
1
1
3
2
1
1
1
1
1
1
8
1
5
1
15
14
8
0.33
1
1
1
1
15
1
Nehalem
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV c)
DIV c)
DIV c)
DIV c)
IDIV c)
IDIV c)
IDIV c)
IDIV c)
CBW CWDE CDQE
CWD CDQ CQO
POPCNT )
POPCNT )
CRC32 )
CRC32 )
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
r,r
r,m
r,r
r,m
1
2
3
2
1
1
2
2
2
4
1
1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
6
6
~40
4
8
7
~60
1
1
1
1
1
1
1
1
1
2
2
3
1
1
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
6
6
x
4
8
7
x
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
x
1
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
1
1
x
x
x
1
x
x
x
x
x
Page 144
2
4
3
x
2
5
3
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
1
x
x
x
x
x
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
9
23
5
1
6
2
2
7
1
1
1
6
3
15
20
3
5
5
3
3
3
3
3
3
3
3
5
5
3
3
3
3
11-21
17-22
17-28
28-90
10-22
18-23
17-28
37-100
1
1
3
3
0.33
1
1
2
2
0.33
1
0.33
1
1
2
7
1
2
2
2
1
1
1
1
1
2
1
2
2
2
1
1
1
1
1
1
7-11
7-12
7-17
19-69
7-11
7-12
7-17
26-86
1
1
1
1
1
1
Nehalem
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD
SHLD
SHRD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl
m8,i/cl
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
1
1
2
1
1
1
3
1
3
2
9
8
6
4
12
11
10
2
3
2
3
1
9
2
1
10
3
1
2
1
2
1
2
2
1
1
1
1
1
1
2
1
2
2
9
8
6
3
9
8
7
2
2
2
2
1
8
2
1
7
3
1
1
1
1
1
2
2
1
31
1
1
31
1
1
2
6
11
2
46
3
4
47
1
31
1
1
31
1
1
2
6
11
2
46
2
3
47
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
1
1
x
x
x
?
x
x
x
?
1
11
1
1
1
x
x
1
1
1
1
1
9
?
?
Page 145
?
?
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
6
1
1
6
1
6
2
13
11
12-13
7
16
14
15
3
8
4
9
1
1
6
6
3
3
1
1
0
0
0
0
0
0.33
1
1
0.33
1
0.5
1
1
1
2
12-13
1
1
1
5
1
1
1
1
1
1
0.33
4
5
2
67
2
2
73
2
2
2
4
7
2
74
2
2
79
Nehalem
RETN
RETN
RETF
RETF
BOUND i)
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
)
i
i
r,m
small n
large n
small n
large n
a,0
a,b
1
3
39
40
15
4
1
2
39
40
13
4
1
1
int
int
int
int
int
int
2
1
11+4n
3
1
60+n
2.5/16 bytes
5
2
13+6n
2/16 bytes
3
2
37+6n
5
3
65+8n
1
1
5
11
34+7b
3
25-100
22
28
1
1
5
9
1
1
x
x
x
x
x
x
x
x
x
x
x
x
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
2
2
120
124
7
5
1
40+12n
1
12+n
1 clk / 16 bytes
4
12+n
1 clk / 16 bytes
1
40+2n
4
42+2n
0.33
1
9
8
79+5b
~200
5
~200
24
40-60

Low values are for small results, high values for high results.
SSE4.2 instruction set.

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
Operands
r
m32/64
m80
m80
r
m32/m64

Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
41
1
1
1
1
2
38
1
1
1
x
1
1
x
1
2
3
1
Page 146
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
4
1
1
2
20
1
1
Nehalem
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
m80
m80
r
m
m
m
r
AX
m16
m16
m16
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m
7
208
1
1
3
3
1
2
2
2
2
3
2
2
1
2
143
79
3
204
0 f)
1
1
1
1
2
2
2
2
2
1
1
1
2
89
52
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17
24
17
1
~100
~100
~100
19
~55
~100
~82
24
17
1
~100
~100
~100
19
~55
~100
~82
x
x
x
x
x
x
1
1
1
1
1
2
2
2
2
1
1
1
1
1
2
2
1
1
1
1
x
x
x
x
x
x
x
x 8 23 23
x 27
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
x
x
x
x
x
x
x
Page 147
1
1
1
1
1
1
2
1
1
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
5
242
0
6
7
7
2+2
7
5
1
178
156
5
245
1
1
1
1
1
2
2
2
1
2
31
1
1
4
178
156
float
3
1
float
1
float
5
1
float
1
float 7-27 d) 7-27 d)
float 7-27 d) 7-27 d)
float
1
1
float
1
1
float
1
float
1
float
1
float
1
float
3
2
float
5
2
float 7-27 d) 7-27 d)
float
1
float
1
float
1
float
14
float
19
float
22
float
12
float
13
float
~27
float 40-100
float 40-100
float ~110
float
58
float
~80
float ~115
float ~120
Nehalem
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
1
1
1
2
2
x
3
3
~190 ~190 x
x
x
x
float
float
float
float
x
x
x
1
1
17
77


Instruction
Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)
Operands

Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
(x)mm, (x)mm
(x)mm,m
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
int
1
1
ivec
1
1
ivec
1
1
ivec
1
1
1
1
1
1
1
1
x
x
x
x
x
x
ivec
ivec
1
1
1
1
Page 148
ivec
Laten- Recicy
procal
throughput
1+1
3
1+1
2
1
2
3
1
2
3
2
3
2
1
1
~270
~270
2
0.33
1
0.33
1
0.33
1
1
0.33
1
1
1
1
1
0.33
0.33
2
2
1
2
ivec
1
ivec
ivec
ivec
ivec
ivec
1
1
1
1
1
0.5
2
2
2
0.5
2
0.5
1
1
2
1
2
Nehalem
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)
PSHUFB h)
PSHUFB h)
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
PADD/SUB(U)
(S)B/W/D/Q
PADD/SUB(U)
(S)B/W/D/Q
PHADD/SUB(S)W/D h)
PHADD/SUB(S)W/D h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)
xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
(x)mm, (x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
1
3
4
1
1
1
1
1
3
3
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 149
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
float
ivec
2+2
2+1
ivec
2+1
ivec
2+1
ivec
2+1
ivec
1+1
ivec
1+1
ivec
1+1
ivec
1+1
ivec
ivec
ivec
ivec
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
x
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
1
1
1
1
0.5
1
2
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
2
1,5
3
0.5
2
0.5
2
Nehalem
PCMPGTQ )
PCMPGTQ )
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULLD j)
PMULLD j)
PMULDQ j)
PMULDQ j)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXSB j)
PMIN/MAXSB j)
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW j)
PMIN/MAXUW j)
PMIN/MAXU/SD j)
PMIN/MAXU/SD j)
PHMINPOSUW j)
PHMINPOSUW j)
PABSB PABSW PABSD
h)
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
MPSADBW j)
MPSADBW j)
PCLMULQDQ n)
AESDEC, AESDECLAST,
AESENC, AESENCLAST
n)
AESIMC n)
AESKEYGENASSIST n)
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
1
1
1
1
1
1
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm,i
xmm,m,i
xmm,xmm,i
1
1
1
3
4
1
1
1
3
3
x
x
x
x
x
x
x
x
x
x
x
x
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
1
1
1
1
1
1
1
1
1
1
1
1
1
1
ivec
ivec
ivec
1
1
1
1
x
x
Page 150
x
x
x
x
ivec
0.5
12
~5
~5
~5
~2
~2
~2
0.33
1
0.5
2
1
3
1
2
8
1
x
x
1
1
1
1
1
1
1
1
0.5
1
1
2
0.5
2
0.5
2
1
2
1
2
1
3
xmm,xmm
xmm,xmm
xmm,xmm,i
(x)mm,(x)mm
(x)mm,m
1
1
1
1
1
1
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
ivec
Nehalem
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
2
2
1
1
1
2
3
1
2
2
1
1
1
2
2
1
x
x
String instructions
PCMPESTRI )
PCMPESTRI )
PCMPESTRM )
PCMPESTRM )
PCMPISTRI )
PCMPISTRI )
PCMPISTRM )
PCMPISTRM )
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
8
9
9
10
3
4
4
6
8
8
9
10
3
4
4
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
11
11
Other
EMMS
Notes:
g)
h)
j)
k)
)
m)
n)
x
x
x
x
x
1
1
1
1
1
x
x
ivec
ivec
ivec
ivec
1
2
ivec
1
1
1
2
1
2
1
1
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
14
14
7
7
8
8
7
7
5
6
6
6
2
2
2
5
1
1
x
x
x
1
1
1
1
float

MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits
Only available in 64 bit mode
Only available on newer models

Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS/D
SHUFPS/D
Operands
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i

Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
1
1
1
1
1
2
2
1
1
1
1
2
float
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
1
1
1
1
Page 151
1
1
1
float
float
Laten- Recicy
procal
throughput
1
2
3
2
3
1
2
3
3
5
1
1+2
~270
1
1
1
1
1-4
1-3
1
1
1
2
1
1
1
2
1
1
Nehalem
BLENDPS/PD j)
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)
xmm,xmm,i
xmm,m128,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i
1
2
2
3
1
1
1
1
1
1
1
2
1
3
1
1
2
2
1
Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
Arithmetic
ADDSS/D SUBSS/D
xmm,xmm
1
1
2
2
1
float
float
float
float
float
1
1
1
float
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
2
?
1
1
1
1
x
x
Page 152
1
1
1
?
1
1
1
1
1
1
?
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
float
float
float
1
1+2
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
4
4
2
1
3+2
3+2
4+2
4+2
3+2
3+2
ivec/float
float/ivec
float
float
float
float
float
float
float
float
3+2
float
1
1
1
1
1
1
3+2
4+2
3+2
1
1
2
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
Nehalem
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS MULPS
MULSS MULPS
MULSD MULPD
MULSD MULPD
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D CMPccPS/D
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
1
1
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm
xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
2
4
6
3
4
1
4
5
3
3
1
2
x
x
x
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
xmm,xmm
xmm,m128
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
3
5
3
4
5
7-14
7-22
3
1
1
1
1
1
2
2
2
2
1
1
1
1
7-14
7-14
7-22
7-22
2
2
1
CMPccSS/D CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D
ROUNDPS/D j)
ROUNDSS/D
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
1
x
x
x
1
1
1
1
float
float
float
float
float
float
float
float
1
1
x
x
x
1
1
1
1
3
3
11
1
2
1
3
7-18
float
float
float
float
float
float
7-18
7-18
7-32
7-32
2
2
float
float
1
1
1
1
float
float
float
float
float
1+2
1
1
1
1
1
1
1
7-32
3
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
Other
Page 153
1
1
1
1
Nehalem
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
g)
m32
m32
m4096
m4096
6
2
141
112
6
1
141
90
x
x
x
x
Page 154
x 1
1
1 1
x 5 38 38
x 42
90
5
1
90
100
Sandy Bridge
Intel Sandy Bridge

Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same
register for both operands. m = memory operand, m32 = 32-bit memory operand, etc.
ops fused domain:
macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under ops fused
domain. A number indicated as 1+ under a read or write port means a 256-bit
read or write operation using two clock cycles for handling 128 bits each cycle.
The port cannot receive another read or write op in the second clock cycle, but
a read port can receive an address-calculation op in the second clock cycle.
An x under p0, p1 or p5 means that at least one of the ops listed under p015
can optionally go to this port. For example, a 1 under p015 and an x under p0
and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that
it is not known which of the three ports these ops go to.
ops unfused domain:
p015:
p0:
p1:
p5:
p23:
p4:
Latency:

The number of ops going to port 2 or 3 (memory read or address calculation).
may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles given by the
time stamp counter.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.
Instruction
Move instructions
MOV
Operands
r,r/i
ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4

procal ments
dothroughmain
put
1
Page 155
Sandy Bridge
MOV
r,m
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
m,r
m,i
m,r
r,r
1
1
2
1
1
1
1
r,m
r,r
r,m
r,r
r,m
2
2
3
8
2
2
3
x
3
1
1
2
3
16
1
1
2
9
18
1
3
1
1
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA
BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
ADD SUB
ADD SUB
ADD SUB
SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
r
i
m
r
(E/R)SP
m
r,m
r,m
r32
r64
m
m
r,r/i
r,m
m,r/i
r,same
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
1
1
1
0.5
1
1
1
~350
1
2
0
x
x
x
x
x
x
x
x
x
0
8
10
1
3
1
1
1
2
1
1
2
3
2
1
2
1
1
2
1
2
2
4
1
1
1
1
1
1
0
2
2
3
1
1
1
1
1
1
2
1
8
1
1
2
1
8
1
1
1
1
8
2
25
7
3
2
1
1
1
1
3
1
2
1
2
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 156
0.5
1
2
1
1
1
1
1
1
implicit
lock
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
0.5
1
1
1
0.5
0.5
4
33
6
1
1
2
1
2
all addressing
modes
6
0
2
2
7
1
1
1
0.5
1
0.25
1
1
1,5
0.5
not 64 bit
not 64 bit
not 64 bit
simple
complex
or rip relative
Sandy Bridge
INC DEC NEG NOT
AAA AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
3
2
1
2
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
10
11
10
x
10
10
9
x
r,r
r,m
r,r
r,m
3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
10
11
10
34-56
10
10
9
59138
1
1
1
2
1
1
1
1
1
1
r,r/i
r,m
m,r/i
r,same
r,r/i
m,r/i
r,i
m,i
1
1
2
1
1
1
1
3
1
1
1
0
1
1
1
1
r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
4
2
20
3
4
4
3
3
4
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
4
20-24
21-25
20-28
30-94
21-24
21-25
20-27
40-103
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
3
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 157
2
not 64 bit
11
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
11-14
11-14
11-18
22-76
11-14
11-14
11-18
25-84
0.5
1
0.5
1
1
0.5
1
1
1
1
1
1
2
6
0
1
1
2
1
1
0.5
1
0.25
0.5
0.5
2
not 64 bit
not 64 bit
not 64 bit
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Sandy Bridge
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR
RCR
RCR
RCR
RCR
RCR
RCL
RCL
RCL
RCL
RCL
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD
r,cl
m,cl
r,i
m,i
r,cl
m,cl
r8,1
r16/32/64,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m

JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
short
short
short
near
r
m
i
3
5
1
4
3
5
high
3
8
11
8
11
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3
3
3
1
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
2
7
11
3
2
3
2
3
2
7
11
2
1
2
2
2
1
2
2
high
2
5
3
8
7
8
7
3
8
7
8
7
1
4
3
1
8
1
1
7
1
1
1
1
1
0
1
3
5
2
6
x
2
1
x
1
1
x
2
x
1
3
1
x
x
x
1
x
x
1
1
2
4
1
2
2
4
high
2
5
6
5
6
2
6
6
6
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
1
4
Page 158
1
1
1
1
1
1
1
2
1
1
1
1
1
0
0
0
0
2
2
2
1-2
1-2
2-4
5
5
2
2
2
2
2
fast if not
jumping
Sandy Bridge
BOUND
INTO
r,m
15
4
13
4
String instructions
LODS
REP LODS
STOS
REP STOS
3
5n+12
3
2n
REP STOS
1.5/16B
MOVS
REP MOVS
5
2n
1.5 n
REP MOVS
3/16B
1/16B
SCAS
REP SCAS
CMPS
REP CMPS
3
6n+47
5
8n+80
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
not 64 bit
not 64 bit
1
~2n
1
n
worst
case
best case
1/16B
4
1
1
a,0
a,b
7
6
worst
case
best case
1
2n+45
4
2n+80
0
0
0.25
0.25
7
7
12
10
49+6b
3
3
31-75
21
35
decode
only 1
per clk
11
8
1
84+3b
7
100-250
28
42

Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m

procal ments
dothroughmain
put
1
1
4
43
1
1
7
246
1
1
3
1
1
2
40
1
1
1
1
2
1
3
4
45
1
4
5
0
6
7
1
2
3
1
1
2
3
0
1
1
Page 159
1
1
1
1
1
1
2
21
1
1
5
252
0.5
1
2
Sandy Bridge
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
r
AX
m16
m16
m16
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m
3
1
2
2
3
2
2
3
2
1
1
143
90
1
1
2
2
3
2
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41-87
17
1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
1
1
1
1
2
3
2
1
1
1
1
1
1
1
1
1
8
5
1
3
1
1
1
1
1
1
1
1
?
5
1
10-24
1
1
1
1
?
1
1
2
1
1
1
1
3
1
?
4
1
1
1
1
17
21
26-50
22
27
27
17
17
1
1
64-100 x
20-110 x
20-110 x
53-118 x
12
10
10-24
47-100
47-115
43-123
61-69
102 102
28-91 x
1
2
1
2
2
2
2
2
2
1
1
1
1
1
166
165
1
1
1
1
10-24
10-24
1
1
1
1
1
1
1
1
2
1
2
21
26-50
130
93-146
Page 160
1
1
SSE3
Sandy Bridge
FNCLEX
FNINIT
5
26
5
26
22
81

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ
Operands

procal ments
dothroughmain
put
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
mm,mm
mm,m64
x,x
x,m128
x,x
x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
1
1
1
1
1
1
1
1
2
1
1
1
1
Page 161
1
1
1
1
1
3
1
3
1
3
3
1
3
3
3
3
3
1
1
~300
~300
1
0.5
0.5
1
0.5
1
0.5
1
0.5
1
SSE3
1
0.5
0.5
SSE4.1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Sandy Bridge
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ
x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
3
2
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
2
1
2
2
2
2
2
1
2
1
2
1
2
1
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
x,same
x,same
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
0
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
4
1
x
2
2
2
2
1
1
2
1
2
1
2
1
2
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
2
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1,
64b
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1,
64 b
PADD/SUB(U,S)B/W/D/Q
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PSUBxx, PCMPGTx
PCMPEQx
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
x
x
x
x
x
x
x
x
Page 162
2
1
1
1
1
1
5
1
1
1
1
1
1
1
1
0
0
5
1
5
1
5
0.5
0.5
1,5
1,5
0.5
0.5
0.5
0.5
1
1
0.25
0.5
1
1
1
1
1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.2
SSE4.2
SSSE3
SSSE3
SSE4.1
Sandy Bridge
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
1
1
1
1
1
1
1
1
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PXOR
PTEST
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
(x)mm,(x)mm
(x)mm,m
x,same
x,x
x,m128
mm,mm/i
mm,m64
xmm,i
x,x
x,m128
x,i
1
1
1
1
1
1
1
1
2
3
1
1
1
0
1
1
1
1
1
2
2
1
x
x
x
x
x
x
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
8
8
8
8
3
4
3
8
7
8
7
3
3
3
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
Page 163
1
5
1
5
1
5
1
5
1
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
5
1
x
x
x
x
1
1
1
1
5
1
6
1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1
1
1
1
2
1
1
4
1
11-12
1
3
1
11
0.5
0.25
1
1
1
2
1
1
1
1
4
4
4
4
3
3
3
SSE4.1
SSE4.1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Sandy Bridge
PCMPISTRM
Encryption instructions
PCLMULQDQ
AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESKEYGENASSIST
x,m128,i
x,x,i
18
18
x,x
x,x
x,x,i
2
2
11
2
2
11
31
31
Other
EMMS
SSE4.2
14
CLMUL
4
2
8
AES
AES
AES
18

Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD
Operands

procal ments
dothroughmain
put
x,x
y,y
x,m128
1
1
1
y,m256
m128,x
1
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
2
2
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
0.5
1+
1
1
1
AVX
4
3
1+
3
1
3
3
3
3
1
2
2
~300
~300
1
1
1
0.5
1
1
1
1
1
1
1
25
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
1
AVX
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
Page 164
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
1
4
1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
Sandy Bridge
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
2
3
2
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
3
4
4
1
2
2
2
2
1
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
2
2
2
2
2
2
2
2
2
3
2
2
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
2
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
1
1+
2
1
2
1+
1
3
1
3
1
1
1
1
1
1
1
1
1
1
1+
1
1
1
1
1
3
1
4
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1+
?
?
?
?
1
1
1
1
?
?
1
1
1
?
1
1
Page 165
1
1
1
1
1
1
1
?
1
?
1
1
1
1
1
1
1
1
1
1+
2
1
2
1
1
2
1
1
1+
1 1
1 1+
4
1
4
1+
3
1
3
1
4
1
3
1
1
1
1
1
1
1
3
1
3
1+
3
1
1
1
1
1
1
1
0.5
1
1
1
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Sandy Bridge
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64
1
1
2
2
2
3
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
1
2
2
1
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
4
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1+
1
1
1
1
1
1
1
1
4
1
5
1
4
1
5
1+
4
1
4
1
1
1
4
1
4
1
?
?
?
1
?
?
1
1
1
1
1
1
1
1
1
Page 166
1
1
1
1
1
1
1
1
4
1
?
?
4
1
4
1
?
?
4
1
3
1
3
1
3
1+
3
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-14
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1,5
1,5
1
1
1,5
1,5
1
1
AVX
AVX
AVX
AVX
AVX
AVX
1
1
1
1
1
1
1
1
1
1
2
2
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
2
1
1
1
1
1
1
1
1
10-14
AVX
AVX
AVX
AVX
AVX
Sandy Bridge
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
1
3
4
1
1
3
4
1
1
3
4
1
3
3
1
1
3
3
1
1
3
3
1
2
2
1
1
2
?
1
1
2
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i
2
1
2
2
2
1
1
1
1
1
1
1
2
1
2
4
6
4
6
3
4
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
4
5
4
5
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
1
3
4
1
2
3
4
1
1
3
4
1
1
3
3
1
1
3
3
1
1
3
3
x,x
x,m128
1
1
1
1
1
1
21-29
1+
10-22
1
1
?
21-45
1+
5
1
7
1+
3
10-14
20-28
20-28
10-22
10-22
20-44
20-44
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
1
1
1
3
1+
2
1
3
1
3
1
3
1+
3
1
3
1+
12
1
12
1+
9
1
1
1
10-14
1
1+
1
1
10-21
1
21-43
1+
1
1
5
1
7
1+
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
2
4
2
2
10-14
10-14
21-28
21-28
10-21
10-21
21-43
21-43
1
1
2
2
Logic
AND/ANDN/OR/XORPS/PD
Page 167
1
1
1
1
1
1
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
Sandy Bridge
VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
(V)XORPS/PD
y,y,y
y,y,m256
x/y,x/y,same
1
1
1
0
Other
VZEROUPPER
VZEROALL
12
VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVEOPT
m32
m32
m32
m4096
m4096
m
AVX
1
0.25
AVX
AVX
AVX,
32 bit
AVX,
64 bit
1+
11
20
3
3
3
3
3
3
130
116
100-161
?
?
Page 168
?
?
1
1
1
1
1
9
3
1
1
68
72
1
1
60-500
AVX
Ivy Bridge
Intel Ivy Bridge

Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm

register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same =
same register for both operands. m = memory operand, m32 = 32-bit memory
operand, etc.
ops fused domain:
macro-ops count as one. The instruction has op fusion if the sum of the
numbers listed under p015 + p23 + p4 exceeds the number listed under ops
fused domain. A number indicated as 1+ under a read or write port means a
256-bit read or write operation using two clock cycles for handling 128 bits
each cycle. The port cannot receive another read or write op in the second
clock cycle, but a read port can receive an address-calculation op in the second clock cycle. An x under p0, p1 or p5 means that at least one of the ops
ops unfused domain:
p015:
p0:
p1:
p5:
p23:

The number of ops going to port 2 or 3 (memory read or address calculation).
p4:
Latency:

may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency.
The time unit used is core clock cycles, not the reference clock cycles given by
the time stamp counter.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.
Instruction
Operands
ops ops unfused domain Latency ReciComfused

procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put
Page 169
Ivy Bridge
Move instructions
MOV
MOV
MOV
r,i
r8/16,r8/16
r32/64,r32/64
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
MOV
MOV
MOV
r8/16,m8/16
r32/64,m32/64
r,m
1
1
1
MOV
MOV
MOVNTI
MOVSX MOVSXD
MOVZX
MOVZX
m,r
m,i
m,r
r,r
r16,r8
r32/64,r8
1
1
2
1
1
1
MOVZX
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
r32/64,r16
r16,m8
r32/64,m
r,r
r,m
r,r
r,m
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
1
2
1
1
1
x
x
x
x
x
x
2
2
3
7
2
2
3
x
x
x
x
x
x
x
x
x
x
8
10
1
3
2
1
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA
r16,m
r32/64,m
3
1
1
2
2
3
19
1
3
2
9
18
1
3
2
1
LEA
r32/64,m
r32
r64
m
m
1
2
1
1
2
3
1
2
BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
r
i
m
(E/R)SP
r
(E/R)SP
m
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
x
x
x
x
x
1
1
1
2
1
1
8
1
1
2
1
8
1
1
1
1
1
8
1
1
0.5
0.5
1
3
~340
1
1
0-1
1
1
1
0.33
0.33
0.25
1
3
2
0.33
0.5
0.5
0.67
~0.8
1
2
25
7
3
1
1
2-4
1
3
1
2
1
1
43
43
4
36
may be
elimin.
64 b abs
address
may be
elimin.
implicit
lock
1
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
1
0.5
1
1
Page 170
2
2
2
0.33
0.33
0.25
1
2
1
2
3
1
1
1
1
1
0-1
not 64 bit
not 64 bit
not 64 bit
1-2 components
3 components
or RIP
Ivy Bridge
SFENCE
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
r,r
r,m
r,r
1
1
2
2
2
4
1
1
1
3
2
3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
11
11
10
35-57
11
11
9
59134
1
1
1
2
1
1
1
1
1
1
1
1
2
2
3
1
1
1
1
2
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
11
11
10
x
11
11
9
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
2
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
x
x
x
x
x
x
x
x
x
x
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
2
2
7-8
1
1
1
6
4
4
2
20
3
4
4
3
3
4
3
3
3
19-22
20-24
19-27
29-94
20-23
20-24
19-26
28-103
Page 171
x
x
x
x
1
1
1
x
x
x
x
x
x
1
1
1
1
1
1
3
1
3
0.33
0.5
1
1
1
2
0.33
0.5
0.33
1
8
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
9
10
11
22-76
8
8
8-11
26-88
not 64 bit
not 64 bit
not 64 bit
not 64 bit
0.33
0.5
1
1
1
SSE4.2
SSE4.2
SSE4.2
Ivy Bridge
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCL RCR
RCL RCR
RCL RCR
RCL RCR
RCL RCR
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD
r,m
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
1
1
2
1
1
1
3
2
5
2
1
4
2
5
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3
1
1
1
1
1
1
1
2
3
2
1
3
2
3
3
8
8
8
8
1
3
4
4
1
9
1
1
8
2
1
1
1
1
0
1
3
1
1
1
1

JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E
short
short
short
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
2
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
2
5
2
1
5
1
2
1
1
1
1
2
1
1
1
3
1
x
x
1
1
0.33
0.5
1
0.33
0.5
0.5
2
1
4
1
0.5
2
1
4
2
5
6
5
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.33
4
x
x
1
1
1
1
1
1
1
1
0
0
0
0
2
2
2
1-2
1-2
2
7
11
2
7
11
Page 172
x
x
1
1
1
1
1
x
x
x
x
x
x
6
1
x
x
x
x
SSE4.2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1-2
4-5
6
short form
fast if no
jump
fast if no
jump
Ivy Bridge
CALL
CALL
CALL
RET
RET
BOUND
INTO
2
2
3
2
3
15
4
1
1
1
1
2
13
4
String instructions
LODS
REP LODS
STOS
REP STOS
3
~5n
3
many
REP STOS
many
MOVS
REP MOVS
5
2n
REP MOVS
4/16B
SCAS
REP SCAS
CMPS
REP CMPS
3
~6n
5
~8n
4
8
7
5
9
14
18
22
24
3
5
5
3
6
11
15
19
21
Synchronization instructions
XADD
LOCK XADD
LOCK ADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP / Long NOP
PAUSE
ENTER
ENTER
LEAVE
XGETBV
CPUID
RDTSC
RDPMC
RDRAND
near
r
m
i
r,m
m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r
a,0
a,b
1
1
1
1
1
1
1
2
1
1
2
1
1
1
2
2
2
2
2
7
6
not 64 bit
not 64 bit
1
~2n
1
1
n
worst
case
best
case
1/16B
2
4
n
worst
case
best
case
1/16B
1
~2n
4
~2n
1
0
7
7
12
9
45+7b
3
2
8
37-82
21
35
13
12
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
2
1
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
7
22
22
7
22
7
22
16
27
0.25
10
8
1
84+3b
6
9
XGETBV
100-340

Page 173
27
39
104-117 RDRAND
Ivy Bridge
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
r
AX
m16
m16
m16
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m

procal ments
main
put
1
1
4
43
1
1
7
243
1
1
3
3
1
2
2
3
2
2
3
2
1
1
143
90
1
2
1
2
1
2
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17
2
40
1
1
1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
1
1
0
6
7
7
1
2
3
1
1
2
3
0
1
1
1
1
2
2
3
2
1
2
1
1
1
1
2
1
3
5
45
1
4
5
1
1
1
1
1
1
1
1
x
1
1
1
1
1
2
x
2
1
1
1
1
1
2
4
1
1
1
1
1
17
Page 174
1
1
3
5
1
10-24
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
2
1
1
2
21
1
1
5
252
0.5
1
1
2
1
1
3
1
1
1
4
5
1
1
1
1
21-26
27-50
22
2
2
1
1
3
1
1
1
167
162
1
1
1
1
8-18
8-18
1
1
1
1
1
1
2
2
2
1
2
12
19
11
SSE3
Ivy Bridge
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
25
17
1
21-78
23-100
20-110
16-23
42
56
102
28-72
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
5
26
25
17
1
x
x
1
x
x
x
x
42 x
56 x
102 x
x
1
2
5
26
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
49
10
10-23
47-106
48-115
50-123
~68
90-106
82
130
94-150
49
10
8-17
47-106
48-115
50-123
~68
1
1
22
80

Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA MOVDQU
MOVDQA MOVDQU
MOVDQA MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
Operands

procal ments
main
put
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
mm,mm
mm,m64
x,x
x,m128
x,x
1
1
1
1
x
x
x
x
1
1
1
1
x
1
1
1
2
1
x
1
1
1
1
1
1
1
Page 175
1
1
1
3
1
3
1
3
3
0-1
3
3
3
1
1
~360
~360
3
1
1
1
0.5
0.33
0.5
1
0.25
0.5
1
0.5
1
0.33
1
1
0.5
eliminat.
SSE3
SSE4.1
1
1
0.5
1
1
0.5
0.5
SSE4.1
Ivy Bridge
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ
x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x
x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
x
1
1
1
1
1
Page 176
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
4
1
2
2
2
2
2
2
1
1
2
1
2
1
2
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
1
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Ivy Bridge
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST
(x)mm,(x)mm
(x)mm,m
x,x
1
1
2
1
1
2
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
x
x
1
x
x
x
x
x
x
Page 177
1
1
1
1
1
1
5
1
6
1
1
1
1
0.5
0.5
1,5
1,5
0.5
0.5
0.5
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1
0.33
0.5
1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.2
SSE4.2
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
Ivy Bridge
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
x,m128
mm,mm/i
mm,m64
xmm,i
x,x
x,m128
x,i
3
1
1
1
2
3
1
2
1
1
1
2
2
1
1
1
1
1
1
1
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
8
8
8
8
3
4
3
4
8
7
8
7
3
3
3
3
x,x,i
x,m,i
18
18
x,x
x,m
x,x
x,m
x,x,i
x,m,i
PCLMULQDQ
PCLMULQDQ
AESDEC, AESDECLAST,
AESENC, AESENCLAST
1
1
1
1
1
1
0.5
SSE4.1
4
4
4
4
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
14
8
8
CLMUL
CLMUL
AES
1
2
2
8
7
AES
AES
AES
AES
AES
1
1
x
x
x
x
x
x
3
3
3
3
3
3
3
3
1
1
1
1
4
3
4
3
18
17
x
x
x
x
x
x
3
2
3
11
11
2
2
2
11
10
1
2
2
x
x
31
31
1
2
1
1
4
1
12
1
3
1
3
11
AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESIMC
AESKEYGENASSIST
AESKEYGENASSIST
Other
EMMS
x
x
x
x
1
14
1
10
1
18

Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
Operands

procal ments
main
put
x,x
y,y
x,m128
1
1
1
1
1
1
1
y,m256
m128,x
1
1
1+
1
m256,y
x,x
1
1
1+
1
Page 178
0-1
0-1
3
1
1
0.5
elimin.
elimin.
4
3
1
1
AVX
4
1
2
1
AVX
Ivy Bridge
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
2
2
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1+
1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1
1+
2
1
2
1+
1
3
1
3
4
5
5
5
1
3
1
1
1
1
1
1
1
1
1
1
1
1+
1
1
1
1
1
1
1
1
1
2
2
1
0
1
1
1
1
1
1
x
x
1
3
3
4
3
1
2
2
~380
~380
1
1+
x
x
Page 179
1
1
1
1+
2
1
1
1
1
1
1
2
4
1
0.5
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
1
1
1
1
1
1
0.5
1
1
0.5
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
Ivy Bridge
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
1
2
3
3
4
4
1
1
2
2
2
2
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64
2
2
2
2
2
2
2
2
2
3
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 180
1
1
1
1
1
1
1
1
1
1+
1 1
1 1+
2
4
4
5
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1+
4
1
1
1
1
1
1
4
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
3
1+
3
1
3
1+
1
1
1
1
1
1
1
1
4
1
5
1
4
1
5
1+
4
1
4
1
1
1
1
1
1
4
1
4
1
4
1
4
1
4
1
4
1
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Ivy Bridge
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
4
1
1
1
1
1
1
1
1
1
1
3
4
1
1
3
4
1
1
3
4
3
1
1
1
1
1
1
1
1
1
1
3
3
1
1
3
3
1
1
3
3
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
2
1
2
2
2
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
1
1
2
2
3
1
3
1
3
1+
3
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-13
1
1
1
19-21
1+
10-20
1
1
1
20-35
1+
5
1
1
1
7
1+
3
1
1
1
1
1
1
1
1
1
1
2
2
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
2
1
1
1
1
1
1
1
1
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D
1
1
Page 181
1
3
1+
1
3
1
3
1
3
1+
1
1
1
1
1
1
1
1
1
1
1
AVX
AVX
AVX
AVX
Ivy Bridge
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i
1
2
1
2
4
6
4
6
3
4
1
1
1
1
4
5
4
5
3
3
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
1
3
4
1
1
3
4
1
1
3
4
1
1
3
3
1
1
3
3
1
1
3
3
1
1
2
2
1
1
2
2
1
1
2
2
x,x
x,m128
1
1
1
1
1
1
y,y,y
y,y,m256
m32
m32
m4096
m4096
m
4
12
20
3
3
130
116
100-161
0
2
2
2
2
1
3
1+
1
2
1
2
1
1
12
1
12
1+
9
1
11
1
1
1
19
1+
16
1
1
1
28
1+
5
1
1
1
7
1+
1
1
1
1
2
4
2
4
1
1
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
Logic
VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
Other
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVEOPT
1
1
Page 182
1
1
1
1
AVX
AVX
1
11
9
3
1
66
68
AVX
32 bit
64 bit
1+
1
1
6
7
60-500
Haswell
Intel Haswell
Instruction:
Name of instruction. Multiple names mean that these instructions have the same data.
Instructions with or without V name prefix behave the same unless otherwise noted.
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register,
(x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx,
xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc.
ops fused
domain:
ops unfused
domain:
The number of ops at the decode, rename and allocate stages in the pipeline. Fused
ops count as one.
The total number of ops for all execution port. Fused ops count as two. Fused macroops count as one. The instruction has op fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any
execution port or if the counters are inaccurate.
ops each port: The number of ops for each execution port. p0 means a op to execution port 0.
p01means a op that can go to either port 0 or port 1. p0 p1 means two ops going to
port 0 and 1, respectively.
Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address
Latency:
This is the delay that the instruction generates in a dependency chain. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase the clock
counts considerably. Where hyperthreading is enabled, the use of the same execution
units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
Reciprocal
throughput:
The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
Operands
r,i
r8/16,r8/16
r32/64,r32/64
r8l,m
r8h,m
r16,m
r32/64,m
Reciproops
ops
cal
fused
unfused
through
domain domain ops each port Latency put
Comments
1
1
1
1
1
1
1
1
1
1
2
1
2
1
p0156
p0156
p0156
p23 p0156
p23
p23 p0156
p23
Page 183
1
0-1
0.25
0.25
0.25
0.5
0.5
0.5
0.5
may be elim.
all addressing
modes
Haswell
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
m,r
m,i
m,r
r,r
1
1
2
1
2
2
2
1
p237 p4
p237 p4
p23 p4
p0156
r16,m8
r,m
1
1
2
1
p23 p0156
p23
r,r
r,m
r,r
r,m
2
3
3
8
3
2
2
3
3
4
19
1
3
3
9
18
1
3
2
2p0156
2p0156 p23
3p0156
r16,m
2
3
3
8
3
1
1
2
2
3
11
1
3
2
9
18
1
3
2
LEA
r32/64,m
LEA
r32/64,m
LEA
BSWAP
BSWAP
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHNTA/
0/1/2
LFENCE
MFENCE
SFENCE
3
~400
1
1
1
1
0.25
0.5
0.5
2
2
21
7
3
0.5
1
1
implicit lock
p23
p23 2p0156
2p237 p4
p06
3p0156
p1 p0156
1
1
4
2
1
1
1
1
1
8
0.5
4
1
18
9
1
1
1
p15
0.5
p1
r32/64,m
p1
r32
r64
r16,m16
r32,m32
r64,m64
m16,r16
m32,r32
m64,r64
1
2
3
2
3
2
2
3
1
2
3
2
3
3
3
4
p15
p06 p15
2p0156 p23
p15 p23
2p0156 p23
p06 p237 p4
p15 p237 p4
p06 p15 p237 p4
p23
0.5
2
2
none counted
p23 p4
p23 p4
4
33
5
r
i
m
stack pointer
r
stack pointer
m
2
3
2
p237 p4
p237 p4
p4 2p237
p0156 p237 p4
p1 p4 p237 p06
Page 184
1
1
2
all other
combinations
0.5
1
0.5
0.5
0.5
1
1
1
not 64 bit
not 64 bit
not 64 bit
16 or 32 bit
address size
1 or 2 components in
address
3 components
in address
rip relative
address
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
Haswell
ADD SUB
ADD SUB
ADD SUB
r,r/i
r,m
m,r/i
1
1
2
1
2
4
ADC SBB
ADC SBB
ADC SBB
r,r/i
r,m
m,r/i
2
2
4
2
3
6
CMP
CMP
INC DEC NEG
NOT
INC DEC NOT
NEG
AAA
AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MULX
MULX
MULX
MULX
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
r,r/i
m,r/i
r
1
1
1
m
m
3
2
2
2
3
3
8
1
4
3
2
1
4
3
2
1
1
2
1
1
2
1
1
3
3
2
2
9
11
10
36
9
10
9
59
1
1
1
2
r8
r16
r32
r64
m8
m16
m32
m64
r,r
r,m
r16,r16,i
r32,r32,i
r64,r64,i
r16,m16,i
r32,m32,i
r64,m64,i
r32,r32,r32
r32,r32,m32
r64,r64,r64
r64,r64,m64
r8
r16
r32
r64
r8
r16
r32
r64
p0156
p0156 p23
2p0156 2p237 p4
2p0156
2p0156 p23
3p0156 2p237 p4
1
1
2
1
2
1
p0156
p0156 p23
p0156
1
1
1
0.25
0.5
0.25
4
4
2
2
3
3
8
1
4
3
2
2
5
4
3
1
2
2
1
1
3
2
2
3
4
2
3
9
11
10
36
9
10
9
59
1
1
1
2
p0156 2p237 p4
p0156 2p237 p4
p1 p0156
p1 p56
p1 2p0156
p1 2p0156
p0 p1 p5 p6
p1
p1 p0156
p1 p0156
p1 p6
p1
p1 p0156 p23
p1 p0156 p23
p1 p6 p23
p1
p1 p23
p1 p0156
p1
p1
p1 p0156 p23
p1 p23
p1 p23
p1 2p056
p1 2p056 p23
p1 p6
p1 p6 p23
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0156
p0156
p0156
p0156
6
6
4
6
4
4
21
3
4
4
3
1
1
Page 185
3
4
3
3
4
4
22-25
23-26
22-29
32-96
23-26
23-26
22-29
39-103
1
1
1
1
0.25
0.5
1
8
1
2
2
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
9
9
9-11
21-74
8
8
8-11
24-81
not 64 bit
not 64 bit
not 64 bit
not 64 bit
not 64 bit
AVX2
AVX2
AVX2
AVX2
Haswell
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHRD SHLD
SHRD SHLD
SHLD
SHRD
SHRD SHLD
SHLX SHRX SARX
SHLX SHRX SARX
RORX
RORX
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC
r,r
r,m
r,r
r,m
1
1
1
1
1
1
1
1
1
2
1
2
p06
p06
p1
p1 p23
p1
p1 p23
1
1
3
r,r/i
r,m
m,r/i
1
1
2
1
2
4
p0156
p0156 p23
2p0156 2p237 p4
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
m,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
r,r,cl
m,r,cl
r,r,r
r,m,r
r,r,i
r,m,i
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
1
1
1
3
3
5
2
1
4
3
5
3
4
8
11
8
11
1
3
4
4
5
1
2
1
2
1
10
2
1
10
3
1
1
1
2
1
1
1
2
1
4
3
6
2
1
5
3
6
3
6
8
11
8
11
1
5
4
4
7
1
2
1
2
1
10
2
1
11
4
1
2
1
3
0
1
p0156
p0156 p23
p06
2p06 p237 p4
3p06
3p06 2p23 p4
2p06
p06
2p06 2p237 p4
3p06
1
1
1
1
2
2p06 p0156
p0156
p0156
p1
p0156
p0156
3
4
p06
p06 p23
p06
p06 p23
p06
p06 p23
p06
2p06 p23 p4
p1
p1 p23
p06
p06 p237 p4
none
p0156
Page 186
1
1
2
1
1
3
1
1
1
1
1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
0.25
0.5
1
0.25
0.5
0.5
2
2
4
1
0.5
2
2
4
2
3
6
6
6
6
1
2
2
2
4
0.5
0.5
0.5
0.5
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.25
short form
BMI2
BMI2
BMI2
BMI2
Haswell
CMC
CLD STD
LZCNT
LZCNT
TZCNT
TZCNT
ANDN
ANDN
BLSI BLSMSK
BLSR
BLSI BLSMSK
BLSR
BEXTR
BEXTR
BZHI
BZHI
PDEP
PDEP
PEXT
PEXT
r,r
r,m
r,r
r,m
r,r,r
r,r,m
r,r
1
2
1
1
1
1
1
1
1
1
3
1
2
1
2
1
2
1
p0156
p15 p6
p1
p1 p23
p1
p1 p23
p15
p15 p23
p15
r,m
p15 p23
r,r,r
r,m,r
r,r,r
r,m,r
r,r,r
r,r,m
r,r,r
r,r,m
2
3
1
1
1
1
1
1
2
3
1
1
1
2
1
2
2p0156
2p0156 p23
p15
p15 p23
p1
p1 p23
p1
p1 p23

JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
1
1
1
1
1
1
2
1
p6
p6
p23 p6
p6
1-2
2
2
1-2
Conditional jump
p06
0.5-1
p6
1-2
p06
0.5-1
2
7
11
2
2
3
1
3
15
4
2
7
11
3
3
4
2
4
15
4
p0156 p6
0.5-2
5
6
2
2
3
1
2
8
5
not 64 bit
not 64 bit
3
2
5n+12
3
<2n
3
2
2p0156 p23
p0156 p23
p23 p0156 p4
1
1
~2n
1
~0.5n
worst case
Fused arithmetic
and branch
Fused arithmetic
and branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODSB/W
LODSD/Q
REP LODS
STOS
REP STOS
short/near
short
short
short
near
r
m
i
r,m
p237 p4 p6
p237 p4 p6
2p237 p4 p6
p237 p6
p23 2p6 p015
Page 187
1
3
3
1
1
1
2
1
3
3
4
1
1
1
1
0.5
0.5
0.5
LZCNT
LZCNT
BMI1
BMI1
BMI1
BMI1
BMI1
0.5
BMI1
0.5
1
0.5
0.5
1
1
1
1
BMI1
BMI1
BMI2
BMI2
BMI2
BMI2
BMI2
BMI2
predicted
taken
predicted not
taken
predicted
taken
predicted not
taken
Haswell
REP STOS
2.6/32B
MOVS
REP MOVS
REP MOVS
5
~2n
4/32B
SCAS
REP SCAS
CMPS
REP CMPS
3
6n
5
8n
p23 2p0156
2p23 3p0156
Synchronization instructions
XADD
m,r
LOCK XADD
m,r
LOCK ADD
m,r
CMPXCHG
m,r
LOCK CMPXCHG
m,r
CMPXCHG8B
m,r
LOCK CMPXCHG8B
m,r
CMPXCHG16B
m,r
LOCK CMPXCHG16B
m,r
4
9
8
5
10
15
19
22
24
5
9
8
6
10
15
19
22
24
1
1
0
0
Other
NOP (90)
Long NOP (0F
1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDPMC
RDRAND
a,0
a,b
1/32B
2p23 p4 2p0156
5
5
12
12
~14+7b ~45+7b
3
3
34-69 34-116
8
8
15
15
34
34
17
17
4
~1.5 n
1/32B
best case
aligned by 32
worst case
best case
aligned by 32
1
2n
4
2n
7
19
19
8
19
9
19
15
25
none
none
0.25
0.25
p05 3p6
9
8
~87+2b
2p0156 p23
p23 16p0156
6
100-250 100-250
9
24
37
~320
XGETBV
RDRAND
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
Reciproops
ops
cal
fused
unfused
through
Comments
1
1
4
43
1
1
7
238
1
1
4
43
1
2
7
226
p01
p23
2p01 2p23
p01
p4 p237
3p0156 2p23 2p4
Page 188
1
3
4
47
1
4
1
0.5
0.5
2
22
0.5
1
5
265
Haswell
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P)
FSUB(R)(P)
FADD(P)
FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
r
m
m
m
0
2
3
3
1
2
2
3
2
3
3
3
1
1
147
90
none
p01 p23
p1 p23 p4
p1 p23 p4
p01
2p01
2p01
2p0 p5
p0 p0156
p0 p4 p237
p01 p23 p6
p237 p4 p6
p01
p01
0
6
7
7
r
m
m
2
1
3
3
1
2
2
3
2
2
3
2
1
1
147
90
p1
m
r
m
r
m
1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17
2
1
2
1
2
1
1
1
1
2
3
3
3
3
3
1
2
28
41
17
p1 p23
p0
p0 p23
p0
p0 p23
p0
p0
p1
p1 p23
2p01
3p01
2p1 p23
p0 p1 p23
p0 p1 p23
2p1 p23
p1
2p1
r
AX
m16
m16
m16
r
m
r
m
m
m
m
25-75
17
1
71-100
110
70-120
58-89
55-417
55-228
17
1
Page 189
2
6
7
0
5
10-24
1
1
19
27
11
p0
49-125
15
10-23
47-106
112
52-123
63-68
58-680
58-360
0.5
1
1
2
1
2
2
2
1
1
2
1
0.5
0.5
150
164
1
1
1
1
8-18
8-18
1
1
1
1
1
1.5
2
2
2
1
2
13
17
23
11
8-17
SSE3
Haswell
FPTAN
FPATAN
110-121
78-160
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
5
26
130
96-156
1
2
5
26
p01
p01
p0156
0.5
1
22
83
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA/U
MOVDQA/U
MOVDQA/U
VMOVDQA/U
VMOVDQA/U
VMOVDQA/U
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
VMOVNTDQ
MOVNTDQA
VMOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/L
BW/WD/DQ
Operands
Reciproops
ops
cal
fused
unfused
through
Comments
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
r64,(x)mm
(x)mm,r64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
2
p0
p237 p4
p5
p23
p0
p5
p015
p23
p237 p4
p015
p23
p237 p4
1
3
1
3
1
1
1
3
3
0-1
3
3
1
1
1
0.5
1
1
0.33
0.5
1
0.33
0.5
1
y,y
y,m256
m256,y
x, m128
mm, x
x,mm
m64,mm
m128,x
m256,y
x, m128
y,m256
1
1
1
1
2
1
1
1
1
1
1
1
2
1
2
1
2
2
2
1
1
p015
p23
p237 p4
p23
p01 p5
p015
p237 p4
p237 p4
p237 p4
p23
p23
0-1
3
4
3
1
1
~400
~400
~400
3
3
0.33
1
1
0.5
1
0.33
1
1
1
0.5
0.5
mm,mm
p5
mm,m64
p23 2p5
x,x / y,y,y
p5
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
1
1
1
2
1
2
p23 p5
p5
p23 p5
1
1
1
v,v / v,v,v
p5
Page 190
may be elim.
AVX
may be elim.
AVX
AVX
SSE3
AVX2
SSE4.1
AVX2
SSE4.1
SSE4.1
Haswell
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
QDQ
PUNPCKH/L
QDQ
v,m / v,v,m
p23 p5
x,x / y,y,y
p5
x,m / y,y,m
p23 p5
PMOVSX/ZX BW
BD BQ DW DQ
x,x
p5
PMOVSX/ZX BW
BD BQ DW DQ
x,m
p23 p5
VPMOVSX/ZX BW
BD BQ DW DQ
y,x
p5
y,m
v,v / v,v,v
v,m / v,v,m
mm,mm,i
mm,m64,i
v,v,i
v,m,i
v,v,i
v,m,i
v,v,i / v,v,v,i
v,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,v,i
v,v,m,i
y,y,y
y,y,m
y,y,i
y,m,i
y,y,y,i
y,y,m,i
mm,mm
x,x
v,v,m
m,v,v
r,v
r32,x,i
m8,x,i
x,y,i
m,y,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
y,y,x,i
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
1
1
2
1
2
4
10
3
4
1
2
2
1
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
2
1
2
1
2
4
10
3
4
1
2
3
1
2
2
2
2
2
2
2
1
p5 p23
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
2p5
2p5 p23
2p5
2p5 p23
p5
p23 p5
p015
p015 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p0 p4 2p23
4p04 2p56 4p23
p23 2p5
p0 p1 p4 p23
p0
p0 p5
p23 p4 p5
p5
p23 p4
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
VPMOVSX/ZX BW
BD BQ DW DQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
VPBLENDVB
VPBLENDVB
PBLENDW
PBLENDW
VPBLENDD
VPBLENDD
VPERMD
VPERMD
VPERMQ
VPERMQ
VPERM2I128
VPERM2I128
MASKMOVQ
MASKMOVDQU
VPMASKMOVD/Q
VPMASKMOVD/Q
PMOVMSKB
PEXTRB/W/D/Q
PEXTRB/W/D/Q
VEXTRACTI128
VEXTRACTI128
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD/Q
PINSRD/Q
VINSERTI128
Page 191
1
1
1
1
1
1
1
1
1
2
2
1
1
3
3
3
13-413
14-438
4
13-14
3
2
3
4
2
2
2
3
SSE4.1
SSE4.1
AVX2
1
1
1
1
1
1
1
1
1
1
1
2
1
2
2
1
1
0.33
0.5
1
1
1
1
1
1
1
6
2
1
1
1
1
1
1
2
1
2
1
2
1
1
AVX2
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX2
Haswell
VINSERTI128
VPBROADCAST
B/W/D/Q
VPBROADCAST
B/W
VPBROADCAST
D/Q
VPBROADCAST
B/W/D/Q
VPBROADCAST
B/W
VPBROADCAST
D/Q
y,y,m,i
p015 p23
0.5
AVX2
x,x
p5
AVX2
x,m8/16
p01 p23 p5
AVX2
x,m32/64
p23
0.5
AVX2
y,x
p5
AVX2
y,m8/16
p01 p23 p5
AVX2
y,m32/64
y,m128
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
1
20
34
15
22
12
20
14
22
1
1
20
34
15
22
12
20
14
22
p23
p23
5
3
0.5
0.5
9
12
8
7
7
9
7
9
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
PADD/SUB(S,US)
B/W/D/Q
v,v / v,v,v
p15
0.5
PADD/SUB(S,US)
B/W/D/Q
v,m / v,v,m
p15 p23
v,v / v,v,v
p1 2p5
v,m / v,v,m
p1 2p5 p23
v,v / v,v,v
p15
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
1
1
2
1
2
1
2
p15 p23
p15
p15 p23
p0
p0 p23
v,v / v,v,v
p0
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
1
1
1
2
3
1
1
1
1
1
2
1
2
2
3
1
2
1
2
1
p0 p23
p0
p0 p23
2p0
2p0 p23
p0
p0 p23
p0
p0 p23
p0
VBROADCASTI128
VPGATHERDD
VPGATHERDD
VPGATHERQD
VPGATHERQD
VPGATHERDQ
VPGATHERDQ
VPGATHERQQ
VPGATHERQQ
PHADD(S)W/D
PHSUB(S)W/D
PHADD(S)W/D
PHSUB(S)W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
Page 192
0.5
3
1
5
5
10
5
5
5
SSSE3
SSSE3
0.5
0.5
0.5
0.5
1
1
SSE4.1
SSE4.1
SSE4.2
SSE4.2
1
1
1
1
2
2
1
1
1
1
1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Haswell
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
Logic instructions
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PTEST
PTEST
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
PSLLDQ
PSRLDQ
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
1
1
2
1
2
1
2
p0 p23
p0
p0 p23
p15
p15 p23
x,x / y,y,y
p15
x,m / y,y,m
x,x
x,m128
v,v
v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x,i / v,v,v,i
x,m,i / v,v,m,i
1
1
1
1
1
1
1
1
1
3
4
2
1
2
1
2
1
2
1
2
3
4
p15 p23
p0
p1 p23
p15
p15 p23
p15
p15 p23
p0
p0 p23
p0 2p5
p0 2p5 p23
v,v / v,v,v
p015
0.33
v,m / v,v,m
v,v
v,m
1
2
2
2
2
3
p015 p23
p0 p5
p0 p5 p23
0.5
1
1
mm,mm
p0
mm,m64
p0 p23
x,x / v,v,x
p0 p5
x,m / v,v,m
p0 p23
v,i / v,v,i
p0
v,v,v
2p0 p5
AVX2
v,v,m
2p0 p5 p23
AVX2
x,i / v,v,i
p5
Page 193
5
1
5
1
1
5
6
1
1
1
0.5
0.5
SSSE3
SSSE3
0.5
SSE4.1
0.5
1
1
0.5
0.5
0.5
0.5
1
1
2
2
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Haswell
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
PCLMULQDQ
x,x,i
PCLMULQDQ
x,m,i
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,x
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,m
AESIMC
x,x
AESIMC
x,m
AESKEYGENAS
SIST
x,x,i
AESKEYGENAS
SIST
x,m,i
Other
EMMS
8
8
9
9
3
4
3
4
8
8
9
9
3
4
3
4
6p05 2p16
11
4
4
5
5
3
3
3
3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
3p0 2p16 4p5

6p05 2p16 p23
3p0
3p0 p23
3p0
3p0 p23
10
3
4
3
4
2p0 p5
2p0 p5 p23
2
2
CLMUL
CLMUL
p5
AES
2
2
3
2
2
3
p5 p23
2p5
2p5 p23
14
1.5
2
2
AES
AES
AES
10
10
2p0 8p5
10
AES
10
10
2p0 p23 7p5
AES
31
31
3p0 2p16 2p5 p23
11
10
13
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
Operands
Reciproops
ops
cal
fused
unfused
through
Comments
x,x
y,y
1
1
1
1
p5
p5
0-1
0-1
1
1
MOVAPS/D
MOVUPS/D
x,m128
p23
0.5
VMOVAPS/D
VMOVUPS/D
y,m256
p23
0.5
m128,x
p237 p4
m256,y
x,x
x,m32/64
m32/64,x
x,m64
1
1
1
1
1
2
1
1
2
2
p237 p4
p5
p23
p237 p4
p23 p5
4
1
3
3
4
1
1
0.5
1
1
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
Page 194
may be elim.
may be elim.
AVX
AVX
Haswell
MOVHPS/D
MOVLPS/D
MOVLPS/D
MOVHLPS
MOVLHPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
VPERMPS
VPERMPS
VPERMPD
VPERMPD
BLENDPS/PD
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VPGATHERDPS
VPGATHERDPS
m64,x
x,m64
m64,x
x,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,i
v,m,i
v,v,v
v,v,m
y,y,y,i
y,y,m,i
y,y,y
y,y,m
y,y,i
y,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
v,v
v,m
x,m32
y,m32
x,x
y,x
y,m64
y,x
y,m128
v,v
v,m
x,x / v,v,v
x,m / v,v,m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
v,v,m
m128,x,x
m256,y,y
x,[r+s*x],x
y,[r+s*y],y
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
1
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
4
4
20
34
2
2
2
1
1
1
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
2
1
2
1
2
3
4
4
20
34
p4 p237
p23 p5
p4 p237
p5
p5
p0
p0
p4 p237
p4 p237
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p015
p015 p23
p5
2p5 p23
2p5
2p5 p23
p5
p23
p23
p23
p5
p5
p23
p5
p23
p5
p23
p5
p5 p23
p0 p5
p0 p5 p23
p5
p23 p4
p5
p23 p5
p5
p015 p23
2p5 p23
p0 p1 p4 p23
p0 p1 p4 p23
Page 195
3
4
3
1
1
3
2
~400
~400
1
1
1
3
3
3
1
2
2
1
3
4
5
1
3
5
3
3
1
3
1
4
3
4
1
4
3
4
4
13
14
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.33
0.5
2
2
2
2
1
0.5
0.5
0.5
1
1
0.5
1
0.5
1
0.5
1
1
1
1
1
1
1
1
1
1
2
1
2
9
12
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX2
AVX2
AVX
AVX2
AVX
SSE3
SSE3
SSE3
SSE3
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
Haswell
VPGATHERQPS
VPGATHERQPS
VPGATHERDPD
VPGATHERDPD
VPGATHERQPD
VPGATHERQPD
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
15
22
12
20
14
22
15
22
12
20
14
22
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32/64
x,m32
r32/64,x
r32,m64
2
2
2
2
2
3
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2
2
3
2
3
2
3
2
2
2
2
2
2
1
2
1
2
1
2
1
2
2
2
2
2
2
3
2
3
1
2
2
2
2
2
2
3
2
2
2
3
2
2
2
3
8
7
7
9
7
9
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p0 p5
p0 p23
p0 p5
p0 p23
p0 p5
p0 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
Page 196
4
5
4
2
5
2
3
3
3
3
4
6
4
6
4
4
4
4
4
4
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Haswell
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D PS/D
SUBSS/D PS/D
ADDSS/D PS/D
SUBSS/D PS/D
ADDSUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D PS/D
MULSS/D PS/D
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPS
VDIVPS
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D
CMPccPS/D
CMPccSS/D
CMPccPS/D
(U)COMISS/D
(U)COMISS/D
MAXSS/D PS/D
MINSS/D PS/D
MAXSS/D PS/D
MINSS/D PS/D
ROUNDSS/D PS/D
ROUNDSS/D PS/D
DPPS
DPPS
DPPD
DPPD
VFMADD...
(all FMA instr.)
VFMADD...
(all FMA instr.)
Math
SQRTSS/PS
x,v,i
m,v,i
v,x
v,m
2
4
2
2
2
4
2
2
p1 p5
p1 p4 p5 p23
p1 p5
p1 p23
x,x / v,v,v
p1
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
1
1
1
2
1
2
p1 p23
p1
p1 p23
1
1
1
SSE3
SSE3
x,x / v,v,v
p1 2p5
SSE3
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
x,x
x,m
x,x
x,m
y,y,y
y,y,m256
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
4
1
1
1
1
1
1
3
4
3
4
1
1
3
4
4
1
2
1
2
1
2
3
4
3
4
1
2
3
4
p1 2p5 p23
p01
p01 023
p0
p0 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
2
0.5
0.5
7
7
8-14
8-14
14
14
16-28
16-28
1
1
2
2
SSE3
x,x / v,v,v
p1
x,m / v,v,m
x,x
x,m32/64
2
1
2
2
1
2
p1 p23
p1
p1 p23
x,x / v,v,v
p1
x,m / v,v,m
v,v,i
v,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,i
x,m128,i
1
2
3
4
6
3
4
2
2
3
4
6
3
4
p1 p23
2p1
2p1 p23
2p0 p1 p5
v,v,v
v,v,m
x,x
5
10-13
10-20
18-21
19-35
5
7
F16C
F16C
F16C
F16C
AVX
AVX
AVX
AVX
AVX
AVX
1
1
1
1
6
14
2p0 p1 p5 p23 p6
p0 p1 p5
p0 p1 p5 p23
p01
p01 p23
p0
Page 197
1
1
1
1
11
1
1
2
2
2
4
1
1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
0.5
FMA
0.5
FMA
Haswell
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
3
4
1
1
3
4
1
1
3
4
2
3
4
1
2
3
4
1
2
3
4
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
28-29
AND/ANDN/OR/XO
RPS/PD
x,x / v,v,v
p5
AND/ANDN/OR/XO
RPS/PD
x,m / v,v,m
p5 p23
Other
VZEROUPPER
none
VZEROALL
12
12
none
10
VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
XSAVEOPT
20
3
3
3
130
116
224
173
20
3
4
none
p0 p6 p23
p0 p4 p6 p237
8
3
1
1
68
72
84
111
19
16
5
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
Logic
m32
m32
m32
m4096
m4096
Page 198
6
7
AVX
AVX,
32 bit
AVX,
64 bit
AVX
Pentium 4
Intel Pentium 4
This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for
P4E, listed on the next sheet

Instruction:
Operands:
JNE, etc.
memory operand including indirect operands, m64 means 64-bit memory operand, etc.
ops:
Microcode:
Latency:
Number of ops issued from instruction decoder and stored in trace cache.
Number of additional ops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the
next dependent instruction starts in the same execution unit. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase
the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latency of moves to and from memory cannot be measured
accurately because of the problem with memory intermediates explained
above under How the values were measured.
Additional latency:
This number is added to the latency if the next dependent instruction is in a

different execution unit. There is no additional latency between ALU0 and
ALU1.
This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent
instruction can begin to execute in the same execution subunit. A value of
0.25 indicates 4 instructions per clock cycle in one thread.
Reciprocal
throughput:
Port:
Execution unit:
Execution subunit:
Instruction set
The port through which each op goes to an execution unit. Two independent
ops can start to execute simultaneously only if they are going through different ports.
Use this information to determine additional latency. When an instruction with
more than one op uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
Page 199
Pentium 4
alu0/1
alu0/1
load
load
store
store
86
86
86
86
86
86
86
86
sse2
386
386
386
386
ppro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
386
386
386
86
86
86
86
486
alu0/1
load
alu0
alu0/1
alu0/1
alu0/1
int,alu
int,alu
int,alu
int
alu0/1
int
int,alu
86
sse
sse
Notes
Page 200
Instruction set
r,m
r
r,r/i
m
m
Subunit
r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]
0
0,5 0.5-1 0,25 0/1
0
0,5 0.5-1 0,25 0/1
0
2
0
1
2
0
3
0
1
2
0
1
2
0
0
2 0,3
2
6
4
12
0
14
0
33
0
0,5 0.5-1 0,25 0/1
0
2
0
1
2
0
0,5 0.5-1 0,5 0
0
3 0.5-1 1 2,0
0
6
0
3
0
1,5 0.5-1 1 0/1
8 >100
0
3
0
1
2
0
1
2
0
2
4
7
4
10
10
19
0
1
0
1
8
14
5
13
8
52
16
14
0
0,5 0.5-1 0,25 0/1
0
1 0.5-1 0,5 0/1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4
0
4
1
0
0,5 0.5-1 0,5 0/1
0
5
0
1
1
7
15
0
7
0
2
64
>1000
2
6
2
6
Execution unit
r
m
sr
Port
Reciprocal throughput
r
i
m
sr
Additional latency
1
1
1
2
1
3
4
4
2
1
1
1
2
3
3
4
4
2
2
3
4
4
4
2
4
4
4
4
1
2
3
2
3
1
1
3
4
3
8
4
4
Latency
r,r
r,i
r32,m
r8/16,m
m,r
m,i
r,sr
sr,r/m
m,r32
r,r
r,m
r,r
r,m
r,r/m
r,r
r,m
Microcode
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVSX
MOVSX
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2
Operands
ops
Instruction
b, c
a, q
c
c
a, e
Pentium 4
SFENCE
LFENCE
MFENCE
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW
CWD, CDQ
CWDE
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
4
4
4
r,r
r,m
m,r
r,r
r,i
r,m
m,r
r,r
r,m
r
m
r
m
r8/32
r16
m8/32
m16
r32,r
r32,(r),i
r16,r
r16,r,i
r16,m16
r32,m32
r,m,i
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32
r,r
r,m
m,r
r,r
r,m
r
1
2
3
4
3
4
4
1
2
2
4
1
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
2
2
1
1
2
3
1
2
1
2
2
2
40
38
100
0
0,5 0.5-1 0,25
0
1 0.5-1 1
0
8
4
4
6
0
6
0
6
0
6
6
8
0
8
7
9
8
0
0,5 0.5-1 0,25
0
1 0.5-1 1
0
0,5 0.5-1 0,5
0
4
4
0
0,5 0.5-1 0,5
0
3
27 90
57 100
10 22
22 56
6
16
0
8
7
17
0
8
7-8 16
0
8
10 16
0
8
0
14
0
4,5
0
14
0
4,5
5
16
0
9
5
15
0
8
7
15
0
10
0
14
0
8
7
14
0
10
20 61
0
24
18 53
0
23
21 50
0
23
24 61
0
24
22 53
0
23
20 50
0
23
0
1 0.5-1 1
0
1 0.5-1 0,5
0
0,5 0.5-1 0,5
0
0
0
0
0
0
0,5
1
8
0,5
1
0,5
0.5-1 0,5
0.5-1 1
4
0.5-1 0,5
0.5-1 1
0.5-1 0,5
Page 201
sse
sse2
sse2
0/1
alu0/1
1
1
1
int,alu
int,alu
int,alu
0/1
alu0/1
0/1
alu0/1
alu0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0/1
0
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
alu0
alu0/1
alu0
alu0
alu0
alu0
fpmul
fpdiv
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
386
386
386
186
386
386
186
86
86
386
86
86
386
86
86
386
86
86
86
86
86
86
c
c
c
c
c
a
a
a
a
a
c
c
c
c
c
Pentium 4
NOT
SHL, SHR, SAR
SHL, SHR, SAR
ROL, ROR
ROL, ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
CLI
STI
m
r,i
r,CL
r,i
r,CL
r,1
r,i
r,CL
4
1
2
1
2
1
4
4
0
0
0
0
0
0
15
15
1
0
1
0
1
0
0
4
1
1
1
1
1
15
14
m,i/CL
m,1
m,i/CL
r,r,i/CL
m,r,i/CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r
r,m
r
m
4
4
4
4
4
3
2
4
4
3
2
4
4
2
3
3
4
3
3
4
4
4
4
7-8 10
0
7
10
0
18 18-28
14 14
0
18 14
0
0
4
0
0
4
0
0
4
0
12 12
0
0
6
0
0
6
0
7
18
0
15 14
0
0
4
0
0
4
0
0
5
0
0
5
0
0
10
0
0
10
0
7
52
0
5
48
0
5
35
12 43
1
4
3
3
4
1
4
4
3
4
4
4
4
4
4
4
0
28
0
0
31
0
4
4
0
34
4
4
38
0
0
33

JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
4
6
4
6
4
16
16
0
118
4
4
11
0
0
0
2
8
9
2
2
11
Page 202
1
1
1
1
1
1
1
int
int
int
int
int
int
int
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
10
10
14
14
14
2
1
2
12
2
4
8
14
2
3
1
3
2
2
52
48
35
43
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
1
118
4
4
11
2-4
2-4
2-4
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
alu0
branch
alu0
alu0
branch
branch
alu0
alu0
alu0
alu0
branch
branch
branch
branch
alu0
alu0
branch
branch
alu0
alu0
branch
branch
86
186
86
186
86
86
186
86
86
86
86
386
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
d
d
d
d
d
d
d
d
d
d
d
d
d
Pentium 4
RETF
IRET
ENTER
ENTER
LEAVE
BOUND
INTO
INT
4
33 11
4
48 24
4
12 26
4 45+24n
4
0
3
4
14 14
4
5
18
4
84 644
i,0
i,n
m
i
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CPUID
RDTSC
Notes:
a)
b)
c)
d)
e)
q)
4
4
4
4
4
4
4
4
4
4
0
0
26
128+16n
3
14
18
3
6
5n 4n+36
2
6
2n+3 3n+10
4
6
163+1.1n
3
40+6n
4n
5
50+8n
4n
1
0
0
1
0
0
4
2
4 39-81
4
7
86
86
186
186
186
186
86
86
86
86
86
86
86
86
86
86
86
86
0,25 0/1
0,25 0/1
alu0/1
alu0/1
200-500
80
86
ppro
sse2
p5
p5
Add 1 op if source is a memory operand.

Uses an extra op (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is
used as base pointer.
Add 1 op if source or destination, but not both, is a high 8-bit register (AH,
BH, CH, DH).
Has (false) dependence on the flags in most cases.
Not available on PMMX
Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode.
0
2
mov
load
Notes
Page 203
1
1
Instruction set
0
0
Subunit
6
7
Execution unit
0
0
Port
1
1
Additional latency
r
m32/64
Latency
Microcode
Move instructions
FLD
FLD
Operands
ops
Instruction
87
87
Pentium 4
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST
FIST
FISTP
FLDZ
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m16
m32/64
m
st0,r
r
AX
AX
m16
m16
m16
r
m
m16
m32
r
m
m16
m32
r
m
m16
m32
r
m
r
m16
m32
3
3
1
2
3
3
1
3
2
3
2
3
1
2
4
3
1
4
6
4
4
4
4
75
0
0
8
311
0
3
0
0
0
0
0
0
0
0
0
0
0
4
4
7
1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
4
3
1
1
3
6
6
0
0
4
0
0
0
4
0
0
0
4
0
0
0
0
0
0
0
4
0
0
0
15
84
84
6
7
0
10
10
10
10
10
2-4
0
11
11
0
0
0
(3)
5
5
6
5
7
7
7
7
43
43
43
43
2
2
2
2
2
10
1
1
0
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
2
2
2
23
212
212
0
0
0
0
Page 204
6
2
90
2
1
0
2-3 0
8
0
400 0
1
0
6
2
1
2
2-4 0
2-3 0
2-4 0
2
0
2
0
4
1
4
0
1
0
3
1
3
1
6
0
6
0
(8) 0,2
1
1
6
2
2
2
6
2
43
43
43
43
1
1
1
1
1
3
6
2
1
1
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1
load
load
mov
store
store
store
mov
load
load
store
store
store
mov
mov
fp
mov
mov
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
287
287
87
87
87
add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387
g, h
g, h
g, h
g, h
Pentium 4
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)
g)
h)
i)
1
2
6
6
7
6
3
3
3
3
3
11
0
43
0
150 180
175 207
178 216
160 230
92 187
24 57
15 20
45 165
60 200
134 242
1
2
4
6
4
4
4
4
0
0
4
29
174
96
69
94
0
0
1
0
43
3
170
207
211
200
153
66
20
63
90
220
456
528
132
208
1
1
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
1
0
1
0
96
1
172
420 0,1
532
96
208
div
87
87
387
387
387
87
87
87
87
87
87
87
mov
mov
87
87
87
87
87
87
sse
sse
g, h
i
i

The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is 143.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 23, double precision: 38, long double precision
(default): 43.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
Takes 6 ops more and 40-80 clocks more when XMM registers are disabled.
0
1
2
0
fp
mmx
load
fp
alu
mmx
mmx
mmx
sse2
Notes
Page 205
1
2
1
2
Instruction set
1
0
0
1
Subunit
5
2
8
10
Execution unit
0
0
0
0
Port
2
2
1
2
Additional latency
r32, mm
mm, r32
mm,m32
r32, xmm
Latency
Microcode
Move instructions
MOVD
MOVD
MOVD
MOVD
Operands
ops
Instruction
Pentium 4
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W
xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
2
1
2
1
1
1
2
1
1
2
4
4
3
2
3
2
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
6
8
8
6
2
8
8
6
8
8
2
1
2
1
2
1
2
1
1
2
2
2
2
2
75
18
1
2
0,1
0
1
2
0
0
2
0
2
0
0,1
0,1
0
0
mm,r/m
mmx
shift
mmx
xmm,r/m
mmx
shift
mmx
mm,r/m
mmx
shift
mmx
xmm,r/m
mmx
shift
sse2
xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i
1
1
1
1
4
4
2
3
3
2
2
0
0
0
0
4
6
0
0
0
0
0
2
4
2
2
1
1
1
1
shift
shift
shift
shift
sse2
sse2
sse2
mmx
sse
sse2
1
1
1
1
1
1
1
1
1
0
0
0,1
1
1
1
1
mmx
mmx
mmx
mmx
mov
mov
7
8
9
3
4
2
2
2
1
7
10
3
2
2
2
2
mmx-alu0
mmx-int
mmx-int
int-mmx
int-mmx
sse
sse
sse2
sse
sse2
r,r/m
1,2
mmx
alu
mmx
a,j
r,r/m
mm,r/m
xmm,r/m
1
1
1
0
0
0
2
2
4
1
1
1
1,2
1
2
1
1
1
mmx
mmx
fp
alu
alu
add
mmx
sse2
sse2
a,j
a
a
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
0
0
0
0
0
0
2
6
6
6
6
2
1
1
1
1
1
1
1,2
1,2
1,2
1,2
1,2
1,2
1
1
1
1
1
1
mmx
fp
fp
fp
fp
mmx
alu
mul
mul
mul
mul
alu
mmx
mmx
sse
mmx
sse2
sse
a,j
a,j
a,j
a,j
a,j
a,j
8
8
1
0
0
1
1
1
Page 206
mmx
load
mov
mmx
load
mov
mov
load
mov
load
mov
mov-mmx
mov-mmx
shift
shift
sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2
k
k
sse2
sse2
mov
mov
sse
sse2
Pentium 4
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ
Other
EMMS
Notes:
a)
j)
k)
r,r/m
r,r/m
r,r/m
1
1
1
0
0
0
2
2
4
1
1
1
1,2
1,2
1,2
1
1
1
mmx
mmx
mmx
alu
alu
alu
sse
sse
sse
a,j
a,j
a,j
r,r/m
r,r/m
1
1
0
0
2
2
1
1
1,2
1,2
1
1
mmx
mmx
alu
alu
mmx
mmx
a,j
a,j
r,i/r/m
xmm,i
1
1
0
0
2
4
1
1
1,2
2
1
1
mmx
mmx
shift
shift
mmx
sse2
a,j
a
11
12
12
mmx

Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves
2
2
2
1
1
1
0
2
0
0
2
0
1
1
2
0
1
1
sse
0
0
0
0
0
0
2
4
3
2
2
2
0
0
1
1
1
1
sse
sse/2
sse
sse
sse
sse
2
2
7
0
1
0
4
2
0
0
mov
mmx
mmx
shift
shift
mmx
mmx
shift
shift
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
MOVHPS/D, MOVLPS/D
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D
6
4
4
2
1
1
1
1
Page 207
fp
mmx
mmx
mmx
shift
shift
shift
Notes
m,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m
1
1
2
1
2
8
2
2
1
2
2
2
mov
Instruction set
0
0
Subunit
r,m
6
7
7
6
Execution unit
0
0
0
0
0
6
0
0
0
0
0
0
Port
1
1
2
1
4
4
1
1
1
2
1
1
Additional latency
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
Operands
Latency
Microcode
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS
MOVHPS/D, MOVLPS/D
ops
Instruction
k
k
Pentium 4
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
4
2
4
4
1
3
1
2
4
4
3
3
3
4
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
10
14
10
4
9
4
9
10
11
7
11
10
15
8
8
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
4
2
6
6
2
4
2
2
4
5
2
3
3
6
2,5
2,5
1
1
1
1
1
1
1
1
1
1
0,1
0,1
1
1
1
1
mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp
shift
sse2
shift
shift
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
2
0
0
0
0
0
0
0
0
4
4
6
23
39
38
69
4
1
1
1
0
0
0
0
1
2
2
2
23
39
38
69
4
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
mmx
r,r/m
r,r/m
r,r/m
1
2
0
0
4
6
1
1
2
3
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
2
2
0
0
0
0
0
0
23
39
38
69
4
4
0
0
0
0
1
1
m
m
4
4
8
4
98
Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D
Other
LDMXCSR
STMXCSR
Notes:
Page 208
sse2
a
sse2
sse2
sse2
a
sse2
a
sse
a
a
a
a
a
sse2
sse
add
add
mul
div
div
div
div
sse
sse
sse
sse
sse
sse2
sse2
sse
a
a
a
a,h
a,h
a,h
a,h
a
fp
add
sse
1
1
fp
fp
add
add
sse
sse
a
a
mmx
alu
sse
23
39
38
69
3
4
1
1
1
1
1
1
fp
fp
fp
fp
mmx
mmx
div
div
div
div
sse
sse
sse2
sse2
sse
sse
a,h
a,h
a,h
a,h
a
a
100
6
1
1
sse2
sse2
sse2
sse
sse2
sse
sse2
sse
sse
a
a
a
a
a
a
a
Pentium 4
a)
h)
k)

It may be advantageous to replace this instruction by two 64-bit moves.
Page 209
Prescott
Intel Pentium 4 w. EM64T (Prescott)

Instruction:
Operands:
JNE, etc.
memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address.
ops:
Microcode:
Latency:
Number of ops issued from instruction decoder and stored in trace cache.
Number of additional ops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the next
dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal
numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately
because of the problem with memory intermediates explained above under
How the values were measured.
Additional latency:
This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1.
This is also called issue latency. This value indicates the number of clock cycles
from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25
indicates 4 instructions per clock cycle in one thread.
Reciprocal
throughput:
Port:
Execution unit:
Execution subunit:
Instruction set
The port through which each op goes to an execution unit. Two independent
ops can start to execute simultaneously only if they are going through different
ports.
Use this information to determine additional latency. When an instruction with
more than one op uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
86
86
x64
Notes
alu0/1
alu0/1
alu0/1
Instruction set
Page 210
0,25 0/1
0,25 0/1
0,5 0/1
Subunit
0
0
0
Execution unit
1
1
Port
0
0
0
Additional latency
1
1
1
Latency
r,r
r8/16/32,i
r64,i32
Microcode
Move instructions
MOV
MOV
MOV
Operands
ops
Instruction
Prescott
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVZX
MOVSX
MOVSX
MOVSX
MOVSXD
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVSB
REP MOVSW
r64,i64
r8/16,m
r32/64,m
m,r
m,i
m64,i32
r,sr
sr,r/m
r,mabs
mabs,r
m,r32
r,r
r16,r8
r,m
r16,r8
r32/64,r8/16
r,m
r64,r32
r,r/m
r,r
r,m
r
i
m
sr
r
m
sr
r,[m]
r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]
r,m
2
0
0
2
0
3
0
1
0
2
0
1
0
2
0
2
0
1
2
1
8
3
0
3
0
2
0
1
0
1
0
2
0
2
0
1
0
2
0
2
0
2
0
1
0
1
0
2
0
3
0
1
0
1
0
3
0
9,5
0
3
0
2
0
2
6 100
4
0
6
2
0
2
2
0
2
3
0
2
1
3
1
3
1
9
2
0
1
0
2
6
1
8
1
8
2
16
1
0
1
0
2,5
0
2
0
3,5
0
3
0
3,5
0
2
0
3,5
0
3
0
3,5
0
1
0
4
0
1
0
5
0
2
0
0
2
10
1
3
8
1
5n 4n+50
1
2
8
1 2.5n 3n
1
4
8
9 .3n .3n
1 .5-1.1n .6-1.4n
Page 211
1
1
1
2
2
2
8
27
1
2
2
0,25
1
1
1
0,5
1
0,5
3
1
2
2
2
9
9
16
1
10
30
70
15
0,25
0,25
0,5
1
1
1
1
28
8
1
2
2
0
0,3
0,3
alu1
load
load
store
store
store
0/1
0/1
2
0
0
2
0
alu0/1
alu0/1
load
alu0
alu0
load
alu0
0/1
alu0/1
0/1
0/1
0/1
1
0,1
1
1
0/1
1
alu0/1
alu0/1
alu0/1
alu
alu0,1
alu
int
alu0/1
int
x64
86
86
86
86
x64
86
86
x64
x64
sse2
386
386
386
386
386
386
x64
PPro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
86
386
386
386
86
86
86
86
86
86
8
86
86
86
86
86
b,c
a,q
l
l
c
c
a,c,o
a,c,o
a
a,e
m
p
n
d,n
m
m
Prescott
REP MOVSD
REP MOVSQ
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
r
r,r/i
m
m
r,r
r,m
m,r
r,r/i
r,m
m,r
m,i
r,r
r,m
r
m
r
m
r8
r16
r32
r64
m8
m16
m32
m64
r16,r16
r16,r16,i
r32,r32
r32,(r32),i
r64,r64
r64,(r64),i
r16,m16
r32,m32
r64,m64
r,m,i
r8/m8
r16/m16
r32/m32
r64/m64
1
1
1
1
1
1
1
1
1
1
2
3
3
2
2
3
1
2
2
4
1
3
1
1
2
2
1
4
3
1
2
2
3
2
1
2
1
1
1
1
2
2
2
3
1
1
1
1
1.1n 1.4 n
86
x64
alu
1.1n 1.4 n
0
52
0
0
2
2
4
0
0
0
0
5
6
5
0
0
0
0
0
0
10
16
5
17
0
0
0
5
0
5
0
6
0
0
0
0
0
0
0
0
0
0
20
19
21
31
1
1
5
10
10
20
22
1
1
1
5
1
5
26
29
13
71
10
11
11
11
10
11
11
11
10
11
10
10
10
10
10
10
10
10
74
73
76
63
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Page 212
1
>1000
1
1
50
50
124
86
sse
sse
sse
sse2
sse2
0,25 0/1
1
2
10
1
10
1
10
10
0,25 0/1
1
0,5 0/1
3
0,5 0
3
2,5
2,5
2,5
2,5
2,5
2,5
2,5
2,5
2,5
1-2.5
34
34
34
52
486
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
alu0/1
int,alu
int,alu
alu0/1
alu0/1
alu0
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
mul
fpdiv
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
fpdiv
fpdiv
fpdiv
fpdiv
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
x64
86
86
86
x64
386
186
386
386
x64
x64
386
386
x64
186
86
86
386
x64
c
c
c
c
c
m
m
m
m
a
a
a
a
Prescott
IDIV
IDIV
IDIV
IDIV
CBW
CWD
CDQ
CQO
CWDE
CDQE
SCAS
REP SCAS
CMPS
REP CMPS
Logic
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL
SHR, SAR
SHR, SAR
SHL
SHR, SAR
SHR, SAR
ROL, ROR
ROL, ROR
ROL, ROR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL, SHR, SAR
ROL. ROR
SHL, SHR, SAR
ROL. ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD
SHRD
SHLD, SHRD
SHLD
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,m
m,r
r,r
r,m
r
m
r,i
r8/16/32,i
r64,i
r,CL
r8/16/32,CL
r64,CL
r8/16/32,i
r64,i
r8/16/32,CL
r64,CL
r,1
r,i
r,i
r,CL
r,CL
m8/16/32,i
m8/16/32,i
m8/16/32,cl
m8/16/32,cl
m8/16/32,1
m8/16/32,i
m8/16/32,cl
r8/16/32,r,i
r64,r64,i
r64,r64,i
r8/16/32,r,cl
r64,r64,cl
1
21
76
0
1
19
79
0
1
19
79
0
1
58
96
0
2
0
2
0
2
0
2
0
1
0
1
0
1
0
7
0
2
0
2
0
1
0
1
0
1
3
0
1 54+6n
4n
1
5
1 81+8n
5n
34
34
34
91
1
1
1
1
1
1
8
1
2
3
1
2
1
3
1
1
1
2
2
2
1
1
2
2
1
2
2
1
1
3
3
2
2
2
3
2
3
4
3
4
4
0,5
1
2
0,5
1
0,5
2
0,5
0,5
2
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
11
11
11
11
6
6
6
6
5
13
13
0
5
7
0
5
1
1
5
1
1
1
5
1
1
7
2
2
8
1
7
2
8
7
31
25
31
25
10
10
10
10
27
38
37
8
10
10
9
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Page 213
1
1
1
1
0
0/1
0/1
0/1
0/1
0/1
int
int
int
int
alu0
alu0/1
alu0/1
alu0/1
alu0/1
alu0/1
fpdiv
fpdiv
fpdiv
fpdiv
86
86
386
x64
86
86
386
x64
386
x64
86
a
a
a
a
86
10
86
86
1
7
2
8
7
31
25
31
25
27
38
37
7
alu0
alu0
alu0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
86
86
86
86
86
86
86
186
186
x64
86
86
x64
186
x64
86
x64
86
186
186
86
86
86
86
86
86
86
86
86
386
x64
x64
386
x64
c
c
c
c
c
d
d
d
d
d
d
d
d
d
d
d
d
d
d
Prescott
SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD, STD
r64,r64,cl
m,r,i
m,r,CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r/m
r
m
3
3
2
1
2
3
2
1
2
3
2
2
2
3
2
3
1
8
8
8
0
0
0
7
0
0
6
10
0
0
0
0
0
8
12
20
20
8
9
8
10
8
9
28
14
16
9
9

JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
BOUND
m
INT
i
INTO
1
2
3
3
2
1
4
4
3
3
4
4
2
4
4
1
2
1
2
2
1
0
25
0
0
28
0
0
0
0
29
0
0
32
0
0
30
30
49
11
67
4
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
LEAVE
CLI
STI
CPUID
RDTSC
1
0
1
0
1
2
4
0
1
5
1
11
1 49-90
1
12
15
0
0
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
10
8
9
8
10
8
9
10
14
4
1
2
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x64
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86
53
1
154
15
10
157
2-4
4
4
7
160
7
9
160
7
7
160
160
325
12
470
26
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0,25 0/1
0,25 0/1
50
5
52
64
300-500
100
Page 214
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
int
int
alu0
branch
alu0
alu0
branch
branch
alu0
alu0
alu0
alu0
branch
branch
branch
branch
alu0
alu0
branch
branch
alu0
alu0
branch
branch
alu0/1
alu0/1
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
186
86
86
86
ppro
sse2
186
86
86
p5
p5
d
d
d
d
m
m
Prescott
RDPMC (bit 31 = 1)
RDPMC (bit 31 = 0)
MONITOR
MWAIT
Notes:
a)
b)
c)
d)
e)
l)
m)
n)
o)
p)
q)
1
4
37
154
100
240
p5
p5
(sse3)
(sse3)

Uses an extra op (port 3) if SIB byte used.
Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH,
CH, DH).
Has (false) dependence on the flags in most cases.
Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3).
Not available in 64 bit mode on some processors.
MOVSX uses an extra op if the destination register is smaller than the biggest
register size available. Use a 32 bit destination register in 16 bit and 32 bit
mode, and a 64 bit destination register in 64 bit mode for optimal performance.
LEA with a direct memory operand has 1 op and a reciprocal throughput of
0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 ops because of the SIB byte. The throughput is 1 in this case.
You may use a MOV instead.
These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode op and a reciprocal throughput of 17.
Page 215
mov
load
load
load
mov
store
store
store
mov
load
load
store
store
mov
87
87
87
87
87
87
87
87
87
87
87
87
sse3
87
Notes
0
2
2
2
0
0
0
0
0
2
2
0
0
0
Instruction set
7
7
1
1
8
90
1
2
10
400
1
8
2
2,5
2,5
2
Subunit
0
0
Execution unit
Port
0
0
3
74
0
0
6
311
0
2
0
0
0
0
Additional latency
1
1
3
3
1
2
3
3
1
3
2
3
3
1
Latency
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m
m
Microcode
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST(P)
FISTTP
FLDZ
Operands
ops
Instruction
Prescott
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN, FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
st0,r
r
AX
AX
m16
m16
m16
r
m
m16
m32
r
m
m16
m32
r
m
m16
m32
r
m
r
m16
m32
2
4
3
1
4
6
2
4
3
0
0
0
0
0
0
3
0
6
1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
3
3
1
1
3
8
9
0
0
3
0
0
0
3
0
0
0
3
3
0
0
0
0
0
0
3
0
0
0
14
86
92
1
2
3
5
8
4
3
4
3
3
3
0
0
100
150
170
97
25
16
190
63
58
0
0
0
6
6
7
6
8
8
8
8
45
45
45
45
3
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
28
220
220
1
1
1
45
200
200
270
250
96
27
270
170
170
Page 216
2
4
3
1
3
3
8
3
10
0
1
0
0
1
1
0
0
0,2
mov
fp
mov
mov
1
1
6
2
2
2
8
3
45
45
45
45
1
1
1
1
1
3
8
2
1
1
16
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
1
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
45
2
200
200
270
250
87
PPro
87
87
287
287
87
87
87
add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387
div
87
87
387
387
87
87
87
87
87
87
87
fp
fp
g,h
g,h
g,h
g,h
g,h
Prescott
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)
g)
h)
i)
1
2
1
1
2
2
2
2
0
0
4
30
181
96
121
118
1
0
0
0
1
1
120
200
500
570
0
0
1
mov
mov
0,1
160
244
87
87
87
87
87
87
sse
sse
i
i

The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is > 100.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 32, double precision: 40, long double precision
(default): 45.
Takes fewer microcode ops when XMM registers are disabled, but the
throughput is the same.
7
2
0
1
10
Page 217
alu
shift
shift
sse2
mmx
mmx
mmx
sse2
sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2
sse3
Notes
1
1
0
fp
1
mmx
2
load
0
fp
1
mmx
2
load
0,1
0
mov
1
mmx
2
load
0
mov
0
mov
2
load
0
mov
2
load
0
mov
2
load
0,1 mov-mmx
Instruction set
7
4
1
1
1
1
2
1
2
1
2
1
2
1
1
2
23
8
2,5
2
Subunit
1
1
Execution unit
6
3
Port
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
Additional latency
2
1
1
1
2
1
2
1
1
1
2
1
1
2
4
4
4
3
Latency
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
xmm,m
mm,xmm
Microcode
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
Operands
ops
Instruction
k
k
Prescott
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVDDUP
MOVSHDUP
MOVSLDUP
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ
xmm,mm
m,mm
m,xmm
xmm,xmm
2
3
2
1
0
0
0
0
10
2
4
4
2
0,1 mov-mmx
0
mov
0
mov
1
mmx
xmm,xmm
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
r,r32,i
1
1
1
1
1
1
2
2
2
2
r,r/m
sse2
shift
sse
sse2
sse3
mmx
shift
sse3
mmx
shift
mmx
mmx
shift
mmx
mmx
shift
mmx
mmx
shift
sse2
0
0
0
0
4
6
0
0
0
0
2
4
2
2
1
1
1
1
2
2
2
1
10
12
3
2
3
2
shift
shift
shift
shift
sse2
sse2
sse
sse
sse
sse2
1,2
mmx
alu
mmx
a,j
r,r/m
mm,r/m
xmm,r/m
1
1
1
0
0
0
2
2
5
1
1
1
1,2
1
2
1
1
1
mmx
mmx
fp
alu
alu
add
mmx
sse2
sse2
a,j
a
a
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
2
7
7
7
7
2
2
2
4
1
1
1
1
1
1
1
1
1
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1
1
1
1
1
1
1
1
1
mmx
fp
fp
fp
fp
mmx
mmx
mmx
mmx
alu
mul
mul
mul
mul
alu
alu
alu
alu
mmx
mmx
sse
mmx
sse2
sse
sse
sse
sse
a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j
r,r/m
r,r/m
1
1
0
0
2
2
1
1
1,2
1,2
1
1
mmx
mmx
alu
alu
mmx
mmx
a,j
a,j
r,i/r/m
xmm,i
1
1
0
0
2
4
1
1
1,2
2
1
1
mmx
mmx
shift
shift
mmx
sse2
a,j
7
7
7
4
Page 218
1
mmx
1
mmx
1
mmx
1
mmx
0
mov
0
mov
0,1 mmx-alu0
1 mmx-int
1 mmx-int
1 int-mmx
sse
sse
sse2
sse
Prescott
Other
EMMS
Notes:
a)
j)
k)
10
10
12
mmx

Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
1
2
3
2
1
3
1
2
4
0
0
0
0
0
0
0
0
0
MOVHPS/D, MOVLPS/D
1
1
0
4
2
1
1
4
2
1
1
5
4
4
2
1
1
1
1
4
10
14
8
5
10
5
11
12
1
1
1
1
1
1
1
1
1
4
2
6
6
2
4
2
2
6
Page 219
0
2
0
0
2
0
1
1
2
0
1
1
2
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
mov
mov
mmx
mmx
shift
shift
mmx
mmx
shift
shift
fp
mmx
mmx
mmx
shift
shift
shift
mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx
shift
sse2
shift
shift
sse2
sse2
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse3
sse3
sse
sse
sse
sse
sse
sse2
a
sse2
sse2
sse2
a
sse2
a
sse
Notes
MOVHPS/D, MOVLPS/D
2
4
1
1
2
1
2
8
2
2
1
2
2
2
2
2
2
2
4
3
2
2
2
Instruction set
0
0
Subunit
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Execution unit
1
1
2
1
4
4
1
1
1
2
1
1
2
2
1
1
2
2
1
2
1
MOVSH/LDUP
MOVDDUP
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m
Port
Additional latency
Latency
Microcode
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS
Operands
ops
Instruction
k
k
a
a
a
a
a
a
Prescott
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
4
3
4
3
4
2
2
0
0
0
0
0
0
0
12
8
12
20
20
12
17
1
0
1
1
1
1
1
5
2
3
4
5
4
4
1
0,1
0,1
1
1
1
1
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp
sse2
sse
sse2
sse
sse2
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
3
1
1
1
1
1
2
0
0
0
0
0
0
0
0
0
0
5
5
5
13
7
32
41
40
71
6
1
1
1
1
1
1
1
1
1
1
2
2
2
5-6
2
23
41
40
71
4
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
mmx
r,r/m
r,r/m
r,r/m
1
2
0
0
5
6
1
1
2
3
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
2
2
0
0
0
0
0
0
32
41
40
71
5
6
1
1
1
1
1
1
m
m
2
3
11
0
Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
ADDSUBPS/D
HADDPS/D HSUBPS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D
Other
LDMXCSR
STMXCSR
Notes:
a)
h)
k)
a
a
a
a
a
sse2
sse
a
a
add
add
add
add
mul
div
div
div
div
sse
sse
sse3
sse3
sse
sse
sse
sse2
sse2
sse
a
a
a
a
a
a,h
a,h
a,h
a,h
a
fp
add
sse
1
1
fp
fp
add
add
sse
sse
a
a
mmx
alu
sse
32
41
40
71
3
4
1
1
1
1
1
1
fp
fp
fp
fp
mmx
mmx
div
div
div
div
sse
sse
sse2
sse2
sse
sse
a,h
a,h
a,h
a,h
a
a
13
3
1
1
sse
sse

It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.
Page 220
Atom
Intel Atom
Instruction:
Operands:
ops:
Unit:
Latency:
JNE, etc.
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit
xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
The number of ops from the decoder or ROM.
Tells which execution unit is used. Instructions that use the same unit cannot
execute simultaneously.
ALU0 and ALU1 means integer unit 0 or 1, respectively.
ALU0/1 means that either unit can be used. ALU0+1 means that both units
are used.
Mem means memory in/out unit.
FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions).
FP1 means floating point unit 1 (adder).
MUL means multiplier, shared between FP and integer units.
DIV means divider, shared between FP and integer units.
np means not pairable: Cannot execute simultaneously with any other instruction.
may increase the clock counts considerably. Floating point operands are
presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX MOVSXD
CMOVcc
CMOVcc
XCHG
Operands
ops
Unit
r,r
r,i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
r,r/m
r,r
r,m
r,r
1
1
1
1
1
1
2
7
8
1
1
1
1
3
ALU0/1
ALU0/1
ALU0, Mem
ALU0, Mem
ALU0, Mem
Latency Reciprocal
throughput
1
1
1-3
1
1
ALU0, Mem
ALU0
ALU0+1
1
2
6
Page 221
1/2
1/2
1
1
1
1
5
21
26
2,5
1
2
3
6
Remarks
All addr. modes

All addr. modes
Atom
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
r,m
r
i
m
sr
r
(E/R)SP
m
sr
4
3
1
1
2
3
14
9
1
1
3
7
19
16
1
1
2
r,m
r
m
m
m
1
1
10
1
1
1
1
1
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
1
1
1
1
1
1
1
1
1
13
13
20
21
4
10
3
4
3
8
2
1
r8
r16
r32
r64
r16,r16
r32,r32
np
np
6
6
1
np
np
1
1
ALU0+1
ALU0/1
2
1
7
AGU1
ALU0
1-4
1
30
1
1
30
1
1
1/2
1
1
1/2
1
1
2
2
2
1/2
1
1/2
Mem
Mem
ALU0/1
ALU0/1, Mem
ALU0/1
ALU0/1
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
Page 222
6
6
1
1
5
6
12
11
1
1
6
31
28
12
2
1/2
5
2
2
2
2
1
1
1
16
12
20
25
7
24
7
6
6
14
6
5
Implicit lock
Not in x64 mode
Not in x64 mode
Not in x64 mode

4 clock latency
on input register
Not in x64 mode

Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
7
6
6
14
5
2
Atom
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR
RCL
RCR
RCL
SHLD
SHLD
SHLD
SHLD
SHLD
SHLD
SHRD
SHRD
SHRD
SHRD
SHRD
SHRD
BT
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r/m8
r/m16
r/m32
r/m 64
r/m8
r/m16
r/m32
r/m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r,1
r/m,i/cl
r/m,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r,r/i
6
2
1
7
3
5
4
8
9
12
12
38
26
29
29
60
2
1
1
2
1
1
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0
13
5
5
14
6
7
7
14
22
33
49
183
38
45
61
207
5
1
1
5
1
1
1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0/1, Mem 1
1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0
1
1
ALU0
1
1
ALU0
1
1
ALU0
1
5
ALU0
7
2
ALU0
1
12-17
ALU0
12-15
14-20
ALU0
14-18
10
ALU0
10
2
ALU0
5
10
ALU0
11
9
ALU0
9
2
ALU0
5
9
ALU0
10
8
ALU0
8
2
ALU0
5
10
ALU0
9
7
ALU0
8
2
ALU0
5
9
ALU0
9
1
ALU1
1
Page 223
11
5
2
14
22
33
49
183
38
45
61
207
1/2
1
1
1/2
1
1
1
1
1
1-2 more if mem

1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1
Atom
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc
CLC STC
CMC
CLD
STD
m,r
m,i
r,r/i
m,r
m,i
r,r/m
r
m

JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND
r,m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
9
2
1
10
3
10
1
2
1
1
5
6
ALU1
ALU1
ALU1
ALU0+1
ALU0/1
1
29
1
2
30
1
3
8
8
1
37
1
2
38
1
1
36
36
11
4
ALU1
2
5
1/2
2
7
25
2
66
4
7
78
2
7
8
8
3
65
18
20
64
6
6
80
80
10
6
ALU1
np
np
3
5n+11
2
3n+10
4
4n+11
3
5n+16
5
6n+16
1
1
10
5
1
11
6
16
2
6
3n+50
5
2n+4
6
2n - 4n
6
3n+60
7
4n+40
ALU0/1
ALU0/1
Page 224
Not in x64 mode
Not in x64 mode
Not in x64 mode

Not in x64 mode
fastest for high n
1/2
1/2
Atom
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
a,0
a,b
5
14
20+6b
4
40-80
16
24
24
23
6
100-170
29
48

Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
Operands
ops
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
1
1
4
52
1
3
8
189
1
1
3
3
1
2
2
3
4
4
2
3
1
1
166
83
1
3
9
92
1
7
12
221
1
7
11
11
1
1
1
1
1
1
1
5
3
3
3
5
5
71
1
1
1
1
r
AX
m16
m16
m16
m
m
r/m
r/m
r/m
r/m
r
m
m
m
Unit
Latency Reciprocal
throughput
1
321
177
Mul
Div
Mul
Div
Page 225
1
1
10
92
1
9
13
221
1
6
9
9
1
8
10
9
10
10
8
9
1
1
321
177
1
2
71
1
1
1
1
10
9
9
73
Remarks
SSE3
Atom
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
3
1
1
26
37
19
1
1
~110
~130
48
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
30
15
1
9
112
25
63
100
91
56
24
71
~260
~260
~100
~220
~300
~300
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
4
23
Div
9
1
1
1
5
26
74

Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
Operands
ops
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
1
1
1
1
1
3
4
4
1
1
1
1
(x)mm, (x)mm
Unit
Latency Reciprocal
throughput
Mem
Mem
4
5
3
4
1
4
5
1
4
5
6
6
6
1
1
~400
~450
2
1
1
1
1/2
1
1
1/2
1
1
6
6
6
1
1
1
3
FP0
Mem
Mem
FP0/1
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem
Page 226
Remarks
Atom
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PINSRW
PEXTRW
(x)mm, (x)mm
(x)mm, (x)mm
mm,mm
xmm,xmm
mm,mm,i
xmm,xmm,i
xmm,xmm,i
xmm, xmm,i
mm,mm
xmm,xmm
r32,(x)mm
(x)mm,r32,i
r32,(x)mm,i
1
1
1
4
1
1
1
1
1
2
1
1
2
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PSADBW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD
PSIGNB PSIGNW PSIGND
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm,(x)mm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
1
2
7
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
(x)xmm,i
xmm,i
Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
FP0
FP0
FP0
FP0
FP0
FP0
FP0
Mem
Mem
1
1
1
6
1
1
1
1
4
3
5
FP0/1
1
1
1
6
1
1
1
1
2
7
2
1
5
1/2
5
8
FP0/1
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0/1
FP0/1
FP0/1
FP0/1
1
5
8
6
1
4
5
4
5
4
5
4
5
4
5
4
5
1
1
1
1
FP0/1
1/2
1
2
1
1
FP0/1
FP0
FP0
FP0
1
5
1
1
1/2
5
1
1
1/2
1
2
1
2
1
2
1
2
1
2
1
2
1/2
1/2
1/2
1/2

Page 227
Atom
Operands
ops
Unit
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
FP0/1
Mem
Mem
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem
FP0
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm
4
3
4
3
3
3
3
3
1
1
3
4
3
3
3
3
Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
3
1
3
5
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
Mem
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP0+1
Page 228
Latency Reciprocal
throughput
1
4
5
6
6
1
4
5
5
4
4
1
4
~500
1
1
1
1
1
1
1/2
1
1
6
6
1/2
1
1
1
1
1
1
2
3
1
1
1
1
1
11
10
7
6
6
6
7
6
6
4
7
7
7
10
8
10
11
10
6
6
6
6
6
6
5
1
6
7
6
8
6
8
5
5
5
6
5
6
8
1
1
1
6
1
6
7
Remarks
Atom
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
5
1
1
1
6
3
3
6
6
1
5
1
3
4
1
3
FP0+1
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Div
FP0, Div
FP0, Div
FP0, Div
FP0
FP0
FP0
FP0
FP0
8
4
5
5
9
31
60
64
122
4
9
5
6
9
5
6
7
1
2
2
9
31
60
64
122
1
8
1
6
9
1
6
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
3
5
3
5
1
5
FP0, Div
FP0, Div
FP0, Div
FP0, Div
FP0
FP0
31
63
60
121
4
9
31
63
60
121
1
8
Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
1
FP0/1
FP0/1
FP0/1
FP0/1
1
1
1
1
1/2
1/2
1/2
1/2
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
4
4
121
116
5
14
142
149
6
15
144
150
Page 229
VIA Nano 2000
VIA Nano 2000 series

Operands:
ops:
Port:
Latency:
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 2000 processor has no reliable performance monitor counter for ops.
Therefore the number of ops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.
instructions of the same kind in the same thread.
Operands ops
Port
Latency
Reciprocal
thruoghput
Remarks
Move instructions
MOV
MOV
r,r
r,i
1
1
I2
I2
1
1
1
1
MOV
MOV
MOV
MOV
MOV
MOV
MOV
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
1
1
1
LD
SA, ST
SA, ST
2
2
1
1,5
1,5
1
2
20
20
20
20
Page 230
Latency 4 on
pointer register
VIA Nano 2000

MOVNTI
MOVSX MOVSXD
MOVZX
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
m,r
r,r
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr
1
2
1
2
3
SA, ST
1,5
I2
LD, I2
LD
I1, I2
LD, I1
I2
1
3
2
2
5
3
20
6
1
1
1
1
2
3
20
SA, ST
SA, ST
Ld, SA, ST
8
r
(E/R)SP
m
sr
LD
9
1
1
r,m
SA
1
1
9
1
I2
30
30
1-2
1-2
14
14
14
I12
LD I12
LD I12 SA ST
5
1
1/2
1
2
1
1
2
1/2
1
1/2
m
m
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
I1
I1
1-2
1-2
2
17
8
15
1,25
4
5
20
9
12
1
1
6
1
LD
LD
1
2
3
1
2
3
1
2
1
3
I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST
5
1
1
5
37
37
22
Page 231
Implicit lock
Not in x64 mode
Not in x64 mode
Not in x64 mode

3 clock latency on
input register
Not in x64 mode

Not in x64 mode
Not in x64 mode
VIA Nano 2000

DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
24
23
30
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64
1
1
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r64,r64,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
1
2
3
1
2
1
1
1
1
2
2
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
I1
I1
7-9
7-9
7-9
8-10
4-6
4-6
5-7
4-6
4-6
5-7
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1
I12
LD I12
LD I12 SA ST
5
1
I12
LD I12
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
Page 232
1
1
1
28+3n
11
7
33
43
11
7
33
43
1
2
10
8
3
1
1
2
1
1
2
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1
1/2
1
2
1/2
1
1
1
1
28+3n
11
7
33
43
11
7
33
43
1
8
1
2
10
8
2
Not in x64 mode

Not in x64 mode
Not in x64 mode
Extra latency to
other ports
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
VIA Nano 2000

SETcc
SETcc
CLC STC CMC
CLD STD
r
m
I1
1
1
3
3
I1
3
3
I2
3
58
I2
3
3
55
1-3-8
3
3
1-3-8
1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block
do.
do.

JMP
JMP
short/near
far
JMP
JMP
JMP
Conditional jump
r
m(near)
m(far)
short/near
J(E/R)CXZ
LOOP
LOOP(N)E
short
short
short
1-3-8
1-3-8
25
1-3-8
1-3-8
25
CALL
CALL
near
far
3
72
3
72
CALL
CALL
CALL
r
m(near)
m(far)
3
4
72
3
3
72
3
3
39
39
3
3
39
39
13
7
RETN
RETN
RETF
RETF
BOUND
INTO
i
i
r,m
String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q
REP STOSB/W/D/Q
1
3n+22
1-2
Small: 2n+2,
Big: 6 bytes
per clock
MOVSB/W/D/Q
REP MOVSB/W/D/Q
2
Small: 2n+45,
Big: 6 bytes
per clock
SCASB/W/D/Q
REP SCASB
1
2.2n
Page 233
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.
Not in x64 mode

Not in x64 mode
VIA Nano 2000

REP SCASW/D/Q
Small: 2n+50
Big: 5 bytes
per clock
CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
6
2.4n+24
1
1
All
I12
a,0
a,b
4
53-173
40
1
1/2
25
23
52+5b
4
Blocks all ports
39
40

Operands ops
Port and
Unit
Latency
Reciprocal
thruoghput
1
2
2
MB
LD MB
LD MB
1
3
3
MB
MB SA ST
MB SA ST
I2
1
4
4
54
1
5
5
125
0
7
5
5
6
5
5
1
1
1
54
1
1-2
1-2
125
1
MB
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64
r
AX
m16
m16
m16
13
1
1
I2
MB
m
m
0
321
195
Page 234
1
10
2
5
3
13
2
1
1
321
195
Remarks
VIA Nano 2000

FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r/m
r/m
r/m
r/m
r
m
m
m
m
1
1
1
1
1
1
1
1
1
MB
MA
MA
MB
MB
MB
MB
MB
MB
2
4
15-42
1
1
MB
1
2
15-42
1
1
1
1
1
2
4
42
2
1
41
Lower precision:
Lat: 4, Thr: 2
151-171
106-155
29
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
39
36-57
73
51-159
270-360
50-200
~60
~170
300-370
~170
Other
FNOP
WAIT
FNCLEX
FNINIT
1
1
MB
I12
1
1/2
57
85

Operands ops
Port and
Unit
Latency
Reciprocal
thruoghput
3
2-3
4
2-3
1
2-3
2-3
1
2-3
1
1-2
1
1
1
1
1-2
1
1
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
1
1
SA ST
1
1
1
1
1
1
LD
MB
LD
SA ST
MB
LD
Page 235
Remarks
VIA Nano 2000

MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
1
3
3
SA ST
SA ST
LD
LD
MB
MB
2-3
2-3
2-3
2-3
1
1
~300
~300
1-2
1-2
1
1
1
1
2
2
v,v
MB
v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,i
mm,mm
xmm,xmm
r32,(x)mm
r32 ,(x)mm,i
(x)mm,r32,i
1
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
MB
1
1
1
1
1
1
1
3
3
9
1
1
1
1
1
1
1
1-3
1-3
1
1
9
v,v
v,v
1
1
MB
MB
1
1
1
1
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
3
3
1
1
1
1
MB
MB
MB
MA
MA
MA
1
1
1
MB
MB
MB
MB
3
3
1
3
3
3
4
10
2
1
1
1
3
3
1
1
1
1
2
8
1
1
1
1
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULHRSW
PMULUDQ
PMADDWD
PMADDUBSW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD
v,v
MB
PSIGNB PSIGNW
PSIGND
v,v
MB
Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
v,v
v,v
v,i
x,i
1
1
1
1
MB
MB
MB
MB
1
1
1
1
1
1
1
1
Page 236
VIA Nano 2000

Other
EMMS
MB

Operands ops
Port and
Unit
Latency
Reciprocal
thruoghput
1
1
1
1
1
1
1
1
MB
LD
SA ST
LD
SA ST
MB
LD
SA ST
MB
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
1
2-3
2-3
2-3
2-3
1
2-3
2-3
6
6
6
2
1
3
~300
1
1
1
1
1
1
1
1
1-2
1
1-2
1
1
1-2
1
1
1-2
1-2
1
1
2,5
1
1
1
1
1
1
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm
Arithmetic
ADDSS SUBSS
xmm,xmm
3-4
15
3-4
15
3
2
4
3
4
3
4
3
5
4
5
4
MBfadd
Page 237
2-3
Remarks
VIA Nano 2000

ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
MAXSS/D MINSS/D
MAXPS/D MINPS/D
xmm,xmm
xmm,xmm
xmm,xmm
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
1
1
1
1
1
1
1
MBfadd
MBfadd
2-3
2-3
2-3
2-3
2-3
5
5
3
4
3
4
15-22
15-36
42-82
24-70
5
14
2
2
1
1
MBfadd
MBfadd
3
2
2
1
1
1
MA
MA
MA
MA
33
126
62
122
5
14
33
126
62
122
5
11
MB
MB
MB
MB
1
1
1
1
1
1
1
1
45
13
208
232
29
13
208
232
1
1
1
1
1
1
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA
1
1
1
1
1
3
3
1
2
1
2
15-22
15-36
42-82
24-70
5
11
1
1
VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE
Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0
Clock cycles, approximately

160-400 clock giving 8 bytes
4800 clock per 8 bytes
Page 238
VIA Nano 2000

REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256
128 bits key

192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key

3 clock per byte
4 clock per byte
Page 239
Nano 3000
VIA Nano 3000 series

Operands:
ops:
Port:
Latency:
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 3000 processor has no reliable performance monitor counter for ops.
Therefore the number of ops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.
instructions of the same kind in the same thread.
Operands
ops
Port
Latency Reciprocal
thruoghput
MOV
MOV
r,r
r,i
1
1
I2
I12
1
1
1
1/2
MOV
MOV
MOV
MOV
MOV
MOV
r,m
m,r
m,i
r,sr
m,sr
sr,r
1
1
1
LD
SA, ST
SA, ST
I12
2
2
1
1,5
1,5
1/2
1,5
20
Remarks
Move instructions
20
Page 240
Latency 4 on pointer
register
Nano 3000
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE MFENCE
SFENCE
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
sr,m
m,r
r,r
r64,r32
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr
r
(E/R)SP
m
sr
1
1
2
1
1
3
3
1
1
3
9
2
SA, ST
I12
LD, I12
LD
I12
LD, I12
I12
LD, I1
SA, ST
SA, ST
LD, SA, ST
20
2
1
1
3
2
1
5
3
18
6
1
1
10
20
1,5
1/2
1
1
1
1/2
1
1,5
18
2
1-2
1-2
2
6
2
15
1,25
4
2
11
1
12
1
1
6
1
1
1
1
28
28
1
1
2
LD
3
3
16
1
1
2
r,m
r
1
1
m
m
m
12
1
1
I1
I1
SA
I2
LD
LD
Implicit lock
Not in x64 mode
Not in x64 mode
Not in x64 mode

Extra latency to other
ports
15
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
2
3
1
2
3
1
2
1
3
12
12
14
I12
LD I12
LD I12 SA ST
5
1
I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST
5
1
1
5
1/2
1
2
1
1
2
1/2
1
1/2
37
22
22
Page 241
Not in x64 mode

Not in x64 mode
Not in x64 mode
Nano 3000
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
r8
r16
r32
14
7
13
1
3
3
I2
I2
I2
2
3
3
MUL IMUL
IMUL
IMUL
r64
r16,r16
r32,r32
3
1
1
MA
I2
I2
8
2
2
8
1
1
IMUL
IMUL
IMUL
r64,r64
r16,r16,i
r32,r32,i
1
1
1
MA
I2
I2
5
2
2
2
1
1
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64
MA
MA
MA
MA
MA
MA
MA
MA
MA
I2
I2
5
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc
1
1
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i/cl
r32,r32,i/cl
r64,r64,i/cl
r64,r64,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r8
m
24
24
31
1
I12
2
LD I12
3
LD I12 SA ST
1
I12
2
LD I12
1
I12
1
I1
1
I1
5+2n
I1
2
I1
2
I1
16
I1
23
I1
1
I1
6
I1
2
I1
2
I1
8
I1
5
I1
2
I1
1
I1
2
Page 242
1
5
1
1
1
1
28+3n
2
2
32
42
1
2
10
8
2
1
Not in x64 mode

Not in x64 mode
Not in x64 mode

ports

ports

ports
2
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1
1/2
1
2
1/2
1
1/2
1
1
28+3n
2
2
32
42
1
8
1
2
10
8
2
1
2
Nano 3000
CLC STC CMC
CLD STD
3
3
I1
I1
3
3
3
3

JMP
JMP
short/near
far
1
14
I2
3
50
JMP
JMP
JMP
r
m(near)
m(far)
2
2
17
I2
3
3
3
3
42
Conditional jump
J(E/R)CXZ
LOOP
LOOP(N)E
short/near
short
short
short
1
2
2
5
CALL
CALL
near
far
CALL
CALL
CALL
r
m(near)
m(far)
RETN
RETN
RETF
RETF
BOUND
INTO
i
i
r,m
I2
1-3-8
1-3-8
1-3-8
24
1-3-8
1-3-8
1-3-8
24
2
17
3
58
2
3
19
3
4
3
3
54
3
4
20
20
9
3
3
3
3
3
49
49
13
7
String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q
2
3n
1
REP STOSB/W/D/Q
MOVSB/W/D/Q
1
3n+27
1-2
Small:
n+40, Big:
6-7
bytes/clk
2
Small:
2n+20,
Big: 6-7
bytes/clk
1
2.4n
Small:
2n+31,
Big: 5
bytes/clk
REP MOVSB/W/D/Q
SCASB/W/D/Q
REP SCASB
REP SCASW/D/Q
Page 243
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.
Not in x64 mode

Not in x64 mode
Nano 3000
CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
a,0
a,b
0-1
0-1
2
10
6
2.2n+30
I12
I12
0
0
2
55-146
1/2
1/2
6
21
52+5b
2
Sometimes fused
37
40

Operands
ops
Port
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64
MB
LD MB
LD MB
m
m
1
2
2
36
1
3
3
80
1
3
2
2
3
3
3
1
3
1
1
3
5
3
1
1
122
115
r/m
Latency Reciprocal
thruoghput
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
FADD(P) FSUB(R)(P)
r
AX
m16
m16
m16
MB
MB SA ST
MB SA ST
I2
1
4
4
54
1
5
5
125
0
7
5
5
6
5
5
MB
319
196
1
10
2
1
2
8
2
1
1
319
196
MB
I2
MB
MB
Page 244
1
1
1
54
1
1-2
1-2
125
1
Remarks
Nano 3000
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r/m
r/m
r/m
r
m
m
m
m
Math
FSCALE
FXTRACT
MA
MA
MB
MB
MB
MB
MB
MB
4
14-23
1
1
MB
11
2
38
~130
~130
27
22
13
37
57
1
1
1
1
1
3
3
3
3
1
15
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
2
14-23
1
1
1
1
1
2
4
16
2
1
38
Less at lower
precision
73
~150
270-360
50-200
~50
~50
300-370
~180
Other
FNOP
WAIT
FNCLEX
FNINIT
1
1
MB
I12
1
1/2
59
84

Operands
ops
Port
r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
x,x
1
1
1
1
1
1
1
1
MB
SA ST
I2
LD
MB
LD
SA ST
MB
Latency Reciprocal
thruoghput
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
Page 245
3
2
4
2
1
2
2
1
1
1-2
1
1
1
1
1-2
1
Remarks
Nano 3000
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PBLENDVB
PBLENDW
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRB/D/Q
PINSRW
PINSRB/D/Q
PMOVSX/ZXBW/BD/
BQ/WD/WQ/DQ
x, m128
m128, x
m128, x
x, m128
x, m128
mm, x
x,mm
m64,mm
m128,x
x,m128
1
1
1
1
1
1
1
2
2
1
LD
SA ST
SA ST
LD
LD
MB
MB
2
2
2
2
2
1
1
~360
~360
2
1
1-2
1-2
1
1
1
1
2
2
1
v,v
x,x
v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,xmm0
x,x,i
x,x,i
mm,mm
x,x
r32,(x)mm
r32 ,(x)mm,i
r32/64,x,i
(x)mm,r32,i
x,r32/64,i
1
1
1
1
1
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
MB
MB
MB
MB
MB
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
MB
MB
MB
MB
3
3
3
5
5
1
1
1
1
1
1
1
1
2
1
1
1-2
1-2
1
1
1
1
1
x,x
MB
v,v
v,v
1
1
MB
MB
1
1
1
1
v,v
v,v
v,v
x,x
v,v
v,v
x,x
v,v
x,x
v,v
v,v
v,v
x,x,i
v,v
3
3
1
1
1
1
1
1
1
1
7
1
1
1
MB
MB
MB
MB
MA
MA
MA
MA
MA
MA
3
3
1
1
3
3
3
3
3
4
10
2
2
1
3
3
1
1
1
1
1
1
1
2
8
1
1
1
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PCMPEQQ
PMULL/HW PMULHUW
PMULHRSW
PMULLD
PMULUDQ
PMULDQ
PMADDWD
PMADDUBSW
PSADBW
MPSADBW
PAVGB/W
MB
MB
MB
Page 246
Nano 3000
PMIN/MAXSW
PMIN/MAXUB
PMIN/MAXSB/D
PMIN/MAXUW/D
PHMINPOSUW
PABSB PABSW PABSD
PSIGNB PSIGNW
PSIGND
Logic instructions
PAND(N) POR PXOR
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
v,v
v,v
x,x
x,x
x,x
1
1
1
1
1
MB
MB
MB
MB
MB
1
1
1
1
2
1
1
1
1
1
v,v
MB
v,v
MB
v,v
v,v
v,v
(x)xmm,i
x,i
1
1
1
1
1
MB
MB
MB
MB
MB
1
3
1
1
1
1
1
1
1
1
MB
Operands
ops
Port
x,x
x,m128
m128,x
x,m128
m128,x
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
m128,x
x,x,i
x,x,i
x,x
x,x
x,x
x,x
1
1
1
1
2
1
1
2
2
2
3
1
1
MB
LD
SA ST
LD
SA ST
MB
LD
SA ST
x,x
x,x
x,x
2
1
2
Other
EMMS

Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
2
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
Page 247
Latency Reciprocal
thruoghput
1
2
2
2
2
1
2-3
2-3
6
6
6
2
1
3
~360
1
1
1
1
1
1
1
1
1
1
1
1
1
1-2
1
1
1-2
1-2
1
1
1-2
1
1
1
1
1
1
5
2
5
2
1
Remarks
Nano 3000
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
x,x
x,x
x,x
x,x
x,x
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32
r32,x
1
1
1
2
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
3
1
1
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MBfadd
MBfadd
2
2
2
2
2
2
5
5
3
4
3
4
13
13-20
24
21-38
5
14
2
2
1
1
1
1
1
1
3
3
1
2
1
2
13
13-20
24
21-38
5
11
1
1
MAXSS/D MINSS/D
MAXPS/D MINPS/D
x,x
x,x
x,x
1
1
1
MBfadd
MBfadd
MBfadd
3
2
2
1
1
1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
x,x
x,x
x,x
x,x
x,x
x,x
1
1
1
1
1
3
MA
MA
MA
MA
33
64
62
122
5
14
33
64
62
122
5
11
Logic
ANDPS/D
x,x
MB
Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
MB
2
1
2
2
2
1
2
1
Page 248
2
3
2
5
4
5
4
4
4
5
4
5
4
1
1
1
2
2
1
1
2
1
1
Nano 3000
ANDNPS/D
ORPS/D
XORPS/D
x,x
x,x
x,x
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
1
1
1
MB
MB
MB
1
1
1
1
1
1
31
13
97
201
VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256
Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
Clock cycles, approximately

5 clock per byte
5 clock per byte
Page 249

Instruction Tables

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Instruction Tables

Enviado por

Direitos autorais:

Formatos disponíveis

Introduction

By Agner Fog. Technical University of Denmark.

Uop or op is an abbreviation for micro-operation. Processors with out-of-order cores

How the values were measured

Microprocessor versions tested

Control transfer instructions

low values = real

low values = real

values per count

Floating point x87 instructions

Low latency immediately after

Low values are

Integer MMX instructions

Floating point XMM instructions

Low values are

3DNow instructions (obsolete)

Move and convert instructions

Floating point instructions

Latency Reciprocal Execution

Any addressing mode.

Any address size

latency ax=3, dx=4

Control transfer instructions

low values = real

low values = real

values are per count

Floating point x87 instructions

Latency Reciprocal Execution

Integer MMX and XMM instructions

Latency Reciprocal Execution

Floating point XMM instructions

Latency Reciprocal Execution

Low values are for

Latency Reciprocal Execution

Any addr. mode. Add

from AMD manual

Any address size

latency ax=3, dx=4

Control transfer instructions

low values = real mode

low values = real mode

low values = real mode

low values = real mode

recip. thrp.= 2 if jump

values are per count

3 ops, 5 clk if 16 bit

Floating point x87 instructions

Latency Reciprocal Execution

Low latency immediFMISC, FA/M ately after FCOMI

FMISC, ALU after FCOM FTST

Low latency immediately

Integer MMX and XMM instructions

Latency Reciprocal Execution

SSE4.A, AMD only

Floating point XMM instructions

Latency Reciprocal Execution

SSE4.A, AMD only

Obsolete 3DNow instructions

Latency Reciprocal Execution

Move and convert instructions

Floating point instructions

Indicates which execution pipe or unit is used for the macro-operations:

Tells which execution unit domain is used:

all addr. modes