Escolar Documentos
Profissional Documentos
Cultura Documentos
CS424
KCS Murti, Chandra.kavuri@gmail.com
Overview
Concepts associated with the processor within the SOC Reference to the Intel Atom processor 32-bit processor with a number of onboard peripherals Cost effective SoC is the objective for ES.
Flags
Modes of operation
Real-address mode: implements the programming environment of the Intel 8086 processor with extensions Default mode after power-up or a reset. Protected mode: native state of the processor ability to directly execute real-address mode 8086 software in a protected, multi-tasking environment. System management mode (SMM): processor switches to a separate address space while saving the basic context of the currently running program or task. provides an OS with a transparent mechanism for implementing platforms pecific functions such as power management Virtual-8086 mode allows the processor execute 8086 software in a protected, multitasking environment
Segment Registers
Privilege Levels
Provide a mechanism to allow portions of the software operate with differing levels of privilege. Current Privilege Level (CPL) is used by the system to control access to resources and execution of certain instructions. CPL stored in the lowest 2 bits of the code segment IOPL is stored in bits 12 and 13 of the FLAGS register Highest privilege level is number zero If the CPL is less than the current IOPL, then the privileged. operation is allowed
Floating-Point Units
Intel processors have two floating-point units. operate on floating-point, integer, and binary coded decimal (BCD) operands. supports 80-bit precision, double extended floating-point . Atom processor supports the Supplemental Streaming SIMD Extensions 3 (SSSE3) version of the SIMD instructions, which support integer, single, and double precession floating-point units.
Processor Specifics
Need some sort of version management in ES Need to establish exactly which features are supported on the particular version you are working with. Use control and information registers Use CPUID instruction to get information.
Calling conventions
cdecl default calling convention requires the calling function to perform the stack cleanup supports functions with a variable number of arguments Stdcall supports a fixed number of arguments for a function stack cleanup is performed by the called function
Processor Instructions
Operands
Immediate Operands data encoded in the instruction itself; Ex: MOV EAX, 00 Register Operands Source and destination operands can be any of the follow registers. 32-bit general purpose registers (EAX, EBC, ECX, EDX, ESI, EDI, ESP, or EBP) 16-bit general purpose registers (AX, BX, CX, DX, SI, SP, BP) 8-bit general-purpose registers (AH, BH, CH, DH, AL, BL, CL, DL) Segment registers EFLAGS register MMX Control (CR0 through CR4) System Table registers (such as the Interrupt Descriptor Table register) Debug registers Machine-specific registers Memory operands referenced by means of a segment selector and an offset.
Memory operands
Source and destination operands in memory are referenced by means of a segment selector and an offset. Ex: MOV [EBX], EAX; moves the value in EAX to the address pointed by EBX. Offset is added to base address: DisplacementAn 8-, 16-, or 32-bit immediate value. IndexA value in a general-purpose register. Scale factorA value of 2, 4, or 8 that is multiplied by the index value. Memory Operand = Segment Selector +Base Register+(Index Reg X Scale) Displacement Value
Offset computation
Data Types
Basic
Bit field
Pointer
General purpose
Same as x86. Some special instructions: CMOVE Conditional move BTS Bit test and set ENTER High-level procedure entry LEAVE High-level procedure exit
Example
1 void add(float *a, float *b, float *c) 2{ 3 __asm { 4 mov eax, a 5 mov edx, b 6 mov ecx, c 7 movaps xmm0, XMMWORD PTR [eax] 8 addps xmm0, XMMWORD PTR [edx] 9 movaps XMMWORD PTR [ecx], xmm0 10 }
Comparison Instructions
FCOM Compare floating-point FCOMP Compare floating-point and pop
Transcendental Instructions
FSIN Sine FCOS Cosine
Control Instructions
FINCSTP Increment FPU register stack pointer FDECSTP Decrement FPU register stack pointer
MMX Instructions
single-instruction multiple-data (SIMD) operations. MMX technology, SSE extensions, SSE2 extensions, and SSE3 extensions MMX instructions operate on packed byte, word, doubleword, or quadword integer Use 8 MMX registers (64 bit) , and general-purpose registers. Data Transfer Instructions
move doubleword and quadword operands between MMX registers and between MMX registers and memory.
Conversion Instructions
pack and unpack bytes, words, and doublewords
Arithmetic Instructions
perform packed integer arithmetic
Comparison Instructions
compare packed bytes, words, or doublewords
SSE Instructions
Streaming SIMD Extensions (SSE) Enhance the performance of IA-32 processors for advanced 2-D and 3-D graphics, motion video, image processing, speech recognition, audio synthesis, telephony, and video conferencing Executed on Intel 64 and IA-32 processors that support SSE extensions Detect support using CPUID instruction Instruction groups SIMD single-precision floating-point instructions that operate on the XMM registers MXSCR state management instructions 64-bit SIMD integer instructions that operate on the MMX registers Cacheability control (temporal and non temporal) , prefetch, and instruction ordering instructions
SSE2 Instructions
Operate on Packed double-precision floating-point operands packed byte, word, doubleword, and quadword operands located in the XMM registers. Instruction groups Packed and scalar double-precision floating-point instructions Packed single-precision floating-point conversion instructions 128-bit SIMD integer instructions Cacheability-control and instruction ordering instructions
SSE3 instructions
accelerate performance of Streaming, SSE2 and x87-FP math capabilities One x87FPU instruction used in integer conversion One SIMD integer instruction that addresses unaligned data loads Two SIMD floating-point packed ADD/SUB instructions Four SIMD floating-point horizontal ADD/SUB instructions Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions Two thread synchronization instructions
SSE4 instructions
Improve the performance of media, imaging, and 3D workloads. Dword Multiply Floating-Point Dot Product Streaming Load Hint Packed Blending Floating-Point Round Instructions with Selectable Rounding Insertion and Extractions from XMM Registers Packed Integer Format Conversions String and Text Processing
Input/Output
Access IO through a separate I/O address space or memory-mapped I/O (64K) individually addressable 8-bit I/O ports. I/O devices that respond like memory components can be accessed through the processors physical-memory address space. When using memory-mapped I/O, caching of the address space mapped for I/O operations must be prevented using memory type range registers (MTRRs) I/O privilege level: EFLAGS register controls access to the I/O address space by restricting use of selected instructions. IN, Out instructions can be executed only if the current privilege level (CPL) of the program or task currently executing is numerically less than or equal to the IOPL
Processor Identification
When the CPUID instruction is executed, selected information is returned in the EAX, EBX, ECX, and EDX registers.
Privilege levels
segment-protection mechanism recognizes 4 privilege levels greater numbers mean lesser privileges Processor checks: Current privilege level (CPL):privilege level of the currently executing program or task Descriptor privilege level (DPL): privilege level of a segment or gate Requested privilege level (RPL) :override privilege level
Gate desriptors
Provide controlled access to code segments with different privilege levels Call gates Trap gates Interrupt gates Task gates
CALLs
Stack Structure
Interrupt latency
Task Management
Unit of work that a processor can dispatch, execute, and suspend Task execution space consists of a code segment, a stack segment, and one or more data segments Task-state segment (TSS) specifies the segments that make up the task execution space Provides a storage place for task state information Task register (TR) When a task is loaded into the processor for execution all attributes of TSS are loaded into TR.
Task State
Tasks current execution space general-purpose registers. EFLAGS register. EIP register. control register CR3. task register. LDTR register. The I/O map base address and I/O map (contained in the TSS). Stack pointers to the privilege 0, 1, and 2 stacks (contained in the TSS). Link to previously executed task (contained in the TSS).
Executing a Task
A task is dispatched by either An explicit call to a task with the CALL instruction. An explicit jump to a task with the JMP instruction. An implicit call (by the processor) to an interrupt-handler task. An implicit call to an exception-handler task. A return (initiated with an IRET instruction) when the NT flag in the EFLAGSregister is set. Task switch occurs between the currently running task and the dispatched task. Execution environment of the currently executing task is saved in its TSS Execution of the task is suspended. Tasks are not recursive.
Task Linking
Return execution to the previous task Uses previous task link field of the TSS Uses NT flag in the EFLAGS register
IDT
Interrupt tasks
Interrupt handler is accessed through a task gate . Advantages: entire context of the interrupted program or task is saved automatically. A new TSS permits the handler to use a new privilege level 0 stack . The handler can be further isolated from other tasks by giving it a separate address space Dis advantages Higher interrupt latency IA-32 architecture tasks are not re-entrant. So disable interrupts in handler.
Memory management
Provides Address translation: provides per process address translation of linear (virtual) address to physical addresses. Protection: provide privilege checking and read/write protection of memory. Cache control: Different memory regions requires different cacheability attributes. page directory entry contains page present indicator and the base address of a page table page table entry physical address of the 4-kB page .
Directory descriptor
bit
0 2 2 3 4
name
Present 1 R/W Read/write;
description
if the page descriptor is present and valid if 0, writes may not be allowed to the 4-MB region controlled by this entry User/supervisor; if 0, accesses with CPL 3 are not allowed to the 4-MB region controlled by this entry Page-level write-through; indirectly determines the memory type used to access the page table referenced by this entry Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this entry Accessed; indicates whether this entry has been used for linear-address translation Ignored If CR4.PSE =1, must be zero Should be zero Physical address of 4-kB aligned page table referenced by this entry
U/S
PWT
PCD
5 6 7 811 1231
D PS Ignored Addr
MMU-Additional descriptors
Write protection general-protection fault if an attempt is made to a write to a protected page Privilege Set privilege levels to pages (kernel space to user space) Accessed accessed bit can be used to identify the age of a page table entry Dirty Set when it is written. Used by page swapping algo. Modes of MMU 32 bit Physical address extension 64 bit Nominal page sizes: 4 KB, 2MB and 4 MB
Translation Caching
Translation would be a very costly for every single memory transaction Cache the translation tables. Translation look-aside buffers (TLBs) TLB constructed as a highly associative cache. Virtual address is compared against all cache entries. TLB are used for the translation when hit occurs. TLB structures for Atom: Instruction for 4-kB page: 32 entries, fully associative. Instruction for large pages: 8 entries, four-way set associative. Data 4-kB pages: 16-entry-per-thread micro-TLB, fully associative; 64-entry DTLB, four-way set associative; 16-entry page directory entry cache, fully associative. Different translations for different processes! Let different processes live in the same linear address space to avoid flushout (at the cost of security!)
Cache in Atom
Hierarchy 32-K eight-way set associative L1 instruction cache. 24-K six-way set associative L1 data cache. 512-K eight-way set associative unified instruction and data L2 cache. cache line size is of 64 bytes Cache allocation policy read-only allocate allocation on a write transaction Cache coherency: MESI
cachebility
Strong Un-cacheable (UC) System memory locations are not cached. Un-cacheable (UC-) can be overridden by programming the MTRRs for the write combining memory type Write Combining (WC) System memory locations are not cached coherency is not enforced Writes may be delayed and combined in the write combining buffer Write-Through (WT) All writes are written to a cache line and through to system memory. Write-Back (WB) Delayed write Write Protected (WP) Writes are propagated to the system bus Cause corresponding cache lines on all processors on the bus to be invalidated.
Micro-architecture
Architecture : Contract between the platform and the software Micro-architecture : Specific implementation that complies with the architecture Tuned to fulfill specific optimizations such as core speed, power etc Necessary to know for tuning system performance
Atom Micro-architecture
in-order, superscalar pipeline two-wide superscalar MS=Micro sequencer TLB= Translate virtual to physical ILD: Instruction decoders AGU: Address generation unit BIU: bus Interface Unit
In nutshell
Power-efficient performance Single-micro-op instruction execution from decode to retirement Sixteen-stage, in-order pipeline Dual pipelines to enable decode, issue, execution and retirement of two instructions per cycle. Second level cache is 512 KB and 8-way associativity. Efficient hardware prefetchers to L1 and L2 (speculative loading) Two issue ports for dispatching SIMD instructions to execution units. Single-cycle throughput for most 128-bit integer SIMD instructions Up to six floating-point operations per cycle Up to two 128-bit SIMD integer operations per cycle
Atom pipeline
References
Modern embedded Computing- Chap 4 Overview of Bluetooth Technology, Hongfeng Wang, penn state More on USB can be found from www.usb.org