Open CL

OpenCL Programming
OPEN COMPUTING LANGUAGE B Y S W A P N I L A T I L AY E
Overview
Introduction to OpenCL Design Goals of OpenCL OpenCL Architecture OpenCL Framework
Basic OpenCL Program Structure

OpenCL Program: Example of Hello World OpenCL Program: Example Element-wise Matrix Addition OpenCL Language Restriction Future of OpenCL Conclusion
Motivation: Before OpenCL
Motivation: Promise of OpenCL
OpenCL Commercial Objective:

Grow the market for parallel computing For vendors of systems, silicon, middleware, tools and applications Open, royalty-free standard for heterogeneous parallel computing Unified programming model for CPUs, GPUs, Cell, DSP and other processors in a system Cross-vendor software portability to a wide range of silicon and systems HPC servers, desktop systems and handheld devices covered in one specification
Support for a wide diversity of applications From embedded and mobile software through consumer applications to HPC solutions
Rapid deployment in the market Designed to run on current latest generations of GPU hardware
What is OpenCL:
OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms.
OpenCL stands for Open Computing Language. It supported by Apple and several vendors. The Khronos group who made OpenGL making OpenCL. Cross-platform parallel computing API and C-like language for heterogeneous computing devices
A single OpenCL kernel will likely not achieve peak performance on all device types
Heterogeneous Computing : OpenCL
OpenCL Working Group

Apple initially proposed and is very active in the working group Serving as specification editor Diverse industry participation Processor vendors, system OEMs, middleware vendors, application developers Here are some of the other companies in the OpenCL working group
Design Goals of OpenCL:

Code is portable across various target devices: Correctness is guaranteed Performance of a given kernel is not guaranteed across differing target devices. Targets a broader range of CPU-like and GPU-like devices than CUDA Targets devices produced by multiple vendors Many features of OpenCL are optional and may not be supported on all devices OpenCL works for all kinds of GPGPUs like AMD and NVIDIA GPUs, all kind of multi cores CPUs like x86 CPUs, Cell processors and works on handhelds and mobile devices. OpenCL implementations also exit in FPGAs. OpenCL codes must be prepared to deal with much greater hardware diversity
OpenCL Platform Model:

A host connected to one or more OpenCL devices
Device can be divided into one or more compute units (CUs)

Compute unit can be further divided into one or more processing elements (PEs).
The computation will be done on the Processing elements

Application send commands from host to PE PE within CU execute instructions as SIMD/SPMD units
OpenCL Hardware Abstraction
OpenCL exposes CPUs, GPUs, and other Accelerators as devices Each device contains one or more compute units, i.e. cores, SMs, etc...
Each compute unit contains one or more SIMD processing elements
OpenCL Memory Model:

OpenCL memory devided into 4 types Global, Constant, Local, and Private.
OpenCL Memory Model:

OpenCL memory devided into 4 types Global, Constant, Local, and Private. Private Memory Per work-item Qualifier: __private; Ex.: __private char *px; local memory in CUDA Local Memory Shared within a workgroup (16Kb) Qualifier: __local Shared memory in CUDA Local Global/Constant Memory Not synchronized Qualifier: __global; Ex.: __global float4 *p; Qualifier: __constant Global memory, constant memory in CUDA
OpenCL Execution Model:

The execution in OpenCL happen in 2 parts, one called the Kernel, which execute on the OpenCL devices or the computer devices. The other is the host program which control how the kernel execute. Work item is the basic unit of work
Kernel is code for work item. Executed on OpenCL devices, similar to C functions, CUDA kernels, etc. Data-parallel or task-parallel
Host program executed on host. Collection of compute kernels and internal functions. Analogous to a dynamic library
OpenCL Host Program:
An OpenCL program contains one or more kernels and any supporting routines that run on a target device.
An OpenCL kernel is the basic unit of parallel code that can be executed on a target device

Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
OpenCL Kernels :
Kernel Execution:
Code that actually executes on target devices
Kernel body is instantiated once for each work item An OpenCL work item is equivalent to a CUDA thread Each OpenCL work item gets a unique index
Eg: __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; }
OpenCL Kernels Execution Launch:

Kernel Execution:
The host program invokes a kernel over an index space called an NDRange - NDRange, N-Dimensional Range, can be a 1D, 2D, or 3D space
A single kernel instance at a point in the index space is called a work-item - Work-items have unique global IDs from the index space - CUDA: thread Ids Work-items are further grouped into work-groups - Work-groups have a unique work-group ID - Work-items have a unique local ID within a work-group - UDA: Block IDs
Array of Parallel Work Item:

An OpenCL kernel is executed by an array of work items All work items run the same code (SPMD) Each work item has an index that it uses to compute memory addresses and make control decisions
Work Group Scalable Operation:

Divide monolithic work item array into work groups Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization Work items in different work groups cannot cooperate
OpenCL Execution Model : example 2D NDRange
Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID
OpenCL Programming Model: Data Parallel Model

Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a workitem The N-D domain defines the total number of workitems that execute in parallel global work size. Parallel work is submitted to devices by launching Kernels
Kernels run over global dimension index ranges (NDRange), broken up into work groups,and work items Work items executing within the same work group can synchronize with each other with barriers or memory fences Work items in different work groups cant sync with each other, except by launching a new kernel
OpenCL ND Range Configuration:Global and Local Dimensions

Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (executed together) Choose the dimensions that are best for your algorithm
OpenCL Programming Model: Task Parallel Model

Some compute devices such as CPUs can also execute task-parallel compute kernels Executes as a single work-item The main characteristic is that each processor executes different commands. A compute kernel written in OpenCL A native C / C++ function
OpenCL Programming Model: Synchronization

Work-items in a single work-group (work-group barrier) Similar to _synchthreads () in CUDA No mechanism for synchronization between work-groups Synchronization points between commands in command-queues Similar to multiple kernels in CUDA but more generalized Command-queue barrier Waiting on an event
OpenCL Compilation Model

OpenCL uses dynamic compilation model (like DirectX and OpenGL)
Static compilation: The code is compiled from source to machine execution code at a specific point in the past.
Dynamic compilation: Also known as runtime compilation Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an assembler of a virtual machine. Step 2: The IR is compiled to a machine code for execution. This step is much shorter. In dynamic compilation, step 1 is done usually once, and the IR is stored. The App loads the IR and does step 2 during the Apps runtime
Mapping Programming Model: OpenCL to CUDA

OpenCL Parallelism Concept Kernel NDRange(index space) Work Item Work Group CUDA Parallelism Concept kernel grid Thread Block
OpenCL Object:
Setup DevicesGPU, CPU, Cell/B.E. ContextsCollection of devices QueuesSubmit work to the device Memory BuffersBlocks of memory Images2D or 3D formatted images Execution ProgramsCollections of kernels KernelsArgument/execution instances Synchronization/profiling Events
OpenCL Framework:
OpenCL Framework:
The OpenCL framework allows applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. The framework contains the following components: OpenCL Platform layer: The platform layer allows the host program to discover OpenCL devices and their capabilities and to create contexts OpenCL Runtime: The runtime allows the host program to manipulate contexts once they have been created. OpenCL Compiler: The OpenCL compiler creates program executable that contain OpenCL kernels. The OpenCL C programming language implemented by the compiler supports a subset of the ISO C99 language with extensions for parallelism.
Basic OpenCL Program Structure:
Main Flow of Host Code:

Get information about the platform and devices Create an OpenCL context Create a command queue Create memory buffer objects Create program object Load the kernel source code and compile it Create kernel object Set kernel arguments
Execute the kernel

Copy memory from GPU to CPU
OpenCL Context
Contains one or more devices OpenCL memory objects are associated with a context, not a specific device clCreateBuffer() is the main data object allocation function error if an allocation is too large for any device in the context Each device needs its own work queue(s) Memory transfers are associated with a command queue (thus a specific device
OpenCL Program: Example of Hello World

Kernel Code:
Host Code:
CUDA Program: Example Element-wise Matrix Addition

/* Set grid size*/ const int N = 1024; const int blocksize = 16; /* Compute kernel*/ __global__void add_matrix( float* a, float *b, float *c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } int main() { /* CPU Memory Allocation*/ float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; For ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } /*GPU Memory Allocation */ float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size ); /*Copy Data to GPU */ cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice ); /* Execute Kernel*/ dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N ); /*Copy Result Back to CPU */ cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost ); /*Clean Up and Return */ cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); return EXIT_SUCCESS; }
OpenCL Language Restriction:

Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions
Future of OpenCL:
The future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion. AMD Fusion is a new approach to processor design and software development, delivering powerful CPU and GPU capabilities for HD, 3D and data-intensive workloads in a single-die processor called an APU. APUs combine high-performance serial and parallel processing cores with other special-purpose hardware accelerators, enabling breakthroughs in visual computing, security, performance-per-watt and device form factor. This processor would be perfect for OpenCL, As that doesnt care what type of processor is available; as long as it can be used.
Conclusion:
OpenCL would attract HPC programmers because it is long term strategy with GPUs and other accelerators. It might be complicated language for the short application, but it is very useful with more complicated application . There are some restrictions on OpenCL. But it won't affect the language reliability. There will be other implementations for OpenCL in other high end language where would be easy for the normal programmers. In the end, you might find OpenCL very difficult. But when you master it, you will be the master of parallel computing There are already some requests for OpenCL programmers from UK companies.
OpenCL Demo on AMD :
http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related
http://www.youtube.com/watch?v=PJ1jydg8mLg
http://www.youtube.com/watch?v=mcU89Td53Gg
Thank You

Open CL

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Open CL

Enviado por

Direitos autorais:

Formatos disponíveis

OpenCL Programming

OPEN COMPUTING LANGUAGE B Y S W A P N I L A T I L AY E

Basic OpenCL Program Structure

Motivation: Before OpenCL

Motivation: Promise of OpenCL

OpenCL Commercial Objective:

Heterogeneous Computing : OpenCL

OpenCL Working Group

Design Goals of OpenCL:

OpenCL Platform Model:

Device can be divided into one or more compute units (CUs)

The computation will be done on the Processing elements

OpenCL Hardware Abstraction

Each compute unit contains one or more SIMD processing elements

OpenCL Memory Model:

OpenCL Memory Model:

OpenCL Execution Model:

OpenCL Host Program:

OpenCL Execution Model:

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

Code that actually executes on target devices

OpenCL Kernels Execution Launch:

OpenCL Execution Model:

Array of Parallel Work Item:

Work Group Scalable Operation:

OpenCL Execution Model : example 2D NDRange

OpenCL Programming Model: Data Parallel Model

OpenCL ND Range Configuration:Global and Local Dimensions

OpenCL Programming Model: Task Parallel Model

OpenCL Programming Model: Synchronization

OpenCL Compilation Model

Mapping Programming Model: OpenCL to CUDA

Basic OpenCL Program Structure:

Main Flow of Host Code:

Execute the kernel

OpenCL Program: Example of Hello World

CUDA Program: Example Element-wise Matrix Addition

OpenCL Language Restriction:

OpenCL Demo on AMD :

Você também pode gostar