Accerelate Framwork

Accelerate Framework
Fast and energy efficient computation
Session 713
Geoff Belter
Engineer, Vector and Numerics Group
These are confidential sessionsplease refrain from streaming, blogging, or taking pictures
What is it?
Easy access to a lot of functionality

Accurate
Fast with low energy usage
Works on both OS X and iOS
Optimized for all of generations of hardware
What operations are available?
Image processing (vImage)

Digital signal processing (vDSP)
Transcendental math functions (vForce, vMathLib)
Linear algebra (LAPACK, BLAS)
Session goals
How Accelerate helps you

Areas of your code likely to benefit from Accelerate
How you use Accelerate
Why is it fast?
Why is it fast?
SIMD instructions
SSE, AVX and NEON
Why is it fast?
SIMD instructions
SSE, AVX and NEON
Match the micro-architecture
Instruction selection and scheduling
Software pipelining
Loop unrolling
Why is it fast?
SIMD instructions
SSE, AVX and NEON
Match the micro-architecture
Instruction selection and scheduling
Software pipelining
Loop unrolling
Multi-threaded using GCD
Tips for Successful Use of Accelerate

Prepare your data
Contiguous
16-byte aligned

Prepare your data
Contiguous
16-byte aligned
Understand problem size

Prepare your data
Contiguous
16-byte aligned
Do setup once/destroy at the end
Using Accelerate Framework

Xcode

Xcode

Xcode

Xcode

Xcode

Xcode

Xcode

Xcode

Xcode

Command L=line

Command L=line
cc -framework Accelerate main.c

vImage
Vectorized image processing library
vImage
Whats available?
vImage
Whats available?
Additions and Improvements

Improved conversion support

vImage Buffer creation utilities

Resampling of 16-bit images

Resampling of 16-bit images
Streamlined Core Graphics interoperability
Core Graphics Interoperability

How do I use vImage with my CGImageRef?

How do I use vImage with my CGImageRef?
New utility functions
vImageBuffer_InitWithCGImage
vImageCreateCGImageFromBuffer

From CGImageRef to vImageBuffer
#include <Accelerate/Accelerate.h>
// Create and prepare CGImageRef
CGImageRef inImg;
// Specify Format
vImage_CGImageFormat format = {
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.colorSpace = NULL,
.bitmapInfo = kCGImageAlphaFirst,
.version = 0,
.decode = NULL,
.renderingIntent = kCGRenderingIntentDefault,
};
// Create vImageBuffer
vImage_Buffer inBuffer;
vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

CGImageRef inImg;
// Specify Format
.bitsPerPixel = 32,
.colorSpace = NULL,
.version = 0,
.decode = NULL,
};

CGImageRef inImg;
// Specify Format
.bitsPerPixel = 32,
.colorSpace = NULL,
.version = 0,
.decode = NULL,
};

CGImageRef inImg;
// Specify Format
.bitsPerPixel = 32,
.colorSpace = NULL,
.version = 0,
.decode = NULL,
};

From vImageBuffer to CGImageRef
// The output buffer
vImage_Buffer outBuffer;
// Create CGImageRef
vImage_Error error;
CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL,
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

vImage_Error error;

vImage_Error error;

vImageConvert_AnyToAny
Convert between vImage_CGImageFormat types

Tips for use
Create converter once
Use many times
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s
Software JPEG Encode Performance

iPhone 5
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s

iPhone 5
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s

iPhone 5
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s

iPhone 5
Old method
AnyToAny
20
10
Conversion Example
Scaling with a premultiplied PlanarF image
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversion Example
...
// Resize the image
Conversion Example
...
// Resize the image
Conversion Example
...
// Resize the image
Conversion Example
...
// Resize the image
Conversions in Context
Scaling on a premultiplied PlanarF image
Unpremultiply
Scale
Premultiply
25
50
Percentage of time taken
75
100
Conversions in Context
Scaling on a premultiplied PlanarF image
Unpremultiply
1.10%
Scale
98.26%
Premultiply
0.64%
25
50
Percentage of time taken
75
100
vImage vs. OpenCV
OpenCV
The competition
Open source computer vision library

Image processing module
Points of Comparison
Execution time
Energy consumed
vImage Speedup over OpenCV

iPhone 5
Above one is better
Histogram
Max
Box Convolve
Affine Warp
0
10
15
Speedup over OpenCV
(lanczos)
20
25
vImage Speedup over OpenCV

iPhone 5
Above one is better
Histogram
1.62
Max
23.19
Box Convolve
7.88
Affine Warp
6.41
10
15
Speedup over OpenCV
(lanczos)
20
25
Energy Consumption and Battery Life

Fast code tends to
Decrease energy consumption
Increase battery life
Typical Energy Consumption Profile

Instantaneous power
Power
p2
p1
Energy
Idle power
p0
t0
Time
t1
Typical Energy Consumption Profile
Optim
ized
Power
Instantaneous power
Unoptimized
Idle power
Time
vImage Energy Savings over OpenCV

iPhone 5
Above one is better
Histogram
Max
Box Convolve
Affine Warp
0
4
Times less energy than OpenCV
(lanczos)
vImage Energy Savings over OpenCV

iPhone 5
Above one is better
Histogram
0.75
Max
6.71
Box Convolve
4.05
Affine Warp
6.96
4
Times less energy than OpenCV
(lanczos)
Using vImage from the Accelerate

framework to dynamically pre-render my
sprites. Its the only way to make it fast. ;-)
Twitter User

vDSP
Vectorized digital signal processing library
vDSP
Whats available?
Basic operations on arrays

Add, subtract, multiply, conversion, accumulation, etc.
Discrete Fourier Transform
Convolution and correlation
New and Improved in vDSP

Multi-channel IIR filter

Multi-channel IIR filter
Improved power of 2 support
Discrete Fourier Transform (DFT)
Discrete Cosine Transform (DCT)

Same operation, two entries based on number of points

Before
256
384
FFT
DFT

Before
OS X 10.9, iOS 7
256
384
FFT
DFT
256
384
DFT
DFT Example
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
DFT Example
// Once at start:
...
...
// Once at end:
DFT Example
// Once at start:
...
...
// Once at end:
DFT Example
// Once at start:
...
...
// Once at end:
DFT Example
// Once at start:
...
...
// Once at end:
vDSP vs. FFTW
FFTW
Fastest Fourier Transform in the west
One and multi-dimensional transforms

Real and complex data
Parallel
vDSP Speedup over FFTW

DFT on iPhone 5

DFT on iPhone 5
Above one is better
3
Speedup
0
240
256
320
384
480
Number of Points
512
640
768
960

DFT on iPhone 5
Above one is better
3
Speedup
0
240
256
320
384
480
Number of Points
512
640
768
960
DFT In Use
AAC Enhanced Low Delay
Used in FaceTime
DFT one of many DSP routines
Percentage of time spent in DFT
DFT In Use
DFT In Use
Percent time spent in DFT
DFT
Everything Else
FFTW
47%
54%
DFT In Use
Percent time spent in DFT
DFT
Everything Else
FFTW
vDSP
30%
47%
54%
70%
Data Types in vDSP

Single and double precision
Real and complex
Support for strided data access
Wanna do FFT on iOS?

Use the Accelerate.framework.
Highly recommended. #ios
Twitter User

Fast Math
Luke Chang
Math for Every Data Length

Libm for scalar data
float

vMathLib for SIMD vectors
float
vFloat

vMathLib for SIMD vectors
vForce for array data
float
vFloat
float []
...
Libm
Standard C math library
Libm
Standard math library in C
Collection of transcendental functions
Operates on scalar data
exp[f ]
log[f ]
sin[f ]
cos[f ]
pow[f ]
Etc
Whats New in Libm?

Extensions to C11, prefixed with __
Added in both iOS 7 and OS X 10.9
__exp10[f ]
__sinpi[f ], __cospi[f ], __tanpi[f ]
__sincos[f ], __sincospi[f ]
Power of 10
__exp10[f]
Commonly used for decibel calculations

Faster than pow(10.0, x)
More accurate than exp(log(10) * x)
exp(log(10) * 5.0) =100000.0000000002
Trigonometry in Terms of PI
__sinpi[f], __cospi[f], __tanpi[f]
cospi(x) means cos(*x)

Faster because argument reduction is simpler
More accurate when working with degrees
cos(M_PI*0.5) returns 6.123233995736766e-17
__cospi(0.5) returns exactly 0.0
Sine-Cosine Pairs
__sincos[f], __sincospi[f]
Compute sine and cosine simultaneously

Faster because argument reduction is only done once
Clang will call __sincos[f ] when possible
C11 Features
Some complex values cant be specified as literals
(0.0 + INFINITY * I)
C11 adds CMPLX macro for this purpose
CMPLX(0.0, INFINITY)
CMPLXF and CMPLXL are also available
vMathLib
SIMD vector math library
vMathLib
Collection of transcendental functions for SIMD vectors
Operates on SIMD vectors
vexp[f ]
vlog[f ]
vsin[f ]
vcos[f ]
vpow[f ]
Etc
When to Use vMathLib?

Writing your own vector algorithm
Need transcendental functions in your vector code
vMathLib Example
Taking sine of a vector
Using Libm
#include <math.h>
vFloat vx =
vFloat vy;
...
float *px =
for( i = 0;
py[i] =
}
...
{ 1.f, 2.f, 3.f, 4.f };
(float *)&vx, *py = (float *)&vy;

i < sizeof(vx)/sizeof(px[0]); ++i ) {
sinf(px[i]);
vMathLib Example
Taking sine of a vector
Using vMathLib
vFloat vx = { 1.f, 2.f, 3.f, 4.f };
vFloat vy;
...
vy = vsinf(vx);
...
vForce
Vectorized math library
vForce
Collection of transcendental functions for arrays
Operates on array data
vvexp[f ]
vvlog[f ]
vvsin[f ]
vvcos[f ]
vvpow[f ]
Etc
vForce Example
Filling a buffer with sine wave using a for loop
#include <math.h>
float buffer[length];
float indices[length];
...
for (int i = 0; i < length; i++)
{
buffer[i] = sinf(indices[i]);
}
vForce Example
Filling a buffer with sine wave using vForce
float buffer[length];
float indices[length];
...
vvsinf(buffer, indices, &length);
Better Performance
Measured on iPhone 5
Better Performance
Bigger is better
vForce
56
For loop
24
00
1
10
2
20
3
30
Sines Computed per s
4
40
5
50
6
60
Less Energy
Smaller is better
vForce
For loop
00
41
2
8
3
12
nJ consumed per sine
4
16
5
20
6
24
Less Energy
Smaller is better
vForce
14.16
For loop
23.37
00
41
2
8
3
12
nJ consumed per sine
4
16
5
20
6
24
vForce Performance
truncf
logf
expf
powf
sinf
sincosf
0
40
vForce
For loop
80
120
Results computed per s
160
200
vForce Performance
truncf
826
161
logf
116
22
expf
29
powf
26.6
9.2
sinf
120
56
24
sincosf
10.6
28.8
40
vForce
For loop
80
120
Results computed per s
160
200
vForce in Detail
Supports both float and double
Handles edge cases correctly
Requires minimal data alignment
Supports in-place operation
Improves performance even with small arrays
Consider using vForce when more than 16 elements

LAPACK and BLAS

Linear Algebra PACKage and
Basic Linear Algebra Subprograms
Geoff Belter
LAPACK Operations
High-level linear algebra
Solve linear systems
Matrix factorizations
Eigenvalues and eigenvectors
LINPACK
How fast can you solve a system of linear equations?
1000 x 1000 matrices
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
Bigger is better
Brand A
40
500
1000
Mflops
1500
LAPACK
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
Bigger is better
Brand A
788
500
1000
Mflops
1500
LAPACK
Bigger is better
Accelerate
Brand A
788
500
1000
Mflops
1500
LAPACK
Bigger is better
Accelerate
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
Bigger is better
Accelerate on iPhone 4S
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
Bigger is better
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
Bigger is better
Accelerate on iPhone 5
Brand A
788
500
1000
Mflops
1500
2000
LAPACK
Bigger is better
Accelerate
Accelerateon
oniPhone
iPhone4S
5
Brand A
788
500
1000
1500
2000
Mflops
2500
3000
3500
4000
LAPACK
Bigger is better
Accelerate
Accelerateon
oniPhone
iPhone4S
5
3446
Brand A
788
500
1000
1500
2000
Mflops
2500
3000
3500
4000
iPad
with Retina Display
Power Mac
VS.
G5
LAPACK
iPad with Retina display vs. Power Mac G5
Triumphant return
After 10 years
All fans blazing
LAPACK
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
1000
2000
Mflops
3000
4000
LAPACK
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
3686
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK Example
Solve linear system
// Create and prepare input and output data
double *A, *B;
__CLPK_integer *ipiv;
// solve
dgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
LAPACK Example
Solve linear system
double *A, *B;
// solve
LAPACK Example
Solve linear system
double *A, *B;
// solve
BLAS Operations
Low-level linear algebra
Vector
Dot product, scalar product, vector sum
Matrix-vector
Matrix-vector product, outer product
Matrix-matrix
Matrix multiply
BLAS Example
Matrix Multiply
// Create and prepare data
double *A, *B, *C;
// C <-- A * B
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
BLAS Example
Matrix Multiply
double *A, *B, *C;
// C <-- A * B
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
BLAS Example
Matrix Multiply
double *A, *B, *C;
// C <-- A * B
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
Data Types
Single and double precision
Real and complex
Multiple data layouts
Dense, banded, triangular, etc.
Transpose, conjugate transpose
Row and column major
Playing with the Accelerate.framework

today. Having a BLASt.
Twitter User
Summary
Lots of functionality

Features and benefits
Easy access to a lot of functionality

Accurate
Fast with low energy usage
Works on both OS X and iOS
Optimized for all of generations of hardware
To be successful
Prepare your data

Contiguous
16-byte aligned
Do setup once/destroy at the end
If You Need a Feature
If You Need a Feature

Request It
Discrete Cosine Transform was my

feature request that made it into the
Accelerate Framework. I feel so special!
Twitter User
Thanks Apple for making the

Accelerate Framework.
Twitter User
More Information
Paul Danbold
Core OS Technologies Evangelist
danbold@apple.com
George Warner
DTS Sr. Support Scientist
geowar@apple.com
Documentation
vImage Programming Guide
http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/
Introduction/Introduction.html
vDSP Programming Guide
http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/
vDSP_Programming_Guide/Introduction/Introduction.html
Apple Developer Forums

http://devforums.apple.com
Labs
Accelerate Lab
Core OS Lab A
Thursday 4:30PM

Accerelate Framwork

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Accerelate Framwork

Enviado por

Direitos autorais:

Formatos disponíveis

Accelerate Framework

Fast and energy efficient computation

Easy access to a lot of functionality

Image processing (vImage)

How Accelerate helps you

SSE, AVX and NEON

Tips for Successful Use of Accelerate

Tips for Successful Use of Accelerate

Tips for Successful Use of Accelerate

Tips for Successful Use of Accelerate

Do setup once/destroy at the end

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

Using Accelerate Framework

cc -framework Accelerate main.c

Image processing (vImage)

Additions and Improvements

Additions and Improvements

Additions and Improvements

Additions and Improvements

Additions and Improvements

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Core Graphics Interoperability

Convert between vImage_CGImageFormat types

Software JPEG Encode Performance

Software JPEG Encode Performance

Software JPEG Encode Performance

Software JPEG Encode Performance

vImage vs. OpenCV

Open source computer vision library

vImage Speedup over OpenCV

Speedup over OpenCV

vImage Speedup over OpenCV

Speedup over OpenCV

Energy Consumption and Battery Life

Typical Energy Consumption Profile

Typical Energy Consumption Profile

vImage Energy Savings over OpenCV

vImage Energy Savings over OpenCV

Using vImage from the Accelerate

Image processing (vImage)

Basic operations on arrays

Convolution and correlation

New and Improved in vDSP

New and Improved in vDSP

New and Improved in vDSP

Discrete Fourier Transform (DFT)

(float )&vx, py = (float *)&vy;