Escolar Documentos
Profissional Documentos
Cultura Documentos
Session 713
Geoff Belter
Engineer, Vector and Numerics Group
These are confidential sessionsplease refrain from streaming, blogging, or taking pictures
Accelerate Framework
What is it?
Accelerate Framework
What operations are available?
Accelerate Framework
Session goals
Accelerate Framework
Why is it fast?
Accelerate Framework
Why is it fast?
SIMD instructions
Accelerate Framework
Why is it fast?
SIMD instructions
SSE, AVX and NEON
Match the micro-architecture
Instruction selection and scheduling
Software pipelining
Loop unrolling
Accelerate Framework
Why is it fast?
SIMD instructions
SSE, AVX and NEON
Match the micro-architecture
Instruction selection and scheduling
Software pipelining
Loop unrolling
Multi-threaded using GCD
Accelerate Framework
What operations are available?
vImage
Vectorized image processing library
vImage
Whats available?
vImage
Whats available?
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s
Old method
AnyToAny
20
10
ul
) pr
em
int1
6
BA
(u
RG
int1
6)
30
BA
(u
ul
) pr
em
Bigger is Better
RG
loa
t
BA
(f
RG
loa
t)
ul
rem
BA
(f
RG
BA
p
RG
BA
mu
ul
rem
pre
Bp
RA
RG
BG
ARG
ARG
MPixel/s
Old method
AnyToAny
20
10
Conversion Example
Scaling with a premultiplied PlanarF image
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversion Example
Scaling with a premultiplied PlanarF image
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversion Example
Scaling with a premultiplied PlanarF image
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversion Example
Scaling with a premultiplied PlanarF image
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversion Example
Scaling with a premultiplied PlanarF image
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;
...
// Premultiplied data -> Non-premultiplied data, works in-place
vImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the image
vImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-place
vImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversions in Context
Scaling on a premultiplied PlanarF image
Unpremultiply
Scale
Premultiply
25
50
Percentage of time taken
75
100
Conversions in Context
Scaling on a premultiplied PlanarF image
Unpremultiply
1.10%
Scale
98.26%
Premultiply
0.64%
25
50
Percentage of time taken
75
100
OpenCV
The competition
Points of Comparison
Execution time
Energy consumed
Max
Box Convolve
Affine Warp
0
10
15
(lanczos)
20
25
1.62
Max
23.19
Box Convolve
7.88
Affine Warp
6.41
10
15
(lanczos)
20
25
Power
p2
p1
Energy
Idle power
p0
t0
Time
t1
Optim
ized
Power
Instantaneous power
Unoptimized
Idle power
Time
Max
Box Convolve
Affine Warp
0
4
Times less energy than OpenCV
(lanczos)
0.75
Max
6.71
Box Convolve
4.05
Affine Warp
6.96
4
Times less energy than OpenCV
(lanczos)
Accelerate Framework
What operations are available?
vDSP
Vectorized digital signal processing library
vDSP
Whats available?
256
384
FFT
DFT
OS X 10.9, iOS 7
256
384
FFT
DFT
256
384
DFT
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:
float *Ir,*Ii,*Or,*Oi;
// Once at start:
vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);
...
vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);
...
// Once at end:
vDSP_DFT_DestroySetup(setup);
FFTW
Fastest Fourier Transform in the west
Speedup
0
240
256
320
384
480
Number of Points
512
640
768
960
Speedup
0
240
256
320
384
480
Number of Points
512
640
768
960
DFT In Use
AAC Enhanced Low Delay
Used in FaceTime
DFT one of many DSP routines
Percentage of time spent in DFT
DFT In Use
AAC Enhanced Low Delay
DFT In Use
AAC Enhanced Low Delay
Percent time spent in DFT
DFT
Everything Else
FFTW
47%
54%
DFT In Use
AAC Enhanced Low Delay
Percent time spent in DFT
DFT
Everything Else
FFTW
vDSP
30%
47%
54%
70%
Accelerate Framework
What operations are available?
Fast Math
Luke Chang
Engineer, Vector and Numerics Group
float
float
vFloat
float
vFloat
float []
...
Libm
Standard C math library
Libm
Standard math library in C
Collection of transcendental functions
Operates on scalar data
exp[f ]
log[f ]
sin[f ]
cos[f ]
pow[f ]
Etc
Power of 10
__exp10[f]
Trigonometry in Terms of PI
__sinpi[f], __cospi[f], __tanpi[f]
Sine-Cosine Pairs
__sincos[f], __sincospi[f]
C11 Features
Some complex values cant be specified as literals
(0.0 + INFINITY * I)
C11 adds CMPLX macro for this purpose
CMPLX(0.0, INFINITY)
CMPLXF and CMPLXL are also available
vMathLib
SIMD vector math library
vMathLib
Collection of transcendental functions for SIMD vectors
Operates on SIMD vectors
vexp[f ]
vlog[f ]
vsin[f ]
vcos[f ]
vpow[f ]
Etc
vMathLib Example
Taking sine of a vector
Using Libm
#include <math.h>
vFloat vx =
vFloat vy;
...
float *px =
for( i = 0;
py[i] =
}
...
vMathLib Example
Taking sine of a vector
Using vMathLib
#include <Accelerate/Accelerate.h>
vFloat vx = { 1.f, 2.f, 3.f, 4.f };
vFloat vy;
...
vy = vsinf(vx);
...
vForce
Vectorized math library
vForce
Collection of transcendental functions for arrays
Operates on array data
vvexp[f ]
vvlog[f ]
vvsin[f ]
vvcos[f ]
vvpow[f ]
Etc
vForce Example
Filling a buffer with sine wave using a for loop
#include <math.h>
float buffer[length];
float indices[length];
...
for (int i = 0; i < length; i++)
{
buffer[i] = sinf(indices[i]);
}
vForce Example
Filling a buffer with sine wave using vForce
#include <Accelerate/Accelerate.h>
float buffer[length];
float indices[length];
...
vvsinf(buffer, indices, &length);
Better Performance
Measured on iPhone 5
Better Performance
Measured on iPhone 5
Bigger is better
vForce
56
For loop
24
00
1
10
2
20
3
30
Sines Computed per s
4
40
5
50
6
60
Less Energy
Measured on iPhone 5
Smaller is better
vForce
For loop
00
41
2
8
3
12
nJ consumed per sine
4
16
5
20
6
24
Less Energy
Measured on iPhone 5
Smaller is better
vForce
14.16
For loop
23.37
00
41
2
8
3
12
nJ consumed per sine
4
16
5
20
6
24
vForce Performance
Measured on iPhone 5
truncf
logf
expf
powf
sinf
sincosf
0
40
vForce
For loop
80
120
160
200
vForce Performance
Measured on iPhone 5
truncf
826
161
logf
116
22
expf
29
powf
26.6
9.2
sinf
120
56
24
sincosf
10.6
28.8
40
vForce
For loop
80
120
160
200
vForce in Detail
Supports both float and double
Handles edge cases correctly
Requires minimal data alignment
Supports in-place operation
Improves performance even with small arrays
Accelerate Framework
What operations are available?
LAPACK Operations
High-level linear algebra
Solve linear systems
Matrix factorizations
Eigenvalues and eigenvectors
LINPACK
How fast can you solve a system of linear equations?
1000 x 1000 matrices
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
40
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Brand A
788
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
Brand A
788
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate on iPhone 4S
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
1202
Brand A
788
500
1000
Mflops
1500
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate on iPhone 5
Brand A
788
500
1000
Mflops
1500
2000
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
Accelerateon
oniPhone
iPhone4S
5
Brand A
788
500
1000
1500
2000
Mflops
2500
3000
3500
4000
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
Accelerateon
oniPhone
iPhone4S
5
3446
Brand A
788
500
1000
1500
2000
Mflops
2500
3000
3500
4000
iPad
Power Mac
VS.
G5
LAPACK
iPad with Retina display vs. Power Mac G5
Triumphant return
After 10 years
All fans blazing
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
1000
2000
Mflops
3000
4000
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK
LINPACK benchmark performance in Mflops
Bigger is better
Accelerate
iPad with Retina
on iPhone
Display
4S
3686
PowerMac G5
3643
1000
2000
Mflops
3000
4000
LAPACK Example
Solve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output data
double *A, *B;
__CLPK_integer *ipiv;
// solve
dgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
LAPACK Example
Solve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output data
double *A, *B;
__CLPK_integer *ipiv;
// solve
dgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
LAPACK Example
Solve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output data
double *A, *B;
__CLPK_integer *ipiv;
// solve
dgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
BLAS Operations
Low-level linear algebra
Vector
Dot product, scalar product, vector sum
Matrix-vector
Matrix-vector product, outer product
Matrix-matrix
Matrix multiply
BLAS Example
Matrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare data
double *A, *B, *C;
// C <-- A * B
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
BLAS Example
Matrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare data
double *A, *B, *C;
// C <-- A * B
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
BLAS Example
Matrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare data
double *A, *B, *C;
// C <-- A * B
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
100, 100, 100,
1.0, A, 100, B, 100,
0.0, C, 100);
Data Types
Single and double precision
Real and complex
Multiple data layouts
Dense, banded, triangular, etc.
Transpose, conjugate transpose
Row and column major
Summary
Accelerate Framework
Lots of functionality
Accelerate Framework
Features and benefits
Accelerate Framework
To be successful
More Information
Paul Danbold
Core OS Technologies Evangelist
danbold@apple.com
George Warner
DTS Sr. Support Scientist
geowar@apple.com
Documentation
vImage Programming Guide
http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/
Introduction/Introduction.html
vDSP Programming Guide
http://developer.apple.com/library/mac/#documentation/Performance/Conceptual/
vDSP_Programming_Guide/Introduction/Introduction.html
Labs
Accelerate Lab
Core OS Lab A
Thursday 4:30PM