Code Optimization Word-Wide Optimization Mixing C and Assembly

Code Optimization
Word-Wide Optimization
Mixing C and Assembly
To mix C and assembly, it is necessary to know the register convention used
by the
compiler to pass arguments. This convention is illustrated in Figure. DP, the
base
pointer, points to the beginning of the .bsssection, containing all global and
static
variables. SP, the stack pointer, points to local variables. The stack grows
from higher
memory to lower memory, as indicated in Figure. The space between even
registers
(odd registers) is used when passing 40-bit or 64-bit values.
Software Pipelining
Software pipelining is a technique for writing highly efficient assembly loop
codes
on the C6x processor. Using this technique, all functional units on the
processor
are fully utilized within one cycle. However, to write hand-coded software
pipelined
assembly code, a fair amount of coding effort is required, due to the
complexity and
number of steps involved in writing such code. In particular, for complex
algorithms
encountered in many communications, and signal/image processing
applications,
hand-coded software pipelining considerably increases coding time. The C
compiler
at the optimization levels 2 and 3 (o2and o3) performs software pipelining
to
some degree. Compared with linear assembly, the increase in code
efficiency when writing hand-coded software pipelining is relatively slight.
Linear Assembly
Linear assembly is a coding scheme that allows one to write efficient codes
(compared
with C) with less coding effort (compared with hand-coded software pipelined
assembly). The assembly optimizer is the software tool that parallelizes
linear assembly
code across the eight functional units. It attempts to achieve a good
compromise
between code efficiency and coding effort. In a linear assembly code, it is not
required to specify any functional units, registers,
and NOPs.
The directives .procand .endprocdefine the beginning and end,
respectively, of the linear assembly procedure. The symbolic names p_m,
p_n, m, n,
count, prod, and sumare defined by the .regdirective. The names p_m,
p_n,
and countare associated with the registers A4, B4, and A6 by using the
assignment
MVinstruction.
Hand-Coded Software Pipelining
First let us review the pipeline concept. As can

be seen from this figure, the functional units in the non-pipelined version are
not
fully utilized, leading to more cycles compared with the pipelined version.
There are
three stages to a pipelined code, named prolog, loop kernel, and epilog.
Prolog corresponds
to instructions that are needed to build up a loop kernel or loop cycle, and
epilog to instructions that are needed to complete all loop iterations. When a
loop
kernel is established, the entire loop is done in one cycle via one parallel
instruction
using the maximum number of functional units. This parallelism is what
causes a
reduction in the number of cycles.
Three steps are needed to produce a hand-coded software pipelined code

from a linear
assembly loop code: (a) drawing a dependency graph, (b) setting up a
scheduling
table, and (c) deriving the pipelined code from the scheduling table.
In a dependency graph, the nodes denote
instructions and symbolic variable names. The paths show the flow of data
and are
annotated with the latencies of their parent nodes. To draw a dependency
graph for
the loop part of the dot-product code, we start by drawing nodes for the
instructions
and symbolic variable names.
After the basic dependency graph is drawn, a functional unit is assigned to

each node
or instruction. Then, a line is drawn to split the workload between the A- and
B-side
data paths as equally as possible. It is apparent that one load should be done
on each
side, so this provides a good starting point. From there, the rest of the
instructions
need to be assigned in such a way that the workload is equally divided
between the
A- and B-side functional units. The dependency graph for the dot-product
example
is shown in Figure.
The next step for handwriting a pipelined code is to set up a scheduling

table. To
do so, the longest path must be identified in order to determine how long the
table
should be. Counting the latencies of each side, we see that the longest path
is 8. This
means that 7 prolog columns are required before entering the loop kernel.
Thus, as
shown in Table, the scheduling table consists of 15 columns (7 for prolog, 1
for
loop kernel, 7 for epilog) and eight rows (one row for each functional unit).
Epilog
and prolog are of the same length. Next, the code is handwritten directly
from the scheduling table
C64x Improvements
This section shows how the additional features of the C64x DSP can be used
to
further optimize the dot-product example. Figure (b) shows the C64x version
of
the dot-product loop kernel for multiplying two 16-bit values. The equivalent
C code
appears in Figure (a).
As shown in Figure 7-19(a), in C, these can be achieved by

using the intrinsic _dotp2()and by casting shorts as integers. The
equivalent loop
kernel code generated by the compiler is shown in Figure 7-19(b), which is a
doublecycle
loop containing four 16 * 16 multiplications. The instruction LDWis used to
bring in the required 32-bit values.
Considering that the C64x can bring in 64-bit data values by using the
double-word
loading instruction LDDW, the foregoing code can be further improved by
performing
four 16 * 16 multiplications via two DOTP2instructions within a single-cycle
loop, as shown in Figure (b). This way the number of operations is reduced by
four-fold, since four 16 * 16 multiplications are done per cycle. To do this in C,
we need to cast short datatypes as doubles, and to specify which 32 bits of
64-bit data a
DOTP2is supposed to operate on. This is done by using the _lo()and _hi()
intrinsics
to specify the lower and the upper 32 bits of 64-bit data, respectively. Figure
(a) shows the equivalent C code.
Circular Buffering
In many DSP algorithms, such as filtering, adaptive filtering, or spectral
analysis, we
need to shift data or update samples (i.e., we need to deal with a moving
window).
The direct method of shifting data is inefficient and uses many cycles.
Circular buffering
is an addressing mode by which a moving-window effect can be created
without
the overhead associated with data shifting. In a circular buffer, if a pointer
pointing
to the last element of the buffer is incremented, it is automatically wrapped
around
and pointed back to the first element of the buffer. This provides an easy
mechanism
to exclude the oldest sample while including the newest sample, creating a
movingwindow
effect as illustrated in Figure.
Some DSPs have dedicated hardware for doing this type of addressing. On
the C6x
processor, the arithmetic logic unit has the circular addressing mode
capability built
into it. To use circular buffering, first the circular buffer sizes need to be
written into
the BK0 and BK1 block size fields of the Address Mode Register (AMR), as
shown
in Figure . The C6x allows two independent circular buffers of powers of 2 in
size.
Buffer size is specified as 2 bytes, where N indicates the value written to
the BK0
and BK1 block size fields.
(N+1)
Then, the register to be used as the circular buffer pointer needs to be

specified by
setting appropriate bits of AMR to 1. For example, as shown in Figure , for
using
A4 as a circular buffer pointer, bit 0 or 1 is set to 1. Of the 32 registers on the
C6x, 8
can be used as circular buffer pointers: A4 through A7 and B4 through B7.
Note that
linear addressing is the default mode of addressing for these registers.
Adaptive Filtering
Adaptive filtering is used in many applications ranging from noise
cancellation to
system identification. In most cases, the coefficients of an FIR filter are
modified
according to an error signal in order to adapt to a desired signal. In this lab, a
system
identification example is implemented wherein an adaptive FIR filter is used
to
adapt to the output of a seventh-order IIR bandpass filter. The IIR filter is
designed
in MATLAB and implemented in C. The adaptive FIR is first implemented in C
and
later in assembly using circular buffering.
In system identification, the behavior of an unknown system is modeled by
accessing
its input and output. An adaptive FIR filter can be used to adapt to the output
of the
system based on the same input. The difference in the output of the system,
d[n], and
the output of the adaptive filter, y[n], constitutes the error term e[n], which
is used to
update the coefficients of the FIR filter.
The error term calculated from the difference of the outputs of the two
systems is
used to update each coefficient of the FIR filter according to the formula
(least mean
square (LMS) algorithm [1]):
where the hs denote the unit sample response or FIR filter coefficients. The
output
y[n] is required to approach d[n]. The term indicates step size. A small step
size will
ensure convergence, but results in a slow adaptation rate. A large step size,
though
faster, may lead to skipping over the solution.

Code Optimization Word-Wide Optimization Mixing C and Assembly

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Code Optimization Word-Wide Optimization Mixing C and Assembly

Enviado por

Direitos autorais:

Formatos disponíveis

Code Optimization

Hand-Coded Software Pipelining

First let us review the pipeline concept. As can

Three steps are needed to produce a hand-coded software pipelined code

After the basic dependency graph is drawn, a functional unit is assigned to

The next step for handwriting a pipelined code is to set up a scheduling

As shown in Figure 7-19(a), in C, these can be achieved by

Then, the register to be used as the circular buffer pointer needs to be

Você também pode gostar