Tiger SHARC Processor

-
ABSTRACT
The Tiger SHARC processor is the newest and most power member of this
family which incorporates many mechanisms like SIMD, VLIW and short vector
memory access in a single processor. This is the first time that all these have been
combined in a real time processor.
The TigerSHARC DSP is an ultra high performance static superscalar

architecture that optimized for tele-communications infrastructure and other
computationally demanding applications.
The unique architecture combines elements of RISC, VLIW and standard DSP
processors to provide native support for 8, 16,and 32-bit fixed, as well as floating
point data types on single chip. Large on-chip memory, extremely high internal and
external bandwidths and dual compute blocks provide the necessary capabilities to
handle a vast array of computationally demanding, large signal processing tasks.
Contents
1. INTRODUCTION
1.1. Analog and digital signals

1.2. Signal processing
1.3. Digital signal processing
1.4. Development of DSP
1.5. Digital signal processors
2. ARCHITECTURE OF DIGITAL SIGNAL PROCESSORS
2.1. Von Neumann Architecture
2.2. Harvard Architecture
2.3. Super Harvard Architecture
3. THE TIGER SHARC PROCESSOR
3.1. Features
3.2. Benefits
3.3. Description
3.4. Tiger SHARC Processor families
3.5. Functional Block Diagram
3.5.1. Architectural Features
3.5.2. Adapts to evolving signal processing demands
3.5.3. Multiprocessor, general-purpose processing
3.5.4. Instruction Parallelism and SIMD operations
3.5.5. Independent, Parallel Computation Blocks
3.5.6. CLU (Communications Logic Unit)
3.5.7. Integer ALUs
3.5.8. Tiger SHARC memory Integration
3.5.9. Program Sequencer
3.5.10. Flexible Integrated Memory
3.5.11 DMA Controller
3.5.12. Link Ports
3.5.13. External Port
4. APPLICATIONS
5. ADVANTAGES
6. CONCLUSION
7. REFERANCES
1. INTRODUCTION
1.1 Analog and digital signals
In many cases, the signal of interest is initially in the form of an analog
electrical voltage or current, produced for example by a microphone or some other
type of transducer. An analog signal must be converted into digital form before DSP
techniques can be applied. An analog electrical voltage signal, for example, can be
digitized using an electronic circuit called an analog-to-digital converter or ADC.
This generates a digital output as a stream of binary numbers whose values represent
the electrical voltage input to the device at each sampling instant.
1.2 Signal processing
Signals commonly need to be processed in a variety of ways. For example,

the output signal from a transducer may well be contaminated with unwanted
electrical "noise". The electrodes attached to a patient's chest when an ECG is taken
measure tiny electrical voltage changes due to the activity of the heart and other
muscles. The signal is often strongly affected by "mains pickup" due to electrical
interference from the mains supply. Processing the signal using a filter circuit can
remove or at least reduce the unwanted part of the signal. Increasingly nowadays, the
filtering of signals to improve signal quality or to extract important information is
done by DSP techniques rather than by analog electronics.
1.3 Digital Signal Processing
Digital signal processing (DSP) is the study of signals in a digital

representation and the processing methods of these signals. DSP and analog signal
processing are subfields of signal processing Digital Signal Processing is carried out
by mathematical operations. In comparison, word processing and similar programs
merely rearrange stored data. This means that computers designed for business and
other general applications are not optimized for algorithms such as digital filtering
and Fourier analysis. Digital Signal Processors are microprocessors specifically
designed to handle Digital Signal Processing tasks. These devices have seen
tremendous growth in the last decade, finding use in everything from cellular
telephones to advanced scientific instruments. In fact, hardware engineers use "DSP"
to mean Digital Signal Processor, just as algorithm developers use "DSP" to mean
Digital Signal Processing
1.4 Development of DSP
The development of digital signal processing dates from the 1960's with the
use of mainframe digital computers for number-crunching applications such as the
Fast Fourier Transform (FFT), which allows the frequency spectrum of a signal to be
computed rapidly. These techniques were not widely used at that time, because
suitable computing equipment was generally available only in universities and other
scientific research institutions.
1.5 Digital Signal Processors (DSPs)
DSP processors are microprocessors designed to perform digital signal

processing- the mathematical manipulation of digitally represented signals. The
introduction of the microprocessor in the late 1970's and early 1980's made it
possible for DSP techniques to be used in a much wider range of applications.
However, general-purpose microprocessors such as the Intel x86 family are not
ideally suited to the numerically-intensive requirements of DSP, and during the
1980's the increasing importance of DSP led several major electronics manufacturers
(such as Texas Instruments, Analog Devices and Motorola) to develop Digital Signal
Processor chips - specialised microprocessors with architectures designed
specifically for the types of operations required in digital signal processing. (Note
that the acronym DSP can variously mean Digital Signal Processing, the term used
for a wide range of techniques for processing signals digitally, or Digital Signal
Processor, a specialised type of microprocessor chip). Like a general-purpose
microprocessor, a DSP is a programmable device, with its own native instruction
code. DSP chips are capable of carrying out millions of floating point operations per
second, and like their better-known general-purpose cousins, faster and more
powerful versions are continually being introduced. DSPs can also be embedded
within complex "system-on-chip" devices, often containing both analog and digital
circuitry.
Advantage over other Microprocessors

• Single cycle multiply-accumulate operations(MAC)
• Real time performance
• Flexibility and Reliability
• Increased system performance
• Reduced cost
• Harvard architecture
2. Architecture of the Digital Signal Processor
One of the biggest bottlenecks in executing DSP algorithms is transferring

information to and from memory. This includes data, such as samples from the input
signal and the filter coefficients, as well as program instructions, the binary codes
that go into the program sequencer. For example, suppose we need to multiply two
numbers that reside somewhere in memory. To do this, we must fetch three binary
values from memory, the numbers to be multiplied, plus the program instruction
describing what to do.
2.1 Von Neumann architecture
Figure 1(a).shows how this seemingly simple task is done in a traditional

microprocessor. This is often called a Von Neumann architecture, after the brilliant
American mathematician John Von Neumann (1903-1957). Von Neumann guided
the mathematics of many important discoveries of the early twentieth century. His
many achievements include: developing the concept of a stored program computer,
formalizing the mathematics of quantum mechanics, and work on the atomic bomb.
As shown in (a), a Von Neumann architecture contains a single memory and

a single bus for transferring data into and out of the central processing unit (CPU).
Multiplying two numbers requires at least three clock cycles, one to transfer each of
the three numbers over the bus from the memory to the CPU. We don't count the
time to transfer the result back to memory, because we assume that it remains in the
CPU for additional manipulation (such as the sum of products in an FIR filter). The
Von Neumann design is quite satisfactory when you are content to execute all of the
required tasks in serial. In fact, most computers today are of the Von Neumann
design. When an instruction is processed in such a processor, units of the processor
not involved at each instruction phase wait idly until control is passed on to them.
Increase in processor speed is achieved by making the individual units operate faster,
but there is a limit on how fast they can be made to operate. So we need other
architectures when very fast processing is required, and we are willing to pay the
price of increased complexity.
2.2 Harvard architecture
This leads us to the Harvard architecture, shown in (b). This is named for the
work done at Harvard University in the 1940s under the leadership of Howard Aiken
(1900-1973). As shown in this illustration, Aiken insisted on separate memories for
data and program instructions, with separate buses for each. Since the buses operate
independently, program instructions and data can be fetched at the same time,
improving the speed over the single bus design. Most present day DSPs use this dual
bus architecture.
2.3 Super Harvard Architecture(SHARC)
Figure (c) illustrates the next level of sophistication, the Super Harvard
Architecture. This term was coined by Analog Devices to describe the internal
operation of their ADSP-2106x and new ADSP-211xx families of Digital Signal
Processors. These are called SHARC® DSPs, a contraction of the longer term, Super
Harvard ARChitecture. The idea is to build upon the Harvard architecture by adding
features to improve the throughput. While the SHARC DSPs are optimized in
dozens of ways, two areas are important enough to be included in Fig. (c): an
instruction cache, and an I/O controller.
A handicap of the basic Harvard design is that the data memory bus is
busier than the program memory bus. When two numbers are multiplied, two binary
values (the numbers) must be passed over the data memory bus, while only one
binary value (the program instruction) is passed over the program memory bus. To
improve upon this situation, we start by relocating part of the "data" to program
memory. For instance, we might place the filter coefficients in program memory,
while keeping the input signal in data memory. (This relocated data is called
"secondary data" in the illustration). At first glance, this doesn't seem to help the
situation; now we must transfer one value over the data memory bus (the input signal
sample), but two values over the program memory bus (the program instruction and
the coefficient). In fact, if we were executing random instructions, this situation
would be no better at all.
However, DSP algorithms generally spend most of their execution time in

loops. This means that the same set of program instructions will continually pass
from program memory to the CPU. The Super Harvard architecture takes advantage
of this situation by including an instruction cache in the CPU. This is a small
memory that contains about 32 of the most recent program instructions. The first
time through a loop, the program instructions must be passed over the program
memory bus. This results in slower operation because of the conflict with the
coefficients that must also be fetched along this path. However, on additional
executions of the loop, the program instructions can be pulled from the instruction
cache. This means that all of the memory to CPU information transfers can be
accomplished in a single cycle: the sample from the input signal comes over the data
memory bus, the coefficient comes over the program memory bus, and the program
instruction comes from the instruction cache. In the jargon of the field, this efficient
transfer of data is called a high memory-access bandwidth.
Figure 2. presents a more detailed view of the SHARC architecture, showing
the I/O controller connected to data memory. This is how the signals enter and exit
the system. For instance, the SHARC DSPs provides both serial and parallel
communications ports. These are extremely high speed connections. For example, at
a 40 MHz clock speed, there are two serial ports that operate at 40 Mbits/second
each, while six parallel ports each provide a 40 Mbytes/second data transfer. When
all six parallel ports are used together, the data transfer rate is an incredible 240
Mbytes/second.
Just as important, dedicated hardware allows these data streams to be

transferred directly into memory (Direct Memory Access, or DMA), without having
to pass through the CPU's registers. The main buses (program memory bus and data
memory bus) are also accessible from outside the chip, providing an additional
interface to off-chip memory and peripherals. This allows the SHARC DSPs to use a
four Gigaword (16 Gbyte) memory, accessible at 40 Mwords/second (160
Mbytes/second), for 32 bit data.
This type of high speed I/O is a key characteristic of DSPs. The
overriding goal is to move the data in, perform the math, and move the data out
before the next sample is available. Everything else is secondary. Some DSPs have
on-board analog-to-digital and digital-to-analog converters, a feature called mixed
signal. However, all DSPs can interface with external converters through serial or
parallel ports.
At the top of the diagram are two blocks labeled Data Address Generator
(DAG), one for each of the two memories. These control the addresses sent to the
program and data memories, specifying where the information is to be read from or
written to. In simpler microprocessors this task is handled as an inherent part of the
program sequencer, and is quite transparent to the programmer. However, DSPs are
designed to operate with circular buffers, and benefit from the extra hardware to
manage them efficiently. This avoids needing to use precious CPU clock cycles to
keep track of how the data are stored. For instance, in the SHARC DSPs, each of the
two DAGs can control eight circular buffers. This means that each DAG holds 32
variables (4 per buffer), plus the required logic.
Some DSP algorithms are best carried out in stages. For instance, IIR filters
are more stable if implemented as a cascade of biquads (a stage containing two poles
and up to two zeros). Multiple stages require multiple circular buffers for the fastest
operation. The DAGs in the SHARC DSPs are also designed to efficiently carry out
the Fast Fourier transform. In this mode, the DAGs are configured to generate bit-
reversed addresses into the circular buffers, a necessary part of the FFT algorithm. In
addition, an abundance of circular buffers greatly simplifies DSP code generation-
both for the human programmer as well as high-level language compilers, such as C.
The data register section of the CPU is used in the same way as in
traditional microprocessors. In the ADSP-2106x SHARC DSPs, there are 16 general
purpose registers of 40 bits each. These can hold intermediate calculations, prepare
data for the math processor, serve as a buffer for data transfer, hold flags for program
control, and so on. If needed, these registers can also be used to control loops and
counters; however, the SHARC DSPs have extra hardware registers to carry out
many of these functions.
The math processing is broken into three sections, a multiplier, an
arithmetic logic unit (ALU), and a barrel shifter. The multiplier takes the values
from two registers, multiplies them, and places the result into another register. The
ALU performs addition, subtraction, absolute value, logical operations (AND, OR,
XOR, NOT), conversion between fixed and floating point formats, and similar
functions. Elementary binary operations are carried out by the barrel shifter, such as
shifting, rotating, extracting and depositing segments, and so on. A powerful feature
of the SHARC family is that the multiplier and the ALU can be accessed in parallel.
In a single clock cycle, data from registers 0-7 can be passed to the multiplier, data
from registers 8-15 can be passed to the ALU, and the two results returned to any of
the 16 registers.
There are also many important features of the SHARC family architecture
that aren't shown in this simplified illustration. For instance, an 80 bit accumulator is
built into the multiplier to reduce the round-off error associated with multiple fixed-
point math operations. Another interesting feature is the use of shadow registers for
all the CPU's key registers. These are duplicate registers that can be switched with
their counterparts in a single clock cycle. They are used for fast context switching,
the ability to handle interrupts quickly. When an interrupt occurs in traditional
microprocessors, all the internal data must be saved before the interrupt can be
handled. This usually involves pushing all of the occupied registers onto the stack,
one at a time. In comparison, an interrupt in the SHARC family is handled by
moving the internal data into the shadow registers in a single clock cycle. When the
interrupt routine is completed, the registers are just as quickly restored. This feature
allows step 4 on our list (managing the sample-ready interrupt) to be handled very
quickly and efficiently.
SHARC has 32/42 bit floating and fixed point core.DMA controller and
duel ported SRAM to move data into and out of memory without wasting core
cycles. It has high performance computation unit. It has four bus performances. They
include fetch next instruction, access 2 data values, performs DMA for I/O device.
3. The TigerSHARC Processor
Tiger sharc processors provide the highest performance density for

multiplexing applications with peak performance and well above a billion floating
point operations per second. One Gbyte/sec multiprocessing link ports gluelessly
multiple Tiger sharc processors, and versions are available with up to 24 Mbits of
integrated, on chip memory.
Keeping pace with the accelerating march of architectural innovation in

DSPs, Analog devices (ADI) unveiled its third generation floating point
DSP,TIGERSHARC.
There architect Jose Fridman described a complex, high
performance VLIW-based design incorporating unusually extensive single-
instruction, multiple data (SIMD) capabilities. Unlike its predecessors, which
are primarily aimed at application demanding floating point arithmetic,
TigerSHARc has excellent fixed point capabilities and is better described as
16-bit fixed point DSP with floating point support than as a floating point
DSP.
The TigerSHARC® Processor provides leading-edge system performance

while keeping the highest possible flexibility in software and hardware development.
The TigerSHARC Processor's balanced architecture utilizes characteristics

of RISC, VLIW, and DSP to provide a flexible, "all software" approach that adds
capacity while reducing costs and bills of material.
3.1 FEATURES
• Static Superscalar Architecture

• Two 32 bit MACs per cycle with 80-bit accumulation
• Eight 16-bit MACs per cycle with 40-bit accumulation
• Two 16-bit complex MACs per cycle
• Add-subtract instruction and bit reversal in hardware for FFTs
• 64-bit generalised bit manipulation unit
• Two billion MACs per second at 250 MHz
• 2 billion 16-bit MACs
• 500 million 32-bit MACs
• 12 GB/s of internal memory bandwidth for data and code
• 500 MHz, 2.0 ns instruction cycle rate.
• 12 Mbits of internal on-chip –DRAM memory
• Dual computation blocks, each containing an ALU,a multiplier, a shifter
and a register file
• Dual integer ALUs, providing and data addressing and pointer
manipulation
• Single precision IEEE 32-bit and extended bit precision 40-bit floating
point data formats and 8-,16-,32- and 64 bit fixed point data formats.
• Integrated I/O include 14 channel DMA controller, external
port,progamable flag pins, two timers and timer expired pin for system
integration.
3.2 Key Benefits
• Provides high performance static Superscalar DSP operations,

optimized for large, demanding multiprocessor DSP applications
• Performs exceptionally well on DSP algorithm and I/O benchmarks
• Supports low overhead DMA transfers between Internal memory,
external memory, memory mapped peripherals, host processors and
other DSPs
• Eases programming through extremely flexible instruction set and high
level language friendly DSP architecture
• Enables scalable multiprocessing systems with low communication
overhead.
3.3 DESCRIPTION
TigerSHARC processor is an ultrahigh performance, static

superscalar processor optimized for large signal processing tasks and
communication infrastructure. The DSP combines very memory widths with
dual computation blocks-supporting floating point (IEEE 32-bit and extended
precision 40-bit) and fixed point (8-,16-,32-,64- bits) processing to set a new
standard of performance for digital signal processors. The TigerSHARC
static superscalar architecture lets the DSP execute up to four instructions
each cycle, performing 24 fixed point (16-bit) operations. Four independent
128-bit wide internal data buses, each connecting to the six 2M bit memory
banks, enable quad –word data, instruction, and I/O address and provide 28
Gbytes per second of internal memory bandwidth.
Like its competititor Texas Instruments’ TMC320C64x,

TigerSHARC uses a very long instruction word (VLIW) load/store
architecture.TigerSHARC executes as any as four instructions per cycle with
its interlocking ten-stage pipeline
and dual computation blocks. Each block contains a multiplier, an ALU, and
a 64 –bit shifter and can perform one 32-*32 bit or four 16-*16-bit multiply –
accumulates (MAC) per cycle.
TigerSHARC is aimed at telecommunications infrastructure

applications, such as cellular telephone base stations. As illustrated in fig. the
TigerSHARC architecture contains a program control unit two computation units,
two address generators memory various peripherals and a DMA controller. With its
VLIW architecture TigerSHARC is capable of executing up to four instructions in a
single cycle, and its SIMD features enable it to perform arithmetic operations on
multiple 32-bit floating point values or multiple 32-,16- or 8-bit fixed point values.
Each of TigerSHARC’s computation units can perform two 32*32=64-bit fixed
point multiply-accumulates in a single cycle, using two operands each made up of
two concatenated 32-bit registers. Thus using both computation units TigerSHARc
can perform four 32*32=64-bit fixed –point multiply-accumulate operations in a
single cycle. Alternatively, TigerSHARC can perform two 32-bit floating point
MAC operations per cycle.
In fixed point DSP applications, the most common word width is 16-
bits.With four 16-bit fixed point elements concatenated in two 32-bit registers, one
computation unit can in a single cycle perform four 16*16=32-bit multiply-
accumulate operations (with 8 guard bits each to avoid overflow)-twice as many as
any currently available fixed or floating point DSP can perform.
TigerSHARC uses SIMD features at two levels-two separate computation

units that each operate on SIMD operands .Fig illustrates how the two SIMD
computation units divide the registers into different data sizes.
TigerSHARC is the first of the new wave of VLIW –based DSPs to
provide extensive SIMD capabilities. This approach provides greater parallelism
than that of its Texas Instruments competitors.
On-chip memory is divided into three banks: one for soft-ware and
two for data. ADI will not disclose the amount of on-chip memory in the first
TigerSHARC devices, but we expect that the vendor will continue to be generous
with on-chip memory; the predecessor SHARC and Hammerhead devices include
68K to 512K of on-chip memory. When moving 64-bit or 128-bit data,
TigerSHARC transfers data from consecutive memory locations to consecutive data
registers, or vice versa. The smallest amount of data that can be transferred is 32 bits.
If TigerSHARC programs use word sizes of 8 or 16 bits in a DSP algorithm, they
cannot access individual words; any load or store will transfer at least four 8-bit or
two 16-bit words. The chip includes a data alignment buffer and a short data
alignment buffer that allow 64 or 128 bits of data to be transferred from (but not to)
any memory location aligned on a 16-bit word boundary. TigerSHARC provides
more flexibility than most processors with SIMD features, which often require that
data be aligned at memory locations divisible by the size of the data transfer.
Data is transferred between the computation units and on-chip memory in

blocks of 32-,64-,or128-bits.When moving 64-bit or 128-bit data,TigerSHARC
transfers data from consecutive memory locations to consecutive data registers, or
vice versa. The smallest amount of data that can be transferred is 32-bits.If
TigerSHARC programs use word size of 8 or 16 bits in a DSP algorithm ,they
cannot access individual words, any load or store will transfer at least four 8-bit or
two 16-bit words.
The chip includes a data alignment buffer and a short data alignment buffer
that allow 64 or 128 bits of data to be transferred from (but not to be) any memory
location aligned on a 16-bit word boundary.TigerSHARC provides more flexibility
than most processors with SIMD features, which often require that data be aligned at
memory locations divisible by the size of the data transfer.
3.4 TigerSHARC Processor families

Nr
Processor Name Description Manufacturer
.
1 ADSP-TS101-S ADSP-TS101S TigerSHARC DSP Analog Devices
300 MHz TigerSHARC Processor with 6 Mbit on-chip
2 ADSP-TS101S SRAM
Analog Devices
ADSP-TS101SAB1- 300 MHz TigerSHARC Processor with 6 Mbit on-chip

3 Analog Devices
000 SRAM
ADSP-TS101SAB1- 300 MHz TigerSHARC Processor with 6 Mbit on-chip
4 Analog Devices
100 SRAM
500/600 MHz TigerSHARC Processor with 24 Mbit on-chip
5 ADSP-TS201S embedded DRAM
Analog Devices
ADSP-TS201SABP- 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip

6 Analog Devices
050 embedded DRAM
Analog Devices
ADSP-TS202SABP- 500 MHz TigerSHARC Processor with 12 Mbit on-chip

8 Analog Devices
050 embedded DRAM
ADSP-TS202SABP-
9 TigerSHARC Embedded Processor Analog Devices
X
ADSP-TS202SABP-
X
Analog Devices
ADSP-TS203SABP- 500 MHz TigerSHARC Processor with 4 Mbit on-chip

12 Analog Devices
050 embedded DRAM
ADSP-TS203SABP-
X
ADSP-TS203SABP-
X
3.5. FUNCTIONAL BLOCK DIAGRAM
3.5.1 Architectural Features
Flexibility without compromise—the TigerSHARC® Processor provides
leading-edge system performance while keeping the highest possible flexibility in
software and hardware development.
The TigerSHARC Processor's balanced architecture utilizes characteristics

of RISC, VLIW, and DSP to provide a flexible, "all software" approach that adds
capacity while reducing costs and bills of material.
The TigerSHARC® Processor is an ultra-high performance static

superscalar DSP optimized for multi-processing applications requiring
computationally demanding large signal processing tasks. This document describes
the key features of the TigerSHARC Processor architecture that combine to offer the
highest performance, flexibility, efficiency and scalability available to equipment
manufacturers in the marketplace today
3.5.2 Adapts to evolving signal processing demands
The TigerSHARC's unique ability to process 1-, 8-, 16- and 32-bit fixed-point as
well as floating-point data types on a single chip allows original equipment
manufacturers to adapt to evolving telecommunications standards without
encountering the limitations of traditional hardware approaches .Having the highest
performance DSP for communications infrastructure and multiprocessing
applications available, TigerSHARC allows wireless infrastructure manufacturers to
continue evolving their design to meet the needs of their target system, while
deploying a highly optimized and effective Node B solution that will realize
significant overall cost savings.
3.5.3 Multiprocessor, general-purpose processing

The TigerSHARC Processor's balanced architecture optimizes system, cost,
power, and density. A single TigerSHARC Processor, with its large on-chip
memory, zero overhead DMA engine, large I/O throughput, and integrated
multiprocessing support, has the necessary integration to be a complete node of a
multiprocessing system.
This enables a multiprocessor network exclusively made up of

TigerSHARCs without any expensive and power consuming external memories or
logic.
3.5.4 Instruction Parallelism and SIMD Operation
As a static superscalar DSP, the TigerSHARC Processor core can execute

simultaneously from one to four 32-bit instructions encoded in a single instruction
line. With a few exceptions, an instruction line, whether it contains one, two, three or
four 32-bit instructions, executes with a throughput of one cycle in an eight-deep
processor pipeline. The TigerSHARC Processor has a set of instruction parallelism
rules that programmers must follow when encoding an instruction line. In general,
the selection of instruction the DSP can execute in parallel each cycle depends on the
instruction line resources each requires and on the source and destination of registers
used. The programmer has direct control of the three core components - the IALU,
the Computation Blocks, and the Program Sequencer.
In most cases the TigerSHARC Processor has a two-cycle execution pipeline

that is fully interlocked, so whenever a computation result is unavailable for another
operation dependent on it, stall cycles are automatically inserted. Efficient
programming with dependency-free instructions can eliminate most computational
and memory transfer dependencies. All of the instruction parallel rules and data
dependencies are documented in the TigerSHARC Processor User's Guide.
The TigerSHARC Processor also has the capability of supporting single-

instruction, multiple-data SIMD operations through the use of both Computational
Blocks in parallel as well as the use of SIMD specific computations. The
programmer has the option of directing both Computation Blocks to operate on the
same data (broadcast distribution) or different data (merged distribution). In
addition, each Computation Block can execute four 16-bit or eight 8-bit SIMD
computations in parallel.
3.5.5. Independent, Parallel Computation Blocks
As mentioned above, the TigerSHARC Processor has two Computation

Blocks that can operate either independently, in parallel or as a SIMD engine. The
DSP can issue up to two compute instructions per Computation Block per cycle,
instructing the ALU, multiplier or shifter to perform independent, simultaneous
operations. The Computation Blocks each contain four computational units, an ALU,
a multiplier, a 64-bit shifter, a CLU (ADSP-TS201S only) and a 32-bit register file.
The 32-bit word, multi-ported register files are used for transferring data
between the computational units and data buses, and for storing intermediate results.
Instructions can access the registers in the register file individually (word-aligned) or
in sets of two (dual-aligned) or four (quad-aligned). The ALU performs a standard
set of arithmetic operations in both fixed-point and floating-point formats, while also
performing logic operations. The multiplier performs both fixed-point and floating-
point multiplication as well as fixed-point multiply and accumulates. The 64-bit
shifter performs logical and arithmetic shifts, bit and bit-stream manipulation, and
field deposit and extraction.
3.5.6. CLU (Communications Logic Unit)
The CLU on the ADSP-TS201S is a 128-bit unit which houses enhanced

acceleration instructions specifically targeted at increasing the amount of Complex
Multiplies per cycle and improving the Decoding efficiency of the TigerSHARC
device. The CLU is not available on the ADSP-TS202S and ADSP-TS203S.
3.5.7 Integer ALUs
The TigerSHARC Processor has two integer ALUs (IALUs) that provide
powerful address generation capabilities and perform many general-purpose integer
operations. Each IALU has a multi-ported 31-word register file. As address
generators, the IALUs perform immediate or indirect (pre- and post-modify)
addressing. They perform modulus and bit-reverse operations with no constraints
placed on memory addresses for data buffer placement. Each IALU can specify
either a single, dual- or quad- word access from memory.
The TigerSHARC Processor IALUs enable implementation of circular

buffers in hardware. Circular buffers facilitate efficient programming of delay lines
and other data structures required in digital signal processing, and they are
commonly used in digital filters and Fourier transforms. Each IALU provides
registers for four circular buffers, so applications can set up a total of eight circular
buffers. The IALUs handle address pointer wraparound automatically, reducing
overhead, increasing performance, and simplifying implementation.
Circular buffers can start and end at any memory location. Because the
IALU's computational pipeline is one cycle deep, in most cases integer results are
available in the next cycle. Hardware (register dependency check) causes a stall if a
result is unavailable in a given cycle.
3.5.8 TigerSHARC Memory Integration
The large on-chip memory is divided into three separate blocks of equal size. Each
block is 128-bits wide, offering the quad word structure and four addresses for every
row. For data accesses, the processor can address one 32-bit word or two 32-bit
words (long) or four 32-bit words (quad) and transfer it to/from a single
computational unit or to both in a single processor cycle. The user only has to care
that the start addresses are either modulo two or modulo four addresses when
fetching long words and quad words. In applications that require computing data of a
delay line in which the start address of the variable does not match the modulo
requirements, or in other applications that require unaligned data fetches a data
alignment buffer (DAB) is provided. Once the DAB is filled, quad word fetches can
be made from it.Besides the internal memory, the TigerSHARC can access up to
four giga words of memory. The memory map is given in Figure
3.5.9 Program Sequencer
The TigerSHARC Processor Program Sequencer manages program

structure and program flow by supplying addresses to memory for instruction
fetches. Contained within the Program Sequencer, the Instruction Alignment Buffer
(IAB) caches up to five fetched instruction lines waiting to execute. The Program
Sequencer extracts an instruction line from the IAB and distributes it to the
appropriate core component for execution. Other Program Sequencer functions
include; determining flow according to instructions such as JUMP, CALL, RTI and
RTS, decrement the loop counters, handle hardware interrupts and using branch
prediction and 128-entry Branch Target Buffer (BTB) to reduce branch delays for
efficient execution of conditional and unconditional branch instructions.
3.5.10. Flexible Integrated Memory
The ADSP-TS20xS family has three memory variants. The ADSP-

TS201S has 24Mbits of on-chip embedded DRAM memory, divided into six blocks
of 4Mbits (128 K words X 32-bits); the ADSP-TS202S has 12Mbits of on-chip
embedded DRAM memory, divided into six blocks of 2Mbits (64 K words X 32-
bits); the ADSP-TS203S has 4Mbits of on-chip embedded DRAM memory, divided
into four blocks of 1Mbit (16 K words X 32-bits). On all variants, each block can
store program memory, data memory or both, so programmers can configure the
memory to suit their specific needs. The six memory blocks connect to the four 128-
bit wide internal buses through a crossbar connection, enabling four memory
transfers in the same cycle. The internal bus architecture of the ADSP-TS20xS
family provides a total memory bandwidth of 32 Gbytes/second, enabling the core
and I/O to access twelve 32-bit data words four 32-bit instructions per cycle.
3.5.11. DMA Controller
The TigerSHARC Processor on-chip DMA controller, with fourteen DMA

channels, provides zero-overhead data transfers without processor intervention. The
DMA controller operates independently and invisibly to the DSP's core, enabling
DMA operations to occur while the core continues to execute program instructions.
The DMA controller performs routine functions such as external port block
transfers, link port transfers and AutoDMA transfers as well as additional features
such as Flyby transfers, DMA chaining and Two-dimensional transfers.
3.5.12. Link Ports
The ADSP-TS201S and ADSP-TS202S have four full-duplex link ports each
providing four-bit receive and four-bit transmit I/O capability, using Low-Voltage,
Differential-Signal (LVDS) technology. With the ability to operate at a double data
rate running at 500 MHz, each link can support up to 500 Mbytes per second per
direction, for a combined maximum throughput of 4 Gbytes per second.
The ADSP-TS203S has two full-duplex link ports each providing four-bit
receive and four-bit transmit I/O capability, using Low-Voltage, Differential-Signal
(LVDS) technology. With the ability to operate at a double data rate running at 250
MHz, each link can support up to 500 Mbytes per second per direction, for a
combined maximum throughput of 4 Gbytes per second.
Each Link Port has its own triple-buffered quad-word input and double-
buffered quad-word output registers. The DSP's core can write directly to a Link
Port's transmit register and read from a receive register, or the DMA controller can
perform DMA transfers through eight dedicated Link Port DMA channels.
3.5.13. External Port

The external port on TigerSHARC Processor is 64 bits wide and runs up to
125MHz. Using the external port, up to 8 TigerSHARC Processor's, a host and
global memory can be shared without any external logic. This is the second way, in
addition to link ports, that TigerSHARC DSP offers support for multiprocessor
systems. SDRAM and SBSRAM controllers allow for a glueless interface to these
types of memories. The external port also supports a fly by mode which allows a
host to access a global shared memory.
4. Applications
At a 250 MHz clock rate, the ADSP-TS101S [TigerSHARC] offers a

DSP industry-best 1500 MFLOPS peak performance and has native support
for 8, 16, 32, and 40-bit data types. With a 1.5 watt typical power dissipation,
6 Mbits of on-chip memory, 14 channel zero-overhead DMA engine,
integrated SDRAM controller, parallel host interface, cluster multiprocessing
support, and link port multiprocessing support, the TigerSHARC is ideal for heat
sensitive multiprocessing applications.
Here are some of the target applications for floating-point DSPs:
"TigerSHARC's exceptional speed and functionality are suited for applications in:
Defense - sonar, radar, digital maps, munitions guidance
Medical - ultrasound, CT scanners, MRI, digital X-ray
Industrial systems - data acquisition, control, test, and inspection systems
Video processing - editing, printers, copiers
Wireless Infrastructure - GSM, EDGE, and 3G cellular base stations."
5. Advantages of Tiger SHARC Processor

The Analog Devices TigerSHARC® Processor architecture provides the
greatest marriage of performance and flexibility enabling the most cost effective
solution for baseband processing and other applications within the Wireless
Infrastructure market space today. Wireless Infrastructure manufacturers can
consider many approaches when developing baseband modem solutions for third
generation wireless communications networks (3G), however the TigerSHARC
Processor architecture provides the balance of attributes required to satisfy the entire
range of challenges facing their 3G deployments.
The TigerSHARC Processor is the heart of a software defined solution

for baseband modems where all of the implementation occurs in software rather than
in hardware as is the approach taken by ASIC and other competing DSP solutions.
The TigerSHARC Processor allows for the infrastructure vendor to establish a single
baseband processing platform for all of the 3G standards with easily implemented
software changes to update functionality and speed time to market.
The very powerful architecture of the TigerSHARC, combining the best

elements of RISC and DSP cores, is highly suited to deliver the performance
required for upcoming applications in 3G mobile communications, xDSL
technologies and imaging systems. The Static Superscalar architecture maintains
determinism for security-sensitive applications and the high number of internal
registers allows the efficient use of a high-level language, speeding up the
development process of the designers.
6. Conclusion
As a result of its "Load Balancing" capabilities, high internal and

external bandwidth, large integrated memory and unmatched level of flexibility, the
TigerSHARC Processor proves to be an unconventional but extremely effective
solution for baseband signal processing. In future generations of the TigerSHARC
Processor we intend to continue the trend towards reduced systems cost and
component count while increasing the functionality of the solution through clock
speed enhancements and an expanded instruction set.
7. References
• www.analog.com/processors/tigersharc
• www.analog.com/processors/teaching Resources
• www.ener.ucalgory.co/People/Smith/ECE-ADI-PROJECT
• www.answers.com

Tiger SHARC Processor

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Tiger SHARC Processor

Enviado por

Direitos autorais:

Formatos disponíveis

-

The TigerSHARC DSP is an ultra high performance static superscalar

1.1. Analog and digital signals

3.5.13. External Port

1.2 Signal processing

Signals commonly need to be processed in a variety of ways. For example,

1.3 Digital Signal Processing

Digital signal processing (DSP) is the study of signals in a digital

1.4 Development of DSP

1.5 Digital Signal Processors (DSPs)

DSP processors are microprocessors designed to perform digital signal

Advantage over other Microprocessors

One of the biggest bottlenecks in executing DSP algorithms is transferring

2.1 Von Neumann architecture

Figure 1(a).shows how this seemingly simple task is done in a traditional

As shown in (a), a Von Neumann architecture contains a single memory and

2.2 Harvard architecture

2.3 Super Harvard Architecture(SHARC)

However, DSP algorithms generally spend most of their execution time in

Just as important, dedicated hardware allows these data streams to be

Tiger sharc processors provide the highest performance density for

Keeping pace with the accelerating march of architectural innovation in

The TigerSHARC® Processor provides leading-edge system performance

The TigerSHARC Processor's balanced architecture utilizes characteristics

• Static Superscalar Architecture

• Provides high performance static Superscalar DSP operations,

TigerSHARC processor is an ultrahigh performance, static

Like its competititor Texas Instruments’ TMC320C64x,

TigerSHARC is aimed at telecommunications infrastructure

TigerSHARC uses SIMD features at two levels-two separate computation

Data is transferred between the computation units and on-chip memory in

3.4 TigerSHARC Processor families

ADSP-TS101SAB1- 300 MHz TigerSHARC Processor with 6 Mbit on-chip

ADSP-TS201SABP- 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip

ADSP-TS202SABP- 500 MHz TigerSHARC Processor with 12 Mbit on-chip

ADSP-TS203SABP- 500 MHz TigerSHARC Processor with 4 Mbit on-chip

The TigerSHARC Processor's balanced architecture utilizes characteristics

The TigerSHARC® Processor is an ultra-high performance static

3.5.2 Adapts to evolving signal processing demands

3.5.3 Multiprocessor, general-purpose processing

This enables a multiprocessor network exclusively made up of

3.5.4 Instruction Parallelism and SIMD Operation

As a static superscalar DSP, the TigerSHARC Processor core can execute

In most cases the TigerSHARC Processor has a two-cycle execution pipeline

The TigerSHARC Processor also has the capability of supporting single-

3.5.5. Independent, Parallel Computation Blocks

As mentioned above, the TigerSHARC Processor has two Computation

3.5.6. CLU (Communications Logic Unit)

The CLU on the ADSP-TS201S is a 128-bit unit which houses enhanced

3.5.7 Integer ALUs

The TigerSHARC Processor IALUs enable implementation of circular

3.5.8 TigerSHARC Memory Integration

The TigerSHARC Processor Program Sequencer manages program

3.5.10. Flexible Integrated Memory

The ADSP-TS20xS family has three memory variants. The ADSP-

The TigerSHARC Processor on-chip DMA controller, with fourteen DMA

3.5.12. Link Ports

3.5.13. External Port

At a 250 MHz clock rate, the ADSP-TS101S [TigerSHARC] offers a

Here are some of the target applications for floating-point DSPs:

5. Advantages of Tiger SHARC Processor

The TigerSHARC Processor is the heart of a software defined solution

The very powerful architecture of the TigerSHARC, combining the best