Você está na página 1de 10

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO.

7, JULY 2006

1477

Very Small FPGA Application-Specic Instruction Processor for AES


Tim Good, Student Member, IEEE, and Mohammed Benaissa, Member, IEEE
AbstractThis paper presents two low-area designs for the advanced encryption standard on eld-programmable gate arrays (FPGAs). Both these designs are believed to be the smallest to date. The rst design is an 8-bit application-specic instruction processor, which supports key expansion (currently programmed for a 128-bit key), encipher and decipher. The design utilizes less than 60% of the resources of the smallest available Xilinx Spartan II FPGA (XC2S15). The average encipher-decipher throughput is 2.1 Mbps when clocked at 70 MHz. The design has numerous applications where low area and low power are priorities. The second design, using the Xilinx PicoBlaze soft core is included to provide an embedded 8-bit microcontroller comparison baseline. Index Terms8 bit, advanced encryption standard (AES), application-specic instruction processor (ASIP), eld-programmable gate array (FPGA), low area.

I. INTRODUCTION

N 2000, THE U.S. Government announced the result of an open competition to select a replacement cipher for the aging data encryption standard (DES). The winner was the Rijndael cipher developed by Vincent Rijmen and Joan Daemen and was afforded the accolade the advanced encryption standard (AES). The AES and its implementation for both application-specic instruction processor (ASIC) and eld-programmable gate array (FPGA) technologies has been the subject of much research and continues to be a topic of interest in both academic and commercial environments. In recent years, there has been a trend towards using FPGA in the production versions of electronic systems. It is no longer true that FPGAs are only used for prototyping. Their inclusion in the nal version, would at rst appear more expensive, however the ability to update the design and reduced time to market are strong commercial drivers. This was furthered by the introduction by the FPGA manufacturers of effectively mask programmed standard-cell versions of their technologies. This has resulted in an increased demand on optimal FPGA designs. High-throughput AES FPGA designs [1][3] typically unroll the loops within the AES design followed by deep pipelining of

Manuscript received June 3, 2005; revised September 9, 2005 and November 24, 2005. This work was supported by U.K. Engineering and Physical Sciences Research Council (EPSRC). This paper was presented in part at the Workshop on Cryptographic Hardware and Embedded Systems August 29September 1, 2005, Edinburgh, U.K. This paper was recommended by Associate Editor Z. Wang. The authors are with the Electrical Engineering Department, University of Shefeld, Shefeld S1 3JD, U.K. (e-mail: m.benaissa@shefeld.ac.uk) Digital Object Identier 10.1109/TCSI.2006.875179

a 128-bit datapath to achieve throughputs of the order of tens of gigabits per second. Such designs have utility in application areas such as hardware accelerator cards for e-commerce servers and secure trunk communications. Although the high-throughput designs can be very efcient in terms of throughput versus area, for numerous applications would provide a throughput many times more than required thus utilize both excessive area and power. This gives rise to an alternative approach of developing low-area and/or low-power designs. There are a number of reported low-area AES FPGA designs [4][8], however, the designs of Chodowiec [5] and Rouvroy [6] are believed to be the best low-area FPGA designs to date. These opted to iteratively use a reduced xed-width 32-bit datapath and yield a throughput of order hundereds of megabits per second. This throughput would appear to be greatly reduced compared to the 128-bit architectures, however, the designs are signicantly smaller. Typical applications include hardware accelerators for xed network applications. There are a number of application areas seeking even lower area designs for block ciphers such as the AES in consumer electronics, for example mobile communications, which require modest data rates of the order of 1 Mbps. However, such implementations appear to be lacking in the literature. Consideration was given to how small a datapath could be used and 8-bit was selected as a practical minimum width (thus candidate for lowest area). Excluding the essentially software designs which use 8-bit microcontrollers, only one design, and then for ASIC, by Feldhofer [9] has been found that explores the possibility of utilizing an 8-bit datapath. However, even this design used a dedicated 32-bit implementation for the AES MixColumns operator. The main thrust of this paper is an ASIP capable of performing AES encipher and decipher operations using a truly 8-bit datapath. The design utilizes a novel version of the SubBytes operation using existing composite eld arithmetic [10], however, the three required multiplications are performed using a single resource shared multiplier with the commensurate area saving. In addition, the MixColumns and all remaining operations are performed using a dedicated 8-bit Galois Field multiply-accumulate architecture. An iterative approach to multiplication was taken by implementing hardware support for nite-eld doubling or nite-eld multiplication by two (ffm2), halving and modulo-two addition. The ASIP achieves an average encipher-decipher throughput of 2.1 Mbps and utilizes less than two thirds of the resources of the smallest Xilinx Spartan-II part (XC2S15). To complete the design space exploration, a second design is also presented in this paper which utilizes the PicoBlaze [11]

1057-7122/$20.00 2006 IEEE

1478

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 7, JULY 2006

complex state machine soft core processor to provide a comparative 8-bit embedded processor design. Due to mismatch between conventional binary arithmetic used by small microcontrollers and Galois Field arithmetic used in AES the essentially software implementations for AES do not yield promising results for low-area designs. The ASIP is shown to offer a three fold speed advantage for a similar area. Both designs are believed to be the smallest FPGA implementations to date. FPGA technology contains a xed set of resources, typically a two dimensional array of four input congurable look-up tables (LUTs) together with D-type ip ops (FDs). These resources are interconnected to form the desired circuit using congurable switch networks contained within the routing cells. For Xilinx FPGAs, the elemental resource is a slice. Thus, area utilization of a design is usually quoted in terms of the number of slices used. The results presented are for Xilinx FPGAs, however, the optimizations made are equally applicable to other vendors FPGAs. The comparison between FPGA designs which incorporate ROMs and those which do not is sometimes problematic. Here, this is solved by converting the amount of block memory used into an equivalent number of slices. This yields a single area gure for any design. The througputarea ratio is frequently used as the academic measure of design efciency, however, there are both economic and engineering savings to be made by striving towards the lowest possible area design which meets the overall system requirements. The designs in this paper are aimed at challenging the lowest area end of the design space whilst still providing a usable throughput. The remainder of this paper is organized as follows. Section II briey describes the AES followed by a description of the design in Section III. Section IV details the ASIP hardware followed by, in Section V, the corresponding software. The results for FPGA implementation are given in Section VI The paper ends by drawing some conclusions in Section VII including a graphical representation of the low-area AES design space. II. AES The AES has been fully documented in the freely available U.S. government publication FIPS-197 [12]. Briey, this block cipher can perform encipher and decipher operations using the repeated operation of a Substitute Permute Network (SPN) on 128 bits of data. Each time the SPN is used it is supplied with a different RoundKey. These are generated by a function known as KeyExpansion. Three different key lengths were specied, 128, 192, and 256 bits. Which in turn require 10, 12, and 14 rounds of substitution and permutation. The rst and nal rounds differ from the middle rounds and the overall process is summarized in Fig. 1. The AES specication [12] provides two alternative designs for decipherment. In this design, the one where the RoundKeys are the same as encipher was selected. These are applied in reverse order for decipherment. Each of the component operations are described below together with their respective inverses required for decipherment. The ShiftRows operator is essentially a dened reordering of the bytes within the current state. The 128-bit data word, , ap-

Fig. 1. Round structure of the AES.

plied to ShiftRows, , and InvShiftRows, , may be considered as a 4-by-4 matrix of 8-bit values. The byte order to being ordered from (1)

(2)

SubBytes performs sixteen multiplicative in, verse with irreducible polynomial each inversion being followed by a specic afne transform ). The InvSubBytes operation can be similarly ( dened (3) (4) where

GOOD AND BENAISSA: VERY SMALL FPGA ASIP FOR AES

1479

The MixColumns operator multiplications

performs a set of xed-value

(5) This may be conveniently written in matrix form for each column to give the MixColumns and InvMixColumns operations using multiplication modulo represented by the symbol

where

and

(6)

The nal operation AddRoundKey is simply the bitwise exclusive-or (XOR) of the current state and the RoundKey. The KeyExpansion utilizes four SubBytes operations followed by addition to yield the set of RoundKeys. Unfortunately, the order of use of the RoundKeys is reversed for the decipher data path thus it is necessary to compute the nal RoundKey before deciphering data can proceed. The only method of doing this is to commence with the initial key and run through all the intermediate RoundKeys to reach the nal (starting) value. The expansion operation also incorporates a byte-wise rotation and addition of a round specic constant, Rcon. These constants can be derived using ffm2. For 128-bit key, the th RoundKey is of byte values and is dened by composed of columns the following equation:

The design of the ASIP was an iterative process. The design was conceptually split into three principal areas: the hardware, the instruction set and the application program. The denition of the instruction set effectively formed a design partition between the software and hardware aspects. A number of design iterations were followed. This is the classical hardware-software co-design issue. The design process was started by dening functions with conveniently small hardware implementations. This was followed by dening instructions using the hardware then considering the use of the instructions to form the required software program functionality. Next, the software program was examined to look for areas where changed hardware would result in optimization in terms of reducing the number of instructions. At each design iteration, consideration was given to the speed and area of the overall design. The design process continued alternating between searching for software-led and hardware-led optimizations. Area reduction was the primary consideration with throughput as a secondary concern. From the initial stages of the design, three key issues were identied which contributed to most of the area. The rst concerned the computation of SubBytes, for which existing implementations vary from look-up tables to computing the function mathematically. The second, was the denition of a suitable primitive operation (namely ffm-accumulate) to efciently perform the Galois Field mathematics in the AES MixColumns, AddRoundKey, and KeyExpansion operations. The nal issue was program ROM size reduction for which the two traditional techniques of iteration and subroutines were considered. These three issues are discussed in detail in the following sections.

A. Low-Area SubBytes The most obvious method for implementing the SubBytes operation on FPGA was using a look-up table (LUT) based around a block memory (the S-box). The table for the forward and inverse transformation would require 512 bytes (4 kbits). Given the dual port nature of Xilinx block memories this ROM could be used for two simultaneous operations. However, for this low-area case, this would in turn require additional data memory bandwidth (thus area) to make use of such an optimization. The design of Chodowiec and Gaj [5] is such an example. An alternative was to use distributed memory and implement SubBytes using LUTs. This would approximately occupy slices. The resulting overall slice count would not normally be described as low area so an alternative was required. In Rouvroys design [6] SubBytes was combined with MixColumns to form a 32-bit T-box LUT (18 kbit). This produced superior throughput however still occupied a relatively large area when the size of the LUT was taken into account. For many applications, throughputs in hundreds of megabits per second would be considered excessive. Here, an alternative, lower area, solution was required. A number of existing works [1][10] demonstrated how SubBytes may be computed using Composite Field mathematics rather than a LUT.

and for (7)

III. ASIP DESIGN The rst decision was to select an appropriate datapath width for the processor. As already described in the introduction, a number of the previous low-resource designs had opted for a 32-bit datapath. Examination of the AES mathematics revealed the possibility of using an 8-bit datapath which had not been previously explored. Using less than 8 bits is believed to be impractical as the AES predominately uses 8-bit Galois Field arithmetic.

1480

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 7, JULY 2006

TABLE I COMPOSITE FIELD CONSTRUCTION

Fig. 2. Block diagram of multiplicative Inverse in GF(((2 ) ) ).

For a composite eld value

in XC2S15 device. However, Satohs design was aimed at attaining a high throughput rather than low area. An alternative design may be derived at the expense of throughput to further reduce the area cost. The objective was to perform the multiplicative inverse of the over a number of cycles sharing supplied value in composite eld multiplier. Here, the input byte the is split into two 4-bit nibbles . The inversion is then given by the following equation:

(8) with a primitive trinomial

(9) then letting , we have the inverse

where

(10)

where

(14)

This equation only utilizes arithmetic in the subeld, including an inversion which in turn can be similarly decomposed is reached. The composite eld muluntil the base eld of two values and tiplication

as and addition Representing multiplication in the equation may be rewritten and separated into three as separate stages to give

and may be represented in terms of subeld arithmetic as

(11) (15) Each of these equations only contains one multiplication in which enables the sharing of this relatively large resource. A circuit was developed where in the rst cycle, was calculated, followed in subsequent cycles by and , respectively. The resulting inversion is the combination of the and . Fig. 2 shows the circuit for this two 4-bit nibbles novel design. However, the transformation matrices (isomorphisms) from to are also needed. The SubBytes operation requires an afne transformation as part of its process. The isomorphism and afne transformation matrices can be combined, but here the afne transformation and its inverse conveniently pack together (in terms of resource utilization within the FPGA slice) with the required MUXes in terms of LUT utilization and thus were not combined with the isomorphisms ( and ). The computational path of SubBytes was relatively long and this would dominate the cycle time of the entire processor. As the SubBytes operation was not the dominant operation in terms of quantity (as a fraction of the total instructions needed to perform the AES) this would have unduly limited the performance. Thus, SubBytes was split further into a total of ve cycles to remove it from the critical path (Fig. 3).

where

(12)

A further optimization can be made by describing this multiplication in Mastrovitos form [13] as

(13) In order to perform an equivalent inversion in composite eld arithmetic additional isomorphic transformations are required. These can be found using the method described in Paar [14]. The composite eld theory was applied a number of times to construct a set of elds starting with the base eld and building up successively to reach . Each stage has its own primitive trinomial and binary value format. Table I summarizes the eld construction. As a starting point for further optimization, the existing SubBytes design of Satoh [10] was synthesized using Xilinx ISE 6.3 and the area result obtained. The combined SubBytes-InvSubBytes operation occupied 58 slices on a Xilinx Spartan-II

GOOD AND BENAISSA: VERY SMALL FPGA ASIP FOR AES

1481

Fig. 3. Block diagram of new subbytes circuit.

This approach reduced the total forward and inverse SubBytes circuit to 42 slices on an XC2S15, a reduction in size of 27% compared to the original high-throughput version [10]. B. 8-Bit ffm-Accumulate The AES MixColumns operator is fundamentally a 32-bit operator and there have been a number of designs [5][6] based around a 32-bit datapath. Only one design [9], for ASIC, was found which reported using an 8-bit datapath. However, the design married a 32-bit MixColumns to the 8-bit datapath by successively loading three 8-bit input registers in sequence to form the required 32-bit word with a similar process at the output. Here, a truly 8-bit alternative is sought with the corresponding area saving. Examining the AES algorithm, a set of primitive operations were determined which cover the remaining operations of ShiftRows, MixColumns, and KeyExpansion. These were found to be ffm2 and XOR. For this design, the decipher function was also required and as it is undesirable to store the entire set of RoundKeys, a further operation of nite-eld halving or nite-eld division by two (ffd2) was needed for reverse KeyExpansion. The ShiftRows operator was implemented as a set of 8-bit data moves between memory locations. Hardware implementation of the ffm2 and halving is described by the following equations:

Fig. 4. Circuit diagram for multiply-accumulate functions.

C. Program ROM Size Reduction One of the critical design decisions was which looping constructs, if any, were to be supported. A very simple processor could be constructed which only permitted execution of linear code. However, once the cost of the large program ROM size was balanced against the area and performance penalties for implementing even the most limited forms of iteration then linear code was no longer a viable option. Initial estimation of the program ROM size showed it would take a sequence of several thousand instructions to carry out the entire AES cipher process. If a linear code approach was used this would require several kilobytes of ROM which would dominate the area of the design. The standard techniques for reducing the size of a program are iteration and subroutines. However, both techniques require specialist support from the processor hardware thus their inclusion would increase the area cost and complexity of the processor. General purpose microprocessors frequently use stacks to implement the support for storing local variables and subroutine return addresses. However, a simpler more limited version can be produced where only a single level of subroutines is supported. Given the relatively low complexity of the AES process, a single level of subroutines was found to be sufcient and yielded about a factor of three reduction in ROM size. The AES operation is fundamentally round-based (iterative) thus support for some form of iteration would result in significant saving in terms of the number of instructions. Iteration alone would be of little utility without some form of indexed addressing so the two are considered together. One level of iteration would result in approximately a factor of 10 reduction in ROM size. Two levels resulted in approximately a factor of 40 saving. The nal ASIP hardware provided support for one level of subroutines and two levels of iteration with one of the loop counters being used to conditionally provide indexed addressing. This enabled programming of the entire AES cipher process using only a few hundred instructions from an instruction set consisting of only 15 instructions. IV. ASIP HARDWARE The traditional microcontroller architecture was adopted with separate program and data memories (i.e., Harvard Architecture). Two levels of looping were supported using two dedicated four bit counters X and Y. The loading of these was performed using the LDLOOP instruction and a single instruction DJNZ decreases a specied counter and performs a conditional jump if the value was nonzero.

m d
(16) There are numerous examples in the MixColumns and KeyExpansion calculations where the result of an 8-bit operation was further acted upon. This was either in terms of repeated nite-eld addition or repeated ffm2s. Thus, the inclusion in the datapath of an accumulator reduced the demands placed on the data memory. These requirements led to the development of a multiply-accumulate architecture capable of supporting moving 8-bit data 8-bit nite-eld addition (XOR) and multiplication and division . An execution unit specic to this type of opby two in eration was developed and its circuit is presented in Fig. 4. This circuit can have numerous possible operations depending on the control line settings for the accumulator register and multiplexers. In order to reduce the complexity of the instruction decoder, only the minimal required set of these were included in the instruction set. Further reduction in control overhead was made by the creation of a special register containing ags which control the rarely changed overall mode of operation. This yielded a further saving on instruction decoder complexity.

1482

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 7, JULY 2006

It was decided that a single four bit index, conditionally applied to source and destination RAM addresses, and associated with a loop counter (Y) was optimal. This addressing was enabled by the X ag and only operates on the lower 32 addresses (i.e., those associated with key and data and not the temporary storage). The value of the index was added modulo 16 or if the R ag was set, negated prior to the addition. The use of a single level of subroutines was supported using a dedicated return address register associated with the program counter. The JSR instruction calls a subroutine and the RETN instruction resumes execution at the instruction immediately after the previous JSR. The data memory was required to store both key and data, each 16 bytes long (128 bits), together with temporary storage for 13 bytes of intermediate values, totaling 45 bytes. The RAM was implemented as a register le using a block memory but may be implemented in a distributed form, using LUTs, expending an extra 42 slices and saving one block memory. The AES application program consisted of 255 instructions. An address word size of 8 bits was selected capable of addressing a maximum of 256 instructions (each 16 bit long). At reset, program execution starts at one of two xed addresses determined by the state of the external encode line. The program ROM was implemented using a block memory. Xilinx block memories use a precharge technique so required clocking. This introduced an unwanted delay which was mitigated using pipelining to maintain execution of a new instruction each cycle. The processor supported two dedicated busses: one for data and key input and one for data output. Data transfer to/from the processors data memory was performed byte at a time using the INP and OUTP instructions. Additionally, two ags I and O were used to generate handshaking signals for the message blocks. The implementation of the SubBytes operation used a dedicated execution unit. This was controlled by two instructions SBOX-S which was used to start the process and SBOX-C which must be repeated four times to continue the calculation. In the last of these instructions, all of which must be consecutive, the required destination address was set. For programming convenience this was implemented as an assembler macro instruction SBOX. Selection between SubBytes and InvSubBytes was made using the E ag. The second of the two principal execution units was used to perform multiply-accumulate operations using nite-eld arithmetic. This was a problematic area of the design and a balance between throughput and area had to be made. A number of different options were explored which nally yielded the architecture previously described in Fig. 4. There were numerous instructions which could be performed using the architecture however a minimal subset was selected. There are four instructions used to control the multiply accumulate execution unit. The rst two, ffmsrc and ffmacc, perform either ffm2 (E ag set) or ffd2 (E ag clear). The ffmsrc instruction obtains the value to be scaled from the source memory location whereas the ffmacc instruction uses the value currently stored in the units accumulator. In both cases, the accumulator is loaded with the scaled value. The nal two instructions XORsrc and XORacc perform the XOR of the source address contents with the accumulator and write the

TABLE II PROCESSOR INSTRUCTION SET

result to the destination memory location. Additionally, XORsrc updates the accumulator with the result. Some key instructions would at rst appear to be missing from the list, however, can be dened using convenient assembler macros which translate to use this terse instruction set. A HALT instruction was formed using a JSR instruction with the current PC value as the operand. The two counters, X and Y, may be decremented using the DJNZ instruction with a jump address equal to PC+1. A MOVE instruction to copy data from one memory location to another was formed by rst ensuring the Z ag was set then using the XORSRC instruction. The complete instruction set for the processor is summarized in Table II. Fig. 5 shows the architecture of the processor. It should be noted that due to the clocking requirement of block memories, instructions take multiple cycles. Pipelining was used to enhance performance so that one new instruction was executed each cycle. As the memory-write operation takes a cycle then an attempt to read the same location in the next operation will fail to receive the current value. Given the presence of an accumulator in the datapath this was not found to be a limitation and was avoided altogether by careful coding of the program assisted by some design rule checking in the assembler. Similarly, any branching decision must be made in the same cycle as the instruction is fetched and if a jump is required the next instruction (incorrectly fetched) would be replaced by a NOP. This was relatively straight forward to implement in hardware at the expense of a small drop in throughput. Fig. 6 is a pie chart depicting the split in the processor design between the various design modules. The area for each in terms

GOOD AND BENAISSA: VERY SMALL FPGA ASIP FOR AES

1483

TABLE III SEQUENCE OF OPERATIONS IN AES ALGORITHM

Fig. 5. ASIP architecture.

TABLE IV PROGRAM STRUCTURE, SIZE AND EXECUTION TIMES

Fig. 6. Distribution of area between design units (excluding memories).

of number of slices is shown around the chart. The datapath part of the processor occupies 74 of the 122 slices ( 60%) so the processor is not over burdened by control ( 40%). The cost associated with the block memories has been excluded from this gure. V. ASIP SOFTWARE The principal design constraint was the minimization of area. In hardware instantiation, the software is stored in ROM, which occupies area. Thus, our aim to minimize area translates into minimizing the program size. Although the program ROM was implemented using a dedicated block memory, as a measure of how much area an instruction occupies, the distributed ROM equivalent was considered. The instruction size was xed (16 bits) thus an equivalent LUT-based distributed ROM would result in approximately two instructions per slice. The good practice of meaningful subroutines and readability had to be sacriced in the pursuit of low area. With the goal of reducing operations subroutines were only dened to avoid duplication of instructions where code was used more than once. In the case of the proposed architecture, the one level prevented nesting of subroutines however this was not found to present any signicant limitation. A top level design approach was taken followed by optimization. At the top level, the AES algorithm may be described as

in Table III to give separate programs for encipher and decipher. Only the current RoundKey was stored and on-the-y key expansion used. The AES application program developed here performs a xed 128-key expansion however with a little reprogramming different key sizes may be supported. However, it should be recognized that a 128-bit key may be considered sufcient for many of the low-area applications. The process of transforming this denition into a set of assembly language subroutines was relatively straight forward. A number of factors were considered when electing to group instructions into a new subroutine. These included the number of times the sequence of operations was used, the applicability of iteration and the existence of alternative subroutine denitions. The set of subroutines resulting from this part of the design process is documented in Table IV This table includes the area cost in terms of number of instructions and the execution time in cycles for each subroutine. Additionally details for the two main programs (encipher and decipher) are included. A. Forward Key Expansion The key expansion, dened in the AES specication, can expressed as a set of operations which are performed each round

1484

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 7, JULY 2006

Fig. 7. Forward key expansion.

Fig. 9. Reverse key expansion.

TABLE V ACCUMULATION OF PARTIAL PRODUCTS

) and works backwards to starts with ( ). This time (Fig. 9), the Rcon nally yield ( value was propagated in the reverse direction using ffd2. C. Mix Columns
Fig. 8. Assembly language program for forward key expansion.

to generate the next RoundKey. This may then be rened further to operations suitable for an 8-bit datapath. The result of this process has been summarized in the Fig. 7. In this owgraph, the subscripts on represent the bytes within the current the next RoundKey in forward key expansion. RoundKey and The order of execution was selected such that the new key bytes may be directly written back to their original locations with the addition of only two temporary storage bytes. Three more similar owgraphs are required for the remaining key bytes excepting that Rcon was not applied. The Rcon value was updated for the next round using the ffm2 operation. Fig. 8 shows the assembly language listing for the forward key expansion. B. Reverse Key Expansion Reverse key expansion was approached using a similar method to the forward key expansion. However, the process

In an 8-bit environment, without a specialized 32-bit MixColumns operator, the operation must be derived using a number of instructions using the multiply-accumulate execution unit. First the partial products were calculated in an order to make optimal use of the multiply-accumulate architecture. Decipher is the more complex and so is shown in Table V. The resulting partial products were stored in temporary memory locations. Next, the ffms of the required terms were computed using the ffm-accumulate operations and again stored in temporary locations. Finally, these terms were used by a further stage of accumulation to produce either the required MixColumns or InvMixColumns outputs. The following equations describe the operations mathematically:

m m m m m m m m

m m m m

m m m m

GOOD AND BENAISSA: VERY SMALL FPGA ASIP FOR AES

1485

TABLE VI RESULTS COMPARISON WITH OTHER SMALL FPGA DESIGNS

Fig. 10. Placement on XC2S15 FPGA.

VI. FPGA IMPLEMENTATION RESULTS Fig. 10 shows that the placement of this design ts comfortably into the smallest Spartan-II device (XC2S15) occupying about 60% of the resources. The design required 122 slices (depending on user constraints) and two block memories. The block memory used as the register le was only partially utilized (360 bits) which gives rise to an alternative implementation using distributed memory with a cost of 42 additional slices and saving one of the block memories. No comparable 8-bit FPGA designs were found so comparison was made against the best 32-bit designs. Additionally, a second design was developed using the freely available Xilinx PicoBlaze core. This was done to provide a small embedded software baseline for comparison in terms of throughput and area. A concession was made in terms of implementing the SubBytes as a ROM based lookup table. The version III PicoBlaze (KCPSM3) was selected as it contained the internal scratch-pad memory (64 bytes) which was needed for data and key storage. The microcontroller uses a load-store architecture and contains sixteen internal 8-bit registers. The maximum supported program size is 1024 instructions and a stack to support 31 levels of subroutine nesting. This processor was augmented with the required I/O interfacing and the S-box. In this design one block memory was used bit for the program ROM and the second as the lookup table for the S-box (including its inverse). The application program to perform encipher and decipher was generated using the PicoBlaze assembler. The same basic subroutine structure previously developed for the ASIP version was used. The resulting program required 365 instructions, KeyExpansion followed by encipher takes 13546 cycles and decipher with KeyExpansion requires 18885 cycles. For the previous designs [5],[6] the throughput gures quoted excluded the time taken for KeyExpansion. For fair comparison with these designs the gures had to be recalculated to include KeyExpansion and then an average for both encipher and decipher found. In the case of the 8-bit designs, the total time including I/O and key expansion was presented as the average throughput. It must be noted that the decipher operation takes longer due to having to process the KeyExpansion to reach

the nal RoundKey before decryption can begin. Thus, the average-throughput gure quoted was the average for separate encipher and decipher operations inclusive of KeyExpansion. Table VI shows comparative results between these designs and a selection of the best previously reported designs. For the new FPGA designs, the area and timing gures quoted are from the post place and route phase of ISE version 6.3i. In a direct comparison of the PicoBlaze based design, the ASIP is both smaller and faster. The increased performance (about three times faster) was as a result of the novel implementation of SubBytes and the efciency of the specialized ffm-accumulate datapath. The ASIP throughput can be compared with Chodowiecs design by making allowance for the key expansion for both encipher and decipher operations. This reduces the 166 Mbps to approximately 70 Mbps. Thus, the new ASIP design is approximately half the size at the cost of being 30 times slower, however, still achieves a combined encipher-decipher throughput gure of approximately 2.1 Mbps. This is believed to be useful for numerous applications. The 8-bit ASIP suffers greatly in terms of throughput and throughput area compared to the optimal 32-bit designs. Similarly so for 32-bit designs when compared with a fully loop unrolled design [1]. VII. CONCLUSION Both the ASIP and PicoBlaze based designs are the smallest known FPGA implementations to date. Such designs have application across a wide range of areas especially those needing a short time to market and relatively low power. The ASIP design will operate with an average encipher-decipher throughput (including key expansion) of 2.1 Mbps (70-MHz clock) using a

1486

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 7, JULY 2006

designs while maintaining an adequate throughput for a number of applications. REFERENCES


[1] X. Zhang and K. K. Parhi, High-speed VLSI architectures for the AES algorithm, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 9, pp. 957967, Sep. 2004. [2] A. Hodjat and I. Verbauwhede, A 21.54 Gbits/s Fully Pipelined AES Processor on FPGA, in Proc. FCCM04, Apr. 2004, pp. 308309. [3] J. Zambreno, D. Nguyen, and A. Choudhary, Exploring Area/Delay Trade-offs in an AES FPGA Implementation, in Proc. LNCS FPL04, Antwerp, Belgium, 2004, vol. 3203, pp. 575585. [4] N. Pramstaller and J. Wolkerstorfer, A universal and efcient AES co-processor for eld programmable logic arrays, in Proc. LNCS FPL04, 2004, vol. 3203, pp. 565574. [5] P. Chodowiec and K. Gaj, Very Compact FPGA Implementation of the AES Algorithm, in Proc. LNCS03, 2003, vol. 2779, pp. 319333. [6] G. Rouvroy, F. X. Standaert, J. J. Quisquater, and J. D. Legat, Compact and efcient encryption/decryption module for FPGA implementation of the AES Rijndael very well suited for small embedded applications, in Proc. ITCC04, Apr. 2004, vol. 2, pp. 583587. [7] V. Fischer and M. Drutarovsky, Two Methods of Rijndael Implementation in Recongurable Hardware, in Proc. CHES01, 2001, vol. 2162, pp. 7792. [8] F. X. Standaert, G. Rouvroy, J. Quisquater, and J. Legat, A Methodology to Implement Block Ciphers in Recongurable Hardware and its Application to Fast and Compact AES RIJNDAEL, in Proc. ACM FPGA03, Monterey, CA, 2003, pp. 216224. [9] M. Feldhofer, S. Dominikus, and J. Wolkerstorfer, Strong Authentication for RFID Systems Using the AES Algorithm, in Proc. LNCS CHES04, 2004, pp. 357370. [10] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, A Compact Rijndael Hardware Architecture With S-Box Optimization, in Proc. LNCS ASIACRYPT01, Dec. 2001, vol. 2248, pp. 239254. [11] K. Chapman, PicoBlaze 8-bit Microcontroller. 2002 [Online]. Available: http://www.xilinx.com/products/design_resources/proc_central/grouping/picoblaze.htm, Xilinx [12] Advanced Encryption Standard (AES) (in National Institute of Standards and Technology [NIST]), Federal Information Processing Standards (FIPS) Pub. 197, Nov. 2001. [13] E. Mastrovito, VLSI Architectures for Compositions in Galois Fields, Ph.D. dissertations, Linkoping Univ., Linkoping, Sweden, 1991. [14] C. Paar, Efcient VLSI Architectures for Bit-Parallel Computation in Galois Fields, Ph.D. dissertation, Inst. for Experimental Mathematics, Univ. of Essen, Essen, Germany, Jun. 1994. Tim Good (S05) received the B. Eng. degree in electronic engineering from the University of Shefeld, Shefeld, U.K., in 1991. He is currently working toward the Ph.D. degree at the same university. Mr. Good is a Chartered Engineer, U.K.

Fig. 11. Throughput versus area for different low-area FPGA designs.

truly 8-bit datapath and can be implemented using less than 60% of the resources in the smallest Xilinx Spartan-II part (XC2S15). The two best 32-bit designs are both considerably faster and more efcient in terms of speed per slice however are larger. Similarly the same 32-bit designs would appear inferior in throughput and throughput area when compared to pipelined fully loop unrolled 128-bit architectures. A speed of 2.1 Mbps is sufcient for numerous applications where area and power usage are more of a priority. Most wireless and consumer data links (GSM, UMTS, ADSL, etc) are much lower bandwidth than the throughput that this design has attained. Thus, its smaller area would present a signicant saving, including tting into a smaller FPGA part than the 32-bit designs. There are a number of measures that can be taken to improve the ASIP performance, such as further pipelining at the outputs to the memories to reduce the currently unavoidable fan-outs. However, such improvements would increase speed with an approximately linear cost in terms of area. The ASIP design compares very well with a more conventional embedded software approach utilizing the PicoBlaze. A three fold performance improvement for a comparable area was achieved. This was done using two novel execution units. The rst was an 8-bit ffm-accumulate unit and the second a new architecture for performing the SubBytes operation using composite eld arithmetic which performed the three required multiplications by resource sharing a single multiplier. The throughput versus area results for the various low-area FPGA designs are presented graphically in Fig. 11. The graph shows that the ASIP design is a factor of two smaller than other

Mohammed Benaissa (S86M90) received the Ph.D. degree from the University of Newcastle, Newcastle. U.K., in 1990 He is currently a Senior Lecturer at the University of Shefeld, Shefeld, U.K. He has been actively working in the area of VLSI signal processing, error-control coding, and cryptography for the past 18 years and has published over 70 papers in recognised journals and conferences. His recent research has concentrated on investigating congurable and secure approaches to optimum hardware implementation of cryptographic and error-control coding techniques and their incorporation in systems-on-chip.

Você também pode gostar