Você está na página 1de 81

Department of Electrical and Computer Engineering

Title: Speech Recognition Using FPGA

Senior Design Project Report Student: Tyler Havner Ismael Perez Technical Advisor: Dr. Reza Raeisi Dr. Daniel Bukofzer Dr. Sean Fulop

FALL 2012 Comments: _________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________

Speech Recognition Using FPGA

TABLE OF CONTENTS Section Course Evaluation Rubric ... Definition of Key Terms . ...... Page i ii

1. Problem and its Setting 1.1 Introduction

1 1 1 1 1 2

1.2 General Statement of the Problem . 1.3 Objective Solution . 1.4 Scope of the Study . 1.5 Project Limitation .. 2. Background Theory 2.1 Analog to Digital Converter (ADC) 2.2 Frequency Spectrum 2.3 Digital Filters 2.3.1 Infinite Impulse Response (IIR) Filters 2.3.2 Finite Impulse Response (FIR) Filters 2.4 FPGA . 2.5 Summary 3. Monetary Costs of Project .... 4. Methodology .... 4.1 Theoretical Concept 4.2 Detail Algorithm and Desig n Approach 4.2.1 Data Acquisition and the ADC 4.2.2 Start of the Word Detection 4.2.3 Frequency Analysis

3 3 4 6 6 7 7 9

10 10 13 13 15 16 21

4.2.4 Fingerprint Generation

Speech Recognition Using FPGA

4.2.5 Comparison Function 4.2.6 Driving Outputs 4.2.7 System Architecture 4.2.8 Training the System .. 4.3 Work Breakdown

22 23

26 25 30

5. Parts Ordering ..... 6. Finding and Conclusion .... 6.1 MatLab Findings ,,,,,. 6.2 Testing Results . 6.3 System Improvements .. 7. Conclusion .... References ...... Appendix A: MatLab Code Appendix B: Ordering Receipt s ..... Appendix C: DE2 Board Code ..

34

34 34 39 40

41

42

43

52

54

ECE 186B Course Evaluation Rubric 1. How successfully you were able to convert your problem statement or project objectives to your own engineering domain such as digital domain, control domain, microcontroller domain, etc. in order to find an approach to come up with solution. We were very successful at being able to convert our project objectives to your own engineering domain. We were able to create a MatLab model to observe the digital time as well as frequency domain. Also, were able to create the equivalent model in the microprocessor domain to program the DE2 board in C language. 2. How successfully you were able to determine the right engineering tools for the purpose of your project. I was successfully able to determine the right engineering tool through my years of lab equipment and software familiarity. When our project when it became clear that our project entailed a lot of digital signal processing (DSP) we quickly realized that MatLab would cut down design time by providing powerful DSP toolboxes to analyze each step of our design. Also, the bench tools allowed us to test all of our hardware. 3. The effectiveness of using the tools. We were effective at using the engineering tools for our project. We had to do a lot of research into some of the tools we used such as how to sample data from the microphone input channel on MatLab and programming the DE2 board in C. Due to our sound foundation in those areas we were able to understand how to accurately and effectively use those tools. 4. Your experience on being able to develop a prototype and simulation of it. We feel confident in our ability to utilize engineering tools come up with the best solution within the given constraints. When developing a prototype many issues arise that are usually not anticipated and the design needs to be adapted. We feel as though we made good design decisions in order to provide a working prototype in a timely manner. 5. Overall correctness of your design. Our design was correct according to the goals of the method we set out to test. There was some error associated with the casting of floating point accumulations into integer values but those were minor. The design could have been improved with a higher order of filter as had originally designed for but that would have required even longer computation times and memory.

Speech Recognition Using FPGA

Definition of Key Terms The following is a list of key terms along with their definitions from Wikipedia that will be needed in order to grasp the concept of our proposed system [6]. x x x Analog Signal: any continuous signal for which the time varying feature (variable) of the signal is a representation of some other time varying quantity Digital Signal: is a physical signal that is a representation of a sequence of discrete values (a quantified discrete-time signal) Pulse-code modulation (PCM): A PCM stream is a digital representation of an analog signal, in which the magnitude of the analog signal is sampled regularly at uniform intervals, with each sample being quantized to the nearest value within a range of digital steps. Frequency Spectrum: is a representation of a time-domain signal in the frequency domain. The frequency spectrum can be generated via a Fourier transform of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency. Low Pass Filter: is an electronic filter that passes low-frequency signals but attenuates (reduces the amplitude of) signals with frequencies higher than the cutoff frequency. Band Pass Filter: is an electronic filter that passes frequencies within a certain range and rejects (attenuates) frequencies outside that range. Digital Filter: is characterized by its transfer function, or equivalently, its difference equation. Logarithmic Scale: is a scale of measurement using the logarithm of a physical quantity instead of the quantity itself. A simple example is a chart whose vertical axis has equally spaced increments that are labeled 1, 10, 100, 1000, instead of 1, 2, 3, 4. Each unit increase on the logarithmic scale thus represents an exponential increase in the underlying quantity for the given base (10, in this case). Decibel: is a logarithmic unit that indicates the ratio of a physical quantity (usually power or intensity) relative to a specified or implied reference level. A ratio in decibels is ten times the logarithm to base 10 of the ratio of two power quantities. Accumulator: is a register in which intermediate arithmetic and logic results are stored. Without a register like an accumulator, it would be necessary to write the result of each calculation (addition, multiplication, shift, etc.) to main memory. Aliasing: an effect that causes different signals to become indistinguishable (or aliases of one another) when sampled. FPGA: is an integrated circuit designed to be configured by the customer or designer after manufacturinghence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). ii

x x x x

x x

Speech Recognition Using FPGA

Chapter 1: Problem and Its Setting


1.1 Introduction Speech recognition is expanding its reaches with modern technology. Calls made to large companies heavily rely on voice recognition to efficiently route the calls to the proper department. Luxury cars are incorporating systems so that the driver has an interactive experience with the automobile. Smart phone programs like Siri are pushing the envelope in artificial intelligence using speech recognition. The next logical progression is using speech recognition around the home. 1.2 General Statement of Problem According to the U.S. Census Bureau, 11 million Americans need personal assistance with everyday activities and over 3.3 million use a wheelchair [5]. For these Americans simple tasks such as turning on a ceiling fan or opening a door becomes a chore. There is an obvious need for voice recognition in household devices to make for a hand-free environment. 1.3 Objective Solution Our project will employ the programmability structure of a field-programmable gate array (FPGA) in order to design ordinary household objects, like a door and a ceiling fan, into handsfree devices. For example, if the user needed to open a door they could simply say open and the system would fully open the door. This system could alleviate some of the day to day struggles that physically disabled people go through in their home by allowing them to interact with household devices simply by the use their voice. This could potentially allow them to become more independent in their home. With a successful implementation of our system the speaker will be able to open and close a door as well as turn a ceiling fan off and on by using four simple commands that will be discussed in detail in Chapter 4. 1.4 Scope of the Study A vast knowledge of engineering background is needed for this project. A strong understanding of signals and systems with an emphasis on signal analysis via Fourier Transform methods will provide the basic foundation. Digital signal processing (DSP) is where the core of the project lies
1

Speech Recognition Using FPGA

as well as programming in hardware description language (HDL) and C. The hardware portion of the project will require a background in electric motors and electronics to drive them. The courses we have taken to prepare for this project are as follows: Course Number ECE 71 ECE 121 ECE 124 ECE 134 ECE 138 Course Description Programming in C Electromechanical Systems Signals and Systems Communications Electronics II

Tyler:

Ismael:

Course Number ECE 107 ECE 124 ECE 176 ECE 178 CSCI 150

Course Description Digital Signal Processing Signals and Systems Verilog Coding Embedded Systems Software Engineering

1.5 Project Limitations Speech processing is a very robust problem and modern speech analysis is accomplished by using complex probabilistic characterizations of words and sentence structures known as Hidden Markov Chains [3]. The goal of our project is to create a reliable and accurate system that does not rely on such complex models. The system will attempt to be simplistic in order to lay a basis for future consumer products. One such restriction on our system will be speaker dependence. Creating a system that does not rely on a specific speaker is a very complex problem but our system will have a primary speaker (the homeowner). Also, our system will use single word recognition not continuous speech recognition. In other words the speaker will have to first train our system with several versions of the same word, thus yielding a reference fingerprint. The reference fingerprint represents the set of values that result from averaging the three set of values from the training words. Subsequent words can be recognized based upon how closely they relate to the saved reference fingerprint.

Speech Recognition Using FPGA

Chapter 2: Background Theory


In the following sections we will discuss the relevant aspects of our project including the analog to digital conversion, frequency spectrum analysis, and digital filters. All these sections will be discussed in detail how they relate to our project. We will use an analog to digital converter to convert the voltage representation of our spoken words into digital information. Also, once we have a digital representation of the word spoken we need to extract its significant frequency component so that a decision can be made upon whether the information is in fact one of the correct words. 2.1 Analog to Digital Converter (ADC) Everything in the real world is analog this includes sound, light, and even temperature [2]. Computers are not able to handle analog information so everything that needs to be manipulated by a computer has to be converted into digital. Analog information has to be converted into strings of ones and zeros, which is what digital is. To achieve this we need an analog to digital converter. For our purpose we will be converting sound into its digital version. To convert a waveform back from digital to analog a digital to analog converter is used. The two most important variables that determine how closely a digitally sampled waveform patterns the original continuous time waveform are the sampling rate and the bit resolution. The ADC will take discrete points on the waveform based on a specified rate or sampling frequency. One of the most popular methods for analog to digital conversion is called pulse code modulation (PCM). In PCM, the amplitude of the waveform (most commonly voltage) is quantized into discrete levels that have encoded binary representations. The number of quantization levelsL is directly related to the number of binary bits or bit resolution being used to represent each sample which is shown in Equation 2.1.1.  (2.1.1)

The size of the quantization levels are then based upon the amplitude bounds and the number of levels. Equation 2.1.2 shows this relationship with equal to the size of the quantization levels is the peak amplitude and is the number of quantization levels.
3

Speech Recognition Using FPGA

 

(2.1.2)

Figure 2.1.1 shows a PCM encoded waveform with 3-bit resolution.

Figure 2.1.1: 3-bit PCM Encoded Waveform In Figure 2.1.1, the quantization levels are numbered on the left hand side of the y-axis while the binary encoded representation of the levels is shown on the right-hand side of the y-axis. The xaxis shows the encoded sampled values of the waveform. The sampling frequency that we will be using will be based on the Nyquist formula [1]. This formula says that in order to avoid aliasing of the reconstructed waveform the sampling frequency should be at least twice that of the highest frequency component or Nyquist frequency in the waveform. Equation 2.1.3 gives the Nyquist sampling theorem with equal to the Nyquist sampling rate and equal to the highest frequency component in the sampled waveform. (2.1.3)

2.2 Frequency Spectrum Speech processing is explored by performing spectral analysis to characterize the time-varying properties of the signal [7]. In other words speech processing requires a frequency domain representation of the signal to be analyzed. The Fourier transform, shown in Equation 2.2.1, does exactly that and transforms a time domain signal into its equivalent frequency domain representation, .
4

Speech Recognition Using FPGA

(2.2.1)

The Fourier transform often reveals characteristics of the signal that would not otherwise be readily apparent in the time domain. For example, the Fourier transform of the band limited rectangle function in the time domain becomes an infinite banded sinc function in the frequency domain as shown in Figure 2.2.1.

Figure 2.2.1: Fourier Transform of rect(t) In the case of a sampled digital waveform, , the discrete-time Fourier transform (DFT) is used. It is described by Equation 2.2.2.

(2.2.2)

The power spectral density (PSD) of a waveform builds on the Fourier transform of a waveform and gives the relationship to the energy of the signal with relation to frequency [8]. This property is described by Rayleighs Theore m and is shown in Equation 2.2.3.

  

(2.2.3)

Using Rayleighs Theorem to gain the PSD of the waveform the significant frequency components of the waveform become more apparent [8]. When using a discrete time waveform Rayleighs Theorem becomes Equation 2.2.4.

Speech Recognition Using FPGA

  

(2.2.4)

2.3 Digital Filters There are two types of digital filters available for our purposes. They are the infinite impulse response (IIR) or finite impulse response (FIR) filters. We studied the different attributes of each and based on that we made our decision. The following discusses some of the major differences between the two digital filters available to us. 2.3.1 IIR filters These types of filters are generally difficult to control and are typically unstable. We want to be able to control our filters but we also expect them to be stable. Normally there is no particular phase to describe the operation of these filters and they also have limited cycles. While infinite response filters have no particular phase the finite response filters have a linear phase, which is part of what makes them stable. The fact that these filters are infinite impulse response makes them non-causal. Both the poles and the zeros have an effect on these filters. Since IIR filters require less coefficients than FIR filters their cutoff will not be as sharp but for the same reason they require less memory than FIR filters. Figure 2.3.1 shows an IIR filter with a non-linear phase.

Figure 2.3.1: IIR Filter Phase Plot


6

Speech Recognition Using FPGA

2.3.2 FIR filters These filters always have a linear phase therefore behave how one expects them to. Unlike IIR filters FIR filters are stable and have no limit to how many cycles you can have. They are stable because the output only depends on present and past values of the input. Another aspect that is different between these filters and IIR filters is that these types dont have analog history. This is because IIR filter are derived from analog. We wanted to use filters that were completely digitize or digital so FIR filters were the logical choice. FIR filters require less multiplications and additions than the alternative because they are of higher order. Delays are easier to implement on FIR filters but FIR filters require more memory than IIR filters. One of the reasons why FIR filters require more memory is because they typically require more coefficients for the sharp cutoff unlike IIR filters. These FIR filters only depend on the zeros of the transfer function. Figure 2.3.2 shows an FIR filter with a linear phase graph.

Figure 2.3.2: FIR Phase Plot 2.4 FPGA An FPGA is an IC that contains an array of identical logic cells with programmable interconnections also know as configurable logic blocks (CLBs) [9]. The user can program the functions realized by each logic cell and the connections between the cells. A typical CLB contains two or more function generators, often referred to as look-up tables or LUTs, programmable multiplexers, and D-CE flip-flops. The D-CE flip flop is just a normal D flip flop
7

Speech Recognition Using FPGA

with a clear enable bit. As long as the CE bit is not set the flip flop acts like a regular D flip flop. Figure 2.4.1 shows a simplified version of a CLB.

Figure 2.4.1: Simplified Configurable Logic Block (CBL)

The CLB shown in Figure 2.4.1 contains two function generators, two flip-flops, and various multiplexers for routing signals within the CLB. Each function generator has four inputs and can implement any function of up to four variables. The function generators are implemented as lookup tables (LUTs). A four input LUT is essentially a reprogrammable read-only memory (ROM) with 16 1-bit words [9]. This ROM stores the truth table for the function being generated. The array of CLBs is then surrounded by a ring of input-output (I/O) interface blocks. Figure 2.4.2 shows the layout of part of a typical FPGA.

Figure 2.4.2: Typical FPGA Layout

Speech Recognition Using FPGA

The I/O blocks in Figure 2.4.2 connect the CLB signals to directly to the IC pins [9]. Normally an FPGA contains other components such as memory blocks, clock generators, tri-state buffers, as well as other useful digital components. The user defined flexibility coupled with the large amount of memory on an FPGA makes it a great choice to handle the immense amount of DSP that our project will entail. Unlike the custom application-specific integrated circuit (ASIC) approach using FPGA technology one can work off of a simple design and build up on it. The ASIC approach is built for specific project while an FPGA is reprogrammable and can be used for various applications. The FPGA is programmed with the use of electrically programmable switches similar to other logic devices [10]. 2.5 Summary We briefly mentioned some of the theory that will be essential in order to complete our project. Understanding of ADC will be essential to gain an accurate representation of incoming speech signal, knowledge of the frequency spectrum will be needed to characterize the significant components of the frequency and digital filtering will be used to accomplish the spectral analysis. Knowledge of the inner workings of the FPGA will also be vital to the implementation of our project. The techniques will be demonstrated in detail in our methodology.

Chapter 3: Monetary Costs of Project


The budget for our system will be fairly small (under $400) and the bulk of the cost will be due to the FPGA board (about $320). We will also need a microphone that has a 3.5mm headphone jack output so that we can simply plug it into the input jack input on the FPGA board. The other components will be used to drive our outputs. A DC motor will be used to drive the scale model fan propeller. The door will be opened using a servo motor. In order to control the motors using the header pins on the board we will need MOSFETs to switch power. The servo motor will also need a 555 timer circuit to create the control signal it requires for position control. Table 1 shows the list of the parts needed in order to implement our project along with their corresponding costs.

Speech Recognition Using FPGA

Table 1: Bill of Materials


Part Needed FPGA Development Board 3.5mm Microphone Low Speed DC Motor Plastic Fan Propeller Servo Motor LM 324N Quad Op-Amp (2) 1N4148 Diode NE555P 555 Timer (2) N-Channel MOSFET Assorted Lab Resistors/Caps Manufacture Altera Logitech HobbyTech HobbyTech Futaba Texas Instruments Fairchild Semiconductor Texas Instruments Philips Subtotal Taxes Shipping/Handling Total Cost $269.00 $6.80 $2.95 $1.25 $15.85 $0.53 $0.20 $0.39 $1.04 $297.77 $23.82 $45.55 $367.14

Chapter 4: Methodology
Our project will activate a door to open or a fan to turn on by using voice recognition. A word will be spoken into the microphone and once the word has been recognized and compared a signal will be sent to either the door or the fan to perform the specified operation. Extensive digital signal processing (DSP) will be used to process a word that is spoken into the microphone. Use of digital filters will be necessary to accomplish the DSP. 4.1 Theoretical Concept We need a simple method to gain the significant frequency content of a speech signal. Spectral analyzers are often used to gain the frequency content of a waveform but these devices are bulky, expensive, and far more sophisticated than our projects needs. A better, simplistic approach to reveal the frequency content of a speech signal is a band pass filter bank. The Fourier transform of a waveform can be thought of as a series of band pass filters with infinitesimally small bandwidths and center frequencies that grow infinitesimally larger so that essentially the output of each filter would represent one point on the Fourier transform of a waveform [3]. Obviously this is an idealized system that is not realizable but does emphasize that a band pass filter bank
10

Speech Recognition Using FPGA

can be used in order to expose the frequency spectrum of a waveform. Figure 4.1.1 shows a magnitude plot of a realistic bank of band pass filters [8].

Figure 4.1.1: Band Pass Filter Bank [8] This array of filters will capture frequencies that fall within their respective bandwidth. Based upon the outputs of each filter we can make inferences about the frequency content in that frequency band. The PSD will give a better understanding of the significant frequency content in the waveform. In order to gain the PSD we can use Rayleighs Theorem in Equation 2.2.3 and take the time average of the energy to get the power in each filter band [8]. The block diagram of such a system is shown in Figure 4.1.2.

11

Speech Recognition Using FPGA

Figure 4.1.2: Power Spectrum using a Filter Bank [8] In Figure 4.1.2 the signal x(t) is routed through multiple band pass filters. Each filters response is that part of the signal lying in the frequency range of the filter. The output of each filter is the input of a squarer block that simply takes the square of the signal. The output signal from any squarer is that part of the instantaneous signal power of the original x(t) that lies in the passband of the band pass filter. Then the time averager performs the time-average signal power. Each output response Px (fn) is a measure of the signal power of the original x(t) in a narrow band of frequencies centered at fn. Taken together, the Ps are an indication of the variation of the signal power with frequency or the power spectrum. In the filter bank model in Figure 4.1.1 all the filters are linearly spaced meaning they contain the same bandwidth. This method wastes a lot of bandwidth because the human ear does not process all frequencies the same and actually has unique variations [8]. Figure 4.1.3 shows the average human ears perception of the loudness of a constant -amplitude audio tone as a function of frequency [8].

12

Speech Recognition Using FPGA

Figure 4.1.3: Human Ear Perception of Loudness vs. Frequency [8] Humans can only produce speech signals up to about 10 kHz [8]. From Figure 4.1.3 it is evident that the human ear has a nonlinear response to frequencies and is highly sensitive to frequency changes in the first 4 kHz with a significant roll off occurring thereafter. Therefore the filter bank model for extracting speech analysis can be improved by logarithmically spacing the filters. Equations for the spacing of these filters will be discussed in further detail in Section 4.2.2. 4.2 Detail Algorithm and Design Approach In this section of the chapter we will discuss in detail the steps we have to take to complete our project successfully. At first we were thinking on using the Fast Fourier transform as our design approach but we decided upon the simpler filter bank processing. We initially wanted to use 10 filters spanning about 10 kHz but due to speed and memory limitations we opted for only 5 filters spanning about 8 kHz. 4.2.1 Data Acquisition and the ADC We will be acquiring data by inputting an analog signal from a microphone that will be connected to the mic-in port on the DE2 board. This port is connected to the 24-bit analog to digital converter (ADC) that is embedded on the board. The output of the ADC is the quantized
13

Speech Recognition Using FPGA

form of the input wave form. This quantized waveform is a large string of ones and zeros. Every 24 bits represent one point of the waveform. We did not need to use such a high resolution for our project so we decided to down convert the 24-bit ADC to a 12-bit ADC. Having a 24-bit ADC might cause the output of our filters to overflow. Figure 2.1.1 shows how a 3-bit ADC separates the values on the y-axis so our 12-bit will have values ranging from -2048 to 2047. Figure 4.2.1.1 shows and example of how to down convert from a 3 bit resolution to a 2 bit resolution.

Figure 4.2.1.1: Signed Down Conversion Since we are using the DE2 media computer system for our project the default sampling rate is 48 kHz. For our design purpose this is obviously oversampling so we need to down sample somehow. We accomplished this by only saving every third value of the sampled waveform. By doing this we down sampled from 48 kHz to 16 kHz. We came across many projects that stated that when dealing with voice recognition a 16 kHz sampling rate is ideal. The mic-in ADC on the board saves audio in 2-channel stereo quality, containing a right channel and a left channel. Figure 4.2.1.2 shows the audio register ports unto which the left and right channel data is stored.

14

Speech Recognition Using FPGA

Figure 4.2.1.2: Audio port registers [11] Since voice is only mono quality we only need to retrieve one channel because the other channel will simply be a copy of the data. Thus we could have used the data from either channel so decided to only use the data from the left channel for our project. 4.2.2 Start of Word Detection A crucial step to recognizing speech is locating the beginning of the spoken word (if there is one). For our system the ADC will sample for 3 seconds after the button has been pressed. We used a windowed approach in which the absolute average of two adjacent windows of n points each is compared it to a predefined threshold. Once the threshold is surpassed a pointer will then be specified at the start of the previous window and the samples will be saved into memory from this point onward for 8 K samples or half a second at a 16 KHz sampling rate. The flow chart in Figure 4.2.2.1 shows the design approach for programming the beginning of the word detection.

Figure 4.2.2.1: Flow Chart for Word Detection


15

Speech Recognition Using FPGA

Equation 4.2.2.1 shows how to calculate the absolute average of the first window, from the initial sample to the endpoint of the window in the vector of sound samples, .

(4.2.2.1)

The average of the second window, , is computed from the sound samples starting at and ending at where the number of points in the window is equal to the difference of b and a or equivalently c and b. The computation for the second window is shown in Equation 4.2.2.2.

(4.2.2.2)

The difference between and is compared to the threshold value Th. If it is larger, then the spoken word is considered to start at . If this is not the case then the average of the oldest window ( ) is discarded, and replaced by . Then, the algorithm continually repeats until the word is detected or it reaches the end of the sound samples in which case no word was detected. The value for the threshold was calculated empirically using MatLab which can be seen in Appendix A.

4.2.3 Frequency Analysis Before we can pass the voice samples through the band pass filter bank we must first pass the values through a pre-emphasis filter. Speech signals normally experience some spectral roll-off of about 6-dB per octave [3]. This means that the amplitude is halved for each doubling of frequency. This phenomenon occurs due to the radiation effects of the sound from the mouth [3]. As a result, the majority of the spectral energy is concentrated in the lower frequencies, which results in an inaccurate estimation of the higher formants. However, the information in the high frequencies is just as important in understanding the speech as the low frequencies. To reduce this effect, the speech signal is filtered prior to the filter bank processing. The pre-emphasis filter makes the outputs of the filters nearly uniform across the spectrum at the expense of lowering the
16

Speech Recognition Using FPGA

amplitudes slightly. Equation 4.2.3.1 shows the how to calculate the output of the pre-emphasis filter.   (4.2.3.1)

In Equation 4.2.3.1, is a coefficient most commonly in the range of 0.95 to 0.98 for speech applications. We opted for 0.97 for our design. The magnitude response of our pre-emphasis filter is shown in Figure 4.2.3.1.

Pre-emphasis FIlter Response 20 0

Magnitude (dB)

-20 -40 -60 -80

1000

2000

3000 4000 5000 Frequency (Hertz)

6000

7000

8000

Figure 4.2.3.1: Pre-emphasis Filter Response From the magnitude plot it is apparent that this filter attenuates the lower frequencies while amplifying the higher frequencies to take care of the -6 dB roll-off. We originally desired to use 10 filters to create a filter bank which will cover the frequencies from 200 Hz to about 10 kHz. This however proved to be too ambitious and we had to remove some half of our designed filters so that we were left with 5 filters spanning 300 Hz to about 7 kHz. Each filter is logarithmically spaced out because of the way human voice behaves in the frequency domain. In general for a human spoken word most of the significant components are in the lower frequencies of the frequency spectrum. This is the reason why we need filters with smaller bandwidths in the lower spectrum. Normally a human voice falls between a range of about 300 Hz to 14 kHz but for the most part the significant frequencies for human voice range from 300 Hz to 2 kHz [4]. We
17

Speech Recognition Using FPGA

decided to use FIR filters since they have a linear phase without compromising the ability to approximate the ideal magnitude, unlike IIR filters. Unfortunately, they are computationally more expensive in implementation as they require more coefficients for the equivalent IIR filter [3]. After deciding what type of filters we should use we needed to calculate the bandwidths and center frequencies of each filter. The main equations that were used for calculating the bandwidths and the center frequencies of the filters will be shown below. Equation 4.2.3.2 and 4.2.3.3 were used to calculate the bandwidths of each filter then with the results obtained equation 4.2.3.3 was used to calculate the center frequencies of each filter. In equation 4.2.3.2 C equals the bandwidth of the first filter, we decided on 440 Hz. Then bi is the bandwidth for a given filter and Q is the total number of filters to be used, which will be 5. The in equation 4.2.3.3 represents the logarithmic growth factor that typically falls between 1 and 2. The value for was calculated to be 1.45 which would allow us to fit 5 filters into the 7 kHz range. = C (4.2.3.2)

2iQ
(4.2.3.3)

= + +

(4.2.3.4)

We obtained the coefficients from MatLab for our filters using the fir1 function which is a Hamming-window based, linear-phase filter. Another critical choice for our filters was the order of the filter which would determine its sharpness or effectiveness at simply passing frequencies within its band. We opted for 50th order filters which would give us a good sharpness. The transfer function for the FIR filter is shown in Equation 4.2.3.5.

(4.2.3.5)

18

Speech Recognition Using FPGA

The transfer function of FIR filters only possesses a numerator. This corresponds to an all-zero filter. In this equation the b terms are the filter coefficients, z is the delay element, and M is the order of the filter which in our case is 50. Equation 4.2.3.6 gives the difference equation to solve for the output of the FIR filter.

(4.2.3.6)

The direct form of the FIR filter structure is shown in Figure 4.2.3.2.

Figure 4.2.3.2: FIR Direct Form Structure From Equation 4.2.3.6 and Figure 4.2.3.2 it is apparent that the output of the filter is obtained through the linear combination of the last input samples weighted by the b coefficients. The figure shown in Figure 4.3.3 is that of 5 ideal filters that were generated using MatLab. Appendix A contains the code that was used in MatLab to obtain the graph in Figure 4.3.3. As you can see this filters have cutoff which are impossible to implement using real filters. This is because those idealized band pass filters are rectangular functions in the frequency domain which from Figure 2.2.1 becomes an infinite banded sinc function in the time domain. The sinc function is non-causal and has an infinite delay thus they can only be approximated in the time domain. By observing Figure 4.2.3.3 you can there are 5 filters logarithmically distributed from about 900 Hz to about 6.5 kHz. By this we mean that after each filter the next one keeps increasing in bandwidth by a growth factor alpha shown in Equation 4.2.3.3. Our alpha for the chosen BW represents bandwidth in the Figure 4.2.3.2. We had to limit the number of filters in our band pass filter band due to speed and memory constraints.

19

Speech Recognition Using FPGA


Ideal FIR Filter Bank 1.5

Magnitude

0.5

1000

2000

3000 4000 5000 Frequency (Hertz)

6000

7000

8000

Figure 4.2.3.3: Idealized Logarithmically Spaced BPF Bank A more realistic way of implementing the filters that we are going to use in our project is shown in figure 4.2.3.4. The figure shows the 5 filters with a somewhat sharp cutoff. The MatLab code for graphing figure 4.2.3.4 is shown in Appendix A.
FIR Filter Bank 50

Magnitude (dB)

-50

-100

-150

1000

2000

3000 4000 5000 Frequency (Hertz)

6000

7000

8000

Figure 4.2.3.4: Realizable FIR BPF Bank Since the actual filters are not ideal like those in Figure 4.2.3.3 thus there needs to be some overlap so that there is not a spectral loss between the filters. This is not desirable because a frequency smearing occurs where frequency content appears in neighboring filters due to the
20

Speech Recognition Using FPGA

overlap. This will not really affect the recognition because the training words saved to memory and word to be recognized will be subjected to the same spectral smearing. 4.2.4 Fingerprint Generation Once we have obtained all our points from the filters we need to calculate the energy for each. The energy is found by using equation 2.2.4, which is the cumulative summation of the squared output of the filters from each filter. A good rule of thumb is to have window lengths between 10-30 milliseconds. Since we are sampling at 16 kHz this would mean for a 10 milliseconds window length we would consider the energy at every 160 accumulated points to be a data point in the fingerprint representation for that respective filter. This is essentially the energy windowed over a certain length. Since all of our keywords are small we used saved one half of a second of sound (8000 points) after the detection of the beginning of a word. Once this is accomplished we will have a 500 point representation (100 points for each of the 5 filters) of the energy in the banded spectrum of the filter at discrete points in time throughout the spoken word. Figure 4.2.4.1 shows a general flow chart of what our system will go through to generate our main fingerprint for each word. First the speech signal from the microphone will pass through the ADC which will digitize the waveform. The output of the ADC will go then pass through the pre-emphasis filter before reaching the filter bank. The output of the filters will then be squared to obtain the instantaneous power which will be added to an accumulator to obtain the energy over a 10 millisecond intervals.

21

Speech Recognition Using FPGA

Figure 4.2.4.1: Flow Chart for Fingerprint Extraction We have to make a reference fingerprint for every word that we need to store in memory. The reference fingerprint is the average of the individual fingerprints for each training trial. For our system the user will have to say the keyword three times for the system to gain three individual fingerprints which will then be averaged together to create a reference fingerprint. 4.2.5 Comparison Function In the recognition mode the incoming fingerprint will be compared to each reference fingerprint and closest match will be recognized as the spoken word and displayed on the LCD. Thus we need a formula to calculate the difference between the reference fingerprint data points and the spoken word. The Euclidean formula which is derived from the Pythagorean Theorem gives the straight line distance between two vectors of n points. Equation 4.2.5.1 shows how to calculate the distance between two vectors p and q.

(4.2.5.1)

In Equation 4.2.5.1 we can extrapolate that the equivalent distance from p to q is the cumulative distances from each point in p to the corresponding point in q.

22

Speech Recognition Using FPGA

4.2.6 Driving Outputs We have planned to use only two outputs. One will be the dc motor which will power a fan and the other will be a servo motor that controls the opening of the door. The DC motor only has 2 terminals and simply requires a voltage across the terminals. The servo motor has 3 terminals. Two of the terminals are connected to Vcc and ground while the last terminal is a position signal that requires a pulse-width modulated (PWM) voltage. These outputs will be controlled using the expansion header I/O ports on the DE2 board. The DE2 board provides two 40-pin expansion headers that connect directly to 36 pins on the Cyclone II FPGA, and also provides DC +5V (VCC5), DC +3.3V (VCC33), and two GND pins [11]. Each pin on the expansion headers is connected to two diodes and a resistor that provide protection from high and low voltages. Depending on which word was recognized the corresponding pins should be set to either output or input. For example if the word STOP was recognized the pin controlling the fan should be set to input so that no voltage is supplied by that pin. Figure 4.2.6.1 shows the related schematics for one of the expansion headers (JP1).

Figure 4.2.6.1: Expansion Header I/O Ports [11]

Pins VCC5 and VCC33 which are voltage regulated power supplies and can provide higher currents but they are always high and cannot be controlled to switch the motor on or off. The other header I/O ports can be configured as digital outputs but they are current limited. They can only provide current up to 8 mA [11]. This is not nearly enough current to control the DC motor thus we will need a power circuit for it. A simple MOSFET can be used to control the movement
23

Speech Recognition Using FPGA

of DC motors or brushless stepper motors directly from computer logic [12]. As the motor load is inductive, a simple flywheel diode is connected across the inductive load to dissipate any back EMF generated by the motor when the MOSFET turns it off [12]. An additional silicon diode D1 can also be placed across the channel of a MOSFET switch when using inductive loads for suppressing overvoltage switching transients and noise giving extra protection to the MOSFET switch if required [12]. Resistor R2 is used as a pull-down resistor to help pull the output voltage down to 0V when the MOSFET is switched off [12]. We will use components from the lab to accomplish this circuit. Figure 4.2.6.2 shows the DC motor control circuit.

Figure 4.2.6.2: DC Motor Control Using MOSFET Referring to Figures 4.2.6.1 and 4.2.6.2, the VCC5 pin from the board will be connected to Vdd on the DC motor control circuit. One of the I/O pins such as I/O A0 will be connected to V IN and the circuit will be grounded using the GND pin. Similar to the DC motor, the servo motor will use the VCC5 pin and GND to power the motor while an I/O pin will be used to control the motor. The servo motor relies on PWM to control its position. The I/O pin will have to be programmed to generate a PWM signal. Generally the minimum pulse width will be about 1 millisecond and the maximum pulse width will be 2 milliseconds with a period of 40 milliseconds but the period is not nearly as critical as the pulse widths. Figure 4.2.6.3 shows the position of the servo motor with respect to the pulse width.

24

Speech Recognition Using FPGA

Figure 4.2.6.3: Servo Motor Position vs. Duty Cycle In order to open the door 90 degrees from the neutral position we will want either a one or two millisecond pulse depending upon which direction we want it to open. This would give us a full rotation of about 180 degrees. The simplest way to accomplish the PWM requirements for the control signal to the servo motor is to use a 555-timer circuit. The circuit in Figure 4.3.6.4 accomplishes the PWM requirements that we need to control the servo motor.

Figure 4.2.6.4: Servo Motor Control Circuit Using Equation 4.2.6.1 in order to solve for the time-low of the waveform we get a time low of 40.54 milliseconds.

(4.2.6.1)

25

Speech Recognition Using FPGA

Using Equation 4.2.6.2 in order to solve for the minimum time-high of the waveform that occurs when is shorted by the N-Ch FET produces a time high of 1.039 milliseconds.

(4.2.6.2)

Using Equation 4.2.6.2 in order to solve for the maximum time-high of the waveform in which and are both equal to 10 k we get a time high of 1.039 milliseconds Thus this circuit will provide the necessary PWM requirement for the positioning of our servo. 4.2.7 System Architecture For our system architecture the main component will be the FPGA board. Connected to the input of the FPGA board will be the microphone and to the output will be the chip that controls the dc motor. The architecture of the FPGA board is quite complicated to explain everything in detail but we will mention some of the components that we are going to be using. The core of an FPGA consists of the adaptive logic module (ALM). Figure 4.2.7.1 shows the structure of an ALM and its corresponding adders and registers.

Figure 4.2.7.1 Adaptive Logic Module of a typical FPGA [11]

26

Speech Recognition Using FPGA

The ALM is the key to the speed of the FPGA technology and to the efficiency of its architecture. An ALM can implement many functions because it has 8 inputs to its logic block. The ALM can also be separated into smaller LUTs. The components that are implemented on the FPGA board are the ADC, the audio in and the memory storage which include SSRAM, SDRAM, and FLASH. The FPGA board has a pre-configured system which we ended up using since it has the ADC configured. We used the pre configured media computer system that is available on the board. The original bit resolution of the media computer was 24-bit but we needed the resolution to be 12-bit. This was accomplished by shifting every 24-bit value by 12 bits to the right. Figure 4.2.7.2 shows a top view of the DE2 board and the components labeled.

Figure 4.2.7.2 DE2 FPGA Board [11] Figure 4.2.7.2 above has labeled all of the components on the DE2 board that we will be using. For example, some of these components include the SDRAM, Mic-in, LCD Module, and the toggle switches. The SRAM and FLASH memory locations can also be seen on the figure.

27

Speech Recognition Using FPGA

4.2.8 Training the System We are utilizing a total of 12 switches that are embedded on the DE2 FPGA board, these include from sw-0 to sw-11. There can only be one switch high at one time except when in recognizing mode otherwise if there is an undesired switch high the LCD display will show an error message. Recognizing mode is active when both sw0 and sw1 are high at the time that the record button is pressed. Three switches are used for every word that is to be stored in memory. The reason for using three switches for each word is because we need to record each individual word three different times and then average the values of each recording to obtain our reference fingerprint for each word. This procedure has to be done once for each word, which results in doing this four times because we have four different words. The average of each word will be saved on independent addresses in the SDRAM. We are saving the words on SDRAM because we think that the large quantity of values might cause the SRAM to overflow. SDRAM might be a bit slower than SRAM but do to the processor speed the delay is not significant. Aside from the 12 switches we are also using two pushbuttons. Pushbutton key1 is used for recording and pushbutton key2 is used for playing back the previously recorded word. We decided to have a playback function to allow us to listen to the recording so that we can make sure that it was a fair enough sample of the word. Whenever we have to record we must push key1 and depending on which switch is set high the appropriate target address should be obtained. When we are training our system to store the fingerprint of each word we will have to run the word three times and then take the average banded energy to acquire our fingerprint. The user will select which word he wishes to train by using the sliding switches on the board. Once all the words have been trained the system will be able to run in the recognition mode. Every word we will have 100 points which correspond to each of the 5 filters. Once we run this three times we will have the sum of the energy for 500 points then we will divide this value by 3 to give us the average energy, which will be our reference fingerprint. After we have our reference fingerprint and we speak a word into the microphone we will us the distance formula mentioned before to measure the difference between the spoken word and the words stored in memory. If the spoken word matches one of the stored words within distance threshold then the board should output a signal corresponding to the command that the word represents. However, if the spoken word

28

Speech Recognition Using FPGA

does not match any of the stored words the board should output some kind of message notifying that the word has no match. Table 4.2.1 shows the words that we will be using as our inputs and the corresponding outputs for when the words are recognized. When there is no word match the message WORD NOT RECOGNIZED will be displayed on the LCD display that is embedded on the DE2 board. Table 4.2.1 Inputs and Outputs Word Match GO STOP OPEN CLOSE NO MATCH Output Turn fan on Turn fan off Open door Close door WORD NOT RECOGNIZED

A system diagram is shown in figure 4.2.8.1 where the first image represents the microphone which is connected to the audio in port on the FPGA board. The second image labeled FPGA board represents the whole FPGA board which contains the ADC, the audio in port and its where we implemented our filtering design. You can see the SRAM and SDRAM modules on the DE2 board system diagram. These modules are controlled by the SRAM and SDRAM controller respectively, which can also be seen in the diagram. The 16x2 LCD display is controlled by the LCD port. We use the LCD display to show the messages for our program.

29

Speech Recognition Using FPGA

Figure 4.2.8.1: System Diagram [11] At first we attempted to build our own system to use on the FPGA board using the Quartus II program. We wanted to build our own system so that we could choose which components from the system diagram to use for our project. We ended up giving up on building our own system because we were getting many errors that we could not fix when running the system. After we started using the media computer system that we found on the Altera website we started with testing the switches, pushbuttons, and the LCD display. We successfully tested the ports that we planned to use in our project before actually running our program on the DE2 board. The Altera monitor program was very useful when we needed to find which memory location to use for our values. Using the memory tab on the monitor program we were able to see where the buffer stored all the values for the spoken words. The values on the buffer were the ones that we needed to store somewhere in memory for future use. 4.3 Work Breakdown Table 4.3.1 shows the division of work for our project. Each task shows its corresponding start and completion date along with the team member participated in accomplishing that task. It is proceeded by the Gannt chart for our project.

30

Speech Recognition Using FPGA

Table 4.3.1: Division of Work


Task MatLab Implementation Data Acquisition Using Mic Using 'Analog Input' Function Establish Sampling Variables User Interface for Template Storage Prompts to Select Word Saves Word Template Storage Quantization Function Test Bit Resolutions Word Detection Window Averaging Function Threshold Calculations DSP Pre-emphasis Filter FIR Filters Cutoff Frequencies Filter Coefficients Downsampling Resolution Downconversion DE2 Board Implementation Data Acquisition Using Mic In User Interface(switches, buttons) Wolfson CODEC Memory Allocation DSP Pre-emphasis Filter Downsampling Resolution Downconversion Band pass Filter Bank Filter Coefficients Create FIR Funtion in C Compare to MatLab Outputs Fingerprint Generation Accumulation of Sampled Data Average of Multiple Trials Comparison Function Start Date 8/6/2012 9/10/2012 9/10/2012 9/17/2012 9/20/2012 9/20/2012 9/25/2012 10/1/2012 10/1/2012 10/10/2012 10/10/2012 10/10/2012 8/6/2012 11/12/2012 8/6/2012 8/6/2012 8/27/2012 10/10/2012 10/15/2012 8/27/2012 8/27/2012 End Date 11/17/2012 9/21/2012 9/21/2012 9/21/2012 9/29/2012 9/26/2012 9/29/2012 10/11/2012 10/11/2012 10/20/2012 10/20/2012 10/20/2012 11/17/2012 11/17/2012 8/29/2012 8/29/2012 8/29/2012 10/13/2012 10/19/2012 12/1/2012 9/25/2012 Team Member Tyler/Ismael Tyler Tyler Tyler Tyler Tyler Tyler Tyler Tyler/Ismael Tyler Tyler Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael

Ismael Ismael Ismael Ismael Tyler/Ismael Tyler/Ismael Ismael Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler Tyler/Ismael Tyler/Ismael Tyler/Ismael Tyler/Ismael 31

8/27/2012 9/15/2012 9/14/2012 9/25/2012 9/4/2012 10/3/2012 8/27/2012 11/21/2012 11/16/2012 11/21/2012 9/25/2012 10/2/2012 10/10/2012 10/16/2012 8/27/2012 11/6/2012 8/27/2012 8/29/2012 9/24/2012 11/2/2012 11/2/2012 11/6/2012 11/5/2012 11/20/2012 11/5/2012 11/10/2012 11/12/2012 11/20/2012 10/12/2012 12/1/2012

Speech Recognition Using FPGA Euclidean Distance Function Word Matching Function Threshold Calculation Signal to Outputs Outputs DC Motor Configure GPIO Pins Opamp Buffer High Side MOSFET Switch Servo Motor 555 Timer for PWM Signal Configure GPIO Pins High Side MOSFET Switch Project Testing / Refinement Debugging & Refinement 10/12/2012 10/22/2012 11/20/2012 11/21/2012 10/15/2012 10/15/2012 10/15/2012 11/1/2012 11/1/2012 10/15/2012 10/22/2012 10/15/2012 11/1/2012 11/26/2012 11/26/2012 10/20/2012 10/26/2012 12/1/2012 11/23/2012 11/6/2012 11/6/2012 11/6/2012 11/6/2012 11/6/2012 11/6/2012 10/30/2012 10/25/2012 11/6/2012 12/7/2012 12/7/2012 Tyler/Ismael Ismael Ismael Ismael Tyler/Ismael Tyler Ismael Tyler Tyler Tyler/Ismael Tyler Ismael Tyler Tyler/Ismael Tyler/Ismael

32

Speech Recognition Using FPGA

33

Speech Recognition Using FPGA

Chapter 5: Parts Ordering


Table 5.1 shows the dates ordered and arrival dates of each of the parts we needed for our project. Receipts for all the parts ordered are shown in Appendix B. Table 5.1: Shipping Status
Part Needed Altera DE2 FPGA Board Logitech Microphone Low Speed DC Motor MOSFETs and Op-Amps Servo Motor Date Ordered May 5th, 2012 May 8th, 2012 May 5th, 2012 May 5th, 2012 May 7th, 2012 Date Received May 10th, 2012 May 20th, 2012 May 15th, 2012 May 17th, 2012 May 11th, 2012

Chapter 6: Results and Findings


This chapter will be focused on the outcome of our project. The first section will cover the extensive analysis of our system and the keywords in MatLab and the second section will proceed with the system testing results of our keywords on the Altera DE2 FPGA board. Lastly, we will discuss some of our shortcomings of the project and our proposed improvements to alleviate those issues. 6.1 MatLab Findings In order to add some clarity to our project we created a test bench on MatLab that patterned our system. If we had simply started coding our project onto the board we would have essentially been blind, not knowing what to expect as far as frequency content, threshold values, and timing values. Using MatLab we were able to save word templates just as we would when training our board. This allowed us to modify code and view the changes to the reference fingerprints. This was a big issue at first because we were simply providing new sound samples through the microphone each time and it became apparent that a word is never said exactly the same. This is what makes speech recognition so difficult. We needed to fix some consistency and saving three sample templates of each word was a start. Figure 6.1.1 is the time-domain plot of the word Go which will be used to initiate the dc motor.

34

Speech Recognition Using FPGA

Filter Outputs for "Go" 0.6

0.4

0.2

Amplitude

-0.2

-0.4

-0.6

-0.8

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

Figure 6.1.1: Go in Time-Domain From the plot of Go it is evident that the word begins with the humming of the g sound then much more powerful o vowel. The output of the filters for the word Go shown in Figure 6.1.2 reveals that significant content in the signal occurs in the fifth filter (spanning about 4.8 kHz to 7 kHz) at the beginning of the word. This is consistent with the high frequency consonant g. Also, significant content in the signal occurs in the second filter (spanning about 1.2 kHz to 2 kHz) at about 250 ms into the sample. This lower frequency is consistent with the o vowel.

Filter Outputs for "Go" 0.05 0 -0.05 0.02 0 -0.02


Amplitude

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.01 0 -0.01 5 0 -5 0.02 0 -0.02

0 0.05 -3 x 10

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

35

Speech Recognition Using FPGA

Figure 6.1.2: Output of Filters vs. Time for Go Figure 6.1.3 is the time-domain plot of the word Stop which will be used to turn off the dc motor. The high frequency hissing s sound is clearly at the start of the word followed by a hard soft t then op phoneme.
Filter Outputs for "Stop" 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

Amplitude

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

Figure 6.1.3: Stop in Time-Domain The output of the filters for the word Stop shown in Figure 6.1.4 reveals that significant content in the signal occurs in the fifth filter (spanning about 4.8 kHz to 7 kHz) at the beginning of the word. This is consistent with the high frequency s consonant. Also, the t-o-p sound is present across all of the filters at about 270 milliseconds in an appreciable amount.

36

Speech Recognition Using FPGA

Filter Outputs for "Stop" 0.5 0 -0.5 0.5 0 -0.5


Amplitude

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.5 0 -0.5 0.5 0 -0.5 0.5 0 -0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 Time (sec) 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 6.1.4: Output of Filters vs. Time for Stop Figure 6.1.5 is the time-domain plot of the word Open which will be used to initiate the servo motor to signify the opening of a door. Interestingly enough the same op phoneme is present as it was at the end of the keyword stop. The word then end s with the low frequency consonant n sound.
Normalized Plot of "Open" 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

Amplitude

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

Figure 6.1.5: Open in Time-Domain The output of the filters for the word Open is shown in Figure 6.1.6. The beginning op sound is present across all of the filters in an appreciable amount just as it was in the word Stop. The
37

Speech Recognition Using FPGA

n sound does not even appear on our filt er outputs. This is most likely due to the fact that n is one of the lowest sounds and is getting attenuated by the pre-emphasis filter.
Filter Outputs for "Open" 0.2 0 -0.2 0.2 0 -0.2
Amplitude

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 Time (sec) 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 6.1.6: Output of Filters vs. Time for Open Figure 6.1.7 is the time-domain plot of the word Close whic h will be used to return the servo motor to its original position signifying the closing of a door. There is not much empty content for the word Close unlike the other keywords in which the differe nt sounds were clearly visible.
Filter Outputs for "Close" 0.06

0.04

0.02

Amplitude

-0.02

-0.04

-0.06

-0.08

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

Figure 6.1.7: Close in Time-Domain


38

Speech Recognition Using FPGA

The output of the filters for the word Close is shown in Figure 6.1.8. The beginning of the word begins with has very low amplitude outputs from the first two filters. That is then followed by significant content in the last three filters.

5 0 -5 2 0 -2
Amplitude

x 10

-3

Filter Outputs for "Close"

0 0.05 -3 x 10

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 0 -5

0 0.05 -3 x 10

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.1 0 -0.1 0.1 0 -0.1

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05

0.1

0.15

0.2

0.25 0.3 Time (sec)

0.35

0.4

0.45

0.5

Figure 6.1.8: Output of Filters vs. Time for Close 6.2 Testing Results When testing our system on the DE2 Altera board we ran three rounds of ten trials for each word by each of us. The results of the voice recognition testing are shown in Table 6.2.1. Table 6.2.1: Results of Test

39

Speech Recognition Using FPGA

From these tests we can infer that not all speakers are created equal with our system. Tyler finished with an average recognition rate of 62.5% while Ismael finished with an average recognition rate of only 59.25%. Ismaels voice is definitely deeper so we suspect this was a contributing factor to the lower recognition rate. The word that had the highest accuracy at an average rate of 73.5% was Stop while the lowest accuracy at an average rate of 53.5% was Open. In one of the round with Stop Tyler was able to have a recognition rate of 90% (9 out of 10) but there were rates as low as 40% (4 out of 10) for Open. While this accuracy is not very practical for a real world application we are very pleased with these results given that we did not have any sophisticated pattern-matching algorithm. When testing we could usually tell when the word was going to be correctly identified just by how loud we inflected our voice and by the consistency of the speed at which we trained the word. We tested this theory briefly with our smart phones by recording an input word that yielded a correct output. When we used the recorded message the recognition rates were repeated in the 80% range so the system was consistent given a consistent input. 6.3 System Improvements Our system could become a practical solution given some improvement. The first area that needs to be addressed is the processing speed. At its current state the board takes about one minute and 20 seconds to complete all the necessary computation and make a decision. Using filters in a hardware designed parallel structure would greatly cut down on the latency that the serial software structure causes. Also, we could have convert to using IIR filters to that the equivalent filter order would not need to be so high yielding less computations. In addition using a fixed point representation would be a great improvement over the floating point math currently used by our filters. A method to allow for the use of a variable length for the input sounds would drastically improve its performance on very short or very long words instead of a fixed length like we currently have in place. Another area that has much room for improvement is our comparison function. Our system relies on accumulated energy over windowed bands but does not incorporate any type of pattern matching or linear regression techniques where the best match is found. The comparison function that we have in place now is very susceptible to shifts in the timing of the words so that the peaks and troughs of the fingerprints are out of place,

40

Speech Recognition Using FPGA

introducing a lot of error. Lastly, a normalization technique would help to dampen the variability due to the loudness of the spoken word.

Chapter 7: Conclusion
After applying the background theory, analysis using a MatLab prototype, and implementing a prototype on the Altera DE2 board, it is evident a speech recognition system can indeed be successfully implemented using FPGA technology. We achieved all of our proposed goal and objectives in the time allotted for our project. Improvements are needed to our current system in order to make it practical for the consumer. We set out to create a simple solution to speech recognition and our results were modest. Speech is a robust problem. As is often the case, in order to achieve accurate results complex problems invariably yield complex solutions. Regardless, we have learned a great deal about the density of speech and would to further our interest in the subject by continuing to improve upon our system.

41

Speech Recognition Using FPGA

References 1. Ifeachor, Emmanuel, and Berrie Jervis. Digital Signal Processing: A Practical Approach. Prentice Hall, 2002. Print 2. Torres, Grabriel. Hardware Secrets. LLC. April 21, 2006. Web. March 22, 2012. http://www.hardwaresecrets.com/article/317 3. Rabiner, Lawrence, and Biing-Hwang, Juang. Fundamentals of Speech Recognition. Prentice-Hall International, Inc. Print. 4. EVP Frequency Ranges. Web. March 29, 2012. http://www.paranormalghost.com/evp_frequency_ranges.htm 5. http://www.census.gov/newsroom/releases/archives/facts_for_features_special_editions/c b10-ff13.html 6. Wikipedia. Wikipedia Foundation, Inc. March 12, 2012. Web. March 28, 2012. http://www.wikipedia.org/ 7. Yarlagadda, R. K. Rao. Analog and Digital Signals and Systems. New York: Springer, 2010. Print. 8. Roberts, Michael J. Signals and Systems: Analysis Using Transform Methods and MATLAB. New York: McGraw-Hill, 2012. Print. 9. Roth, Charles H., and Larry L. Kinney. Fundamentals of Logic Design. Stamford, CT: Cengage Learning, 2010. Print. 10. Rose, Jonathan. Architecture of Filed-Programmable Gate Arrays. Web. April 30, 2012. http://isl.stanford.edu/groups/elgamal/abbas_publications/J029.pdf 11. DE2 Development and Education Board User Manual. Altera Cooporation, 2006. PDF. 12. "MOSFET as a Switch." Using the Power MOSFET. Web. 28 Mar. 2012. http://www.electronics-tutorials.ws/transistor/tran_7.html

42

Speech Recognition Using FPGA

APPENDIX A MatLab Code Code 1: Idealized Band Pass Filter Plots


%% Idealized Band Pass Filter Bank alpha = 1.45; % logrithmic growth coefficient of filters % Bandwidths of each of the 10 filters b=[100 145 210.25 304.863 442.05 640.97 929.41 1347.65 1954.09 2833.43]; % Center Frequencies of each BPF fc=[250 372.5 550.13 807.68 1181.14 1722.65 2507.84 3646.37 5297.24 7691]; f = 0:.5:10000; %Frequency Range %-------------------------------------------------------------------------% % Idealized Magnitude Responses f1 = heaviside(f-(fc(1)-b(1)/2)) - heaviside(f-(fc(1)+b(1)/2)); f2 = heaviside(f-(fc(2)-b(2)/2)) - heaviside(f-(fc(2)+b(2)/2)); f3 = heaviside(f-(fc(3)-b(3)/2)) - heaviside(f-(fc(3)+b(3)/2)); f4 = heaviside(f-(fc(4)-b(4)/2)) - heaviside(f-(fc(4)+b(4)/2)); f5 = heaviside(f-(fc(5)-b(5)/2)) - heaviside(f-(fc(5)+b(5)/2)); f6 = heaviside(f-(fc(6)-b(6)/2)) - heaviside(f-(fc(6)+b(6)/2)); f7 = heaviside(f-(fc(7)-b(7)/2)) - heaviside(f-(fc(7)+b(7)/2)); f8 = heaviside(f-(fc(8)-b(8)/2)) - heaviside(f-(fc(8)+b(8)/2)); f9 = heaviside(f-(fc(9)-b(9)/2)) - heaviside(f-(fc(9)+b(9)/2)); f10 = heaviside(f-(fc(10)-b(10)/2)) - heaviside(f-(fc(10)+b(10)/2)); %-------------------------------------------------------------------------% plot(f,f1,f,f2,f,f3,f,f4,f,f5,f,f6,f,f7,f,f8,f,f9,f,f10); axis([0 9500 0 2]); xlabel('Frequency (Hertz)'); ylabel('Magnitude');

Code 2: FIR Filter Bank


%% FIR BandPass Filter Bank %-------------------------------------------------------------------------n = 5; % Number of filters alpha = 1.45; % logrithmic growth coefficient of filters % Bandwidths of each of the 10 filters %================================================================== b=[442.05 640.97 929.41 1347.65 1954.09]; % Center Frequencies of each BPF fc=[1181.14 1722.65 2507.84 3646.37 5297.24 7450]; %-------------------------------------------------------------------------% Calculate -3dB cutoff frequencies using center freq & bandwidth %================================================================== fcut = zeros(size(n+1)); % Zero Pad cutoff freq vector fcut(1) = fc(1) - b(1)/2; % Solve for cutoff freqs using center for i = 2:(n+1); % Freqs and bandwidths fcut(i) = fcut(i-1)+b(i-1); end %-------------------------------------------------------------------------% Normalize frequencies for coefficients caculation %================================================================== Nq = 8000; %Nyquist frequency of 8kHz

43

Speech Recognition Using FPGA


fs = 2*Nq; %sampling frequency of 16kHz fn_c = fc/Nq; %normalize center freq's fn_ct = fcut/Nq; %normalize 3-dB cutoff freq's %-------------------------------------------------------------------------%-------------------------------------------------------------------------% Calc filter coefficients for (n+1)th Order FIR Bandpass Filters %-------------------------------------------------------------------------x = 500; % Number of points for plots N = 50; % Order of filter B = zeros(6,(N+1)); % Zero Pad Coeffcients Array %-------------------------------------------------------------------------%1st Filter %=========== B1 = fir1(N,[fn_ct(1) fn_c(2)]); %Calc coeff's for 1st Filter [H, F1] = freqz(B1,1,x,fs); %Store into tranfer function H1 = abs(H); %Absolute value for mag resp M1 = 20*log(H1); %Mag Resp in dB %-------------------------------------------------------------------------%2nd Filter %=========== B2 = fir1(N,[fn_ct(2) fn_c(3)]); %Calc coeff's for 2st Filter [H, F2] = freqz(B2,1,x,fs); %Store into tranfer function H2 = abs(H); %Absolute value for mag resp M2 = 20*log(H2); %Mag Resp in dB %-------------------------------------------------------------------------%3rd Filter %=========== B3 = fir1(N,[fn_ct(3) fn_c(4)]); %Calc coeff's for 3rd Filter [H, F3] = freqz(B3,1,x,fs); %Store into tranfer function H3 = abs(H); %Absolute value for mag resp M3 = 20*log(H3); %Mag Resp in dB %-------------------------------------------------------------------------%4th Filter %=========== B4 = fir1(N,[fn_ct(4) fn_c(5)]); %Calc coeff's for 4th Filter [H, F4] = freqz(B4,1,x,fs); %Store into tranfer function H4 = abs(H); %Absolute value for mag resp M4 = 20*log(H4); %Mag Resp in dB %-------------------------------------------------------------------------%5th Filter %=========== B5 = fir1(N,[fn_ct(5) fn_c(6)]); %Calc coeff's for 5th Filter [H, F5] = freqz(B5,1,x,fs); %Store into tranfer function H5 = abs(H); %Absolute value for mag resp M5 = 20*log(H5); %Mag Resp in dB %-------------------------------------------------------------------------for j = 1:length(B1) B(1,j) = B1(1,j); B(2,j) = B2(1,j); B(3,j) = B3(1,j); B(4,j) = B4(1,j); B(5,j) = B5(1,j); end % Plot all filters on the same plot figure; hold on plot(F1,M1)

44

Speech Recognition Using FPGA


plot(F2,M2) plot(F3,M3) plot(F4,M4) plot(F5,M5) hold off title('FIR Filter Bank'); %Label Graph xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); ylim([-200 50]) %-------------------------------------------------------------------------% Save coefficients save('FIR_Co_16khz','B1','B2','B3','B4','B5','B','N'); fileID1 = fopen('FIR_Coeff','w'); for i = 1:5 for j = 1:N+1; if j == N+1; fprintf(fileID1,'%f \n \n',B(i,j)); else fprintf(fileID1,'%f,',B(i,j)); end if (mod((j),10)) == 0 fprintf(fileID1,'\n'); end

end

end

Code 3: Pre-emphasis Filter


% Pre-emphasis Filter Fs = 16000; B = [1 -0.97]; %Pre-emphasis coefficient [H,F] = freqz(B,1,N,Fs); H = abs(H); %Absolute value for mag resp M = 20*log(H); %Mag Resp in dB plot(F,M); %freqz([1 -0.97],1,); title('Pre-emphasis FIlter Response'); %Label Graph xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); grid on;

Code 4: FFT Analysis


clear all close all close all hidden clc %------------------------------------------------------------------------% Variables Fs = 48000; % Sampling frequency (in hertz) Ts = 2; % Sampling time (in sec) l = 1; % Word size (in sec) ds = 2; % Downsampling conversion factor

45

Speech Recognition Using FPGA


trials = 3; % Number of word templates bRes = 12; % Bit resolution Q = 2/(2^(bRes)); % Quantization levels w = 100; % Window Length thres = 0.005; % Threshold of Word Recognition prompts1 = 'Press ENTER and begin saying training word. \n'; prompts2 = 'Press ENTER and repeat the training word. \n'; prompts3 = 'Press ENTER and say training word one last time. \n'; word = zeros(1,l*Fs*Ts/ds); % Zero pad word X2fft = zeros(1, Fs/(2*ds)); % Zero pad FFT plot %------------------------------------------------------------------------% Prompt User Which FFT to Create %========================================================================= disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); keywd=int32(input('Please select which PSD to display \n')); %-------------------------------------------------------------------------% % Detect Beginning of the Word %========================================================================= for k = 1:trials % Check keyword and load corresponding word template %---------------------------------------------------------if keywd == 1 if k == 1 load('OPEN1') end if k == 2 load('OPEN2') end if k == 3 load('OPEN3') end qmic = OPEN; end %---------------------if keywd == 2 if k == 1 load('CLOSE1') end if k == 2 load('CLOSE2') end if k == 3 load('CLOSE3') end qmic = CLOSE; end %---------------------if keywd == 3 if k == 1 load('GO1') end if k == 2 load('GO2')

46

Speech Recognition Using FPGA


end if k == 3 load('GO3') end qmic = GO;

end %---------------------if keywd == 4 if k == 1 load('STOP1') end if k == 2 load('STOP2') end if k == 3 load('STOP3') end qmic = STOP; end ptr = 1; % Initialize pointer.

ave1 = mean(qmic(ptr:ptr+w)); % Initialization of average windows. ave2 = ave1; % Go through the sound until the difference between the average of two % adjacent windows is significant. check = 1; error = 1; while check if abs(ave1-ave2) > thres check = 0; end if (ptr + 2*w > Ts*Fs/ds) check = 0; disp '[!] Error: No Word Detected.'; error = 0; end if check ptr = ptr + w; ave2 = ave1; ave1 = mean(abs(qmic(ptr:ptr+w))); end end if error word = qmic(ptr:((ptr-1)+ l*Fs/ds)); % Store the detected word %-------------------------------------------------------------------------% % Perform DSP %========================================================================= Xfft = abs(fft(word)); % Find FFT of data X2fft = Xfft(1:end/2).^2 + X2fft; % Square FFT data to get PSD end end X2fft = X2fft/trials; % Average PSD's FFT's mag = 20*log10(X2fft); % Convert into dB magnitude if error

47

Speech Recognition Using FPGA


figure; plot(mag); if keywd == 1 title('PSD of Waveform "Open"'); end if keywd == 2 title('PSD of Waveform "Close"'); end if keywd == 3 title('PSD of Waveform "GO"'); end if keywd == 4 title('PSD of Waveform "STOP"'); end xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); end

%Label Graph %Label Graph %Label Graph %Label Graph

Code 5: Saving Word Templates


%% Speech Storage and Waveform Testing %-------------------------------------------------------------------------% This m-file prompts the user to select a 'training' word. The training % word will be stated and stored 3 times. This allows us to have to have % consistent testing since words can never be iterated exactly the same. % This way we can perform FFT's and have a knowledge of what our filter % output should look like. %-------------------------------------------------------------------------% Clear old graphs and command history clear all close all close all hidden clc %-------------------------------------------------------------------------% Variables Fs = 48000; % Sampling frequency (in hertz) Ts = 3; % Sampling time (in sec) l = 1; % Length (sec) of stored word after detection ds = 2; % Downsampling conversion factor bRes = 12; % Bit resolution Q = 2/2^(bRes); % Number of Quantization levels trials = 1; % Number of recording trials sampL = Ts*Fs/ds; % Number of samples in recording OPEN = zeros(trials,sampL); % Zeros pad words CLOSE = zeros(trials,sampL); GO = zeros(trials,sampL); STOP = zeros(trials,sampL); prompts1 = 'Press ENTER and begin saying training word. \n'; prompts2 = 'Press ENTER and repeat the training word. \n'; prompts3 = 'Press ENTER and say training word one last time. \n'; %-------------------------------------------------------------------------% Sample Mic, Decimate, & Quantize %========================================================================= % Set up Mic Input AI = analoginput('winsound'); addchannel(AI, 1);

48

Speech Recognition Using FPGA


set (AI, 'SampleRate', Fs); set(AI, 'SamplesPerTrigger', Ts*Fs); disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); disp('[5] FAN'); keywd=int32(input('Please select one of the above keywords to train. \n')); %------------------------------------------------------------------------start(AI); % start the acquisition mic = getdata(AI); % Retrieve all the data dmic = decimate(mic,ds); % Downsample the data [x, qmic] = quantiz(dmic, -1:Q:1-Q, -1:Q:1); % Quantize the sound if keywd == 1 OPEN(1:length(qmic))= qmic; plot(OPEN) end if keywd == 2 CLOSE(1:length(qmic))= qmic; plot(CLOSE) end if keywd == 3 GO(1:length(qmic))= qmic; plot(GO) end if keywd == 4 STOP(1:length(qmic))= qmic; plot(STOP) end if keywd == 5 FAN(1:length(qmic))= qmic; plot(FAN) end save('OPEN1', 'OPEN'); save('CLOSE1', 'CLOSE'); save('GO1', 'GO'); save('STOP1', 'STOP'); save('FAN1', 'FAN'); save('BEGIN1', 'BEGIN');

Code 6: FIR Filter Testing


%% FIR Filter Testing %-------------------------------------------------------------------------% Clear old graphs and command history clear all close all close all hidden clc %-------------------------------------------------------------------------% Load FIR Filter Coefficients load('FIR_Co_24khz'); % Coefficients and number of taps (N) %------------------------------------------------------------------------% Variables

49

Speech Recognition Using FPGA


Fs = 48000; % Sampling frequency (in hertz) Ts = 3; % Sampling time (in sec) l = 2; % Length (sec) of stored word after detection ds = 2; % Downsampling conversion factor trials = 3; bRes = 12; % Bit resolution rec = l*Fs/ds; % Number of samples in recording n = 5; % Number of FIR filters Q = 2/(2^(bRes)); % Quantization levels w = 100; % Window Length thres = 0.0005; % Threshold of Word Recognition pts = 100; % Number of Points for each filter FingerPrint win = floor((rec-(N+1))/(pts)+0.5); % Size of each window length word = zeros(1,Fs/ds); % Zero pad word FP1 = zeros(n,pts); % Zero Pad fingerprint arrays FP2 = zeros(n,pts); FP3 = zeros(n,pts); %------------------------------------------------------------------------% Prompt User Which Finger to Generate %========================================================================= disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); disp('[5] FAN'); keywd=int32(input('Please Select a Reference Fingerprint to Generate\n')); %-------------------------------------------------------------------------% check = 1; error = 0; if (keywd == 1) load('OPEN1'); qmic = OPEN; end if (keywd == 2) load('CLOSE1'); qmic = CLOSE; end if (keywd == 3) load('GO1'); qmic = GO; end if (keywd == 4) load('STOP1'); qmic = STOP; end if (keywd == 5) load('FAN1'); qmic = FAN; end %-------------------------------------------------------------------------% % Detect Beginning of the Word %========================================================================= ptr = 1; % Initialize pointer.

50

Speech Recognition Using FPGA

ave1 = mean(qmic(ptr:ptr+w)); % Initialization of average windows. ave2 = ave1; % Go through the sound until the difference between the average of two % adjacent windows is significant. while check if abs(ave1-ave2) > thres check = 0; end if (ptr + 2*w > Ts*Fs/ds) check = 0; disp '[!] Error: No Word Detected.'; error = 1; end if check ptr = ptr + w; ave2 = ave1; ave1 = mean(abs(qmic(ptr:ptr+w))); end end if ~error word = qmic(ptr:((ptr-1)+ l*Fs/ds)); % Store the detected word %-------------------------------------------------------------------------% % FIR Filtering & Fingerprint Generation %========================================================================= % Apply Preemphasis Filter to Word % Note: Eliminates the -6dB per octave decay of the spectral energy for j = 2:rec s(j) = word(j) - 0.97*word(j - 1); end out1 = (filter(B1, 1, s)).^2; out2 = (filter(B2, 1, s)).^2; out3 = (filter(B3, 1, s)).^2; out4 = (filter(B4, 1, s)).^2; out5 = (filter(B5, 1, s)).^2; %out6 = (filter(B6, 1, s)).^2; % Display the reference fingerprint. % Note: only half of the fft is displayed since the fft of a real signal % is half redundant. figure('Name','Reference Fingerprint','NumberTitle','off'); subplot(n,1,1); plot(out1); subplot(n,1,2); plot(out2); subplot(n,1,3); ylabel ('Amplitude'); plot(out3); subplot(n,1,4); plot(out4); subplot(n,1,5); plot(out5); xlabel ('\omega \times N \div 4\pi'); end %------------------------------------------------------------------------%

51

Speech Recognition Using FPGA

APPENDIX B Ordering Receipts

52

Speech Recognition Using FPGA

53

Speech Recognition Using FPGA

APPENDIX C Source Code Globals.c


#include "globals.h" /* global variables */ volatile int record, play, buffer_index; volatile int left_buffer[BUF_SIZE]; volatile int right_buffer[BUF_SIZE]; volatile char byte1, byte2, byte3; volatile int timeout; synchronize with the timer // audio variables // audio buffer // audio buffer // PS/2 variables // used to

Media Interrupt.c
#include "nios2_ctrl_reg_macros.h" /* these globals are written by interrupt service routines; we have to declare * these as volatile to avoid the compiler caching their values in registers */ extern volatile char byte1, byte2, byte3; /* modified by PS/2 interrupt service routine */ extern volatile int record, play, buffer_index; // used for audio extern volatile int timeout; // used to synchronize with the timer /* function prototypes */ void LCD_cursor( int, int ); void LCD_text( char * ); void LCD_cursor_off( void ); void VGA_text (int, int, char *); void VGA_box (int, int, int, int, short); void HEX_PS2(char, char, char); /* Start audio saving on SRAM address 08040000 */ /******************************************************************************** * This program demonstrates use of the media ports in the DE2 Media Computer * * It performs the following: * 1. records audio for about 10 seconds when an interrupt is generated by * pressing KEY[1]. LEDG[0] is lit while recording. Audio recording is * controlled by using interrupts * 2. plays the recorded audio when an interrupt is generated by pressing * KEY[2]. LEDG[1] is lit while playing. Audio playback is controlled by * using interrupts * 3. Draws a blue box on the VGA display, and places a text string inside * the box. Also, moves the word ALTERA around the display, "bouncing" off * the blue box and screen edges * 4. Shows a text message on the LCD display

54

Speech Recognition Using FPGA


* 5. Displays the last three bytes of data received from the PS/2 port * on the HEX displays on the DE2 board. The PS/2 port is handled using * interrupts * 6. The speed of scrolling the LCD display and of refreshing the VGA screen * are controlled by interrupts from the interval timer ********************************************************************************/ int main(void) { /* Declare volatile pointers to I/O registers (volatile means that IO load and store instructions will be used to access these pointer locations, instead of regular memory loads and stores) */ volatile int * interval_timer_ptr = (int *) 0x10002000; // interal timer base address volatile int * KEY_ptr = (int *) 0x10000050; // pushbutton KEY address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * PS2_ptr = (int *) 0x10000100; // PS/2 port address volatile int * pin_ptr = (int *) 0x10000064; // header pins address /* initialize some variables */ byte1 = 0; byte2 = 0; byte3 = 0; record = 0; play = 0; buffer_index = 0; timeout = 0; synchronize with the timer // used to hold PS/2 data // used for audio record/playback //

/* these variables are used for a blue box and a "bouncing" ALTERA on the VGA screen */ int ALT_x1; int ALT_x2; int ALT_y; int ALT_inc_x; int ALT_inc_y; int blue_x1; int blue_y1; int blue_x2; int blue_y2; int screen_x; int screen_y; int char_buffer_x; int char_buffer_y; short color; /* set the interval timer period for scrolling the HEX displays */ int counter = 0x960000; // 1/(50 MHz) x (0x960000) ~= 200 msec *(interval_timer_ptr + 0x2) = (counter & 0xFFFF); *(interval_timer_ptr + 0x3) = (counter >> 16) & 0xFFFF; /* start interval timer, enable its interrupts */ *(interval_timer_ptr + 1) = 0x7; // STOP = 0, START = 1, CONT = 1, ITO = 1 *(KEY_ptr + 2) = 0xE; register, and bits to 1 (bit 0 is Nios II reset) */ *(PS2_ptr) = 0xFF; *(PS2_ptr + 1) = 0x1; enable interrupts */ NIOS2_WRITE_IENABLE( 0xC3 ); (interval (pushbuttons), 6 (audio), and 7 (PS/2) */ /* reset */ /* write to the PS/2 Control register to /* set interrupt mask bits for levels 0 * timer), 1 /* write to the pushbutton interrupt mask * set 3 mask

55

Speech Recognition Using FPGA


NIOS2_WRITE_STATUS( 1 ); // enable Nios II interrupts

/* create a messages to be displayed on the VGA and LCD displays */ char text_top_LCD[60] = "Audio Record \0"; char text_top_VGA[20] = "Altera DE2\0"; char text_bottom_VGA[20] = "Media Computer\0"; char text_ALTERA[10] = "ALTERA\0"; char text_erase[10] = " \0"; /* output text message to the LCD */ LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xffffffff; // turn off the LCD cursor /* the following variables give the size of the pixel buffer */ screen_x = 319; screen_y = 239; color = 0x1863; // a dark grey color VGA_box (0, 0, screen_x, screen_y, color); // fill the screen with grey // draw a medium-blue box around the above text, based on the character buffer coordinates blue_x1 = 28; blue_x2 = 52; blue_y1 = 26; blue_y2 = 34; // character coords * 4 since characters are 4 x 4 pixel buffer coords (8 x 8 VGA coords) color = 0x187F; // a medium blue color VGA_box (blue_x1 * 4, blue_y1 * 4, blue_x2 * 4, blue_y2 * 4, color); /* output text message in the middle of the VGA monitor */ VGA_text (blue_x1 + 5, blue_y1 + 3, text_top_VGA); VGA_text (blue_x1 + 5, blue_y1 + 4, text_bottom_VGA); char_buffer_x = 79; char_buffer_y = 59; ALT_x1 = 0; ALT_x2 = 5/* ALTERA = 6 chars */; ALT_y = 0; ALT_inc_x = 1; ALT_inc_y = 1; VGA_text (ALT_x1, ALT_y, text_ALTERA); while (1) { while (!timeout) ; // wait to synchronize with timer /* move the ALTERA text around on the VGA screen */ VGA_text (ALT_x1, ALT_y, text_erase); // erase ALT_x1 += ALT_inc_x; ALT_x2 += ALT_inc_x; ALT_y += ALT_inc_y; if ( (ALT_y == char_buffer_y) || (ALT_y == 0) ) ALT_inc_y = -(ALT_inc_y); if ( (ALT_x2 == char_buffer_x) || (ALT_x1 == 0) ) ALT_inc_x = -(ALT_inc_x); if ( (ALT_y >= blue_y1 - 1) && (ALT_y <= blue_y2 + 1) ) { if ( ((ALT_x1 >= blue_x1 - 1) && (ALT_x1 <= blue_x2 + 1)) || ((ALT_x2 >= blue_x1 - 1) && (ALT_x2 <= blue_x2 + 1)) ) { if ( (ALT_y == (blue_y1 - 1)) || (ALT_y == (blue_y2 + 1)) )

56

Speech Recognition Using FPGA


ALT_inc_y = -(ALT_inc_y); else ALT_inc_x = -(ALT_inc_x); } } VGA_text (ALT_x1, ALT_y, text_ALTERA); /* display PS/2 data (from interrupt service routine) on HEX displays */ HEX_PS2 (byte1, byte2, byte3); timeout = 0; } } /**************************************************************************************** * Subroutine to move the LCD cursor ****************************************************************************************/ void LCD_cursor(int x, int y) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display char instruction; instruction = x; if (y != 0) instruction |= 0x40; instruction |= 0x80; to set the cursor location *(LCD_display_ptr) = instruction; register } // set bit 6 for bottom row // need to set bit 7 // write to the LCD instruction

/**************************************************************************************** * Subroutine to send a string of text to the LCD ****************************************************************************************/ void LCD_text(char * text_ptr) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display while ( *(text_ptr) ) { *(LCD_display_ptr + 1) = *(text_ptr); ++text_ptr; } } /**************************************************************************************** * Subroutine to turn off the LCD cursor ****************************************************************************************/ void LCD_cursor_off(void) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display *(LCD_display_ptr) = 0x0C; // turn off the LCD cursor } /**************************************************************************************** * Subroutine to send a string of text to the VGA monitor ****************************************************************************************/ void VGA_text(int x, int y, char * text_ptr) {

// write to the LCD data register

57

Speech Recognition Using FPGA


int offset; volatile char * character_buffer = (char *) 0x09000000; // VGA character buffer /* assume that the text string fits on one line */ offset = (y << 7) + x; while ( *(text_ptr) ) { *(character_buffer + offset) = *(text_ptr); buffer ++text_ptr; ++offset; } } /**************************************************************************************** * Draw a filled rectangle on the VGA monitor ****************************************************************************************/ void VGA_box(int x1, int y1, int x2, int y2, short pixel_color) { int offset, row, col; volatile short * pixel_buffer = (short *) 0x08000000; // VGA pixel buffer /* assume that the box coordinates are valid */ for (row = y1; row <= y2; row++) { col = x1; while (col <= x2) { offset = (row << 9) + col; *(pixel_buffer + offset) = pixel_color; address, set pixel ++col; } } }

// write to the character

// compute halfword

/**************************************************************************************** * Subroutine to show a string of HEX data on the HEX displays ****************************************************************************************/ void HEX_PS2(char b1, char b2, char b3) { volatile int * HEX3_HEX0_ptr = (int *) 0x10000020; volatile int * HEX7_HEX4_ptr = (int *) 0x10000030; /* SEVEN_SEGMENT_DECODE_TABLE gives the on/off settings for all segments in * a single 7-seg display in the DE2 Media Computer, for the hex digits 0 - F */ unsigned char seven_seg_decode_table[] = { 0x3F, 0x06, 0x5B, 0x4F, 0x66, 0x6D, 0x7C, 0x07, 0x7F, 0x67, 0x77, 0x7C, 0x39, 0x5E, 0x79, 0x71 }; unsigned char hex_segs[] = { 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned int shift_buffer, nibble; unsigned char code; int i; shift_buffer = (b1 << 16) | (b2 << 8) | b3; for ( i = 0; i < 6; ++i ) {

58

Speech Recognition Using FPGA


nibble = shift_buffer & 0x0000000F; rightmost nibble code = seven_seg_decode_table[nibble]; hex_segs[i] = code; shift_buffer = shift_buffer >> 4; } /* drive the hex displays */ *(HEX3_HEX0_ptr) = *(int *) (hex_segs); *(HEX7_HEX4_ptr) = *(int *) (hex_segs+4); } // character is in

Audio.c (Main)
#include "globals.h" #include <stdio.h> #include <math.h> /* globals used for audio record/playback */ extern volatile int record, play, buffer_index; extern volatile int left_buffer[]; extern volatile int right_buffer[]; void Euclidean_Dist(int i, int f_num, int *w, int *x, int *y); /* Function Prototype */ void PreEmphasis(int p, int *z); /* Function Prototype */ void averaging(int *a, int *b, int *c, int *d); int best_match(void); // function prototype void FIR_Filter(int trial, int samp_length, float B[], int *samp, int *out); volatile volatile volatile volatile volatile int int int int int * * * * * d1; d2; d3; d4; d5;

int taps = 50; int which_word; int trial; long int dist[5][2]; long int *d = &dist[0][0]; float B1[] = {-0.001085,-0.000904,-0.000504,-0.000093,-0.000067,-0.000898,-0.002737,0.004944,-0.005919,-0.003580,0.003497,0.014749,0.026864,0.034309,0.031310,0.014672,0.013812, -0.046694,-0.072710,-0.080748,-0.064499,0.025774,0.025145,0.072614,0.101135,0.101135,0.072614,0.025145,-0.025774,-0.064499,0.080748,-0.072710,-0.046694,-0.013812, 0.014672,0.031310,0.034309,0.026864,0.014749,0.003497,-0.003580,0.005919,-0.004944,-0.002737,-0.000898,-0.000067,-0.000093,-0.000504,-0.000904,0.001085}; float B2[] = {-0.001713,-0.001456,0.000013,0.002355,0.004189,0.003689,0.000365,0.003649,-0.004955,-0.002540,-0.000016,-0.002727,-0.010978,-0.016151,0.006105,0.022034,0.052881, 0.059114,0.022610,-0.045258,-0.103633,-0.107849,0.044031,0.055007,0.128266,0.128266,0.055007,-0.044031,-0.107849,-0.103633,0.045258,0.022610,0.059114,0.052881,

59

Speech Recognition Using FPGA


0.022034,-0.006105,-0.016151,-0.010978,-0.002727,-0.000016,0.002540,-0.004955,-0.003649,0.000365,0.003689,0.004189,0.002355,0.000013,-0.001456,0.001713}; float B3[] = {-0.001269,0.000886,0.001925,0.000690,-0.000448,0.000788,0.000779,0.004915,-0.009409,-0.000526,0.015642,0.015917,-0.003276,-0.014281,-0.004246,-0.002627,0.025212, -0.023980,0.041228,0.100548,0.039971,-0.111136,-0.165201,0.020003,0.168760,0.168760,-0.020003,-0.165201,-0.111136,0.039971,0.100548,0.041228,0.023980,-0.025212, -0.002627,-0.004246,-0.014281,-0.003276,0.015917,0.015642,0.000526,-0.009409,-0.004915,0.000779,0.000788,-0.000448,0.000690,0.001925,0.000886,0.001269}; float B4[] = { 0.001002,-0.001963,-0.000796,0.001198,-0.000086,0.002717,0.001160,0.009047,-0.000978,0.010347,-0.000102,0.002654,-0.001798,-0.027391,0.008830,0.041886,0.012494, -0.017569,-0.006177,-0.053593,0.060377,0.143398,-0.137512,0.202763,0.198274,0.198274,-0.202763,-0.137512,0.143398,0.060377,-0.053593,-0.006177,0.017569,-0.012494, 0.041886,0.008830,-0.027391,-0.001798,0.002654,0.000102,0.010347,-0.000978,-0.009047,0.001160,0.002717,-0.000086,0.001198,-0.000796,0.001963,0.001002}; float B5[] = {-0.000335,0.000150,-0.001840,0.003022,-0.001237,-0.000785,0.002136,0.006948,-0.004037,-0.004442,0.002871,0.009414,-0.008703,-0.012707,0.022384,0.000190,-0.012552, -0.026198,0.070681,-0.045170,-0.007926,-0.049572,0.228825,0.324275,0.157310,0.157310,-0.324275,0.228825,-0.049572,-0.007926,-0.045170,0.070681,0.026198,-0.012552, -0.000190,0.022384,-0.012707,-0.008703,0.009414,0.002871,0.004442,-0.004037,0.006948,-0.002136,-0.000785,-0.001237,0.003022,-0.001840,0.000150,0.000335}; /*************************************************************************************** * Pushbutton - Interrupt Service Routine * * This routine checks which KEY has been pressed. If it is KEY1 or KEY2, it writes this * value to the global variable key_pressed. If it is KEY3 then it loads the SW switch * values and stores in the variable pattern ****************************************************************************************/ void audio_ISR( void ) { volatile int * SW_ptr = (int *) 0x10000040; // SW slider switches base address volatile int * pin_ptr = (int *) 0x10000064; // expansion pins base address volatile int * red_LED_ptr = (int *) 0x10000000; // red LED address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * green_LED_ptr = (int *) 0x10000010; // green LED address volatile int * initial = (int *) 0x130000; // Starting address for saving data

volatile volatile volatile volatile volatile volatile

int int int int int int

* * * * * *

temp_saving; l_start_saving; signal; temp; temp2; temp3;

60

Speech Recognition Using FPGA


volatile int * starting; volatile int * recognize; volatile int * wordG1,* wordG2,* wordG3,* wordG4; volatile int * wordS1,* wordS2,* wordS3,* wordS4; volatile int * wordO1,* wordO2,* wordO3,* wordO4; volatile int * wordC1,* wordC2,* wordC3,* wordC4; volatile int * check; volatile int * P1,* P2,* P3,* P4,* P5,* P6,*P7,* P8,* P9,* P10,* P11,* P12,* P13,* P14,* P15,* P16,* P17,* P18,* P19,* P20; temp2 = 0x0804027c; temp = 0x08040278; // starting of word temp3 = 0x08040280; // distance between two words check = 0x08040270; // pre-emphasis filter check d1 = 0x8040290; // difference of filter 1 d2 = d1+1; // difference of filter 2 d3 = d2+1; // difference of filter 3 d4 = d3+1; //difference of filter 4 d5 = d4+1; //difference of filter 5 signal = 0x3df0; // starting of buffer int SW_value; signed long int sum; signed long int sum2; int P_in; int i; int k; int m; int n; int Rmode; int dist; int distance; int dif1; int dif2; int dif3; int dif4; int matches;

int fifospace, leftdata, rightdata; SW_value = *(SW_ptr); if (*(audio_ptr) & 0x100) the Control register { int shift; int shift2; m = 0; n = 0; Rmode = 0; matches = 0; P_in = 8000; P1 = 0x250000; // check bit RI of

61

Speech Recognition Using FPGA


P2 = P1+P_in; P3 = P2+P_in; P4 = P3+P_in; P5 = P4+P_in; P6 = P5+P_in; P7 = P6+P_in; P8 = P7+P_in; P9 = P8+P_in; P10 = P9+P_in; P11 = P10+P_in; P12 = P11+P_in; P13 = P12+P_in; P14 = P13+P_in; P15 = P14+P_in; P16 = P15+P_in; P17 = P16+P_in; P18 = P17+P_in; P19 = P18+P_in; P20 = P19+P_in; wordG4 = 0x15ee00; wordS4 = 0x19d600; wordO4 = 0x1dbe00; wordC4 = 0x21a600; *(pin_ptr) = 0xffffffff; if (buffer_index == 0) temp_saving = 0x130000; words l_start_saving = temp_saving; if (SW_value == 0x1) { l_start_saving = temp_saving; *(red_LED_ptr) = 0x1; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG1 = 0x130000; // save a pointer for first word starting address which_word = 1; trial = 1; } else if(SW_value == 2) { temp_saving = temp_saving + 16000; //fa00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x2; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG2 = 0x13fa00; which_word = 1; // starting address of saving

62

Speech Recognition Using FPGA


trial = 2; } else if(SW_value == 4) { temp_saving = temp_saving + 32000; //1f400 l_start_saving = temp_saving; *(red_LED_ptr) = 0x4; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG3 = 0x14f400; which_word = 1; trial = 3; } // address of average result temp_saving+48000 ;2ee00 else if (SW_value == 0x8) { temp_saving = temp_saving + 64000; //3e800 l_start_saving = temp_saving; *(red_LED_ptr) = 0x2; char text_top_LCD[60] = "Rec STOP word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS1 = 0x16e800; which_word = 2; trial = 1; } else if(SW_value == 0x10) { temp_saving = temp_saving + 80000; //4e200 l_start_saving = temp_saving; *(red_LED_ptr) = 0x10; char text_top_LCD[60] = "Rec STOP word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS2 = 0x17e200; which_word = 2; trial = 2; } else if(SW_value == 0x20) { temp_saving = temp_saving + 96000; //5dc00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x20; char text_top_LCD[60] = "Rec STOP word

\0";

\0";

\0";

63

Speech Recognition Using FPGA


LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS3 = 0x18dc00; which_word = 2; trial = 3; } // address of average result temp_saving+112000 else if (SW_value == 0x40) { temp_saving = temp_saving + 128000; l_start_saving = temp_saving; *(red_LED_ptr) = 0x40; char text_top_LCD[60] = "Rec OPEN word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO1 = 0x1ad000; } else if(SW_value == 0x80) { temp_saving = temp_saving + 144000; l_start_saving = temp_saving; *(red_LED_ptr) = 0x80; char text_top_LCD[60] = "Rec OPEN word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO2 = 0x1bca00; } else if(SW_value == 0x100) { temp_saving = temp_saving + 160000; l_start_saving = temp_saving; *(red_LED_ptr) = 0x100; char text_top_LCD[60] = "Rec OPEN word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO3 = 0x1cc400; } // address of average result temp_saving+176000 else if (SW_value == 0x200) { temp_saving = temp_saving + 192000; l_start_saving = temp_saving; *(red_LED_ptr) = 0x200;

;6d600

//7d000 \0";

//8ca00 \0";

//9c400 \0";

; abe00

//bb800

64

Speech Recognition Using FPGA


char text_top_LCD[60] = "Rec CLOSE word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC1 = 0x1eb800; } else if(SW_value == 0x400) { temp_saving = temp_saving + 208000; l_start_saving = temp_saving; *(red_LED_ptr) = 0x400; char text_top_LCD[60] = "Rec CLOSE word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC2 = 0x1fb200; } else if(SW_value == 0x800) { temp_saving = temp_saving + 224000; 20ac00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x800; char text_top_LCD[60] = "Rec CLOSE word LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC3 = 0x20ac00; } // address of average result temp_saving+240000 \0"; \0";

//cb200 \0";

//dac00 + 130000 =

; ea600

else if(SW_value == 0x3) // recognizing mode { l_start_saving = 0x240000; temp_saving = 0x240000; *(red_LED_ptr) = 0x3; char text_top_LCD[60] = "Speak Now \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); recognize = 0x240000; //save a pointer for starting address in recognizing mode which_word = 5; } else {

65

Speech Recognition Using FPGA


temp_saving = temp_saving + 256000; 22a000 l_start_saving = temp_saving; *(red_LED_ptr) = SW_value; char text_top_LCD[60] = "SWITCH ERROR LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); \0"; //fa000 + 130000 =

} fifospace = *(audio_ptr + 1); // read the audio port fifospace register // store data until the the audio-in FIFO is empty or the buffer is full while ( (fifospace & 0x000000FF) && (buffer_index < BUF_SIZE) ) { left_buffer[buffer_index] = *(audio_ptr + 2); right_buffer[buffer_index] = *(audio_ptr + 3); ++buffer_index; if (buffer_index == BUF_SIZE) { // done recording record = 0; *(green_LED_ptr) = 0x0; LEDG *(audio_ptr) = 0x0; interrupts *(red_LED_ptr) = 0x0; red led buffer_index = 0; sum = 0; i = 0; sum2 = 0; // start address 0x2120 // ending address 0x126f40 while (i < 960) { for(k=0;k<100;k++) { sum += abs(*signal); signal++; signal++; } sum = sum/100; if(sum>sum2) { // turn off // turn off // turn off

66

Speech Recognition Using FPGA


*(temp2) = sum; temp2 sum2 = sum; *(temp) = signal-200; address into temp starting = *(temp); // starting points to beginning of word } if(sum2 > 21050000) { char text_top_LCD[60] = "Word Detected \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); i = 961; } i++; } if (sum2 < 21050000) lcd message { char text_top_LCD[60] = "No Word Spoken LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); } buffer_index = *(temp); temp *(check) = *(starting+1) >> 10; while (m < P_in) memory for future use { to the right *(starting) = shift2; *(l_start_saving) = *(starting); l_start_saving++; starting++; starting++; starting++; m++; } PreEmphasis(P_in,temp_saving); original value if (which_word == 1) //changes values to 97% of // word is down sampled shift = *(starting); shift2 = shift >> 10; // shift value by 10 // save 1 second of word into \0"; // if word not detected display //Save starting // save average into

// starting address of word is held by

67

Speech Recognition Using FPGA


{ char text_top_LCD[60] = "Processing LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, FIR_Filter(trial, P_in, B2, temp_saving, FIR_Filter(trial, P_in, B3, temp_saving, FIR_Filter(trial, P_in, B4, temp_saving, FIR_Filter(trial, P_in, B5, temp_saving, } if (which_word == 2) { char text_top_LCD[60] = "Processing LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, FIR_Filter(trial, P_in, B2, temp_saving, FIR_Filter(trial, P_in, B3, temp_saving, FIR_Filter(trial, P_in, B4, temp_saving, FIR_Filter(trial, P_in, B5, temp_saving, } if (which_word == 5) { char text_top_LCD[60] = "Processing LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, FIR_Filter(trial, P_in, B2, temp_saving, FIR_Filter(trial, P_in, B3, temp_saving, FIR_Filter(trial, P_in, B4, temp_saving, FIR_Filter(trial, P_in, B5, temp_saving, } char text_top_LCD[60] = "Ready LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); //averaging(wordG1, //averaging(wordS1, //averaging(wordO1, //averaging(wordC1, wordG2, wordS2, wordO2, wordC2, wordG3, wordS3, wordO3, wordC3, \0"; \0"; \0";

P1); P2); P3); P4); P5);

P6); P7); P8); P9); P10);

P11); P12); P13); P14); P15); \0";

wordG4); wordS4); wordO4); wordC4);

if(SW_value == 0x3) // time domain distance of each value

68

Speech Recognition Using FPGA


{ Euclidean_Dist(P_in, 1, P1, P6, P11); //*(temp3) = distance; //dif1 = distance; Euclidean_Dist(P_in, 2, P2, P7, P12); //dif2 = distance; Euclidean_Dist(P_in, 3, P3, P8, P13); //dif3 = distance; Euclidean_Dist(P_in, 4, P4, P9, P14); //dif4 = distance; Euclidean_Dist(P_in, 5, P5, P10, P15);

matches = best_match(); if (matches == 0) { char text_top_LCD[60] = "Detected Fan LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xfffffffe; } if (matches == 1) { char text_top_LCD[60] = "Detected Stop LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xffffffff; } } } fifospace = *(audio_ptr + 1); register } } if (*(audio_ptr) & 0x200) the Control register { if (buffer_index == 0) *(green_LED_ptr) = 0x2; LEDG_1 // check bit WI of // read the audio port fifospace

\0";

\0";

// turn on

69

Speech Recognition Using FPGA


fifospace = *(audio_ptr + 1); // read the audio port fifospace register // output data until the buffer is empty or the audio-out FIFO is full while ( (fifospace & 0x00FF0000) && (buffer_index < BUF_SIZE) ) { *(audio_ptr + 2) = left_buffer[buffer_index]; *(audio_ptr + 3) = right_buffer[buffer_index]; ++buffer_index; if (buffer_index == BUF_SIZE) { // done playback play = 0; *(red_LED_ptr) = 0x0; *(green_LED_ptr) = 0x0; LEDG *(audio_ptr) = 0x0; interrupts char text_top_LCD[60] = "Done Playback LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); } fifospace = *(audio_ptr + 1); register } } return; } /**************************************************************/ /*difference equation for two words*/ void Euclidean_Dist(int i,int f_num, int *w, int *x, int *y) { /* This function calculates the Euclidean distance between two arrays of length i. */ int j = 0; int *x2; int *y2; int *w2; long int temp_d1 = 0; long int temp_d2 = 0; x2 = x; y2 = y; w2 = w; //-----------------------------------------------------// Loop to find cummulative difference then divide by i while(j < (i - 1)) { temp_d1 += (*(w) - *(y))*(*(w) - *(y)); temp_d2 += (*(x) - *(y))*(*(x) - *(y)); w++; x++; y++; j++; \0"; // turn off

// turn off

// read the audio port fifospace

70

Speech Recognition Using FPGA


} x = x2; y = y2; w = w2; //-----------------------------------------------------dist[f_num-1][0] = abs(temp_d1)/(i); dist[f_num-1][1] = abs(temp_d2)/(i); return; } /**************************************************************/ /**************************************************************/ void PreEmphasis(int p, int *z) { /* This function applies a PreEmphasis Filter to the array which eliminates the -6dB per octave decay of the spectral energy*/ int u = 0; float diff; int in_diff; int *save; int value = *z; //-----------------------------------------------------// save = z; while(u < p) { diff = *z*0.95; *z = value; z++; in_diff = (int)diff; value = *z - in_diff; u++; } z = save; return; } /**************************************************************/ /**************************************************************/ void averaging(int *a, int *b, int *c, int *d) { int summation; int avg; int i; int *a2; int *b2; int *c2; int *d2; a2 = a; b2 = b; c2 = c; d2 = d; i = 0; while (i<8000) {

71

Speech Recognition Using FPGA


summation = *(a)+*(b)+*(c); avg = summation/3; *(d) = avg; a++; b++; c++; d++; i++; } a b c d = = = = a2; b2; c2; d2;

return; } /**************************************************************/ /**************************************************************/ int best_match(void) { long int match1 = 0; long int match2 = 0; int match = 0; /****** match1 = dist[0][0]+dist[1][0]+dist[2][0]+dist[3][0]+dist[4][0]; match2 = dist[0][1]+dist[1][1]+dist[2][1]+dist[3][1]+dist[4][1]; if (match1 < match2) match = 0; else match = 1; ******/

if(dist[0][0] < dist[0][1]) { *(d1) = dist[0][0]-dist[0][1]; match1++; } else match2++; if(dist[1][0] < dist[1][1]) { *(d2) = dist[1][0]-dist[1][1]; match1++; } else match2++; if(dist[2][0] < dist[2][1]) { *(d3) = dist[2][0]-dist[2][1]; match1++;

72

Speech Recognition Using FPGA


} else match2++; if(dist[3][0] < dist[3][1]) { *(d4) = dist[3][0]-dist[3][1]; match1++; } else match2++; if(dist[4][0] < dist[4][1]) { *(d5) = dist[4][0]-dist[4][1]; match1++; } else match2++; if (match1 < match2) match = 0; else match = 1; return match; } /**************************************************************/ /**************************************************************/ void FIR_Filter(int trial, int samp_length, float B[], int *samp, int *out) { /* This function filters the samples pointed to by 'samp' and stores them in the location pointed to by 'out' */ /* samp_length is the number of samples pointed to by 'samp'. */ // /* 'samp_length': number of samples B[]': coefficient array for filters 'samp': pointer to integer samples 'out': pointer to output storage */ /**************************************************************************************** ***************************************************************************************** / int *save_in; int *save_out; save_in = samp; save_out = out; int k = 0; int inc = 0; float val = 0; float y = 0; float f_out = 0; while(inc < taps) { while(k < (inc+1))

73

Speech Recognition Using FPGA


{ val = (float)*(samp-k); //printf("%f \n",val); y += val*B[k]; k++; } //printf("%f \n", y); if (trial == 1) { y = ((y*y)+0.5); } if (trial == 2) { f_out = (float)*out; y = ((y*y)+0.5); y+= f_out; } if (trial == 3) { f_out = (float)*out; y = ((y*y)+0.5); y = (y + f_out)/3; } *out = (int)y; //printf("\t %i \n",*out); samp++; inc++; out++; y =0; k = 0; } //printf("%i \t %f \t %f \n",inc,*samp,*out); //printf("%i \t %i \n",inc,samp_length); while(inc < samp_length) { while(k < taps) { val = (float)*(samp-k); y += val*B[k]; //printf("%i \t %f \t %f \t",k ,B[k], *(samp-k)); k++; } if (trial == 1) { y = ((y*y)+0.5); } if (trial == 2) { f_out = (float)*out; y = ((y*y)+0.5); y+= f_out; } if (trial == 3) { f_out = (float)*out; y = ((y*y)+0.5); y = (y + f_out)/3; } *out = (int)y;

74

Speech Recognition Using FPGA


//printf("%i %3.8f \n",inc,*out); // Inc pointers & counters inc++; out++; y =0; k = 0; // Return pointers to 0th element

samp++;

} samp = save_in; out = save_out; return;

} /*****************************************************************/

Pushbutton.c
extern volatile int buffer_index; /*************************************************************************************** * Pushbutton - Interrupt Service Routine * * This routine checks which KEY has been pressed. If it is KEY1 or KEY2, it writes this * value to the global variable key_pressed. If it is KEY3 then it loads the SW switch * values and stores in the variable pattern ****************************************************************************************/ void pushbutton_ISR( void ) { volatile int * KEY_ptr = (int *) 0x10000050; // pushbuttons base address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * green_LED_ptr = (int *) 0x10000010; // green LED address int KEY_value; KEY_value = *(KEY_ptr + 3); *(KEY_ptr + 3) = 0; if (KEY_value == 0x2) { // read the pushbutton interrupt register // Clear the interrupt // check KEY1 // turn on LEDG[1]

*(green_LED_ptr) = 0x2; // reset the buffer index to record buffer_index = 0; // clear audio-in FIFO *(audio_ptr) = 0x4; // turn off clear, and enable audio-in interrupts *(audio_ptr) = 0x1; } else if (KEY_value == 0x4) { *(green_LED_ptr) = 0x4; // reset buffer index to record buffer_index = 0; // clear audio-out FIFO // check KEY2

// turn on LEDG[2]

75

Speech Recognition Using FPGA


*(audio_ptr) = 0x8; // turn off clear, and enable audio-out interrupts *(audio_ptr) = 0x2; } /****else if (KEY_value == 0x8) { // check KEY3 // turn on LEDG[3]

*(green_LED_ptr) = 0x8; // reset buffer index to record buffer_index = 0; // clear audio-in FIFO *(audio_ptr) = 0x4; // turn off clear, and enable audio-in interrupts *(audio_ptr) = 0x3; } ****/ return; }

Interval Timer ISR.c


extern volatile int timeout; /***************************************************************************** * Interval timer interrupt service routine * * Controls refresh of the VGA screen * ******************************************************************************/ void interval_timer_ISR( ) { volatile int * interval_timer_ptr = (int *) 0x10002000; volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display *(interval_timer_ptr) = 0; // clear the interrupt timeout = 1; // set global variable /* shift the LCD display to the left */ //*(LCD_display_ptr) = 0x18; // instruction = shift left return; }

76

Você também pode gostar