Você está na página 1de 36

MAJOR PROJECT - I FINAL SUBMISSION REPORT (Year 2012)

DSP TOOLS IN WIRELESS COMMUNICATION

SUBMITTED TO: Mr. Hemant Kumar Meena

Presented byPiyush Virmani (9102259) Palash Relan(9102262)

CERTIFICATE

This is to certify that the work titled DSP Tools in Wireless Communication submitted by Piyush Virmani & Palash Relan in partial fulfilment for the award of degree B.TECH of Jaypee Institute of Information Technology University, Noida has been carried out under my supervision. This work has not been submitted partially or wholly to any other University or Institute for the award of this or any other degree or diploma.

Signature of Supervisor ... Name of Supervisor Designation Date ..

.. ..

ACKNOWLEDGEMENT

We are highly obliged to our project supervisor, Mr. Hemant Kumar Meena, for assigning this work of study on the topic DSP Tools in Wireless Communication which has helped us to develop understanding of Speech processing. We are grateful to him for all his time, assistance and guidance which motivated us to work on this topic and without which our major project would have not seen its end. We are also thankful to the external examiners Mr. R.K Dubey and Mr V.K Dwivedi who helped us build a better understanding on the matter.

Date: .. Name of Students: Piyush Virmani (09102259) Palash Relan (09102262)

CONTENTS

1. 2. 3. 4.

Certificate Acknowledgement Contents Abstract i. Wireless Communication for Voice Transmission ii. Digital Speech Processing 5. Application of Digital Speech Processing i. Speech Coding ii. Text to Speech Synthesis iii. Speech Recognition and Pattern Matching iv. Other Applications 6. Human Speech 7. Properties of Speech 8. Speech Analysis i. Short Term Energy ii. Short Term Zero Crossing iii. Short Term Autocorrelation Function 9. General Encoding of Arbitrary Waveforms i. Types of Vocoders ii. Vocoder Quality Measurement 10. Linear Predictive Analysis i. Introduction ii. LPC Model iii. LPC Analysis i. Input Speech ii. Pitch Period Estimation

iii. Vocal Tract Filter iv. Voiced/Unvoiced Determination v. Levinson-Durbin Algorithm iv. LPC Synthesis/Decoding v. Transmission of patrameters vi. Applications of LPC 11. Full LPC Model and Implementation i. LPC Encoder Model ii. LPC Decoder Model iii. MATLAB Implementation 12. Discussion and Conclusion 13. References

Abstract
Wireless Communication for Voice Transmission
Wireless communications operators see phenomenal growth in consumer demand for high quality and low cost services. Since the physical spectrum for wireless services is limited, operators and equipment suppliers continually find ways to optimise bandwidth efficiency. Digital communications technology provides an efficiency advantage over analog wireless communications; multiplexing and filtering is easier, components are cheaper, encryption is more secure and network management is easier. Additionally, digital technology provides more value added services to customers (security, text and voice messages together, etc.). Today wireless communication is primarily voice. The operator meets the increasing need for services by combining digital technology and special encoding techniques for voice. These encoders ("vocoders") take advantage of predictable elements in human speech. Several low data rate encoders are described here with an assessment of their subjective quality. Test methods to determine voice quality are necessarily subjective. The most efficient vocoders have acceptable quality levels and have data rates between 2 and 8 kbit/s. Higher data rate encoders (8-13 kbit/s) have improved quality while 32 kbit/s coders have excellent quality (but use more network resources. The operator must engineer the proper balance between cost, quality and available resources to provide the optimum solution to the customer.

Digital Speech Processing


Since even before the time of Alexander Graham Bells revolutionary invention, engineers and scientists have studied the phenomenon of speech communication with an eye on creating more efficient and effective systems of human-to-human and human-to-machine communication. Starting in the 1960s, digital signal processing (DSP), assumed a central role in speech studies, and today DSP is the key to realizing the fruits of the knowledge that has been gained through decades of research. Concomitant advances in integrated circuit technology and computer architecture have aligned to create a technological environment with virtually limitless opportunities for innovation in speech communication applications. In this project, we highlight the central role of DSP techniques in modern speech communication research and applications.

Applications of Digital Speech Processing


The first step in most applications of digital speech processing is to convert the acoustic waveform to a sequence of numbers. Most modern A-to-D converters operate by sampling at a very high rate, applying a digital lowpass filter with cutoff set to preserve a prescribed bandwidth, and then reducing the sampling rate to the desired sampling rate, which can be as low as twice the cutoff frequency of the sharp-cutoff digital filter. This discrete-time representation is the starting point for most applications.

Speech Coding
Perhaps the most widespread applications of digital speech processing technology occur in the areas of digital transmission and storage of speech signals. In these areas the centrality of the digital representation is obvious, since the goal is to compress the digital waveform representation of speech into a lower bit-rate representation. It is common to refer to this activity as speech coding or speech compression.

Speech coders enable a broad range of applications including narrowband and broadband wired telephony, cellular communications, voice over internet protocol (VoIP) (which utilizes the internet as a real-time communications medium), secure voice for privacy and encryption (for national security applications), extremely narrowband communications channels (such as battlefield applications using high frequency (HF) radio), and for storage of speech for telephone answering machines, interactive voice response (IVR) systems, and pre-recorded messages. Speech coders often utilize many aspects of both the speech production and speech perception processes, and hence may not be useful for more general audio signals such as music. Coders that are based on incorporating only aspects of sound perception generally do not achieve as much compression as those based on speech production, but they are more general and can be used for all types of audio signals. These coders are widely deployed in MP3 and AAC players and for audio in digital television systems.

Text-to-Speech Synthesis
For many years, scientists and engineers have studied the speech production process with the goal of building a system that can start with text and produce speech automatically. In a sense, a text-to-speech synthesizer such as depicted in figure is a digital simulation of the entire upper part of the speech chain diagram.

Text to Speech Synthesis Block Diagram The input to the system is ordinary text such as an email message or an article from a newspaper or magazine. The first block in the text-to-speech synthesis system, labelled linguistic rules, has the job of converting the printed text input into a set of sounds that the machine must synthesize. The conversion from text to sounds involves a set of linguistic rules that must determine the appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.) so that the resulting synthetic speech will express the words and intent of the text message in what passes for a natural voice that can be decoded accurately by human speech perception. Once the proper pronunciation of the text has been determined, the role of the synthesis algorithm is to create the appropriate sound sequence to represent the text message in the form of speech. In essence, the synthesis algorithm must simulate the action of the vocal tract system in creating the sounds of speech.

Speech Recognition and Other Pattern Matching Problems


Another large class of digital speech processing applications is concerned with the automatic extraction of information from the speech signal. Most such systems involve some sort of pattern matching. The figure shows a block diagram of a generic approach to pattern matching problems in speech processing. Such problems include the following: speech recognition, where the object is to extract the message from the speech signal; speaker recognition, where the goal is to identify who is speaking; speaker verification, where the goal is to verify a speakers claimed identity from analysis of their speech signal; word spotting, which involves monitoring a speech signal for the occurrence of specified words or phrases; and automatic indexing of speech recordings based on recognition (or spotting) of spoken keywords.

The first block in the pattern matching system converts the analog speech waveform to digital form using an A-to-D converter. The feature analysis module converts the sampled speech signal to a set of feature vectors. Often, the same analysis techniques that are used in speech coding are also used to derive the feature vectors. The final block in the system, namely the pattern matching block, dynamically time aligns the set of feature vectors representing the speech signal with a concatenated set of stored patterns, and chooses the identity associated with the pattern which is the closest match to the time-aligned set of feature vectors of the speech signal. The symbolic output consists of a set of recognized words, in the case of speech recognition, or the identity of the best matching talker, in the case of speaker recognition, or a decision as to whether to accept or reject the identity claim of a speaker in the case of speaker verification.

Speech Recognition Block Diagram

The major areas where such a system finds applications include command and control of computer software, voice dictation to create letters, memos, and other documents, natural language voice dialogues with machines to enable help desks and call centres, and for agent services such as calendar entry and update, address list modification and entry, etc.

Other Speech Applications

Human Speech
The fundamental purpose of speech is communication, i.e., the transmission of messages. According to Shannons information theory , a message represented as a sequence of discrete symbols can be quantified by its information content in bits, and the rate of transmission of information is measured in bits/second (bps). In speech production, as well as in many humanengineered electronic communication systems, the information to be transmitted is encoded in the form of a continuously varying (analog) waveform that can be transmitted, recorded, manipulated, and ultimately decoded by a human listener. In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals, as illustrated in Figure 1.1, can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired. This form of speech processing is, of course, the basis for Bells telephone invention as well as todays multitude of devices for recording, transmitting, and manipulating speech and audio signals.

Properties of Speech
The two types of speech sounds, voiced and unvoiced, produce different sounds and spectra due to their differences in sound formation. With voiced speech, air pressure from the lungs forces normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from about 50 to 400 Hz (depending on the persons age and sex) and forms resonance in the vocal track at odd harmonics. These resonance peaks are called formants and can be seen in the voiced
speech figures below.

Voiced Speech Sample

Power Spectral Density, Voiced Speech

Unvoiced sounds, called fricatives (e.g., s, f, sh) are formed by forcing air through an opening (hence the term, derived from the word friction). Fricatives do not vibrate the vocal cords and therefore do not produce as much periodicity as seen in the formant structure in voiced speech; unvoiced sounds appear more noise-like (see figures 3 and 4 below). Time domain samples lose periodicity and the power spectral density does not display the clear resonant peaks that are found in voiced sounds.

Unvoiced Speech Sample

Power Spectral Density, Unvoiced Speech

The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of approximately 7000 Hz with an average energy at about 3000 Hz. The auditory canal optimizes speech detection by acting as a resonant cavity at this average frequency. Note that the power of speech spectra and the periodic nature of formants drastically diminish above 3500 Hz. Speech encoding algorithms can be less complex than general encoding by concentrating (through filters) on this region. Furthermore, since line quality telecommunications employ filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives are removed. A caller will often have to spell or otherwise distinguish these sounds to be understood (e.g., F as in Frank).

Schematic Model of Vocal Tract System

Speech Analysis
Our goal is to extract parameters of the model by analysis of the speech signal, it is common to assume structures (or representations) for both the excitation generator and the linear system. One such model uses a more detailed representation of the excitation in terms of separate source generators for voiced and unvoiced speech as shown in the figure.

In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period (P0) rounded to the nearest sample. The pulses needed to model the glottal flow waveform during voiced speech are assumed to be combined (by convolution) with the impulse response of the linear system, which is assumed to be slowly-time-varying (changing every 50100 ms or so). By this we mean that over the timescale of phonemes, the impulse response, frequency response, and system function of the system remains relatively constant. For example over time intervals of tens of milliseconds, the system can be described by the convolution expression

where the subscript n denotes the time index pointing to the block of samples of the entire speech signal s[n] wherein the impulse response hn[m] applies.We use n for the time index within that interval, and m is the index of summation in the convolution sum. To simplify analysis, it is often assumed that the system is an all-pole system with system function of the form:

Although the linear system is assumed to model the composite spectrum effects of radiation, vocal tract tube, and glottal excitation pulse shape (for voiced speech only) over a short time interval, the linear system in the model is commonly referred to as simply the vocal tract system and the corresponding impulse response is called the vocal tract impulse response. For all-pole linear systems, as represented by the equation, the input and output are related by a difference equation of the form:

Short-Time Energy and Zero-Crossing Rate


Two basic short-time analysis functions useful for speech signals are the short-time energy and the short-time zero-crossing rate. These functions are simple to compute, and they are useful for estimating properties of the excitation function in the model. The short-time energy is defined as:

Similarly, the short-time zero crossing rate is defined as the weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to:

The short-time energy and short-time zero-crossing rate are important because they abstract valuable information about the speech signal, and they are simple to compute. The short-time energy is an indication of the amplitude of the signal in the interval around time. From our model, we expect unvoiced regions to have lower short-time energy than voiced regions. Similarly, the short-time zero-crossing rate is a crude frequency analyzer. Voiced signals have a high frequency (HF) fall off due to the lowpass nature of the glottal pulses, while unvoiced sounds have much more HF energy. Thus, the short-time energy and short-time zero-crossing rate can be the basis for an algorithm for making a decision as to whether the speech signal is voiced or unvoiced at a particular time.

Short-Time Autocorrelation Function (STACF)


The autocorrelation function is often used as a means of detecting periodicity in signals, and it is also the basis for many spectrum analysis methods. This makes it a useful tool for short-time speech analysis. The STACF is defined as the deterministic autocorrelation function of the sequence xn[m] = x[m]w[n m] that is selected by the window shifted to time n, i.e.,

Voiced and Unvoiced Segments of speech and their corresponding Autocorrelation

General Encoding of Arbitrary Waveforms


Waveform encoders typically use Time Domain or Frequency Domain coding and attempt to accurately reproduce the original signal. These general encoders do not assume any previous knowledge about the signal. The decoder output waveform is very similar to the signal input to the coder. Examples of these general encoders include Uniform Binary Coding for music Compact Disks and Pulse Code Modulation for telecommunications. Pulse Code Modulation (PCM) is a general encoder used in standard voice grade circuits. The PCM encodes into eight bit words Pulse Amplitude Modulated (PAM) signals that have been samples at the Nyquist rate for the voice channel (8000 samples per second, or twice the channel bandwidth). The PCM signal therefore requires a 64 Kb/s transmission channel. However, this is not feasible over communication channels where bandwidth is a premium. It is also inefficient when the communication is primarily voice that exhibits a certain amount of predictability as seen in the periodic structure from formants. The increasing use of limited transmission media such as radio and satellite links and limited voice storage resources require more efficient coding methods. Special encoders have been designed that assume the input signal is voice only. These vocoders use speech production models to reproduce only the intelligible quality of the original signal waveform. The most popular vocoders used in digital communications are presented below.

Types of Voice Encoders


Linear Predictive Coder (LPC) Regular Pulse Excited (RPE) Coder Code Book Excited (CELP) Coder

Vocoder Quality Measurements


There are several points to rate vocoder quality: Cost/complexity Voice Quality Data Rate Transparency for non-voice signals Tolerance of transmission errors Effects of tandem encodings Coding formats Signal processing requirements. It is suggested that the most important quality measures are voice quality, data rate, communication delay and coding algorithm complexity. While all of these can easily be measured and analysed, voice quality remains subjective.

Linear Predictive Analysis


Proposal
Linear predictive coding(LPC) is defined as a digital method for encoding an analog signal in which a particular value is predicted by a linear function of the past values of the signal. It was first proposed as a method for encoding human speech by the United States Department of Defence in federal standard 1015, published in 1984. Human speech is produced in the vocal tract which can be approximated as a variable diameter tube. The linear predictive coding (LPC) model is based on a mathematical approximation of the vocal tract represented by this tube of a varying diameter. At a particular time, t, the speech sample s(t) is represented as a linear sum of the p previous samples. The most important aspect of LPC is the linear predictive filter which allows the value of the next sample to be determined by a linear combination of previous samples. Under normal circumstances, speech is sampled at 8000 samples/second with 8 bits used to represent each sample. This provides a rate of 64000 bits/second. Linear predictive coding reduces this to 2400 bits/second. At this reduced rate the speech has a distinctive synthetic sound and there is a noticeable loss of quality. However, the speech is still audible and it can still be easily understood. Since there is information loss in linear predictive coding, it is a lossy form of compression.

Introduction
There exist many different types of speech compression that make use of a variety of different techniques. However, most methods of speech compression exploit the fact that speech production occurs through slow anatomical movements and that the speech produced has a limited frequency range. The frequency of human speech production ranges from around 300 Hz to 3400 Hz. Speech compression is often referred to as speech coding which is defined as a method for reducing the amount of information needed to represent a speech signal. Most forms of speech coding are usually based on a lossy algorithm. Lossy algorithms are considered acceptable when encoding speech because the loss of quality is often undetectable to the human ear. There are many other characteristics about speech production that can be exploited by speech coding algorithms. One fact that is often used is that period of silence take up greater than 50% of conversations. An easy way to save bandwidth and reduce the amount of information needed to represent the speech signal is to not transmit the silence. Another fact about speech production that can be taken advantage of is that mechanically there is a high correlation between adjacent samples of speech. Most forms of speech compression are achieved by modelling the process of speech production as a linear digital filter. The digital filter and its slow changing parameters are usually encoded to achieve compression from the speech signal. Linear Predictive Coding (LPC) is one of the methods of compression that models the process of speech production. Specifically, LPC models this process as a linear sum of earlier samples using a digital filter inputting an excitement signal. An alternate explanation is that linear prediction filters attempt to predict future values of the input signal based on past signals. LPC models speech as an autoregressive process, and sends the parameters of the process as opposed to sending the speech itself.

All vocoders, including LPC vocoders, have four main attributes: bit rate, delay, complexity, quality. Any voice coder, regardless of the algorithm it uses, will have to make trade offs between these different attributes. The first attribute of vocoders, the bit rate, is used to determine the degree of compression that a vocoder achieves. Uncompressed speech is usually transmitted at 64 kb/s using 8 bits/sample and a rate of 8 kHz for sampling. Any bit rate below 64 kb/s is considered compression. The linear predictive coder transmits speech at a bit rate of 2.4 kb/s, an excellent rate of compression. Delay is another important attribute for vocoders that are involved with the transmission of an encoded speech signal. Vocoders which are involved with the storage of the compressed speech, as opposed to transmission, are not as concern with delay. The general delay standard for transmitted speech conversations is that any delay that is greater than 300 ms is considered unacceptable. The third attribute of voice coders is the complexity of the algorithm used. The complexity affects both the cost and the power of the vocoder. Linear predictive coding because of its high compression rate is very complex and involves executing millions of instructions per second. The general algorithm for linear predictive coding involves an analysis or encoding part and a synthesis or decoding part. In the encoding, LPC takes the speech signal in blocks or frames of speech and determines the input signal and the coefficients of the filter that will be capable of reproducing the current block of speech. This information is quantized and transmitted. In the decoding, LPC rebuilds the filter based on the coefficients received. The filter can be thought of as a tube which, when given an input signal, attempts to output speech. Additional information about the original speech signal is used by the decoder to determine the input or excitation signal that is sent to the filter for synthesis.

LPC Model
The particular source-filter model used in LPC is known as the Linear predictive coding model. It has two key components: analysis or encoding and synthesis or decoding. The analysis part of LPC involves examining the speech signal and breaking it down into segments or blocks. Each segment is than examined further to find the answers to several key questions: Is the segment voiced or unvoiced? What is the pitch of the segment? What parameters are needed to build a filter that models the vocal tract for the current segment?

LPC analysis is usually conducted by a sender who answers these questions and usually transmits these answers onto a receiver. The receiver performs LPC synthesis by using the answers received to build a filter that when provided the correct input source will be able to accurately reproduce the original speech signal.

Essentially, LPC synthesis tries to imitate human speech production. Figure demonstrates what parts of the receiver correspond to what parts in the human anatomy. This diagram is for a general voice or speech coder and is not specific to linear predictive coding. All voice coders tend to model two things: excitation and articulation. Excitation is the type of sound that is passed into the filter or vocal tract and articulation is the transformation of the excitation signal into speech.

LPC Analysis/Encoding
Input speech The input signal is sampled at a rate of 8000 samples per second. This input signal is then broken up into segments or blocks which are each analysed and transmitted to the receiver. The 8000 samples in each second of speech signal are broken into 180 sample segments. This means that each segment represents 22.5 milliseconds of the input speech signal. Voice/Unvoiced Determination According to LPC-10 standards, before a speech segment is determined as being voiced or unvoiced it is first passed through a low-pass filter with a bandwidth of 1 kHz. Determining if a segment is voiced or unvoiced is important because voiced sounds have a different waveform then unvoiced sounds. The differences in the two waveforms creates a need for the use of two

different input signals for the LPC filter in the synthesis or decoding. One input signal is for voiced sounds and the other is for unvoiced. The LPC encoder notifies the decoder if a signal segment is voiced or unvoiced by sending a single bit. Recall that voiced sounds are usually vowels and can be considered as a pulse that is similar to periodic waveforms. These sounds have high average energy levels which means that they have very large amplitudes. Voiced sounds also have distinct resonant or formant frequencies. Pitch Period Estimation Determining if a segment is a voiced or unvoiced sound is not all of the information that is needed by the LPC decoder to accurately reproduce a speech signal. In order to produce an input signal for the LPC filter the decoder also needs another attribute of the current speech segment known as the pitch period. The period for any wave, including speech signals, can be defined as the time required for one wave cycle to completely pass a fixed position. For speech signals, the pitch period can be thought of as the period of the vocal cord vibration that occurs during the production of voiced speech. Therefore, the pitch period is only needed for the decoding of voiced segments and is not required for unvoiced segments since they are produced by turbulent air flow not vocal cord vibrations. It is very computationally intensive to determine the pitch period for a given segment of speech. There are several different types of algorithms that could be used. One type of algorithm takes advantage of the fact that the autocorrelation of a period function, Rxx(k), will have a maximum when k is equivalent to the pitch period. These algorithms usually detect a maximum value by checking the autocorrelation value against a threshold value. One problem with algorithms that use autocorrelation is that the validity of their results is susceptible to interference as a result of other resonances in the vocal tract. When interference occurs the algorithm can not guarantee accurate results. Another problem with autocorrelation algorithms occurs because voiced speech is not entirely periodic. This means that the maximum will be lower than it should be for a true periodic signal. LPC does not use an algorithm with autocorrelation, instead it uses an algorithm called average magnitude difference function (AMDF) which is defined as

Since the pitch period, P, for humans is limited, the AMDF is evaluated for a limited range of the possible pitch period values. Therefore, in LPC there is an assumption that the pitch period is between 2.5 and 19.5 milliseconds. If the signal is sampled at a rate of 8000 samples/second then 20 < P < 160. For voiced segments we can consider the set of speech samples for the current segment, {yn}, as a periodic sequence with period Po. This means that samples that are Po apart should have similar values and that the AMDF function will have a minimum at Po, that is when P is equal to the pitch period.

An advantage of the AMDF function is that it can be used to determine if a sample is voiced or unvoiced. When the AMDF function is applied to an unvoiced signal, the difference between the minimum and the average values is very small compared to voiced signals. This difference can be used to make the voiced and unvoiced determination. For unvoiced segments the AMDF function we also have a minimum when P equals the pitch period however, any additional minimums that are obtained will be very close to the average value. This means that these minimums will not be very deep.

Voiced

Unvoiced

Vocal Tract Filter


The filter that is used by the decoder to recreate the original input signal is created based on a set of coefficients. These coefficients are extracted from the original signal during encoding and are transmitted to the receiver for use in decoding. Each speech segment has different filter coefficients or parameters that it uses to recreate the original sound. Not only are the parameters themselves different from segment to segment, but the number of parameters differ from voiced to unvoiced segment. Voiced segments use 10 parameters to build the filter while unvoiced sounds use only 4 parameters. A filter with n parameters is referred to as an nth order filter. In order to find the filter coefficients that best match the current segment being analysed the encoder attempts to minimize the mean squared error. The mean squared error is expressed as:

where {yn} is the set of speech samples for the current segment and {ai} is the set of coefficients. In order to provide the most accurate coefficients, {ai} is chosen to minimize the average value of en for all samples in the segment. The first step in minimizing the average mean squared error is to take the derivative.

Taking the derivative produces a set of M equations. In order to solve for the filter coefficients E[yn-iyn-j] has to be estimate. There are two approaches that can be used for this estimation: autocorrelation and autocovariance. Although there are version of LPC that use both approaches, autocorrelation is the approach that will be explained in this paper for linear predictive coding. Autocorrelation requires that several initial assumptions be made about the set or sequence of speech samples, {yn}, in the current segment. First, it requires that {yn} be stationary and second, it requires that the {yn} sequence is zero outside of the current segment. In autocorrelation, each E[yn-iyn-j] is converted into an autocorrelation function of the form Ryy(| i-j |). The estimation of an autocorrelation function Ryy(k) can be expressed as:

Using Ryy(k), the M equations that were acquired from taking the derivative of the mean squared error can be written in matrix form RA = P where A contains the filter coefficients.

In order to determine the contents of A, the filter coefficients, the equation A = R-1P must be solved. This equation can not be solved with out first computing R-1. This is an easy computation if one notices that R is symmetric and more importantly all diagonals consist of the same element. This type of matrix is called a Toeplitz matrix and can be easily inverted. The Levinson-Durbin (L-D) Algorithm is a recursive algorithm that is considered very computationally efficient since it takes advantage of the properties of R when determining the filter coefficients.. This algorithm is denoted with a superscript, {ai (j)}for a jth order filter, and the average mean squared error of a jth order filter is denoted Ej instead of E[e2n]. When applied to an Mth order filter, the L-D algorithm computes all filters of order less than M. That is, it determines all order N filters where N=1,...,M-1.

During the process of computing the filter coefficients {ai} a set of coefficients, {ki}, called reflection coefficients or partial correlation coefficients (PARCOR) are generated. These coefficients are used to solve potential problems in transmitting the filter coefficients. The quantization of the filter coefficients for transmission can create a major problem since errors in the filter coefficients can lead to instability in the vocal tract filter and create an inaccurate output signal. This potential problem is averted by quantizing and transmitting the reflection coefficients that are generated by the Levinson-Durbin algorithm. These coefficients can be used to rebuild the set of filter coefficients {ai} and can guarantee a stable filter if their magnitude is strictly less than one.

Transmitting the Parameters


In an uncompressed form, speech is usually transmitted at 64,000 bits/second using 8 bits/sample and a rate of 8 kHz for sampling. LPC reduces this rate to 2,400 bits/second by breaking the speech into segments and then sending the voiced/unvoiced information, the pitch period, and the coefficients for the filter that represents the vocal tract for each segment. The input signal used by the filter on the receiver end is determined by the classification of the speech segment as voiced or unvoiced and by the pitch period of the segment. The encoder sends a single bit to tell if the current segment is voiced or unvoiced. The pitch period is quantized using a log-companded quantizer to one of 60 possible values. 6 bits are required to represent the pitch period. If the segment contains voiced speech than a 10th order filter is used. This means that 11 values are needed: 10 reflection coefficients and the gain. If the segment contains unvoiced speech than a 4th order filter is used. This means that 5 values are needed: 4 reflection coefficients and the gain. The reflection coefficients are denote kn where 1 < n < 10 for voiced speech filters and 1 < n < 4 for unvoiced filters.

LPC Synthesis/Decoding
The process of decoding a sequence of speech segments is the reverse of the encoding process. Each segment is decoded individually and the sequence of reproduced sound segments is joined together to represent the entire input speech signal. The decoding or synthesis of a speech segment is based on the 54 bits of information that are transmitted from the encoder. The speech signal is declared voiced or unvoiced based on the voiced/unvoiced determination bit. The decoder needs to know what type of signal the segment contains in order to determine what type of excitement signal will be given to the LPC filter. Unlike other speech compression algorithms like CELP which have a codebook of possible excitement signals, LPC only has two possible signals. For voiced segments a pulse is used as the excitement signal. This pulse consists of 40 samples and is locally stored by the decoder. A pulse is defined as ...an isolated disturbance, that travels through an otherwise undisturbed medium [10]. For unvoiced segments white noise produced by a pseudorandom number generator is used as the input for the filter. The pitch period for voiced segments is then used to determine whether the 40 sample pulse needs to be truncated or extended. If the pulse needs to be extended it is padded with zeros since the definition of a pulse said that it travels through an undisturbed medium. This combination of voice/unvoiced determination and pitch period are the only things that are need to produce the excitement signal. Each segment of speech has a different LPC filter that is eventually produced using the reflection coefficients and the gain that are received from the encoder. 10 reflection coefficients are used for voiced segment filters and 4 reflection coefficients are used for unvoiced segments. These reflection coefficients are used to generate the vocal tract coefficients or parameters which are used to create the filter.

The final step of decoding a segment of speech is to pass the excitement signal through the filter to produce the synthesized speech signal.

LPC Applications
In general, the most common usage for speech compression is in standard telephone systems. In fact, a lot of the technology used in speech compression was developed by the phone companies. Linear predictive coding only has application in the area of secure telephony because of its low bit rate. Secure telephone systems require a low bit rate since speech is first digitalized, then encrypted and transmitted. These systems have a primary goal of decreasing the bit rate as much as possible while maintaining a level of speech quality that is understandable. Other standards such as the digital cellular standard and the international telephone network standard have higher quality standards and therefore require a higher bit rate. In these standards, understanding the speech is not good enough, the listener must also be able to recognize the speech as belonging to the original source. A second area that linear predictive coding has been used is in Text-to-Speech synthesis. In this type of synthesis the speech has to be generated from text. Since LPC synthesis involves the generation of speech based on a model of the vocal tract, it provides a perfect method for generating speech from text. Further applications of LPC and other speech compression schemes are voice mail systems, telephone answering machines, and multimedia applications. Most multimedia applications, unlike telephone applications, involve one-way communication and involve storing the data. An example of a multimedia application that would involve speech is an application that allows voice annotations about a text document to be saved with the document. The method of speech compression used in multimedia applications depends on the desired speech quality and the limitations of storage space for the application. Linear Predictive Coding provides a favourable method of speech compression for multimedia applications since it provides the smallest storage space as a result of its low bit rate.

Full LPC Model and Implementation

MATLAB Implementation
Main.m %MAIN BODY clear all; clc; disp('wavfile'); %INPUT inpfilenm = 'sample1'; [x, fs] =wavread(inpfilenm); %LENGTH (IN SEC) OF INPUT WAVEFILE, t=length(x)./fs; sprintf('Processing the wavefile "%s"', inpfilenm) sprintf('The wavefile is %3.2f seconds long', t) %THE ALGORITHM STARTS HERE, M=10; %prediction order [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); %pitch_plot is pitch periods synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain); %RESULTS beep; disp('Press a key to play the original sound!'); pause; soundsc(x, fs); disp('Press a key to play the LPC compressed sound!'); pause; soundsc(synth_speech, fs); figure; subplot(2,1,1), plot(x); title(['Original signal = "', inpfilenm, '"']); subplot(2,1,2), plot(synth_speech); title(['synthesized speech of "', inpfilenm, '" using LPC algo']);

f_ENCODER.m

function [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); M = 10; %prediction order=10;

b=1; fsize = 30e-3; %frame size frame_length = round(fs .* fsize); N= frame_length - 1; %VOICED/UNVOICED and PITCH; [independent of frame segmentation] [voiced, pitch_plot] = f_VOICED (x, fs, fsize); %FRAME SEGMENTATION for aCoeff and GAIN; for b=1 : frame_length : (length(x) - frame_length), y1=x(b:b+N); y = filter([1 -.9378], 1, y1); %pre-emphasis filtering

%aCoeff [LEVINSON-DURBIN METHOD]; [a, tcount_of_aCoeff, e] = func_lev_durb (y, M); aCoeff(b: (b + tcount_of_aCoeff - 1)) = a; %GAIN; pitch_plot_b = pitch_plot(b); %pitch period voiced_b = voiced(b); gain(b) = f_GAIN (e, voiced_b, pitch_plot_b); end

func_lev_durbin.m
%function of levinsonDurbin function [aCoeff, tcount_of_aCoeff, e] = func_lev_durb (y, M); if (nargin<2), M = 10; end sk=0; a=[zeros(M+1);zeros(M+1)]; z=xcorr(y);

%finding array of R[l] R=z( ( (length(z)+1) ./2 ) : length(z)); s=1; J(1)=R(1); %GETTING OTHER PARAMETERS OF PREDICTOR OF ORDER "(s-1)": for s=2:M+1, sk=0; for i=2:(s-1), sk=sk + a(i,(s-1)).*R(s-i+1); end k(s)=(R(s) + sk)./J(s-1); J(s)=J(s-1).*(1-(k(s)).^2); a(s,s)= -k(s); a(1,s)=1; for i=2:(s-1), a(i,s)=a(i,(s-1)) - k(s).*a((s-i+1),(s-1)); end end aCoeff=a((1:s),s)'; tcount_of_aCoeff = length(aCoeff); est_y = filter([0 -aCoeff(2:end)],1,y); e = y - est_y;

f_VOICED.m
%function_main of voiced/unvoiced detection function [voiced, pitch_plot] = f_VOICED(x, fs, fsize);

f=1; b=1; frame_length = round(fs .* fsize); N= frame_length - 1; %FRAME SEGMENTATION: for b=1 : frame_length : (length(x) - frame_length), y1=x(b:b+N); y = filter([1 -.9378], 1, y1); %pre-emphasis filter msf(b:(b + N)) = func_vd_msf (y);

zc(b:(b + N)) = func_vd_zc (y); pitch_plot(b:(b + N)) = func_pitch (y,fs); end thresh_msf = (( (sum(msf)./length(msf)) - min(msf)) .* (0.67) ) + min(msf); voiced_msf = msf > thresh_msf; %=1,0 thresh_zc = (( ( sum(zc)./length(zc) ) - min(zc) ) .* min(zc); voiced_zc = zc < thresh_zc; (1.5) ) +

thresh_pitch = (( (sum(pitch_plot)./length(pitch_plot)) min(pitch_plot)) .* (0.5) ) + min(pitch_plot); voiced_pitch = pitch_plot > thresh_pitch; for b=1:(length(x) - frame_length), if voiced_msf(b) .* voiced_pitch(b) .* voiced_zc(b) == 1, % if voiced_msf(b) + voiced_pitch(b) > 1, voiced(b) = 1; else voiced(b) = 0; end end voiced; pitch_plot;

func_pitch.m
function pitch_period = func_pitch (y,fs) clear pitch_period; period_min = round (fs .* 2e-3); period_max = round (fs .* 20e-3); R=xcorr(y); [R_max , R_mid]=max(R); pitch_per_range = R ( R_mid + period_min : R_mid + period_max ); [R_max, R_mid] = max(pitch_per_range); pitch_period = R_mid + period_min;

func_vd_msf.m function m_s_f = func_vd_msf (y) clear m_s_f; [B,A] = butter(9,.33,'low'); y1 = filter(B,A,y); m_s_f=sum(abs(y1)); %.5 or .33?

func_vd_zc.m
function ZC = func_vd_zc (y) ZC=0; for n=1:length(y), if n+1>length(y) break end ZC=ZC + (1./2) .* abs(sign(y(n+1))-sign(y(n))); end ZC;

f_GAIN.m
%function for calc gain per frame function [gain_b, power_b] = f_GAIN (e, voiced_b, pitch_plot_b); if voiced_b == 0, denom = length(e); power_b = sum(e (1:denom) .^2) ./ denom; gain_b = sqrt( power_b ); else denom = ( floor( length(e)./pitch_plot_b ) .* pitch_plot_b ); power_b = sum( e (1:denom) .^2 ) ./ denom; gain_b = sqrt( pitch_plot_b .* power_b ); end

power_b; gain_b;

f_DECODER.m
%DECODER PORTION function synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain); frame_length=1; for i=2:length(gain) if gain(i) == 0, frame_length = frame_length + 1; else break; end end %decoding starts here, for b=1 : frame_length : (length(gain)), if voiced(b) == 1, %voiced frame pitch_plot_b = pitch_plot(b); syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b, b); else syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); %unvoiced frame end synth_speech(b:b+frame_length-1) = syn_y1; end

f_SYN_V.m
%a function of f_DEOCDER function syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b, b); %creating pulsetrain; for f=1:frame_length if f./pitch_plot_b == floor(f./pitch_plot_b) ptrain(f) = 1; else ptrain (f) = 0; end end

syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], ptrain); syn_y1 = syn_y2 .* gain(b);

f_SYN_UV.m
%a function of f_DEOCDER function syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); wn = randn(1, frame_length); syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], wn); syn_y1 = syn_y2 .* gain(b);

Discussion and Conclusion


Linear Predictive Coding is an analysis/synthesis technique to lossy speech compression that attempts to model the human production of sound instead of transmitting an estimate of the sound wave. Linear predictive coding achieves a bit rate of 2400 bits/second which makes it is ideal for use in secure telephone systems. Secure telephone systems are more concerned that the content and meaning of speech, rather than the quality of speech, be preserved. The trade off for LPCs low bit rate is that it does have some difficulty with certain sounds and it produces speech that sound synthetic. Linear predictive coding encoders break up a sound signal into different segments and then send information on each segment to the decoder. The encoder send information on whether the segment is voiced or unvoiced and the pitch period for voiced segment which is used to create an excitement signal in the decoder. The encoder also sends information about the vocal tract which is used to build a filter on the decoder side which when given the excitement signal as input can reproduce the original speech.

References
[1] Lawrence R. Rabiner and Ronald W. Schafer . Introduction to Digital Speech Processing Vol. 1, Nos. 12 (2007) 1194 V. Hardman and O. Hodson. Internet/Mbone Audio (2000) 5-7. Scott C. Douglas. Introduction to Adaptive Filters, Digital Signal Processing Handbook (1999) 7-12. Poor, H. V., Looney, C. G., Marks II, R. J., Verd, S., Thomas, J. A., Cover, T. M. Information Theory. The Electrical Engineering Handbook (2000) 56-57. R. Sproat, and J. Olive. Text-to-Speech Synthesis, Digital Signal Processing Handbook (1999) 9-11 . Richard C. Dorf, et. al.. Broadcasting (2000) 44-47. Richard V. Cox. Speech Coding (1999) 5-8. Randy Goldberg and Lance Riek. A Practical Handbook of Speech Coders (1999) Chapter 2:1-28, Chapter 4: 1-14, Chapter 9: 1-9, Chapter 10:1-18. Mark Nelson and Jean-Loup Gailly. Speech Compression, The Data Compression Book (1995) 289-319. Khalid Sayood. Introduction to Data Compression (2000) 497-509. Richard Wolfson, Jay Pasachoff. Physics for Scientists and Engineers (1995) 376-377.

[2] [3]

[4]

[5]

[6] [7] [8]

[9] [10] [11]

Você também pode gostar