Você está na página 1de 11

Xavier University Ateneo de Cagayan College of Engineering Department of Electronics Engineering

Research Method ACE 08 D

Date Submitted: September 29, 2011

Submitted by: Mark Kenneth Cario Arrogante Edward Marlo Valledor Pacifico Adrian Zayas

Remarks: Instructor: Engr. Mary Jean O. Apor Signature:

I. Research Title: Converting Filipino sign language into its equivalent in sound. II. Name of Proponent: Mark Kenneth Cario Arrogante, Edward Marlo Valledor Pacifico, and Adrian Zayas III. Background of the Study: Since their existence, humans have been physically and mentally challenged with various impediments that hinder their advancement on all economical, political, and social levels. Statistics around the world show a relatively high rate of people with speaking difficulties. Rates extend to an average of 9% of the whole population, whether in the Middle East, Europe, Africa, America, or other continents of the world. Taking individual age groups into consideration, it is observed that the speakingimpairment rate increases as the age increase. For it is observed that the elderly (65 years and above) show a 25% rate and people whose ages range between 55 and 64 have a rate of 15%. [1] It would be unjust for such people to be excluded from society just because they lack elementary communication means. They are mentally capable individuals who deserve and are even demanded to play an effective role in society. Society cannot dismiss whatever potential they have, simply because it needs it; for it is known that society advances only with the collective effort of all of its members. To this end, technology is utilized to aid humans in their collective effort to overcome such impediments. The objective of the project is to provide a practical way of translating Filipino sign language into speech, offering people with vocal disabilities a means of communication with people incapable of understanding sign language. We first survey existing literature discussing systems dealing with the recognition of sign language and then propose a design that can achieve the desired objective. The Hardware components of the system compromise a video camera, a relatively fast processor, and output speakers. Software components comprise running different algorithms to detect and track the face and hand. Other algorithms are needed to capture frames from the camera, subtract their background, normalize them, and then compare them to a set of stored images in a reference database to come up with a decision about the meaning of the gesture made.

IV. Review of Related Literatures: In this proposal, a system is presented that enables speaking-impaired Filipino people to further connect with their society and aids them in overcoming communication obstacles created by the society's incapability of understanding sign language. The system proposed is based on translating motion into sound it suggests an approach involving human gesture and sign language recognition via a computer processor that assimilates these gestures and signs to produce their equivalent in sound. Machine gesture and sign language recognition is, as the name suggests, about recognition of gestures and sign language using computers. The hardware technique that is used for gathering information about body positioning described in this paper is image-based using a digital camera as the interface. However, getting the data is only the first step. The second step, that of recognizing the sign or gesture once it has been captured, is much more challenging, especially in a continuous stream. For that end, this paper describes an effective system developed by Tony Heap [4] for tracking hand position using a video camera. After undergoing a normalization stage, this tracked hand motion is then compared automatically with thousands of images present in a database in the system's memory unit.

One technology for detecting the motion of or the gestures made by the hands is based on using sensory detectors. In most of the sensory-based approaches, a cyber-glove is used as a way of interfacing with the processor. The tiny sensors integrated in each glove detect the position and can in some cases measure the velocity of corresponding segments of the hand. During processing, each sensor translates the coordinates of its point into a certain eigenspace relative to a certain reference. The input data from the sensors is manipulated by a certain algorithm that synthesizes it either into recognized or unrecognized gestures. There is also another approach in recognition of gestures and this is image processing approach. This second approach to recognition of gestures is based on image processing where frames are captured using a video camera. After data is acquired, the frames are then analyzed in order to detect the hands and face. The use of skin color facilitates the differentiation of the hand and face from the background colors using a specific color space. In some of the approaches, distinctly colored gloves are used in order to have faster detection of the hands. The motion of the hands or the face is tracked using parameters such as the centroid, area, and vertical and horizontal axis of the hands and face. Recognition of the gestures is sometimes done using Hidden Markov Models that use a set of stored images taken as reference. The third kind of approach that is used by already developed systems is a sensor and image processing approach. This third alternative is one that uses both sensors and image processing, each in a specific area. The sensors would be used to detect motion while image recognition method would be employed to detect still images. And since the sensors detect the exact angle in which the different body parts are positioned, a large number of sensors needs to be spread-out on the entire body; the fact that fact makes a system based on such an approach less portable. Adding to this, a system performing image recognition based on both image processing and sensors is more complex and costs more than each approach implemented alone. This is because it combines two technologies based on different principles that require distinct hardware and software components in order to function successfully. After investigating the existing related designs and studying the available alternatives given several constraints, we decided that a design that is based on image processing is better to achieve the sought-after objective. The image processing approach has two main advantages over the sensor-based approach. The first is that a sensor-based approach requires the signer to be continuously connected to sensing devices, the condition that contradicts our objective to have a practical system that can be used in daily life without much complication. An image processing approach, on the other hand, does not require the signer to wear any kind of external devices such as gloves. The second advantage lies in the fact that an image-processing approach allows the recognition of a wider variety of movements and gestures than those 'sensed' by a sensor-based approach. In a video-based approach, body movements and face gestures can be detected in an easier way using the same video frames used to detect hand signs. This is in contrast to the sensor approach that one, cannot detect face gestures in the first place and two, requires even more sensors added over the whole body in order to track body movements, which if done contradicts the first point. Thus, the two advantages render the image processing approach a more effective and efficient one. With no additional devices added onto the signer and with a wider variety of gesture and movement detection, the image processing approach poses itself as the best design to achieve the objective. To build this design, the hardware components needed are a digital video camera to track sequential frames, a relatively fast processor that can assimilate frames and data in real time and a sound output module, and the software algorithms needed are an algorithm to detect and track frames, an algorithm to recognize frames and an algorithm to read data after comparison and output it in sound. The software algorithms can either be C++ or matlab.[4]

V. Conceptual/Theoretical Framework of the Study: To build the system, it must follow a divide and conquer approach where the entire system is partitioned into several modules to be able to simultaneously focus on small tasks and complete them in

parallel. The next step was to assemble them and test their functionality altogether. These modules are shown below in the block diagram of the overall system.

Figure 1. Block Diagram for the entire system As stated earlier, the project is based on image recognition. To that end, the user interface is a digital camera that provides portability to the system in addition to being an excellent medium for capturing continuous images in real time and facilitating the data acquisition process. As shown in the diagram, the users hand and face are first detected and tracked. After this, the system starts capturing frames from the camera. The system then performs background subtraction and passes the modified images to be normalized. The images are then sent to the comparator where they are compared with reference images in the database. After recognizing the signs with an algorithm, a text reader will read the corresponding text and output it to the speakers. In the hand detection and tracking process, it is highly important to continuously keep track of the hand after successfully detecting it. This procedure is attained by the use of the CAMSHIFT (Continuously Adaptive Mean-Shift) algorithm described in the OpenCV library[7]. On the next process, which is data acquisition, the DirectShow program will be used. DirectShow is particularly useful to process image sequences or sequence captures using PC cameras. The DirectShow architecture relies on filter architecture. The processing of a sequence is therefore done using a series of filters connected together; the output of one filter becoming the input of the next one. The first filter is usually a decompressor that reads a file stream and the last filter could be a renderer that displays the sequence in a

window. Following the process of data acquisition and object detection, captured frames will be processed as still images stored as data structures in the memory and the background subtraction will then takes place. The background subtraction method depends on the variation in the brightness of rear pixels that structure the background of the image with the front pixels that form the gesture we need to recognize. The image's brightness and contrast are initially modified in order to further increase the variation of the pixel brightness. The image is then passed through a threshold filter that filters out pixels with brightness below a certain threshold. Later on, the image is converted into a binary image composed of front bright pixels only. After subtracting the background that might act as noise or some kind of interference the image is now ready to be normalized. Normalization is the process of setting the image to a certain spatial center or a certain normalized moment. Each image should be normalized before comparison with data from the reference database. The spatial difference corresponds to a difference in the pixel filling and shape; this will give a fault negative recognition decision when compared to a similar image but with a different spatial orientation or grey level intensity [11]. After undergoing a normalization stage the next process will be gesture recognition. In gesture recognition the captured normalized image is first compared to the first image in the database. If the comparator yields a mismatch, then the comparison is continued with the next image in the database and so on until the image is found. In case the comparator yields a mismatch to all the images in the database, this means that the captured image is not found in the database. The user interface provided in the system allows the user to save the image of the new sign to the database as well as save the corresponding text-equivalent that could be read later on by the text-reader. In case of a match in the gesture recognition, the meaning of the matched image is sent through a text-to-sound converter to generate it in sound.

Costing: Materials Digital Camera Processor Speakers Approximate Price (Php) 20000-30000 7000-10000 2000-3000

VI. Statement of the Problem: How can we make Filipino sign language be readily understandable to people who doesnt understand the language?

VII.Assumptions:

VIII. Significance of the Study. The translation of sign language into speech opens a new communication channel between people incapable of understanding sign language and people incapable of audible articulation. This paper proposes both effective and efficient solution for such communication. However, the existence of a system such as this will not only enable speaking-impaired individuals to be more involved in the society, allowing them to live up to their true potential and be more productive this will also allow the society to overcome impediments created by miscommunication or the lack of it and hence be able, as a whole, to make use of every existing human resource and grow advancing on all economical, political and social levels. This project beholds an elemental impact on society and, if further developed, can give room to potential break-through technological applications.

IX. Scope and Limitation. The scope of the study is the design of a system that will convert Filipino sign language into its equivalent in sound. It also includes a survey of related literature which will be the basis of the design and which will also be a source of the design that will be revised for the specific purpose of this study which is to prioritize the conversion of Filipino sign language into sound. The limitation of the study is on the design of the system and the revision of the algorithm used from other existing projects to make it suitable for converting Filipino sign language.

X. Definition of Terms. OpenCV (Open Source Computer Vision Library). It is a library of programming functions mainly aimed at real time computer vision, developed by Intel and now supported by Willow Garage. It is free for use under the open source BSD license. The library is cross-platform. It focuses mainly on realtime image processing. If the library finds Intel's Integrated Performance Primitives on the system, it will use these proprietary optimized routines to accelerate itself. Hidden Markov model (HMM). It is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. An HMM can be considered as the simplest dynamic Bayesian network. Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics. Normalization. It is the process of isolating statistical error in repeated measured data. A normalization is sometimes based on a property. Quantile normalization, for instance, is normalization based on the magnitude (quantile) of the measures. Data acquisition. It is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems (abbreviated with the acronym DAS or DAQ) typically convert analog waveforms into digital values for processing. The components of data acquisition systems include sensors that convert physical parameters to electrical signals, signal conditioning circuitry to convert sensor signals into a form that can be converted to digital values and an analog-to-digital converters, which convert conditioned sensor signals to digital values.

Background subtraction. It is a commonly used class of techniques for segmenting out objects of interest in a scene for applications such as surveillance. It involves comparing an observed image with an estimate of the image if it contained no objects of interest. The areas of the image plane where there is a significant diference between the observed and estimated images indicate the location of the objects of interest. The name "background subtraction" comes from the simple technique of subtracting the observed image from the estimated image and thresholding the result to generate the objects of interest.

Chereme. It is a basic unit of signed communication and is functionally and psychologically equivalent to the phonemes of oral languages, and has been replaced by that term in the academic literature. Cherology is the study of cheremes.

Epenthesis. In phonology, it is the addition of one or more sounds to a word, especially to the interior of a word. Epenthesis may be divided into two types: excrescence, for the addition of a consonant, and anaptyxis for the addition of a vowel. Expectation-maximization (EM) algorithm. It is a method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. EM is an iterative method which alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Cyberglove. It is a glove-like input device for human-computer interaction, often in virtual reality environments. Various sensor technologies are used to capture physical data such as bending of fingers. Often a motion tracker, such as a magnetic tracking device or inertial tracking device, is attached to capture the global position/rotation data of the glove. These movements are then interpreted by the software that accompanies the glove, so any one movement can mean any number of things. Gestures can then be categorized into useful information, such as to recognize Sign Language or other symbolic functions. Eigenspace. It is the vector space spanned by the eigenvectors associated with that eigenvalue. Its dimension is the number of linearly independent eigenvectors. ODBC (Open Database Connectivity). It is a standard software interface for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of programming languages, database systems, and operating systems. Thus, any application can use ODBC to query data from a database, regardless of the platform it is on or DBMS it uses. ODBC accomplishes platform and language independence by using an ODBC driver as a translation layer between the application and the DBMS. The application thus only needs to know ODBC syntax, and the driver can then pass the query to the DBMS in its native format, returning the data in a format the application can understand. SQL (Structured Query Language). It is a programming language designed for managing data in relational database management systems (RDBMS). Originally based upon relational algebra and tuple relational calculus, its scope includes data insert, query, update and delete, schema creation and modification, and data access control. SQL was one of the first commercial languages for Edgar F. Codd's relational model, as described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks". Despite not adhering to the relational model as described by Codd, it became the most widely used database language.

XI. Methodology. This project is based on image recognition and the user interface is a digital camera that provides portability to the system in addition to being an excellent medium for capturing continuous images in real time and facilitating the data acquisition process. The users hand and face are first detected and tracked. After this, the system starts capturing frames from the camera. The system then performs background subtraction and passes the modified images to be normalized. The images are then sent to the comparator where they are compared with reference images in the database. After recognizing the signs with an algorithm a text reader will read the corresponding text and output it to the speakers.

Hand Detection and Tracking Since the application aims on recognizing hand gestures, it is highly important to continuously keep track of the hand after successfully detecting it. This procedure is attained by the use of the CAMSHIFT (Continuously Adaptive Mean-Shift) algorithm described in the OpenCV library[7]. The algorithm is as follows: After acquiring frames from the camera, each frame (image) is converted to a color probability distribution image using a color histogram model. The color distribution of the object that is aimed to track will be continuously distinguished and hence tracked. Originally, the algorithm requires us to determine the color of the object we plan to track and its initial position. This can be done either by manually assigning the position of the hand (object) using the mouse by drawing a small window around it; or it can be done through implementing a motion detecting algorithm that will determine the position -and hence color- of the hand (object) intend to track and automatically draws a window around it. Therefore, using the latter method, the user is required to initially move his hand in a simple smooth motion in order for the program to detect the position and color of the hand. After determining the position of the object, its color, center, and size are found based on the color probability and color distribution of the image. The current size and location of the tracked hand are reported and used as an initial guess for determining the new location of the search window in the next frames. The process is repeated for each frame and, as a result, the program will keep on tracking the object by tracking its new position in each frame. [4]

Data Acquisition (frame capturing) The DirectShow program will be used for data acquisition. The OpenCV library collaborates with DirectShow technology, which is part of Microsoft DirectX technology [11]. DirectShow is particularly useful to process image sequences or sequence captures using PC cameras. The DirectShow architecture relies on filter architecture. The filters are basically of three types: 1. source filters that output video and/or audio signals, 2. transform filters that process an input signal and produce one (or several) output(s) 3. rendering filters that display or save a media signal. The processing of a sequence is therefore done using a series of filters connected together; the output of one filter becoming the input of the next one. The first filter is usually a decompressor that reads a file stream and the last filter could be a renderer that displays the sequence in a window. In the DirectShow terminology, a series of filters is called a filter graph. The DirectShow program is first used to determine the type of filter graph required for acquiring video stream whether from a camera or from a file (MPEG, AVI, etc). The filters are then implemented (or included to) by the C++ OpenCV library in order to start capturing the video frames using the HighGUI component of the OpenCV library. Background Subtraction Following the process of data acquisition and object detection, captured frames can be processed as still images stored as data structures in the memory. The background subtraction method depends on the variation in the brightness of rear pixels that structure the background of the image with the front pixels that form the gesture we need to recognize. The image's (frame's) brightness and contrast are initially modified using the ContrastBrightness() function in order to further increase the variation of the pixel brightness. The image is then passed through a threshold filter that filters out pixels with brightness below a certain threshold using the built-in function

cvThreshold(,,200,255,BINARY). Later on, the image is converted into a binary image composed of front bright pixels only. After subtracting the background that might act as noise or some kind of interference that causes an error in the output, the image is now ready to be normalized. Normalization Technique Normalization is the process of setting the image to a certain spatial center or a certain normalized moment. Each image should be normalized before comparison with data from the reference database. The spatial difference corresponds to a difference in the pixel filling and shape; this will give a fault negative recognition decision when compared to a similar image but with a different spatial orientation or grey level intensity[11]. Gesture Recognition The gesture recognition is based on image subtraction and threshold comparison. The captured normalized image is first compared to the first image in the database. If the comparator yields a mismatch, then the comparison is continued with the next image in the database and so on until the image is found. In case the comparator yields a mismatch to all the images in the database, this means that the captured image is not found in the database. The user interface provided in the system allows the user to save the image of the new sign to the database as well as save the corresponding text-equivalent that could be read later on by the text-reader. In case the image is found, the comparator module returns the ID of the reference image to which the match was successful. This ID is then used in the text to sound mapping module in order to provide the corresponding correct sound output. Reference Database The database contains the images, their corresponding IDs, and text meaning of the sign. To this end, Microsoft Access was used to build it. This is because, in addition to its ease of use, it provides ability to store images and sounds. An ODBC data source is used to connect the database to the C code. The C code reads-in from the text file that was produced by the gesture recognition module the ID of the image that yielded a successful match. A select SQL statement is used to locate the image in the database in order to get the text meaning of the sign. The corresponding sound is generated after the text-to-sound converter reads from a text-file that contains the meaning of the recognized sign. Text-To-Sound Converter In order to make the system dynamic, allowing the user to have control over the system, the user is provided with the option of inserting new signs to the database. However, in order to simplify the addition of sound to the database and overcome the incapability of the user to do so, it is made available the feature of inserting the meaning of the new sign as a string of text to the database. This text is then processed (by the user) using the text-to-sound module to produce a wave file of the corresponding sound. This sound file is added to the sound files folder and thus the system can now comprehend a new sign.[4]

Business Plan
I. Objectives

The group upon implementation of this project will sell this product to speakingImpaired persons that are capable of filipino sign language and to people, companies or businessmen that are willing to employ Filipinos with vocal disabilities. To make it competitive especially upon the pricision and price of the product. To gain back the capital and to have profit for the continuality of the business and for further development of the product.

II. Mission The goal of the group is to help the speaking-impaired persons or to offer Filipinos with vocal disabilities a means of communication with people incapable of understanding Filipino sign language. It is also our goal that by the use of this device, which converts Filipino sign language into its equivalent in speech, speakingimpaired persons especially Filipinos can have a decent job and have chance to make themselves better. On the other hand, it not our goal to profit from selling this device but to gather funds for further development of the device which in the end will help Filipinos with vocal disabilities. III. Keys to Success

The device must be affordable The functions of the device must be properly tested for its precision The device must be durable for long lasting use The device must be user-friendly and easy to use Cooperation of the group High detail planning and execution

IV. Market Consideration The main target of this device are all people who has vocal disabilities and we need to make a research on how well going to spread the information about the devices for them to be oriented. A budget must be allotted just for giving all people information on what we are going to sell because for an Electronic Engineers who are expert on Communication propagation we need to study on how we are going to spread the information at the least cost possible. We must also consider that not all speaking-impaired persons can afford this kind of technology but there could be people or businessmen or companies that might want to buy this device for them and employ them afterwards. V. Strategic Market Technology needs: The need to develop more efficient methods for procuring products/services, training staff, and other needs necessary to keep a business functioning well. Financing needs: Without financing during good times as well as bad, many small businesses would simply go out of business for want of cash, or because they could not finance and build the product. Connections: Many rural small businesses don't have the capital, financial, and marketing connections that their urban counterparts enjoy. We need the help of big companies for our information to be propagated. VI. Strategy Implementation Fundraising Activity by doing this we will be able to solicit money from anyone who are willing to help our group.

Competitive Edge this will help if someone in our group has an expertise in sales talk so that he can captivate the minds of the people for them to be convince to buy our product.

XII. Working Bibliography: [1]http://books.google.com/books?id=BYiMgQytRU8C&pg=PA6&lpg=PA6&dq=%22hearing+mutism% 22&source=web&ots=W2ypNqM5Zm&sig=abMOBzrb9WReh8MYU9lkVie9UnM&hl=en#v=onepage& q=%22hearing%20mutism%22&f=false [2] http://www.linkedin.com/company/mute/statistics [3]http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F5028 855%2F5069167%2F05069179.pdf%3Farnumber%3D5069179&authDecision=-203 [4]: D. Chai and K. N. Ngan, Locating facial region of a head-and-shoulders color image, in Proc. 3rd Int. Conf. Automatic Face and Gesture Recognition, 1998, pp. 124-129. [5]: BigEye A Real-Time Video to MIDI Macintosh Software [online] Available: http://www.steim.nl/bigeye.html.
th

[6]: Ohki, M. The Sign Language Telephone. 7 World Telecommunication Forum, Vol. 1, pp.3.91-395, 1995 [7]: V. Pavlovic, R. Sharma, and T. Huang. Visual interpretation of hand gestures for human-computer interaction: A review.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(7):677695, July 1997. [8]: Starner T and Pentland A. Real-Time American Sign Language Recognition from Video Using Hidden Markov Models. Perceptual Computing Section, The Media Laboratory, Massachusetts Institute of Technology. IEEE 1995 [9]: Dias J, Nande P, Barata N, Correia A. O.G.R.E. Open Gestures Recognition Engine. ADETTI/ISCTE, Lisboa, Portugal. XVII Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2004) [10]: Sarfraz M, Yusuf A Syed, Zaeshan M. A System for Sign Language Recognition using Fuzzy Object Similarity Tracking. Information and Computer Science Department, King Fahd University of Petropleum & Minerals, Dhahran, Saudi Arabia. Ninth International Conference on Information Visualisation (IV 2005) [11]: Intel Corporation. (2006). Open Source Computer vision Library. [Online]. Available: http://www.intel.com/technology/computing/opencv/index.htm

Você também pode gostar