Você está na página 1de 4

Gesture Recognition for Virtual Reality Applications Using Data Gloves and Neural Networks

John Weissmann, Department of Computer Science, University of Zurich, jody@ifi.unizh.ch, Ralf Salomon, Department of Computer Science, University of Zurich, salomon@ifi.unizh.ch Abstract
This paper explores the use of hand gestures as a means of human-computer interactions for virtual reality applications. For the application, specific hand gestures, such as fist, index finger, and victory sign, have been defined. Most exisiting approaches use various camera-based recognition systems, which are rather costly and very sensitive to environmental changes. In contrast, this paper explores a data glove as the input device, which provides 18 measurement values for the angles of different finger joints. This paper compares the performance of different neural network models, such as back-propagation and radial-basis functions, which are used by the recognition system to recognize the actual gesture. Some network models achieve a recognition rate (training as well as generalization) of up to 100% over a number of test subjects. Due to its good performance, this recogniton system is the first step towards virtual reality applications in which program execution is controlled by a sign language. In gesture recognition, it is more common to use a camera in combination with an image recognition system [2]. These systems have the disadvantage that the image/gesture recognition is very sensitive to illumination, hand position, hand orientation etc. In order to circumnavigate these problems we decided to use a data glove as input device.

Problem Description
The problem we faced was to find a way to map a set of angular measurements as delivered by the data glove to a set of pre-defined hand gestures. Furthermore, it would be advantageous to have a system with a certain amount of flexibility, so that the same system could be used by different people.

Methods
In our experiments, we used the CyberGlove, distributed by Virtual Technologies Inc. [3], which measures the angles of 18 joints of the hand: two for each finger, one each for the angles between neighbouring fingers, as well as one each for thumb rotation, palm arch, wrist pitch, and wrist yaw. To design and train the neural networks we used the Stuttgart Neural Network Simulator [4], a free software package. SNNS also provides a tool which can convert a trained network to a C-code module which can subsequently be included in an application. For our experiments we chose a set of 20 static hand gestures such as fist, index finger, gun, and victory sign. Accordingly, each neural network model had 18 input and 20 output nodes. The experiments were performed with three standard three-layered backpropagation networks using the logistic function

Introduction
Currently, interactions with virtual reality (VR) applications are done in a simple way. Even when sophisticated devices such as space balls, 3D mice or data gloves are present, they are mainly used as a means for pointing and grabbing, i.e. the same I/O-paradigm as is used with 2D mice. However, it has been shown [1], for example, that experienced users work more efficiently with word processors , for example, when using keyboard shortcuts than with the mouse. Generalising this observation to 3 dimensions, our aim was to move away from the simple point&click paradigm to a more compact way of interaction. Therefore, we explore how hand gestures could be used to interact with VR applications in the form of a simple sign language.

f (net i ) = 1 (1 + exp( neti ) at each layer l with l net i = j wij o j1

and with o j

l 1

denoting the output of the units of the

previous layer. Learning was performed with a constant learning rate of = 0.2. For more information on back-propagation, see [5] or [6]. We collected a pattern set of 200 hand gestures from one person which we divided into a training set of 140 patterns and a test set of 60 patterns. The structure of these networks can be described as follows: (i) Network BPfull : all hidden units (30 units) are fully connected to all input units (ii) Network BPpair : each hidden unit is connected to the input units corresponding to the measurements of two fingers ("finger pairs"). Since we treat the measurements of thumb rotation, palm arch, wrist pitch and wrist yaw as measurements of a sixth finger, this amounts to 15 units in the hidden layer. (iii) Network BPtriple each hidden unit is connected to the input units corresponding to the measurements of finger triples, which again leads to 15 hidden units. The idea behind the architectures of BPpair and BPtriple is to exploit a (tentative) correlation between gestures and finger combinations. In all networks, all hidden units are fully connected to all output units, each of which is responsible for recognising a particular gesture (see Fig. 1).
Hand and Wrist Middlefinger Thumb

The recognised gesture is determined in a winner-takesall fashion, if at least one output unit exceeds the (experimentally determined) threshold value = 0.8; otherwise the pattern is classified unknown.

First Results
The first network, BPfull performed quite poorly (< 10%), whereas the BPpair and BPtriple yielded high recognition rates of 99.5% and 92.0%, respectively, on the test set. If a gesture recognition system is to be used in a productive way, it must be flexible enough so that different people can use it without having to go through a tedious data collection and training session. Obviously, the particular recognition rate depends significantly on the test person's hand geometry. To get a better idea of the generalisation capabilities of such networks, we took training and test sets from 5 different persons. Again, all of the training sets consisted of 140 patterns. In a first experiment we trained 5 networks (based on the finger pair structure) with the 5 training sets and checked the recognition rate of each network on each of the 5 test sets. For this and the following experiments we restricted ourselves to the network architecture BPpair. The results are shown in the following table: Table 1 Test Set A Test Set B Test Set C Test Set D Test Set E Net A 1.00 0.92 0.87 0.85 0.78 Net B 0.82 1.00 0.93 0.90 0.77 Net C 0.98 0.90 0.98 0.88 0.75 Net D 0.95 0.88 0.97 1.00 0.77 Net E 0.67 0.80 0.77 0.67 0.98

OK

Gun

Index

Fist

As can be seen, the recognition rate for the own test set is practically 100%, the exceptions being Net C and Net D. The recognition rate for alien test sets strongly varies between 67% and 98%; in most cases it is higher than 85%. These results seemed to indicate the possibility of training a net in such a way, that the gestures of any person will be recognised with an acceptable accuracy.

Fig. 1 : Structure of the finger pair network. The nodes on the input layer are grouped by fingers. Each node of the hidden layer receives its input from exactly two finger node groups. Each output node receives its input from all nodes of the hidden layer. For clarity not all nodes and connections are shown.

Combined Training Sets


In the next experiment we merged several combinations of the original 5 training sets into new training sets. The following table shows the recognition rates of five networks trained with combinations of 4 training sets each. In this table, Net A denotes a net which has

been trained with a combination of training sets from the persons B, C, D, and E but not A. Table 2 Test Set A Test Set B Test Set C Test Set D Test Set E Net A 1.00 .98 1.00 1.00 1.00 Net B 1.00 0.98 1.00 0.98 1.00 Net C 1.00 1.00 1.00 0.98 1.00 Net D 1.00 0.98 1.00 0.97 1.00 Net E 1.00 1.00 1.00 1.00 0.88

Table 3 Test Set A Test Set B Test Set C Test Set D Test Set E

RBF A 0.98 0.53 0.70 0.66 0.45

RBF B 0.35 0.95 0.48 0.53 0.31

RBF C 0.53 0.51 0.98 0.53 0.43

RBF D 0.65 0.60 0.55 1.00 0.53

RBF E 0.33 0.35 0.40 0.46 0.98

It can be seen that the generalisation capabilities of the simple RBFs are somewhat inferior to those of the simply trained back-propagation networks. However, by training RBFs with combinations of training sets, we can achieve generalisation capabilities similar to those of the back-propagation networks trained with combined training sets. As was the case in table 2, RBF A denotes a RBF whose training set is a combination of the training sets B, C, D, and E, but not A. Table 4 Test Set A Test Set B Test Set C Test Set D Test Set E RBF A 0.91 0.98 0.99 0.99 0.96 RBF B 0.99 0.86 1.00 0.99 0.93 RBF C 1.00 0.99 0.94 0.99 0.96 RBF D 0.99 0.99 0.99 0.97 0.96 RBF E 1.00 0.99 0.99 0.98 0.72

A further net, trained with a combination of all five training sets, scored extremely well on the test sets. With the exception of test set B, for which the recognition rate was 98.3%, it showed a 100% recognition rate. Of course we are aware that the data set we used is too small to permit significant statements about such a nets performance for all possible hand geometries. However, we believe the results achieved so far are encouraging. Nevertheless it is conceivable that the combined net cant cope with the gestures of a user with a hand geometry radically differing from those used to create the training sets. Therefore, it would be interesting to look at systems whose parameters could be changed at runtime.

Radial Basis Functions


Radial-basis function (RBF) networks consist of an input and an output layer in which each output unit is fully connected to all input units. Each output unit o maintains an N-dimensional vector rj c (with N representing the number of input units), which represents the centre of a Gaussian bump. Each output unit first calculates the distance
j

The advantage of employing RBFs lies in the fact that RBFs can be easily retrained at run time due to their linear character. This means a gesture recognition system based on RBFs could be adaptively retrained if it encounters a user whose hand geometry differs strongly from those in the training sets.

d =

(c
i= 1 j

Applications and Future Work


In order to demonstrate the usability of a sign language as a means of controlling a program we incorporated our gesture recognition system into a simple virtual reality application. This application consists of some objects in a 3-dimensional space and a robot hand (see Fig. 2). We assigned simple commands such as move robot hand forward, rotate robot hand about x-axis, or grab object to some of the gestures. With a small learning effort, it is possible to effectively navigate in the virtual world and manipulate objects therein.

j i

ui )

of its centre to the current point of the input activation denoted by the ui 's. It then determines its activation

act ( o ) = exp( d / ) with denoting a scaling factor. Further details on RBF


networks can be found in [6] For our experiments with radial basis function systems we employed the same training and test sets as were used for the backpropagation networks. As scaling factor we used the value 1.0. In the next table the recognition rates of 5 simple RBFs (i.e. RBFs trained with the training set of one person each) are shown:

extended index finger, the gesture gun (extended index fingerplus thumb) might unintenionally be formed during the transition. The application of a gesture recognition system as described in this paper must not necessarily be restricted to VR programs; once the points mentioned above have been solved, it would, for example, also open up the possibility of building a system for the translation of ASL (American Sign Language) to spoken English

Conclusion
Fig. 2 : The Test Application. The gesture-controlled robot hand is just about to grab an object in virtual space. We are currently working on the integration of our system as a means of interaction in a number of virtual reality applications developed at the University of Zurich, such as a virtual endoscopy application and a geographical information system. In the future, we are planning to continue our work in the following directions: Exploiting the adaptive possibilities of RBF-based systems: This would enable run time changes to the system, thus enabling retraining of systems to the gestures of a new user with unrecognizable gestures. Recognition of dynamic gestures: Gestures such as waving or wagging a finger can make a sign language much more intuitive. In order to correctly recognize dynamic gestures the data glove must be equipped with a tracking device such as the Ascension Flock of Birds [7] or Polhemus Fastrak [8], in order to provide the system with positional and orientational information. Use of both hands: In VR applications where a particular gesture of the right hand, such as extended index finger, is assigned the command move forward, gestures of the left hand could be used as modifiers to regulate the speed. Recognition of gesture sequences: Here the problem lies in detecting and eliminating unwanted intermediate gestures. If, for instance, the gesture thumbs up is followed by the gesture This paper demonstrates that the chosen combination of data glove and neural networks achieves high recognition rates on a set of predefined gestures. Therefore it can be considered as being a first step towards VR applications or other types of applications in which program execution is controlled by means of a sign language.

Acknowledgements
This work is supported in part by the Swiss National Science Foundation grant #21-50684.97

References
[1] G. dYdewalle et al., Graphical versus Character-Based Word Processors: An Analysis of User Performance, Behaviour and Information Technology, 1995 v.14 n.4 p.208-214 [2] R. Kieldsen, J. Kender, Toward the Use of Gesture in Traditional User Interfaces, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, 1996, p.151-156 [3] Virtual Technologies Inc., Palo Alto, CA 94306 Production and ditribution of data gloves and related devices. www.virtex.com [4] Web site for the Stuttgart Neural Network Simulator : www-ra.informatik.uni-tuebingen.de/SNNS/ [5] J. Hertz, A. Krogh, R. Palmer, Introduction to the Theory of Neural Computation, Santa Fe Institute in the sciences of complexity; Lecture notes v.1, Addison Wesley [6] R. Rojas, Neural Networks: A systematic Introduction, Springer-Verlag, Berlin (1996). [7] Ascension Technology Corporation www.ascension-tech.com [8] Polhemus Incorporated www.polhemus.com

Você também pode gostar