Augmented Desktop

Tom Pierce URN:6026767
Augmented Desktop Final Year Project Dissertation

Supervisor: Prof. Adrian Hilton Examiner: Dr. Terry Windeatt Submitted for Marking: 13/05/2011

Abstract
The goal of the project is to develop a system for an augmented desk top with additional interactive visual information. In this system, interaction takes the form of hand/finger tip tracking and gesture recognition. The result of the project is three applications to demonstrate the types of interaction that can be performed using the system. Hand area recognition is performed using a hue-histogram method for skin colour segmentation while finger detection analyses convexity defects in the hand contours to generate likely finger tip candidates. Along with the hand centroid, finger tips are then tracked using nearest neighbour analysis across multiple frames of video. Experimental results have highlighted the importance of accurate segmentation in the detection and tracking algorithm. Both the segmentation and tracking algorithms have performed well under testing given certain constraints; constant smooth lighting conditions, sleeved users and invariant hand sizes. The project concludes with some suggestions on how these constraints could be addressed to move the system to a legitimate interface for human computer interaction. N.B. Some images in this report benefit from being displayed in colour. The following pages are provided in colour in a ten page colour supplement which should accompany the black and white copy of the full report. Colour Pages: 20, 28, 31, 35, 36, 38, 39, 40, 41, 44.

Contents
1. Introduction ........................................................................................................................................ 1 1.1 Motivation..................................................................................................................................... 1 2. Aims..................................................................................................................................................... 2 3. Objectives............................................................................................................................................ 2 4. Constraints .......................................................................................................................................... 2 5. Planned Deliverables........................................................................................................................... 3 6. Work Plan ............................................................................................................................................ 4 7. Technology Review ............................................................................................................................. 5 7.1 Capture.......................................................................................................................................... 5 7.2 Segmentation ................................................................................................................................ 6 7.3 Feature Detection ......................................................................................................................... 7 7.4 Existing Computer Interaction Technologies ................................................................................ 8 7.4.1 Touch Screens ........................................................................................................................ 8 7.4.2 Multi touch............................................................................................................................. 8 7.4.3 Voice control .......................................................................................................................... 9 7.4.4 Wii-mote ................................................................................................................................ 9 7.4.5 Kinect ..................................................................................................................................... 9 8. Development..................................................................................................................................... 11 8.1 Overview ..................................................................................................................................... 11 8.2 Set Up .......................................................................................................................................... 11 8.3 OpenCV ....................................................................................................................................... 12 8.4 Capture........................................................................................................................................ 12 8.5 Segmentation .............................................................................................................................. 13 8.5.1 Gray Scale & Threshold ........................................................................................................ 13 8.5.2 HSV Threshold ...................................................................................................................... 16 8.5.3 Hue Histogram Model .......................................................................................................... 16 8.5.4 Noise Removal ..................................................................................................................... 21 8.6 Hand Detection ........................................................................................................................... 23 8.6.1 Contours ............................................................................................................................... 23 8.6.2 Convex Hull .......................................................................................................................... 25 8.6.3 Convexity Defects................................................................................................................. 25 8.6.4 Finger Tip Logic .................................................................................................................... 28 8.6.5 Conclusion ............................................................................................................................ 28

8.7 Homography Calculation............................................................................................................. 29 8.7.1 Homography Calibration App .............................................................................................. 30 8.8 Segmentation Calibration Tool ................................................................................................... 32 8.9 Projection/Display ....................................................................................................................... 32 8.9 Development Summary .............................................................................................................. 34 9. System Evaluation ............................................................................................................................. 35 9.1 Methodology............................................................................................................................... 35 9.1.1 Segmentation Test ............................................................................................................... 35 9.1.2 Finger Tip Detection Test ..................................................................................................... 36 9.1.3 Grid Based Accuracy Test ..................................................................................................... 36 9.1.4 User Experience Questionnaire ........................................................................................... 37 9.1.5 Latency ................................................................................................................................. 37 9.2 Results ......................................................................................................................................... 38 9.2.1 Segmentation Test ............................................................................................................... 38 9.2.2 Finger Tip Detection Test ..................................................................................................... 39 9.2.3 Grid Based Accuracy............................................................................................................. 41 9.2.4 Questionnaire ...................................................................................................................... 41 9.2.5 Latency ................................................................................................................................. 42 10. Applications..................................................................................................................................... 43 10.1 Mouse Control .......................................................................................................................... 43 10.2 Checker Board ........................................................................................................................... 43 10.3 Photo Viewer ............................................................................................................................ 44 11. Conclusions ..................................................................................................................................... 46 11.1 Aims and Objectives .................................................................................................................. 46 11.2 Project Organisation ................................................................................................................. 47 11.3 Suggested Additional Development Work ................................................................................ 47 12. Bibliography .................................................................................................................................... 49 12.1 Table of Figures ......................................................................................................................... 51

1. Introduction
The design of new methods for human-computer interaction is an important field of study. As processor power and available memory increase, the interfaces used must not limit the performance of the system. The mouse and keyboard have been the primary hardware interfaces for interaction with computers for several decades. For text entry the keyboard is unlikely to be improved upon, however the mouse and the two dimensional graphical user interface (GUI) look set to be superseded in the next 10-20 years. An obvious example of this is the popularity of multi touch displays and track pads in recent years. These interfaces provide natural and intuitive gestures for simple actions such as zooming, scrolling and rotating which the mouse cannot do without multiple clicks and movements. Another benefit of the multi touch displays is the direct interaction and manipulation of the objects on the screen. While a mouse abstracts the movement of your hand and fingers into a screen coordinate and click events, a touch interface provides direct interaction which is far more intuitive. If we expand upon the idea of a multi touch display, a logical step is to have this sort of interaction work on any available surface. Useful surfaces where this interaction could be applied include walls, white boards and desks. This project aims to explore the possibilities of desk top augmentation.
1.1 Motivation
My motivations for choosing such a project stem from an interest in image and video processing techniques gained while working on various projects during a work placement at Sony Broadcast and Professional Research Labs. While looking for an area to apply these skills to, a quote by Douglas Engelbart, the inventor of the mouse, struck a chord with me: Then one day, it just dawned on meBOOMthat complexity was the fundamental thing. Solving any significant problem would also be a complex thing. And it just went click. If in some way, you could contribute significantly to the way humans could handle complexity and urgency, that would be universally helpful.(1) He was referring to his goal to help make the world a better place, and decided that this could be achieved by utilising computers to present data clearly and allow it to be processed quickly. It was this idea that led him to develop the mouse and many other innovations in computer science. There are many obvious applications for improved desktop interfaces; Systems which allow three dimensional data to be navigated and displayed efficiently, games, entertainment applications and many more.
1|Page

2. Aims
The broad aim of the project is to develop a system for an augmented desk top with additional visual information. This information should be interactively controllable by hand movement and/or gestures. To achieve this, the project can be divided into two development strands: Interactive control of a computer via hand movement and desktop objects/gestures. A large part of the overall image processing will be in recognising hands and other objects that are on the desktop. Achieving success in this area will be a significant achievement. Augmentation of desktop surface with projection. This stage will require some degree of experimentation with different equipment configurations (camera, projector, lights, surface) and the development of a system to map points on the desktop to screen coordinates.
3. Objectives
Measurable objectives for the project consist of the following: Reliable segmentation of skin colour regions from a real-time video feed. Without reliable segmentation the subsequent stages of the system will be far more difficult, or impossible. The reliability can be improved if needed through careful control of the lighting conditions and backgrounds against which the segmentation is to be performed. Recognition and tracking of a hand shape & other key features (fingertips). The primary interest is hand shapes and for the purposes of the project it can be assumed that the only skin colour regions likely to be found in front of the camera are of hands. Mapping a point on the plane of the desk surface to a point on the display. Implementation of this is a key part of the interface and is made possible by the assumption that the hand will remain on the surface of the desk and hence movement will be limited to a two dimensional plane which can be mapped to any other two dimensional plane. Projection of extra information onto the desk surface. This is the key stage in augmenting the desk surface. A suitable system for sending content to an external monitor/projector needs to be created. Interaction with the projections using hand movement/gestures. This goal sees the camera oriented goals and the projector oriented goals being combined into an overall system which represents the augmented desktop. Develop an application which demonstrates the use of this interface. Once the interface is developed an application which takes advantage of its primary features should be created to showcase the system.
4. Constraints
The project is to be undertaken as a solo effort and will be carried out in parallel with the other requirements of final year Digital Media Engineering student. As such, only two days, or 24 hours total per week are to be allocated for work relating to the project.
2|Page

The equipment available will consist of what the student currently possesses, the equipment available in the labs of the University of Surrey Electronic Engineering department and what can be purchased using the allocated budget. To keep costs down, the camera will be a single, standard definition, consumer level web cam and the projector will be sourced from the department at no additional cost. Alternative configurations involving multiple cameras and/or projectors will be briefly discussed within this report but the methods which utilise this additional hardware will not be implemented. The available budget for the project is 70.
5. Planned Deliverables
The project deliverables will include a library of code with a clearly defined object oriented interface for using a webcam for the detection and tracking of multiple hands and the projection of graphics onto a desk surface. The final application will highlight the features of the augmented desktop interface and provide a means for configuring the setup so it can be adapted for different locations/desk size/hand size/hand colour. The total cost of the project is estimated to be significantly under the 70 budget as the student already owns a webcam suitable for the project and the Electronic Engineering Department can supply a projector. All libraries used for the programming will be free for academic use and the developer software is also free for academic use. The final system is intended to extend the functionality of a current desktop computer station. Using a readily available USB webcam and a small projector the application developer can project images, video, documents, additional menu interfaces and more onto the physical desk top of the user. The user can then interact with these projections directly through hand movement/gestures. This work is expected to be carried out to a schedule as described in the Gantt chart on the following page.
3|Page
6. Work Plan
4|Page

7. Technology Review
The proposed augmented desktop system can be broken down into stages of development. This technology review explores the current methods for each stage of the proposed system. From capture, to segmentation, analysis, tracking and finally projection each area has included a review of a selection of methods from various academic books, journals and reports. These methods have been assessed against the following criteria: effectiveness for the proposed use case (primarily hand tracking and interactive projection), performance of the algorithm (ideally real time, or close to) and ease of implementation. After reviewing the methods and algorithms, a selection of them will be attempted by the student and form the overall augmented desktop system.
7.1 Capture
Capturing video input for the system requires very little algorithmic complexity. There are several libraries for C/C++ that provide interfaces for capturing video from an external camera, however an obvious choice for both the capture of video and subsequent processing is OpenCV. OpenCV is an open source computer vision library written in C/C++ and designed for computational efficiency and real time applications. It provides a framework for computer vision applications that extends far beyond simple video capture and can capture and calibrate multiple cameras or take input from video files. (2) Other computer vision libraries include OpenVIDIA, which implements computer vision algorithms on graphics hardware for increased speed and efficiency (3) and VXL, a collection of open source libraries written in C++. (4) For the task of hand gesture recognition there are a number of camera configuration options. Common computer vision techniques use single camera, stereo cameras and depth aware cameras. Other methods include controller feedback and gloved feedback systems. A single camera system for hand gesture recognition is proposed by Du and Li in which they restrict movement of the hand to a two dimensional plane to mitigate against the need for additional cameras. (5) This system bears many similarities to the constraints places on a desktop projection system and reinforces the idea that a single camera will be sufficient for this purpose. The system proposed by Du and Li follows the pattern of segmentation feature detection gesture recognition. Stereo cameras are used in the system proposed by Shimizu et al in (6) to fit an arm model to the stereo images from the cameras. The system works well for tracking the arm as it only has 4 degrees of freedom (DOF) (ignoring the wrist) however a hand has 27 DOF(7) and far more problems with partial occlusion of itself. Utsumi et al (8) describe a method to detect hand position, posture and finger bending using 4+ camera views with a high degree of accuracy, however four or more cameras is not a practical solution for a desk based system as it would be costly and take up a considerable amount of space.
5|Page

7.2 Segmentation
Segmentation is the process of extracting the regions of interest from an image. A basic technique would consist of analysing the values of each pixel of an image in RGB colour space and only selecting those which fall into a certain range. However, the results from such a simple method would not be useful for any practical applications. The key problem is that perceptually different colours for humans can often lie close together in the RGB colour space. Another problem is the effect of subtle changes of lighting. Small variations can have a large effect on the RGB values of each pixel; this is due to the luminance (intensity) of the colour not being separated from the chromaticity (hue & saturation). To overcome this most common colour segmentation methods use a colour space which separates chrominance and luminance, such as HSV. (9)(10)
Fig. 1 Skin colour segmentation for hand and face detection (11)
A variety of methods exist for segmentation not specifically related to colour, as described in (12). These include matching contours with snakes, clustering, graph and energy based methods. However, these are often too elaborate and computationally expensive for real time tracking purposes. (13) In (13), Bradski presents the CAMSHIFT method for face tracking as the input to a perceptual user interface. CAMSHIFT is based on the mean shift algorithm which uses probability distributions to determine if a pixel is likely to match the model for the hue of the colour being segmented. The result of this histogram matching give an array of probabilities that can then be used to determine which areas are of the correct colour. The system of histogram matching proposed by Bradski uses a one dimensional histogram concerned with only the hue component of each pixel. In theory this means the luminance does not need to be considered. However with very high or very low luminance, the hue is often not reliable so pixels with high and low luminance values are ignored. To enhance the segmentation stage, steps can be taken to remove noise. Morphological operations are common within image processing and the two most common of these are erosion and dilation. If we consider a binary image, where white pixels are the object and black pixels are the background, applying erosion would see any white pixel that bordered a black pixel switching to a black pixel. This has the effect of making the object smaller. A dilation operation has the opposite effect, thereby making an object bigger. (14) In the above example of erosion and dilation a square 3x3 structuring element with an anchor point at 1, 1 (the middle pixel) is assumed. The anchor point marks the position of the pixel being 6|Page

considered while the other elements contain 0s or 1s to denote if they should be considered in the operation. Any size or shape of structuring element may be used.(15)
7.3 Feature Detection

After successfully segmenting a frame into skin-coloured areas and non skin-coloured areas, analysis must be carried out on the skin coloured regions. Important information for hand recognition is the location of finger tips, the centre of gravity for the hand region, the area of the hand region and the outline/contours of the hand. The area and centre of gravity (centroid) of the hand shape are trivial to obtain. The area is the total number of pixels which make up the shape. The centroid is the average position when all points of the shape are considered. An important aspect of feature detection is edge detection. Various methods of edge detection exist. Most rely on finding the intensity gradient of the image in one or more directions. The Canny algorithm for edge detection (16) first reduces noise by convolving the image with a Gaussian filter before finding the intensity gradient in four directions. Large intensity gradients along an axis tend to indicate an edge in an image. Cannys method uses thresholding with hysteresis (upper and lower thresholds) to identify these obvious areas. Once an edge has been found, it is traced to find the rest of it. However a method of this complexity is not necessary after segmentation, as edges can obviously be defined by the change from a white [1] to a black [0] pixel. OpenCV provides several functions for extracting the basic shape information from an image. A particularly useful one is cvFindContours(). In OpenCV contours are a list of points that represent a curve in an image. The cvFindContours() function generates a tree of all the contours in the image. (17)
Fig. 2 2D Finger detection (18)
7|Page

From the contours of the hand it is possible to produce a convex hull shape of the hand and subsequently measure the convexity defects between the convex hull and the contours as described in (19). From the convexity defect information (start point, end point and depth point) it is possible to determine the presence and the location of the finger tips.
7.4 Existing Computer Interaction Technologies

Human computer interaction has largely revolved around the mouse-keyboard interface for many years. With the recent explosion of mobile computing and the challenge of creating interfaces for smart phones a number of technologies have been developed and begun to make the move into desktop computing. These include touch screens, multi-touch pads/screens and voice control. Another area in which innovation in interface design has occurred is in console gaming, with both the Nintendo Wii-mote and Xbox Kinect inspiring a lot of research into novel computer interaction technologies. 7.4.1 Touch Screens The touch screen has become a feature of many portable devices and is increasingly becoming integrated into desktop displays as well. The benefit of a touch screen is the direct interaction with the object on the screen without having to use an intermediate device to move a pointer. Important features of modern touch screen technology are; the ability to detect multiple touches/fingers, the ability to scale to a wide array of sizes, fast response times and having a high enough resolution to allow accurate control. Notable touch screen devices include Apples iPad/iPhone, HTCs mobile phones, and a variety of tablet/touch based laptops and net books from companies such as Hewlett Packard. Touch screen technologies come in three main forms, resistive, surface acoustic wave and capacitive. A resistive screen is composed of 2 electrically conductive layers, which when pressed together (e.g. by a finger) act as a voltage divider and thus changes the current. This change is registered as a touch event and a micro-controller performs some processing to ascertain where the screen was touched. Surface acoustic wave screens use changes in ultrasonic waves to detect touches. Ultrasonic waves are passed over the panel, and when touched some of the wave is absorbed. These changes can be interpreted into touch events as before. Capacitive screens consist of a layer of glass coated in a transparent conductive metal. As the human body also conducts electricity, any touches on the surface of the coated glass create a measurable change in capacitance. Apples iPads and iPhones use capacitive screens however all three methods are used in various desktop computer monitors. 7.4.2 Multi touch With both touch screens and track pads as replacement hardware interfaces for the mouse came the opportunity to track the movements of multiple fingers/pointer objects. Capacitive touch screens and track pads often record the motion of multiple fingers and this gives the ability to detect simple gestures.
8|Page
Fig. 3 A selection of gestures available on the Apple Magic Track pad (20)
The Magic Track pad is probably the most advanced consumer multi touch device on the market. Along with the standard pointing and clicking behaviour of a normal track pad it recognises a number of multi-finger gestures for scrolling, skipping pages, rotating and zooming. It is this technology which is paving the way for computer vision based gesture recognition techniques in consumer desktop computing. 7.4.3 Voice control Mobile technologies have also prompted a growth in voice recognition and control software. Many mobile phones allow basic functions to be executed with just the users voice. This has proven itself as particularly useful while driving for making and receiving calls without taking hands off the wheel. The accuracy of text-to-speech software has also improved dramatically. However it still isnt used commonly as a replacement for the keyboard. 7.4.4 Wii-mote The Wii-mote is a handheld remote for the Nintendo Wii console. It uses a combination of an accelerometer and an infra-red sensor technology to allow users to hold the remote as a pointer or perform gestures which translate into actions in games. A number of third party developers have used the Wii-mote in their own projects and produced technologies such as multi-touch white boards and finger tracking.(21) 7.4.5 Kinect The Kinect from Microsoft is a peripheral for the Xbox360 games console and uses an infrared, structured light 3D scanner to build a depth map of the scene in front of it. It can then segment the human body and track points to recognise gestures. It also contains a built in microphone for recognising voice commands.
9|Page
Fig. 4 Hand and finger tip tracking using the Kinect (22)
The Kinect is the first consumer product that looks likely to prompt the growth in natural gesturebased interfaces. Like the Wii, it has also prompted a number of 3rd party development projects including hand and finger tracking applications for a minority report style video wall. (22)
10 | P a g e

8. Development
8.1 Overview
The hardware components for the overall system include a single webcam, a single projector and a laptop/desktop PC. An ideal set up for the camera is above the monitor as shown in Fig. 5. The projector should be situated so that it may project additional interface information onto the gesture area that the camera can see (also in Fig. 5). The final system will enable a user to interact with the projected display using a selection of gestures and hand movements.
8.2 Set Up
The work area is a typical desktop computer set up, with a single web camera mounted on the top of the monitor display. The camera is angled down at the desktop to create a gesture area of approximately 0.2m2.
Fig. 5 Equipment set up for initial development
For initial experimental purposes a video clip of 850 frames of video was recorded (Fig. 6). The contents of the video consisted of a hand moving through a variety of gestures in the centre of the frame, set against a plain black background. This video was used to provide a consistent test in each experiment with segmentation.
11 | P a g e
Fig. 6 A selection of frames from testClip1.avi
All development work was carried out using C++ in Visual Studio 2010 on a PC with dual screens, an Intel dual core 1.67GHz processor, 2GB RAM and Windows 7 Professional OS. The original webcam was a cheap 3MP camera with autofocus lens, however this proved problematic as the autofocus functionality could not be turned off. It was replaced with a Logitech C160 which has the benefits of manual focus, brightness and exposure control.
8.3 OpenCV
OpenCV is an open source computer vision library designed for real time computer vision applications. It was written in C/C++ and has wrappers for python, C# and other languages. For this project, OpenCV v2.1 was used (http://sourceforge.net/projects/opencvlibrary/files/opencvwin/2.1/) however at the time of writing OpenCV is on V2.2. Code in this document relates to OpenCV2.1 only.
8.4 Capture
Capturing images for processing is a trivial task in OpenCV. The two code snippets below demonstrate capturing from a webcam and a file.
// capture from a webcam CvCapture* capture = cvCreateCameraCapture( 1 ); assert( capture ); IplImage* frame = cvQueryFrame( capture ); // capture from file CvCapture* capture = cvCaptureFromAVI("testClip1.avi"); assert( capture ); IplImage* frame = cvQueryFrame( capture );
It is possible to read frame rate/size information from a file using the cvGetCaptureProperty() function and with this information playback of video is achieved by simply looping through with a delay and querying the capture object each time.
12 | P a g e

8.5 Segmentation
Segmentation is the process of splitting an image into regions of interest and background. The outcome of this stage of the system is a binary image (black/white) in which the white pixels indicate a skin coloured pixel, and black pixels indicate everything else. This binary image can be used as a mask to hide areas of the image we are not interested in. (Fig. 7) A selection of segmentation methods were tested with the aim of finding a suitable method for use in the augmented desktop project. They were judged on the reliability of the segmentation, robustness to changes in illumination, restrictions on background and lighting and potential problems the method could pose when projections onto the surface are implemented.
Fig. 7 An original frame with a binary mask applied
8.5.1 Gray Scale & Threshold The first method considered is a very simple technique that categorises a pixel based on its intensity. The colour image is first converted to gray scale and then a threshold is performed to select only pixels above a certain value. This can be achieved for a single frame with the following lines of code.
// convert to grayscale and threshold // image is a 3-channel BGR image // gray is a 1-channel gray scale image cvCvtColor(image, gray, CV_RGB2GRAY); cvThreshold(gray, gray, threshold, 255, CV_THRESH_BINARY);
For a 3-channel colour RGB image to convert to a 1-channel grayscale image OpenCV uses the formula in Equation 1. Where Y is the luminance/grayscale value for a pixel and R, G and B are the red, green and blue values of the colour pixel.
Equation 1 RGB to grayscale colour conversion
13 | P a g e

When the grayscale and threshold method was applied to testClip1.avi, the following results were achieved. Threshold = 60 With the threshold set to 60 we can observe that the background is not uniformly black/low valued, however the hand does stand out clearly.
Fig. 8 Selection of frames from testClip1.avi with gray scale and threshold = 60 method applied.
Threshold = 90 With the threshold increased to 90, a much clearer segmentation of the hand is achieved. However some noisy areas do still exist.
14 | P a g e

Threshold = 120 With a further increase to 120 we begin to see loss of information around the finger tips (frame100).
Conclusion The gray scale & threshold method is an effective means of segmenting light coloured skin from a dark background. The method is very restrictive in the composition of the background (it must be uniformly dark for best results). In varying light conditions the threshold value will need to be updated to account for adjustments in the intensity of the hand pixels. However this would be of little concern if the lighting and background could be fixed and the threshold simply updated for each new user.
15 | P a g e

8.5.2 HSV Threshold In an attempt to achieve more robust segmentation against different backgrounds and lighting conditions a method consisting of only considering pixels within a certain range of HSV values was implemented.
// convert to HSV colour space // select pixels within range minHSV maxHSV // minHSV is a CvScalar with 3 values cvCvtColor(image, copy, CV_RGB2HSV); cvInRangeS(copy, minHSV, maxHSV, gray);
It proved challenging to find a suitable range despite spending a large amount of time adjusting the values manually. The following selection of frames from testClip1.avi was the best attempt at segmentation that could be achieved via this method.
Selection of frames from testClip1.avi with HSV threshold applied. Min = 90, 0, 200. Max = 140, 80, 255
8.5.2.1 Conclusion The problem with this system is obvious from the frames shown above; a full segmentation of the hand could not be achieved. The method of thresholding the hue, saturation and value components of each pixel separately was a naive implementation of HSV segmentation. It did not take into account the cyclical nature of the Hue value in the HSV colour space. As the hue represents a rotation, both 1 and 359 (or 179 in OpenCVs implementation, as the hue is halved to bring it within the range of 8 bit numbers) are a very similar colour. As this was not taken into account when range testing there is likely to be some error involved. The lighting in the testClip1.avi video was also too strong, causing the hand to be registered as nearly white rather than a skin coloured hue. 8.5.3 Hue Histogram Model To deal with more complex backgrounds a more sophisticated segmentation method was required. By selecting an area of the image consisting entirely of hand pixels a histogram model can be constructed. The histogram counts the number of pixels with hue values that fall into predetermined bins. From this histogram model of the skin pixels, the histogram back projection can be applied to each input frame and the resulting grayscale image highlights the query colour as
16 | P a g e

white/high intensity pixels. This is very similar to the first stage of the CamShift (Continuously adaptive mean shift) algorithm for tracking. (23)
Fig. 11 An example of a hand pixel region
Fig. 11 shows a region selected by a user which consists entirely of pixels representing the hand. It includes a smooth expanse of skin as well as shading around the knuckles and provides a good model of a hand in the image. The hue histogram for this image would contain high energy components in the red portion of the histogram, and this should hold true for skin colour of all ethnicities. (24)
Fig. 12 Back projection of skin model on frame from camera (left) and thresholded (right)
Fig. 12 shows the results of back projecting the histogram model of the hand onto a new frame from the camera to produce a grayscale image where pixel intensity represents the probability of a pixel belonging to the hand colour set. This image can be thresholded to obtain the right hand image in Fig. 12 and as can be seen, this generates a good binary mask of the hand region with some additional noise.
17 | P a g e
Fig. 13 Segmentation against a black screen background
Fig. 14 Segmentation against a red screen background
18 | P a g e
Fig. 15 Segmentation against a green screen background
Fig. 16 Segmentation against a blue screen background
Fig. 13 - Fig. 16 show the results of segmentation using this method in the presences of changing background colours. Both the threshold and histogram model of hand pixels remained unchanged for each test and this gave rise to some poor results in the presence of different background colours. It is possible to obtain good segmentation of hands given a variety of backgrounds by altering the reference image and threshold parameters in the segmentation algorithm. A possible solution to take account of the background colour would be to obtain settings for different colours and then select the correct settings when the background changes, or obtain an approximation of the settings through analysis of the background. 8.5.3.1 Additional Processing The initial results show that the segmented image can still contain a number of false positives regardless of the threshold level used. On closer inspection these false positives correspond with very light (white) and very dark (black) areas in the image. Because of the cylindrical nature of the HSV colour space, whites and blacks can often cause problems and it would be advantageous to ignore them in the segmentation of the image. This is easily achieved by thresholding the grayscale frame before any processing using a very low threshold to remove the blacks and then using a high threshold and inverting to remove the whites. These 2 masks can then be combined to create a single mask which can be applied to the back projection using a logical AND operation, effectively removing the lights and darks from the final segmentation. 19 | P a g e
Fig. 17 Mask to exclude light and dark areas from image
Fig. 17 shows a mask using very low threshold values for white and black. Only white pixels in the mask will be considered as skin pixel candidates. It can be seen that no skin pixels are discarded, only the shadow created by the hand and fingers and some background elements are masked out.
Fig. 18 Final segmented images using Hue Histogram technique and noise removal
20 | P a g e

8.5.3.2 Conclusion This method offers better segmentation in the presence of more complicated backgrounds and should provide a better basis for capture on desks of varying colours, as well as changing projected images. It does require the selection of hand pixels to set up the initial model and due to light and dark areas returning false positive results, it is susceptible to shadow and poor lighting conditions. With the additional processing to remove whites and blacks from the segmented binary image this method should provide a clear enough image for the detection of finger tips. Fig. 18 shows some final segmented images from the Hue Histogram technique with additional light/dark masking and morphological noise removal operations. It provides a very reliable segmentation given unchanging lighting and background conditions. Further development work will be required for this method to be usable under variable conditions; however a usable system could be created by placing restrictions on its operation. 8.5.4 Noise Removal To remove noise from the binary image morphological operations were employed to shrink small clusters of false positive pixels and to grow regions of correctly segmented pixels to fill holes. The operations used are erosion and dilation. OpenCV provides functions to implement erosion and dilation, as well as a function for more advanced morphological operations such as opening and closing.
// Morphological operations in OpenCV // erosion by a a 6x6 rectangular structuring element cvErode(gray, gray, cvCreateStructuringElementEx(6, 6, 3, 3,CV_SHAPE_RECT)); // closing by a a 12x12 rectangular structuring element cvMorphologyEx(gray, gray, 0, cvCreateStructuringElementEx(12, 12, 6, 6,CV_SHAPE_RECT), CV_MOP_CLOSE, 1); // dilation by a a 12x12 rectangular structuring element cvDilate(gray, gray, cvCreateStructuringElementEx(12, 12, 6, 6,CV_SHAPE_RECT));
The following screenshots illustrate the effect of erosion. The image is frame 800 from the testClip1.avi video after gray scale and thresholding. Small groupings of pixels in the top left corner are removed and the hand shape is eroded slightly. (Fig. 19)
21 | P a g e
Fig. 19 Erosion by a 4x4 rectangular structuring element
When a 12x12 structuring element is used, the effect of erosion is much more severe. (Fig. 20)
Fig. 20 Erosion by a 12x12 rectangular structuring element
After removing patches of noise, it is often advantageous to dilate the remaining shape to fill in holes or ensure finger tips arent separated from the hand. The screen shots in Fig. 21 show a frame after erosion and the same frame after dilation with a 12x12 rectangular structuring element. As can be seen, small holes in the hand shape have been removed, but at the expense of a loss of definition in the fingers. Careful selection of structuring elements and sizes must be practiced to ensure a good outcome at the noise removal stage.
22 | P a g e
Fig. 21 Dilation by a 12x12 rectangular structuring element after erosion by a 4x4 rectangular element
8.5.4.1 Conclusion Morphological operations are very effective at removing extraneous pixels from the binary image and also have some use for growing regions to obtain more consistent results. The degree of erosion or dilation required will need to be altered for different light levels and variations in skin/background colour/complexity. Other techniques could be explored to ensure accurate representation of finger tips, joining regions that have become separated (for instance, due to shadow, or rings on the finger) and filling large holes on the hand created by shadow/cuts/bruises.
8.6 Hand Detection

To further the development of the project, the following section was carried out using frames segmented via the gray scale and threshold method. Despite it being a very simplistic method, and likely unsuitable for real world applications, under controlled lighting conditions and with a flat black background consistently accurate hand segmentation could be performed. This allowed good progress in the shape analysis stage to continue with the expectation that after improving the colour based method it could be swapped into the system. The algorithm for hand feature detection can be summarised as the following stages: - Extract contours from binary image - Determine area & centre of gravity for each hand shape - Determine convex hull for each hand - Determine convexity defects between convex hull and contours - Identify finger tips from convexity defects 8.6.1 Contours After segmentation contours of the binary image are determined. OpenCV provides a function for this and returns a CvSeq of all the contours in the image. For detection of hands, the assumption is made that any skin-coloured shape with an area greater than AREA_THRESHOLD is a hand.
// finding the contours of a 1-channel binary image CvMemStorage* memStore = cvCreateMemStorage(0);
23 | P a g e

CvSeq* contours; cvFindContours(frame,memStore,&contours,sizeof(CvContour),CV_RETR_CCOMP); // find area of each contour while (contours != NULL) { double area = cvContourArea(contours); if (area > AREA_THRESHOLD) { // its a hand } contours = contours->h_next; }
Fig. 22 Contours of the binary image (green) with area. (Hand = 40766.5, NoiseTopLeftCorner = 182.5)
After acquiring the contours of the hand, the centre of gravity can be determined by taking moments of the shape and dividing by the area.
CvPoint calculateCog(CvSeq* contours) { CvMoments moments; cvMoments(contours, &moments); double moment10 = cvGetSpatialMoment(&moments, 1, 0); double moment01 = cvGetSpatialMoment(&moments, 0, 1); double area = cvGetCentralMoment(&moments, 0, 0); int cogX = int(moment10/area); int cogY = int(moment01/area); return cvPoint(cogX, cogY); }
24 | P a g e
Fig. 23 Selection of frames from testClip1.avi with centre of gravity and bounding box
8.6.2 Convex Hull From the contours a convex hull can be generated. A convex hull for a two dimensional object can be visualised as if a rubber band had been stretched over the shape.
Fig. 24 Segmented hand shape with a convex hull
Again, OpenCV has a built in function for determining the convex hull from the contours.
// cvConvexHull returns a sequence of points that describe the // convex hull of the shape CvSeq* hull = cvConvexHull2(contours, 0, CV_CLOCKWISE, 0);
8.6.3 Convexity Defects With both the contours and convex hull now determined, the convexity defects can be calculated.
// get defects between hull and contours CvSeq* defects = cvConvexityDefects(contours, hull);
25 | P a g e

In the following image start and end points for each convexity defect are marked in green, while the depth point is marked in red. Only convexity defects with a depth greater than 40 pixels are indicated on this screen shot. It can be observed that start and end points for large defects tend to be located on the finger tips, while the depth points are found in the gaps between fingers.
Fig. 25 Segmented hand shape with a convexity defects > 40 pixels marked
Problems with this method include some gestures not being registered due to the depths of defects not being large enough. However, it is also possible for too many defects to be identified if the depth threshold is too small.
26 | P a g e
Fig. 26 No convexity defects are above the 40pixel threshold so detecting finger tip is impossible
To overcome these problems, a lower threshold must be used and an algorithm applied to the convexity defect points to determine if they are likely to be a finger tip.
27 | P a g e

8.6.4 Finger Tip Logic After determining the convexity defects of the hand shape, the process for determining the location of finger tips is thus: All convexity defects with a depth greater than DEPTH_THRESHOLD have their start and end points added to a list of potential points. Points that are less than DISTANCE_THRESHOLD pixels from the centre of gravity of the hand are removed. Finger tips are unlikely to be close to the centre of gravity. Nearby points on the list are averaged to form single new points. This step takes into account the possibility that a finger tip may contain both start and end points for different defects. The remaining points in the list are likely to finger tips.
Fig. 27 Finger tips are identified by numbers. Only points outside of the circle are considered.
The results of this logic are consistent and reliable. However problems can still be experienced due to poor segmentation. 8.6.5 Conclusion For the purposes of the augmented desktop project, area of the segmented region is a good enough measure for detecting the presence of a hand. On a desk top there is unlikely to be skin coloured shapes of a similar area to a hand. Therefore, further methods of analysis are unnecessary for hand detection.
28 | P a g e
Fig. 28 Thumb is not recognised as it is separate from the main hand shape.
The convexity defect method for identifying finger tips is a reliable method given a few constraints. Sleeves, or a wrist band, must be worn by the user to prevent a defect being detected between the forearms and thumb/little finger. The hand segmentation must be a full and accurate representation of the hand. If regions become separated then they are not considered part of the hand and can lead to fingers not being detected. (Fig. 28)
8.7 Homography Calculation

A two dimensional homography is a projective transformations that maps points from one plane to another (25). Having established the location of finger tips in a frame from the camera and knowing the location and dimensions of the display surface in that frame, a homography matrix can be calculated to transform the points from the image to the display.
Fig. 29 Mapping a point in the camera view to the display surface
To calculate the homography four points must be selected from the camera view and four points defined for the display area (Fig. 29). The points Q1-4 are simply the corners of the rectangular image to be display in the display area. The points are thus:
29 | P a g e
The points P1-4 cannot be obtained so easily and vary with the camera orientation. The obvious solution is to have the points selected by the user prior to using the system and adding the constraint that the camera must remain in the same orientation throughout use. This requires a simple program which allows the user to click on a webcam image to mark the four points. Given the four pairs of corresponding points a matrix equation can be formed to obtain the eight unknown values in the 3x3 homography matrix.
Equation 2 Derivation of a homography using four pairs of corresponding points
Equation 2 shows how to derive the homography matrix from the four pairs of corresponding points. OpenCV also provides a function for doing this. 8.7.1 Homography Calibration App To simplify the process of calculating the homography for the system each time it is taken down and re-assembled (leading to changes in orientation of the camera) an application was developed which allows the user to select the four corners of the display on a web cam feed and select the resolution and offset of the display area. The user can then simulate a finger placement and check that it corresponds to the correct point on the display.
30 | P a g e
Fig. 30 Screen shot of homography calibration app during point selection
Fig. 31 Screenshot showing a simulated finger (large green circle) and its corresponding transformed point on the display
31 | P a g e

8.8 Segmentation Calibration Tool
The number of parameters in the segmentation stage make manual configuration difficult. This lead to the development of a configuration tool to allow settings to be changed in real-time and the segmentation effects observed. These settings can then be saved to a text file and re-loaded in any application using the segmentor object.
Fig. 32 Screen shot of segmentation calibration tool
Fig. 32 shows the array of sliders for configuring the segmentation. The top three sliders control the back projection threshold and the black and white threshold. The next two sliders control the size of the opening and closing structuring element. The remaining sliders set the number of iterations of the morphological operations, the value used to normalise the histogram, a smoothing/blurring operation and RGB colour control of the display background. Creating a new segmentor object and loading the saved settings can then be achieved in just a few lines of code:
// create segmentor object Segmentor segmentor; // load settings and skin image ifstream segmentorSettings("configFile.txt"); IplImage* skinImage = cvLoadImage("reference.jpg"); // send settings and image to segmentor if (segmentorSettings.good()) segmentorSettings >> &segmentor; segmentor.setReferenceImage(skinImage); // segment frame and set mask to the binary skin representation segmentor.segment(frame, mask);
This makes the development of applications that use these settings very quick and has helped in the rapid prototyping of the applications described later in this report.
8.9 Projection/Display
With a system for segmenting skin colour and detecting finger tips in place, a means of displaying interactive content on the desk surface needed to be determined. In the early stages of the project, it was thought a projector would be suitable. Using a projector gives the potential for using the entire desk surface for interaction and display and would prevent a desk becoming cluttered with additional display technology/wires. However, after some initial testing of a projector and facing problems with suspending it above a desk, requiring a dark room to display an image and the downward projection affecting the segmentation by 32 | P a g e

distorting skin colour, the decision was made to use a LCD computer monitor on its back as the display area. A computer display has a number of advantages over the projector. It does not require changing the light level in the room. It does not alter the observed skin colour when the display changes and it provides a clearly defined area for gestures to take place within. Downsides of using the computer monitor are increased clutter on the desk, a smaller display area (unless a larger screen is purchased) and the risk of damaging the screen from resting hands on it. A further development would be to build the screen into the desk itself, or alternatively use a projector underneath the desk to project onto a translucent surface.
Fig. 33 Equipment set up. LCD display on the desk and a webcam pointing at it from above
33 | P a g e

8.9 Development Summary
Development work has been primarily focussed around four key areas; Segmentation, Finger Tip Detection, Homography Calculation and Configuration Tools. In the segmentation stage, a number of techniques for segmenting skin colour in real-time have been implemented and assessed and the Hue Histogram method has been selected for use in final applications. It was the most robust method in the presence of changing backgrounds and in conjunction with additional processing and noise removal techniques it provided a clear mask for skin area. In the hand/fingertip detection stage, a reliable algorithm based on detecting convexity defects in the contour of a hand has been developed. Finger tips are tracked over time using nearest neighbour analysis based on positions of finger tips in previous frames. Homography calculation and the segmentation configuration tools operate as separate applications and produce text files which can be loaded into any other application to allow the settings to be applied quickly and easily. Given accurate selection of points the homography transforms camera coordinates to display coordinates reliably and precisely. Additional development time was spent selecting a display medium. A projector was initially considered but was found to add too many complications and constraints. The solution was to use an LCD computer monitor placed horizontally on the desk. While not ideal, it sped up development and allows demonstrations of the system to be easily assembled at any location.
34 | P a g e

9. System Evaluation
9.1 Methodology
9.1.1 Segmentation Test After setting up the desk and display equipment a series of still images were captured and the segmentation algorithm was used to process each one. The resulting mask was coloured blue and superimposed onto an image where the segmentation was performed by hand in image editing software and coloured red. A small application was then written to count red, blue and purple pixels and using these values it is possible to compare multiple images and techniques with numerical values for matches, false positives and false negatives.
Fig. 34 Result of superimposing a red "true" segmentation and the blue actual segmentation
From analysis of the image in Fig. 34 the following results can be obtained:
Total Pixels: 307200 (640x480) True Skin Pixels: 41442 Matched Pixels: 36477 (88.02%) False Negatives: 4965 (11.98%) False Positives: 1957 (5.09%)
False negatives being pixels that should have been skin but were not registered as such (e.g. red pixels). False positives are pixels which were registered as skin when they should not have been (blue). Using these percentages it is possible to quantitatively compare the effectiveness of different images and segmentation techniques. 35 | P a g e

9.1.2 Finger Tip Detection Test Using the true hand segmentation images from the previous test the number and location of finger tips were noted and each image was then processed by the hand detector. The computed results were compared to the human identified results and any differences noted down. The test was repeated for hands with and without sleeves, poorly segmented hands and non-hand objects.
Fingers Found Correct Location?
1 1 Yes
Fingers Found Correct Location?
1 3 Yes
Fig. 35 Example of good and bad results for finger tip detection tests
9.1.3 Grid Based Accuracy Test To determine the accuracy with which a finger can be used to interact on the display area, a test was devised in which a numbered grid is displayed across the display area. The user is asked to place their finger over a grid square for three seconds, after which time the square will turn green, indicating a successful selection. By mapping the results onto a separate grid it is possible to build a map of areas which cannot be accurately selected. Using this information and by altering the grid size a good resolution can be selected and this forms the basis for selecting the size of interactive elements.
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
8 16 24 32 40 48 56 64
Fig. 36 Example results for 4x4 and 8x8 grid
Fig. 36 shows example results from 4x4 and 8x8grids. At higher resolutions it can often be difficult to select areas close to the edge of the display area.
36 | P a g e

9.1.4 User Experience Questionnaire To provide a qualitative assessment of the system as a whole a questionnaire was described to gauge user experience. The questionnaire was given to users to after they had been using an application created for the system. Questions were designed to be non-specific to applications and more focussed on the users experience of the system.
Strongly Disagree 1 2 The system consistently recognised my hand and finger tips The system consistently recognised my gestures Selecting interactive elements was easy and intuitive I enjoyed using this interface ADDTIONAL COMMENTS: 3 4 Strongly Agree 5
Fig. 37 The user experience questionnaire
Each user was asked to use the photo viewer and checkerboard applications for five minutes. Each application was described beforehand and the relevant gestures demonstrated by an expert user. 9.1.5 Latency For an interactive system latency is an important test. During development it was noticeable that the system lagged when tracking finger tip movement. If a coloured blob was placed under a finger tip and the finger tip subsequently moved, the blob would seem to follow the finger tip rather than remain underneath it. To quantify this; timers were placed in the code to establish the time taken between capturing a new frame and updating the output. These timing were further broken down into segmentation and detection to establish bottlenecks in the code.
37 | P a g e

9.2 Results
9.2.1 Segmentation Test
Fig. 38 Results of segmentation tests
Matches 88.02% 90.11% 87.70% 89.24%
91.00% 90.03% 88.49% 84.69%
90.98% 87.92% 86.60% 83.71%
88.34% 88.13% 86.48% 76.45%
Average 87.4%
False Negative 11.98% 9.00% 9.89% 9.97% 12.30% 11.51% 10.76% 15.31%
9.02% 12.08% 13.40% 16.29%
11.66% 11.87% 13.52% 23.55%
Average 12.6%
False Positive 5.09% 2.54% 3.31% 4.68% 4.53% 2.27% 2.59% 3.39%
2.49% 5.78% 3.94% 2.14%
2.45% 6.35% 4.25% 1.36%
Average 3.6%
Fig. 39 Percentage matches, false negatives and false positives in segmentation tests
38 | P a g e

9.2.2 Finger Tip Detection Test
Fig. 40 Results from finger tip detection test (no sleeves)
Image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fingers Found 5 6 1 1 1 1 1 2 1 1 1 1 2 4 2 6 0 0 1 3 2 4 1 1 2 2 3 2 4 4 5 6
Difference 1 0 0 1 0 0 2 4 0 2 2 0 0 -1 0 1
Comments Bare arm Good Good Knuckle Good Good Bare arm & knuckle Bare arm & knuckle Good Bare arm & knuckle Bare arm & knuckle Good Good Tips too close together Good Bare arm & tips too close together
Fig. 41 Finger tip result analysis (no sleeves)
39 | P a g e
Fig. 42 Results from fingertip detection (sleeves)
Image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fingers Found 5 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 0 1 1 2 2 1 1 2 2 3 2 4 3 5 4
Difference 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1
Comments Good Good Good Good Good Good Good Good Good Good Good Good Good Tips too close Tips too close Tips too close
Fig. 43 Finger tip results analysis (sleeves)
40 | P a g e

9.2.3 Grid Based Accuracy The test was initially carried out with a camera to display surface distance of 50cm and a display screen with a resolution of 1280x1024. The following results were obtained:
1 4 7
2 5 8
3 6 9
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Fig. 44 Results for camera to display distance = 50cm
As shown in Fig. 44 the bottom edge of the screen was particularly difficult to select reliably. This was due to the hand disappearing from the camera view when the user went to select a lower grid square. To solve this problem, the camera height was increased, giving a wider view of the desk surface and providing a significant amount of space for the hand to be detected within. Fig. 45 shows the difference in camera views between the original 50cm height and the new 75cm height.
Fig. 45 Camera view at 50cm (left) and camera view at 75cm (right)
After increasing the camera height, the grid based accuracy test results improved considerably. Grid sizes up to 32x32 were tested and no test failures were encountered. In a 32x32 grid at a 1280x1024 screen resolution each grid square is 40x32 pixels in size allowing a confident prediction that any interactive element this size or larger will be reliably selected by a user of this system. It should also be noted that the accuracy largely depends on the accuracy of the segmentation. In cases of poor segmentation the finger tip indicator can be seen to jump around randomly. This can be resolved to some degree by carefully configuring the segmentation and applying a smoothing/blurring filter to the mask. 9.2.4 Questionnaire Of the 8 users who tested both applications, 7 gave overall positive responses with each question receiving an average score of 3.25 or higher. Both the selecting interactive elements question and the I enjoyed using this interface question received the highest average scores (3.875). The questions regarding recognition of finger tips and gestures also received average ratings above 3.25. The questionnaires also brought to light some interesting issues regarding hand size; when users 41 | P a g e

with smaller hands used the system, finger tip tracking became far less accurate and would often miss fingers entirely. This is due to the value used in the tracking algorithm being a static number and not scaling with the hand. It is relatively easy to fix for a new user but by changing this static number to a smaller value, larger hands are more likely to experience finger tracking errors. 9.2.5 Latency The latency introduced by the code was measured using timers and averaged over the period of 1 minute. It resulted in a total time of approximately 0.06s per frame of video captured. When this was broken down into the segmentation and detection it became clear that the segmentation algorithm was taking 10 times longer to execute than the detection algorithm. To learn more about this the timing tests were repeated with a range of segmentation options detailed below.
With no additional operations: Segmenting:0.040s Detecting:0.004s Total:0.051s With 1 opening (no closing, no smoothing): Segmenting:0.041s Detecting:0.005s Total:0.054s With 1 opening, 1 smoothing operation (no closing): Segmenting:0.059s Detecting:0.004s Total:0.069s With 1 opening, 1 closing, 1 smoothing operation: Segmenting:0.058s Detecting:0.004s Total:0.068s
The smoothing operation increases the average time for segmentation by 0.01s and the latency that is observed becomes significantly more pronounced when smoothing is utilised. The opening and closing operations have relatively little effect and no noticeable increase in latency is perceived by the user from these operations. Another factor influencing the latency of the system is the capture frame rate. At the frame rate of 24fps, there is 0.042s between each captured frame.
42 | P a g e

10. Applications
During development a selection of demo applications were produced to demonstrate possible uses for the system. These applications were tested by a selection of users in the system evaluation and are discussed further in this section.
10.1 Mouse Control

The mouse control application turns the display area into a mouse track pad. Users place a single hand in the display area and extend a finger to begin controlling the location of the mouse on the computers main display. To perform a left click action the user extends a second finger. Compared to a normal mouse track pad, the system falls short in a number of areas. The key differences are in the latency and accuracy. When using a mouse a user expects pixel accurate control of the mouse pointer and no lag between their movement and the pointer movement on screen. While the system is not far from this ideal, because of the users previous experiences with the mouse interface it does not perform as well in comparison. This highlights the need to think of new uses for the interface, rather than trying to adapt it to old interfaces.
10.2 Checker Board

The checkerboard application is the first example of direct interaction with graphics projected on to the display area. Using a single finger, a user hovers over a checkers piece to pick it up, and then moves it to a new location on the board, waits for a moment and the piece is dropped. A small amount of game logic was implemented to remove a piece when it is jumped over, but otherwise, the game can be played by simply allowing players to move the pieces wherever they want.
Fig. 46 Screen shot of the checkers game. The app runs full screen on the display monitor.
43 | P a g e

Fig. 46 shows the grid and checkers pieces. When a finger tip is detected within a grid square containing a piece for 1 second the piece is picked up and follows the finger tip until it remains in a grid square for another second. The finger tracking is robust even in the presence of other fingers in the display area. This application highlights the advantages of changing the orientation of the display plane to give multiple users a better interactive experience. Rather than gathering round a computer screen and sharing a mouse, users can sit around the desk, as they would when playing checkers using a real board and pieces, reaching onto the board when it is their turn.
10.3 Photo Viewer

The photo viewer application was developed to show the possibility for simple gesture use in the interface. Photos are displayed on the display area and using a 2-finger swipe gesture a user can go forwards or backwards through the list of photos. By adding a second hand, the user can manipulate the scale of the current photo, with a finger on each hand pulling the bottom left and top right corners of the photo. There is also an exit button which the user can select by hovering a single finger over it.
Fig. 47 Screen shot from the photo viewer application in debug mode.
Fig. 47 shows the photo viewer application in debug mode. The green circles show the current location of the finger tips while the blue line and circle show the start point and movement vector for the gesture which is currently in progress. After detecting a gesture by applying some basic rules such as location of the start and end points and the cessation of motion, the photos animate in a carousel motion.
44 | P a g e

This application demonstrates a range of gestures being used in one application. Flipping through photos feels very much like turning the pages of a book and using your hands to scale the images allows fine control of the operation. This is an excellent example of a system that is more intuitive than using a mouse to click buttons to scroll through images.
45 | P a g e

11. Conclusions
11.1 Aims and Objectives
The aims of the project were: Interactive control of a computer via hand movement and desktop objects/gestures. And Augmentation of desktop surface with projection.
These aims were largely achieved; in particular, the first aim has been clearly demonstrated through the applications that have been developed for the system. Both the photo viewer application and the checker board game show the clear advantages to new gesture based interfaces for human computer interaction. The second aim, augmentation of the desktop surface with projection changed as the project progressed. Projection onto the desk surface proved to be a more difficult problem to solve than the time that was available allowed. This lead to the decision being made to use a computer monitor as the display surface, avoiding many of the problems that projection introduced (interfering with segmentation, requiring low light levels in the room). The computer monitor still enables graphics to be displayed and interacted with. But it does take up significant desk space. An improvement to this set up would be to integrate the screen into the desk surface. The objectives for the project break down the aims into manageable goals: - Reliable segmentation of skin colour regions from a real-time video feed. - Recognition and tracking of a hand shape & other key features (fingertips). - Mapping a point on the plane of the desk surface to a point on the display. - Projection of extra information onto the desk surface. - Interaction with the projections using hand movement/gestures. - Develop an application which demonstrates the use of this interface. The segmentation of skin colour from a real time video feed took up the greatest amount of development time. While achieving reasonable segmentation under very controlled conditions was easily attained, it became clear that a poor segmentation would have severe repercussions in the accuracy of the tracking stage. For the tracking to function the hand must be segmented as a single area and not include extraneous noise which changes the contours of the shape. The results of the segmentation tests show the final system to be almost 90% accurate when compared to a human manually segmenting an image and more importantly the segmentation accurately retained fingers and other key contour shapes. For recognition and tracking of hands and features, development went relatively smoothly. After reviewing a number of techniques for finger tip detection the convexity defects method was 46 | P a g e

implemented and performs well assuming good segmentation and the user is wearing sleeves/wrist bands. Formal tests highlighted the need for sleeves to be worn while using the system as having a full arm included in the hand shape can cause problems with the logic used for determining where a finger tip is located. A simplistic form of tracking is carried out by saving previous finger tip locations and then simply selecting the nearest tip in the subsequent frames. Using a homography to map from the camera view plane to the display plane also took little development time to achieve. An application was developed to allow rapid configuration of the system to new camera positions and display area sizes. No formal testing of the accuracy of this part of the system was undertaken, but in the demonstration applications it is clear that mapping is correct and accurate for everyday use. The fourth objective concerns the projection of extra information onto the desk surface, changes to this objective have already been addressed in this conclusion and using a computer monitor made this stage trivial. Simply plugging in a second monitor and ensuring graphics were drawn to it. The final two objectives are demonstrated in the applications produced during development and user comments on the system. The photo viewer application ties in a number of gestures and potential uses for the interface and demonstrates the benefits of using a system in which you can directly interact with objects rather than abstracting that interaction away via the mouse and clicking GUI buttons.
11.2 Project Organisation

The organisation of the project is best shown in the Gantt chart. Initial research was carried out to gain a broad understanding of the problem and current solutions and then development work began. As part of the development various techniques were tested and evaluated and further reading/research undertaken where required. This eventually led to a stage where the system could be thoroughly tested and demonstration applications developed. The project was primarily a software project and to manage the code an SVN version control repository was set up. This allowed each revision of the code to be saved along with comments. Each week saw two 8 hour days set aside for work on the project although this could change depending on deadlines. These 16 hours/week gave ensured the major goals for the project could be achieved.
11.3 Suggested Additional Development Work

While the main objectives for the project were met, it became clear during the project that additional development work could be undertaken to improve the system and turn it into a viable new interface technology. In the early stages of the project it became clear that segmentation in changeable lighting conditions (rooms with windows, poor lighting, un-even lighting) created a number of problems with segmentation by skin colour. This led to constraining the system to only work in a windowless room where unchanging, even lighting conditions could be achieved. Additional development work could explore alternative methods of segmentation, such as line/contour detection, or use technology such as Microsofts Kinect to obtain depth information from the camera and use that to identify hand shapes. The Kinect solution in particular would be an interesting solution for the segmentation and one which would have been explored had the Kinect been released before this project 47 | P a g e

commenced. Another possible route would be to develop the colour model to be more robust to changing illumination. Another constraint that further development could look to remove is the need for users to wear sleeves/wrist bands. The finger tip tracking algorithm relies on the assumption that hands are the only skin coloured object in front of the camera. Adding arms changes the calculation of the hands centroid and also introduces additional convexity defects which are picked up as fingers. Further development could establish an algorithm to identify the hand in the presence of additional skin colour regions, either through simple logic decisions (hands will tend to be near the middle of the display area and have an area of X pixels2) or possibly by detecting hands using Haar-like features as in (26). Another improvement would be to remove the requirement in multiple user applications for hands to be the same size. In testing, smaller hands and fingers were less well detected without changing values which were hard coded into the detection algorithm. Further development should look at adjusting the algorithm automatically for each individual hand present in the cameras view to achieve optimal finger tip detection. Further improvements to the system could explore the range of recognisable hand gestures and the recognition of additional objects or symbols which could be used to interact with the system in different ways. Development could also look at overcoming the problems associated with projection onto the desk to remove the need for a screen on the desk. If the additional development outlined in this section was acted upon, the system could be considered as a viable human computer interaction technology.
48 | P a g e

12. Bibliography
1. Valerie Landau, Eileen Clegg. The Engelbart Hypothesis: dialogs with Douglas Engelbart. s.l. : NextPress, 2009. 2. Gary Bradski, Adrian Kaehler. Learning OpenCV. s.l. : O'Reilly Media, 2008. 3. Fung, James. OpenVIDIA: Parallel GPU Computer Vision. OpenVIDIA. [Online] 27 June 2010. [Cited: 3 January 2011.] http://openvidia.sourceforge.net/index.php/OpenVIDIA. 4. Introduction. The VXL Homepage. [Online] [Cited: 3 January 2011.] http://vxl.sourceforge.net/. 5. Wei Du, Hua Li. Vision based gesture recognition system with single camera. s.l. : ICSP2000, 2000. 6. Masaki Shimizu, Takeharu Yoshizukaa, Hiroyuki Miyamotoa. A gesture recognition system using stereo vision and arm model fitting. s.l. : Elsevier B.V., 2007. 7. George ElKoura, Karan Singh. Handrix: Animating the Human Hand. s.l. : SIGGRAPH Symposium on Computer Animation, 2003. 8. Akira Utsumi, Tsutomu Miyasato, Fumio Kishino, Ryohei Nakatsu. Hand Gesture Recognition System Using Multiple Cameras . Vienna : Pattern Recognition, 1996., Proceedings of the 13th International Conference on, 1996 . 9. Zhi-Kai Huang, De-Hui Liu. Segmentation of Color Image Using EM algorithm in HSV Color Space. s.l. : Information Acquisition, 2007. ICIA '07. International Conference on, 2007. 10. L. Shafarenko, M. Petrou, J. Kittler. Histogram-Based Segmentation in a Perceptually Uniform Color Space. s.l. : IEEE Transactions on Image Processing, VOL. 7, NO. 9, 1998. 11. Haiying Guan, Matthew Turk. Face and hand tracking with skin color segmentation. Four Eyes Lab. [Online] Department of Computer Science, University of California. [Cited: 14 Jan 2011.] http://ilab.cs.ucsb.edu/index.php/component/content/article/12/31. 12. Szeliski, Richard. Chapter 5: Segmentation. Computer Vision: Algorithms and Applications. s.l. : Springer, 2010. 13. Bradski, Gary. Computer Vision Face Tracking For Use in a Perceptual User Interface. s.l. : Intel Technology Journal, No. Q2., 1998. 14. Smith, Steven W. Chapter 25: Special Imaging Techniques. The Scientist and Engineer's Guide to Digital Signal Processing. s.l. : California Technical Pub., 1997. 15. The MathWorks, Inc. Morphology Fundamentals: Dilation and Erosion. Mathworks. [Online] The MathWorks, Inc. [Cited: 6 January 2011.] http://www.mathworks.com/help/toolbox/images/f1812508.html. 16. A Computational Approach To Edge Detection. Canny, J. s.l. : IEEE Trans. Pattern Analysis and Machine Intelligence, 1986.
49 | P a g e

17. Chapter 8: Contours. [book auth.] Adrian Kaehler Gary Bradski. Learning OpenCV. s.l. : O'Reilly, 2008. 18. Application examples using OpenTL. OpenTL. [Online] Technische Universitt Mnchen. [Cited: 14 Jan 2011.] http://www.opentl.org/videos.html. 19. Cristina Manresa, Javier Varona, Ramon Mas, Francisco J. Perales. Real Time Hand Tracking and Gesture Recognition for Human-Computer Interaction . s.l. : Electronic Letters on Computer Vision and Image Analysis, 2000. 20. Magic Trackpad Walkthough Video. ITD: I Think Different. [Online] [Cited: 10 May 2011.] http://www.ithinkdiff.com/magic-trackpad-walkthough-video/. 21. Lee, Johnny Chung. Wii Projects. JohnnyLee.net. [Online] [Cited: 10 May 2011.] http://johnnylee.net/projects/wii/. 22. Engadet. Kinect finally fulfills its Minority Report destiny. Engadget. [Online] [Cited: 10 May 2011.] http://www.engadget.com/2010/12/09/kinect-finally-fulfills-its-minority-report-destinyvideo/. 23. Intel Corporation. Open Source Computer Vision Library: Reference Manual. 2001. 24. Y. Raja, S. J. McKenna, S. Gong. Tracking and segmenting people in varying lighting conditions using colour. IEEE International Conference on Face & Gesture Recognition. 1998. 25. Szeliski, Richard. Computer Vision: Algorithms and Applications. s.l. : Springer, 2010. 26. Qing Chen, Nicolas D. Georganas, Emil M. Petriu. Real-time Vision-based Hand Gesture Recognition Using Haar-like Features. Warsaw, Poland : Instrumentation and Measurement Technology Conference, 2007.
50 | P a g e

12.1 Table of Figures
Fig. 1 Skin colour segmentation for hand and face detection(11).......................................................... 6 Fig. 2 2D Finger detection(18)................................................................................................................. 7 Fig. 3 A selection of gestures available on the Apple Magic Track pad(20)............................................ 9 Fig. 4 Hand and finger tip tracking using the Kinect (22) ...................................................................... 10 Fig. 5 Equipment set up for initial development .................................................................................. 11 Fig. 6 A selection of frames from testClip1.avi ..................................................................................... 12 Fig. 7 An original frame with a binary mask applied ............................................................................. 13 Fig. 8 Selection of frames from testClip1.avi with gray scale and threshold = 60 method applied. .... 14 Fig. 9 Selection of frames from testClip1.avi with gray scale and threshold = 90 method applied. .... 14 Fig. 10 Selection of frames from testClip1.avi with gray scale and threshold = 120 method applied. 15 Fig. 11 An example of a hand pixel region ............................................................................................ 17 Fig. 12 Back projection of skin model on frame from camera (left) and thresholded (right)............... 17 Fig. 13 Segmentation against a black screen background .................................................................... 18 Fig. 14 Segmentation against a red screen background ....................................................................... 18 Fig. 15 Segmentation against a green screen background ................................................................... 19 Fig. 16 Segmentation against a blue screen background ..................................................................... 19 Fig. 17 Mask to exclude light and dark areas from image .................................................................... 20 Fig. 18 Final segmented images using Hue Histogram technique and noise removal ......................... 20 Fig. 19 Erosion by a 4x4 rectangular structuring element .................................................................... 22 Fig. 20 Erosion by a 12x12 rectangular structuring element ................................................................ 22 Fig. 21 Dilation by a 12x12 rectangular structuring element after erosion by a 4x4 rectangular element ................................................................................................................................................. 23 Fig. 22 Contours of the binary image (green) with area. (Hand = 40766.5, NoiseTopLeftCorner = 182.5) .................................................................................................................................................... 24 Fig. 23 Selection of frames from testClip1.avi with centre of gravity and bounding box .................... 25 Fig. 24 Segmented hand shape with a convex hull ............................................................................... 25 Fig. 25 Segmented hand shape with a convexity defects > 40 pixels marked ...................................... 26 Fig. 26 No convexity defects are above the 40pixel threshold so detecting finger tip is impossible ... 27 Fig. 27 Finger tips are identified by numbers. Only points outside of the circle are considered. ........ 28 Fig. 28 Thumb is not recognised as it is separate from the main hand shape. .................................... 29 Fig. 29 Mapping a point in the camera view to the display surface ..................................................... 29 Fig. 30 Screen shot of homography calibration app during point selection ......................................... 31 Fig. 31 Screenshot showing a simulated finger (large green circle) and its corresponding transformed point on the display .............................................................................................................................. 31 Fig. 32 Screen shot of segmentation calibration tool ........................................................................... 32 Fig. 33 Equipment set up. LCD display on the desk and a webcam pointing at it from above ............. 33 Fig. 34 Result of superimposing a red "true" segmentation and the blue actual segmentation ......... 35 Fig. 35 Example of good and bad results for finger tip detection tests ................................................ 36 Fig. 36 Example results for 4x4 and 8x8 grid ........................................................................................ 36 Fig. 37 The user experience questionnaire ........................................................................................... 37 Fig. 38 Results of segmentation tests ................................................................................................... 38 Fig. 39 Percentage matches, false negatives and false positives in segmentation tests ...................... 38 Fig. 40 Results from finger tip detection test (no sleeves) ................................................................... 39 51 | P a g e

Fig. 41 Finger tip result analysis (no sleeves) ........................................................................................ 39 Fig. 42 Results from fingertip detection (sleeves) ................................................................................ 40 Fig. 43 Finger tip results analysis (sleeves) ........................................................................................... 40 Fig. 44 Results for camera to display distance = 50cm ......................................................................... 41 Fig. 45 Camera view at 50cm (left) and camera view at 75cm (right) .................................................. 41 Fig. 46 Screen shot of the checkers game. The app runs full screen on the display monitor. ............. 43 Fig. 47 Screen shot from the photo viewer application in debug mode. ............................................. 44
52 | P a g e

Augmented Desktop

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Augmented Desktop

Enviado por

Direitos autorais:

Formatos disponíveis

Tom Pierce URN:6026767