Escolar Documentos
Profissional Documentos
Cultura Documentos
1.1 Motivation
My motivations for choosing such a project stem from an interest in image and video processing techniques gained while working on various projects during a work placement at Sony Broadcast and Professional Research Labs. While looking for an area to apply these skills to, a quote by Douglas Engelbart, the inventor of the mouse, struck a chord with me: Then one day, it just dawned on meBOOMthat complexity was the fundamental thing. Solving any significant problem would also be a complex thing. And it just went click. If in some way, you could contribute significantly to the way humans could handle complexity and urgency, that would be universally helpful.(1) He was referring to his goal to help make the world a better place, and decided that this could be achieved by utilising computers to present data clearly and allow it to be processed quickly. It was this idea that led him to develop the mouse and many other innovations in computer science. There are many obvious applications for improved desktop interfaces; Systems which allow three dimensional data to be navigated and displayed efficiently, games, entertainment applications and many more.
1|Page
3. Objectives
Measurable objectives for the project consist of the following: Reliable segmentation of skin colour regions from a real-time video feed. Without reliable segmentation the subsequent stages of the system will be far more difficult, or impossible. The reliability can be improved if needed through careful control of the lighting conditions and backgrounds against which the segmentation is to be performed. Recognition and tracking of a hand shape & other key features (fingertips). The primary interest is hand shapes and for the purposes of the project it can be assumed that the only skin colour regions likely to be found in front of the camera are of hands. Mapping a point on the plane of the desk surface to a point on the display. Implementation of this is a key part of the interface and is made possible by the assumption that the hand will remain on the surface of the desk and hence movement will be limited to a two dimensional plane which can be mapped to any other two dimensional plane. Projection of extra information onto the desk surface. This is the key stage in augmenting the desk surface. A suitable system for sending content to an external monitor/projector needs to be created. Interaction with the projections using hand movement/gestures. This goal sees the camera oriented goals and the projector oriented goals being combined into an overall system which represents the augmented desktop. Develop an application which demonstrates the use of this interface. Once the interface is developed an application which takes advantage of its primary features should be created to showcase the system.
4. Constraints
The project is to be undertaken as a solo effort and will be carried out in parallel with the other requirements of final year Digital Media Engineering student. As such, only two days, or 24 hours total per week are to be allocated for work relating to the project.
2|Page
5. Planned Deliverables
The project deliverables will include a library of code with a clearly defined object oriented interface for using a webcam for the detection and tracking of multiple hands and the projection of graphics onto a desk surface. The final application will highlight the features of the augmented desktop interface and provide a means for configuring the setup so it can be adapted for different locations/desk size/hand size/hand colour. The total cost of the project is estimated to be significantly under the 70 budget as the student already owns a webcam suitable for the project and the Electronic Engineering Department can supply a projector. All libraries used for the programming will be free for academic use and the developer software is also free for academic use. The final system is intended to extend the functionality of a current desktop computer station. Using a readily available USB webcam and a small projector the application developer can project images, video, documents, additional menu interfaces and more onto the physical desk top of the user. The user can then interact with these projections directly through hand movement/gestures. This work is expected to be carried out to a schedule as described in the Gantt chart on the following page.
3|Page
6. Work Plan
4|Page
7.1 Capture
Capturing video input for the system requires very little algorithmic complexity. There are several libraries for C/C++ that provide interfaces for capturing video from an external camera, however an obvious choice for both the capture of video and subsequent processing is OpenCV. OpenCV is an open source computer vision library written in C/C++ and designed for computational efficiency and real time applications. It provides a framework for computer vision applications that extends far beyond simple video capture and can capture and calibrate multiple cameras or take input from video files. (2) Other computer vision libraries include OpenVIDIA, which implements computer vision algorithms on graphics hardware for increased speed and efficiency (3) and VXL, a collection of open source libraries written in C++. (4) For the task of hand gesture recognition there are a number of camera configuration options. Common computer vision techniques use single camera, stereo cameras and depth aware cameras. Other methods include controller feedback and gloved feedback systems. A single camera system for hand gesture recognition is proposed by Du and Li in which they restrict movement of the hand to a two dimensional plane to mitigate against the need for additional cameras. (5) This system bears many similarities to the constraints places on a desktop projection system and reinforces the idea that a single camera will be sufficient for this purpose. The system proposed by Du and Li follows the pattern of segmentation feature detection gesture recognition. Stereo cameras are used in the system proposed by Shimizu et al in (6) to fit an arm model to the stereo images from the cameras. The system works well for tracking the arm as it only has 4 degrees of freedom (DOF) (ignoring the wrist) however a hand has 27 DOF(7) and far more problems with partial occlusion of itself. Utsumi et al (8) describe a method to detect hand position, posture and finger bending using 4+ camera views with a high degree of accuracy, however four or more cameras is not a practical solution for a desk based system as it would be costly and take up a considerable amount of space.
5|Page
Fig. 1 Skin colour segmentation for hand and face detection (11)
A variety of methods exist for segmentation not specifically related to colour, as described in (12). These include matching contours with snakes, clustering, graph and energy based methods. However, these are often too elaborate and computationally expensive for real time tracking purposes. (13) In (13), Bradski presents the CAMSHIFT method for face tracking as the input to a perceptual user interface. CAMSHIFT is based on the mean shift algorithm which uses probability distributions to determine if a pixel is likely to match the model for the hue of the colour being segmented. The result of this histogram matching give an array of probabilities that can then be used to determine which areas are of the correct colour. The system of histogram matching proposed by Bradski uses a one dimensional histogram concerned with only the hue component of each pixel. In theory this means the luminance does not need to be considered. However with very high or very low luminance, the hue is often not reliable so pixels with high and low luminance values are ignored. To enhance the segmentation stage, steps can be taken to remove noise. Morphological operations are common within image processing and the two most common of these are erosion and dilation. If we consider a binary image, where white pixels are the object and black pixels are the background, applying erosion would see any white pixel that bordered a black pixel switching to a black pixel. This has the effect of making the object smaller. A dilation operation has the opposite effect, thereby making an object bigger. (14) In the above example of erosion and dilation a square 3x3 structuring element with an anchor point at 1, 1 (the middle pixel) is assumed. The anchor point marks the position of the pixel being 6|Page
7|Page
8|Page
Fig. 3 A selection of gestures available on the Apple Magic Track pad (20)
The Magic Track pad is probably the most advanced consumer multi touch device on the market. Along with the standard pointing and clicking behaviour of a normal track pad it recognises a number of multi-finger gestures for scrolling, skipping pages, rotating and zooming. It is this technology which is paving the way for computer vision based gesture recognition techniques in consumer desktop computing. 7.4.3 Voice control Mobile technologies have also prompted a growth in voice recognition and control software. Many mobile phones allow basic functions to be executed with just the users voice. This has proven itself as particularly useful while driving for making and receiving calls without taking hands off the wheel. The accuracy of text-to-speech software has also improved dramatically. However it still isnt used commonly as a replacement for the keyboard. 7.4.4 Wii-mote The Wii-mote is a handheld remote for the Nintendo Wii console. It uses a combination of an accelerometer and an infra-red sensor technology to allow users to hold the remote as a pointer or perform gestures which translate into actions in games. A number of third party developers have used the Wii-mote in their own projects and produced technologies such as multi-touch white boards and finger tracking.(21) 7.4.5 Kinect The Kinect from Microsoft is a peripheral for the Xbox360 games console and uses an infrared, structured light 3D scanner to build a depth map of the scene in front of it. It can then segment the human body and track points to recognise gestures. It also contains a built in microphone for recognising voice commands.
9|Page
Fig. 4 Hand and finger tip tracking using the Kinect (22)
The Kinect is the first consumer product that looks likely to prompt the growth in natural gesturebased interfaces. Like the Wii, it has also prompted a number of 3rd party development projects including hand and finger tracking applications for a minority report style video wall. (22)
10 | P a g e
8.2 Set Up
The work area is a typical desktop computer set up, with a single web camera mounted on the top of the monitor display. The camera is angled down at the desktop to create a gesture area of approximately 0.2m2.
For initial experimental purposes a video clip of 850 frames of video was recorded (Fig. 6). The contents of the video consisted of a hand moving through a variety of gestures in the centre of the frame, set against a plain black background. This video was used to provide a consistent test in each experiment with segmentation.
11 | P a g e
All development work was carried out using C++ in Visual Studio 2010 on a PC with dual screens, an Intel dual core 1.67GHz processor, 2GB RAM and Windows 7 Professional OS. The original webcam was a cheap 3MP camera with autofocus lens, however this proved problematic as the autofocus functionality could not be turned off. It was replaced with a Logitech C160 which has the benefits of manual focus, brightness and exposure control.
8.3 OpenCV
OpenCV is an open source computer vision library designed for real time computer vision applications. It was written in C/C++ and has wrappers for python, C# and other languages. For this project, OpenCV v2.1 was used (http://sourceforge.net/projects/opencvlibrary/files/opencvwin/2.1/) however at the time of writing OpenCV is on V2.2. Code in this document relates to OpenCV2.1 only.
8.4 Capture
Capturing images for processing is a trivial task in OpenCV. The two code snippets below demonstrate capturing from a webcam and a file.
// capture from a webcam CvCapture* capture = cvCreateCameraCapture( 1 ); assert( capture ); IplImage* frame = cvQueryFrame( capture ); // capture from file CvCapture* capture = cvCaptureFromAVI("testClip1.avi"); assert( capture ); IplImage* frame = cvQueryFrame( capture );
It is possible to read frame rate/size information from a file using the cvGetCaptureProperty() function and with this information playback of video is achieved by simply looping through with a delay and querying the capture object each time.
12 | P a g e
8.5.1 Gray Scale & Threshold The first method considered is a very simple technique that categorises a pixel based on its intensity. The colour image is first converted to gray scale and then a threshold is performed to select only pixels above a certain value. This can be achieved for a single frame with the following lines of code.
// convert to grayscale and threshold // image is a 3-channel BGR image // gray is a 1-channel gray scale image cvCvtColor(image, gray, CV_RGB2GRAY); cvThreshold(gray, gray, threshold, 255, CV_THRESH_BINARY);
For a 3-channel colour RGB image to convert to a 1-channel grayscale image OpenCV uses the formula in Equation 1. Where Y is the luminance/grayscale value for a pixel and R, G and B are the red, green and blue values of the colour pixel.
Equation 1 RGB to grayscale colour conversion
13 | P a g e
Fig. 8 Selection of frames from testClip1.avi with gray scale and threshold = 60 method applied.
Threshold = 90 With the threshold increased to 90, a much clearer segmentation of the hand is achieved. However some noisy areas do still exist.
Fig. 9 Selection of frames from testClip1.avi with gray scale and threshold = 90 method applied.
14 | P a g e
Fig. 10 Selection of frames from testClip1.avi with gray scale and threshold = 120 method applied.
Conclusion The gray scale & threshold method is an effective means of segmenting light coloured skin from a dark background. The method is very restrictive in the composition of the background (it must be uniformly dark for best results). In varying light conditions the threshold value will need to be updated to account for adjustments in the intensity of the hand pixels. However this would be of little concern if the lighting and background could be fixed and the threshold simply updated for each new user.
15 | P a g e
It proved challenging to find a suitable range despite spending a large amount of time adjusting the values manually. The following selection of frames from testClip1.avi was the best attempt at segmentation that could be achieved via this method.
Selection of frames from testClip1.avi with HSV threshold applied. Min = 90, 0, 200. Max = 140, 80, 255
8.5.2.1 Conclusion The problem with this system is obvious from the frames shown above; a full segmentation of the hand could not be achieved. The method of thresholding the hue, saturation and value components of each pixel separately was a naive implementation of HSV segmentation. It did not take into account the cyclical nature of the Hue value in the HSV colour space. As the hue represents a rotation, both 1 and 359 (or 179 in OpenCVs implementation, as the hue is halved to bring it within the range of 8 bit numbers) are a very similar colour. As this was not taken into account when range testing there is likely to be some error involved. The lighting in the testClip1.avi video was also too strong, causing the hand to be registered as nearly white rather than a skin coloured hue. 8.5.3 Hue Histogram Model To deal with more complex backgrounds a more sophisticated segmentation method was required. By selecting an area of the image consisting entirely of hand pixels a histogram model can be constructed. The histogram counts the number of pixels with hue values that fall into predetermined bins. From this histogram model of the skin pixels, the histogram back projection can be applied to each input frame and the resulting grayscale image highlights the query colour as
16 | P a g e
Fig. 11 shows a region selected by a user which consists entirely of pixels representing the hand. It includes a smooth expanse of skin as well as shading around the knuckles and provides a good model of a hand in the image. The hue histogram for this image would contain high energy components in the red portion of the histogram, and this should hold true for skin colour of all ethnicities. (24)
Fig. 12 Back projection of skin model on frame from camera (left) and thresholded (right)
Fig. 12 shows the results of back projecting the histogram model of the hand onto a new frame from the camera to produce a grayscale image where pixel intensity represents the probability of a pixel belonging to the hand colour set. This image can be thresholded to obtain the right hand image in Fig. 12 and as can be seen, this generates a good binary mask of the hand region with some additional noise.
17 | P a g e
18 | P a g e
Fig. 13 - Fig. 16 show the results of segmentation using this method in the presences of changing background colours. Both the threshold and histogram model of hand pixels remained unchanged for each test and this gave rise to some poor results in the presence of different background colours. It is possible to obtain good segmentation of hands given a variety of backgrounds by altering the reference image and threshold parameters in the segmentation algorithm. A possible solution to take account of the background colour would be to obtain settings for different colours and then select the correct settings when the background changes, or obtain an approximation of the settings through analysis of the background. 8.5.3.1 Additional Processing The initial results show that the segmented image can still contain a number of false positives regardless of the threshold level used. On closer inspection these false positives correspond with very light (white) and very dark (black) areas in the image. Because of the cylindrical nature of the HSV colour space, whites and blacks can often cause problems and it would be advantageous to ignore them in the segmentation of the image. This is easily achieved by thresholding the grayscale frame before any processing using a very low threshold to remove the blacks and then using a high threshold and inverting to remove the whites. These 2 masks can then be combined to create a single mask which can be applied to the back projection using a logical AND operation, effectively removing the lights and darks from the final segmentation. 19 | P a g e
Fig. 17 shows a mask using very low threshold values for white and black. Only white pixels in the mask will be considered as skin pixel candidates. It can be seen that no skin pixels are discarded, only the shadow created by the hand and fingers and some background elements are masked out.
Fig. 18 Final segmented images using Hue Histogram technique and noise removal
20 | P a g e
// Morphological operations in OpenCV // erosion by a a 6x6 rectangular structuring element cvErode(gray, gray, cvCreateStructuringElementEx(6, 6, 3, 3,CV_SHAPE_RECT)); // closing by a a 12x12 rectangular structuring element cvMorphologyEx(gray, gray, 0, cvCreateStructuringElementEx(12, 12, 6, 6,CV_SHAPE_RECT), CV_MOP_CLOSE, 1); // dilation by a a 12x12 rectangular structuring element cvDilate(gray, gray, cvCreateStructuringElementEx(12, 12, 6, 6,CV_SHAPE_RECT));
The following screenshots illustrate the effect of erosion. The image is frame 800 from the testClip1.avi video after gray scale and thresholding. Small groupings of pixels in the top left corner are removed and the hand shape is eroded slightly. (Fig. 19)
21 | P a g e
When a 12x12 structuring element is used, the effect of erosion is much more severe. (Fig. 20)
After removing patches of noise, it is often advantageous to dilate the remaining shape to fill in holes or ensure finger tips arent separated from the hand. The screen shots in Fig. 21 show a frame after erosion and the same frame after dilation with a 12x12 rectangular structuring element. As can be seen, small holes in the hand shape have been removed, but at the expense of a loss of definition in the fingers. Careful selection of structuring elements and sizes must be practiced to ensure a good outcome at the noise removal stage.
22 | P a g e
Fig. 21 Dilation by a 12x12 rectangular structuring element after erosion by a 4x4 rectangular element
8.5.4.1 Conclusion Morphological operations are very effective at removing extraneous pixels from the binary image and also have some use for growing regions to obtain more consistent results. The degree of erosion or dilation required will need to be altered for different light levels and variations in skin/background colour/complexity. Other techniques could be explored to ensure accurate representation of finger tips, joining regions that have become separated (for instance, due to shadow, or rings on the finger) and filling large holes on the hand created by shadow/cuts/bruises.
23 | P a g e
Fig. 22 Contours of the binary image (green) with area. (Hand = 40766.5, NoiseTopLeftCorner = 182.5)
After acquiring the contours of the hand, the centre of gravity can be determined by taking moments of the shape and dividing by the area.
CvPoint calculateCog(CvSeq* contours) { CvMoments moments; cvMoments(contours, &moments); double moment10 = cvGetSpatialMoment(&moments, 1, 0); double moment01 = cvGetSpatialMoment(&moments, 0, 1); double area = cvGetCentralMoment(&moments, 0, 0); int cogX = int(moment10/area); int cogY = int(moment01/area); return cvPoint(cogX, cogY); }
24 | P a g e
Fig. 23 Selection of frames from testClip1.avi with centre of gravity and bounding box
8.6.2 Convex Hull From the contours a convex hull can be generated. A convex hull for a two dimensional object can be visualised as if a rubber band had been stretched over the shape.
Again, OpenCV has a built in function for determining the convex hull from the contours.
// cvConvexHull returns a sequence of points that describe the // convex hull of the shape CvSeq* hull = cvConvexHull2(contours, 0, CV_CLOCKWISE, 0);
8.6.3 Convexity Defects With both the contours and convex hull now determined, the convexity defects can be calculated.
// get defects between hull and contours CvSeq* defects = cvConvexityDefects(contours, hull);
25 | P a g e
Fig. 25 Segmented hand shape with a convexity defects > 40 pixels marked
Problems with this method include some gestures not being registered due to the depths of defects not being large enough. However, it is also possible for too many defects to be identified if the depth threshold is too small.
26 | P a g e
Fig. 26 No convexity defects are above the 40pixel threshold so detecting finger tip is impossible
To overcome these problems, a lower threshold must be used and an algorithm applied to the convexity defect points to determine if they are likely to be a finger tip.
27 | P a g e
Fig. 27 Finger tips are identified by numbers. Only points outside of the circle are considered.
The results of this logic are consistent and reliable. However problems can still be experienced due to poor segmentation. 8.6.5 Conclusion For the purposes of the augmented desktop project, area of the segmented region is a good enough measure for detecting the presence of a hand. On a desk top there is unlikely to be skin coloured shapes of a similar area to a hand. Therefore, further methods of analysis are unnecessary for hand detection.
28 | P a g e
Fig. 28 Thumb is not recognised as it is separate from the main hand shape.
The convexity defect method for identifying finger tips is a reliable method given a few constraints. Sleeves, or a wrist band, must be worn by the user to prevent a defect being detected between the forearms and thumb/little finger. The hand segmentation must be a full and accurate representation of the hand. If regions become separated then they are not considered part of the hand and can lead to fingers not being detected. (Fig. 28)
To calculate the homography four points must be selected from the camera view and four points defined for the display area (Fig. 29). The points Q1-4 are simply the corners of the rectangular image to be display in the display area. The points are thus:
29 | P a g e
The points P1-4 cannot be obtained so easily and vary with the camera orientation. The obvious solution is to have the points selected by the user prior to using the system and adding the constraint that the camera must remain in the same orientation throughout use. This requires a simple program which allows the user to click on a webcam image to mark the four points. Given the four pairs of corresponding points a matrix equation can be formed to obtain the eight unknown values in the 3x3 homography matrix.
Equation 2 shows how to derive the homography matrix from the four pairs of corresponding points. OpenCV also provides a function for doing this. 8.7.1 Homography Calibration App To simplify the process of calculating the homography for the system each time it is taken down and re-assembled (leading to changes in orientation of the camera) an application was developed which allows the user to select the four corners of the display on a web cam feed and select the resolution and offset of the display area. The user can then simulate a finger placement and check that it corresponds to the correct point on the display.
30 | P a g e
Fig. 31 Screenshot showing a simulated finger (large green circle) and its corresponding transformed point on the display
31 | P a g e
Fig. 32 shows the array of sliders for configuring the segmentation. The top three sliders control the back projection threshold and the black and white threshold. The next two sliders control the size of the opening and closing structuring element. The remaining sliders set the number of iterations of the morphological operations, the value used to normalise the histogram, a smoothing/blurring operation and RGB colour control of the display background. Creating a new segmentor object and loading the saved settings can then be achieved in just a few lines of code:
// create segmentor object Segmentor segmentor; // load settings and skin image ifstream segmentorSettings("configFile.txt"); IplImage* skinImage = cvLoadImage("reference.jpg"); // send settings and image to segmentor if (segmentorSettings.good()) segmentorSettings >> &segmentor; segmentor.setReferenceImage(skinImage); // segment frame and set mask to the binary skin representation segmentor.segment(frame, mask);
This makes the development of applications that use these settings very quick and has helped in the rapid prototyping of the applications described later in this report.
8.9 Projection/Display
With a system for segmenting skin colour and detecting finger tips in place, a means of displaying interactive content on the desk surface needed to be determined. In the early stages of the project, it was thought a projector would be suitable. Using a projector gives the potential for using the entire desk surface for interaction and display and would prevent a desk becoming cluttered with additional display technology/wires. However, after some initial testing of a projector and facing problems with suspending it above a desk, requiring a dark room to display an image and the downward projection affecting the segmentation by 32 | P a g e
Fig. 33 Equipment set up. LCD display on the desk and a webcam pointing at it from above
33 | P a g e
34 | P a g e
Fig. 34 Result of superimposing a red "true" segmentation and the blue actual segmentation
From analysis of the image in Fig. 34 the following results can be obtained:
Total Pixels: 307200 (640x480) True Skin Pixels: 41442 Matched Pixels: 36477 (88.02%) False Negatives: 4965 (11.98%) False Positives: 1957 (5.09%)
False negatives being pixels that should have been skin but were not registered as such (e.g. red pixels). False positives are pixels which were registered as skin when they should not have been (blue). Using these percentages it is possible to quantitatively compare the effectiveness of different images and segmentation techniques. 35 | P a g e
1 1 Yes
1 3 Yes
Fig. 35 Example of good and bad results for finger tip detection tests
9.1.3 Grid Based Accuracy Test To determine the accuracy with which a finger can be used to interact on the display area, a test was devised in which a numbered grid is displayed across the display area. The user is asked to place their finger over a grid square for three seconds, after which time the square will turn green, indicating a successful selection. By mapping the results onto a separate grid it is possible to build a map of areas which cannot be accurately selected. Using this information and by altering the grid size a good resolution can be selected and this forms the basis for selecting the size of interactive elements.
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
8 16 24 32 40 48 56 64
Fig. 36 shows example results from 4x4 and 8x8grids. At higher resolutions it can often be difficult to select areas close to the edge of the display area.
36 | P a g e
Each user was asked to use the photo viewer and checkerboard applications for five minutes. Each application was described beforehand and the relevant gestures demonstrated by an expert user. 9.1.5 Latency For an interactive system latency is an important test. During development it was noticeable that the system lagged when tracking finger tip movement. If a coloured blob was placed under a finger tip and the finger tip subsequently moved, the blob would seem to follow the finger tip rather than remain underneath it. To quantify this; timers were placed in the code to establish the time taken between capturing a new frame and updating the output. These timing were further broken down into segmentation and detection to establish bottlenecks in the code.
37 | P a g e
Average 87.4%
False Negative 11.98% 9.00% 9.89% 9.97% 12.30% 11.51% 10.76% 15.31%
Average 12.6%
False Positive 5.09% 2.54% 3.31% 4.68% 4.53% 2.27% 2.59% 3.39%
Average 3.6%
Fig. 39 Percentage matches, false negatives and false positives in segmentation tests
38 | P a g e
Image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fingers Found 5 6 1 1 1 1 1 2 1 1 1 1 2 4 2 6 0 0 1 3 2 4 1 1 2 2 3 2 4 4 5 6
Difference 1 0 0 1 0 0 2 4 0 2 2 0 0 -1 0 1
Comments Bare arm Good Good Knuckle Good Good Bare arm & knuckle Bare arm & knuckle Good Bare arm & knuckle Bare arm & knuckle Good Good Tips too close together Good Bare arm & tips too close together
39 | P a g e
Image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fingers Found 5 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 0 1 1 2 2 1 1 2 2 3 2 4 3 5 4
Difference 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1
Comments Good Good Good Good Good Good Good Good Good Good Good Good Good Tips too close Tips too close Tips too close
40 | P a g e
1 4 7
2 5 8
3 6 9
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
As shown in Fig. 44 the bottom edge of the screen was particularly difficult to select reliably. This was due to the hand disappearing from the camera view when the user went to select a lower grid square. To solve this problem, the camera height was increased, giving a wider view of the desk surface and providing a significant amount of space for the hand to be detected within. Fig. 45 shows the difference in camera views between the original 50cm height and the new 75cm height.
Fig. 45 Camera view at 50cm (left) and camera view at 75cm (right)
After increasing the camera height, the grid based accuracy test results improved considerably. Grid sizes up to 32x32 were tested and no test failures were encountered. In a 32x32 grid at a 1280x1024 screen resolution each grid square is 40x32 pixels in size allowing a confident prediction that any interactive element this size or larger will be reliably selected by a user of this system. It should also be noted that the accuracy largely depends on the accuracy of the segmentation. In cases of poor segmentation the finger tip indicator can be seen to jump around randomly. This can be resolved to some degree by carefully configuring the segmentation and applying a smoothing/blurring filter to the mask. 9.2.4 Questionnaire Of the 8 users who tested both applications, 7 gave overall positive responses with each question receiving an average score of 3.25 or higher. Both the selecting interactive elements question and the I enjoyed using this interface question received the highest average scores (3.875). The questions regarding recognition of finger tips and gestures also received average ratings above 3.25. The questionnaires also brought to light some interesting issues regarding hand size; when users 41 | P a g e
The smoothing operation increases the average time for segmentation by 0.01s and the latency that is observed becomes significantly more pronounced when smoothing is utilised. The opening and closing operations have relatively little effect and no noticeable increase in latency is perceived by the user from these operations. Another factor influencing the latency of the system is the capture frame rate. At the frame rate of 24fps, there is 0.042s between each captured frame.
42 | P a g e
Fig. 46 Screen shot of the checkers game. The app runs full screen on the display monitor.
43 | P a g e
Fig. 47 Screen shot from the photo viewer application in debug mode.
Fig. 47 shows the photo viewer application in debug mode. The green circles show the current location of the finger tips while the blue line and circle show the start point and movement vector for the gesture which is currently in progress. After detecting a gesture by applying some basic rules such as location of the start and end points and the cessation of motion, the photos animate in a carousel motion.
44 | P a g e
45 | P a g e
These aims were largely achieved; in particular, the first aim has been clearly demonstrated through the applications that have been developed for the system. Both the photo viewer application and the checker board game show the clear advantages to new gesture based interfaces for human computer interaction. The second aim, augmentation of the desktop surface with projection changed as the project progressed. Projection onto the desk surface proved to be a more difficult problem to solve than the time that was available allowed. This lead to the decision being made to use a computer monitor as the display surface, avoiding many of the problems that projection introduced (interfering with segmentation, requiring low light levels in the room). The computer monitor still enables graphics to be displayed and interacted with. But it does take up significant desk space. An improvement to this set up would be to integrate the screen into the desk surface. The objectives for the project break down the aims into manageable goals: - Reliable segmentation of skin colour regions from a real-time video feed. - Recognition and tracking of a hand shape & other key features (fingertips). - Mapping a point on the plane of the desk surface to a point on the display. - Projection of extra information onto the desk surface. - Interaction with the projections using hand movement/gestures. - Develop an application which demonstrates the use of this interface. The segmentation of skin colour from a real time video feed took up the greatest amount of development time. While achieving reasonable segmentation under very controlled conditions was easily attained, it became clear that a poor segmentation would have severe repercussions in the accuracy of the tracking stage. For the tracking to function the hand must be segmented as a single area and not include extraneous noise which changes the contours of the shape. The results of the segmentation tests show the final system to be almost 90% accurate when compared to a human manually segmenting an image and more importantly the segmentation accurately retained fingers and other key contour shapes. For recognition and tracking of hands and features, development went relatively smoothly. After reviewing a number of techniques for finger tip detection the convexity defects method was 46 | P a g e
48 | P a g e
49 | P a g e
50 | P a g e
52 | P a g e