Você está na página 1de 114

Departamento de

Universidade de Aveiro Electrónica, Telecomunicações e Informática


2014

Patrı́cia Nunes Deteção e reconhecimento de objetos para


Aleixo aplicações robóticas

Object detection and recognition for robotic


applications
Departamento de
Universidade de Aveiro Electrónica, Telecomunicações e Informática
2014

Patrı́cia Nunes Deteção e reconhecimento de objetos para


Aleixo aplicações robóticas
Object detection and recognition for robotic
applications

“A picture is worth a thou-


sand words.”

— Unknown
Departamento de
Universidade de Aveiro Electrónica, Telecomunicações e Informática
2014

Patrı́cia Nunes Deteção e reconhecimento de objetos para


Aleixo aplicações robóticas
Object detection and recognition for robotic
applications

Dissertação apresentada à Universidade de Aveiro para cumprimento dos


requesitos necessários à obtenção do grau de Mestre em Engenharia
Electrónica e Telecomunicações, realizada sob a orientação cientı́fica do
Professor Doutor António José Ribeiro Neves e da Professora Doutora Ana
Maria Perfeito Tomé Professores do Departamento de Electrónica, Teleco-
municações e Informática da Universidade de Aveiro.
o júri / the jury

presidente / president Prof. Doutor José Luis Guimarães Oliveira


Professor Associado da Universidade de Aveiro

vogais / examiners committee Prof. Doutor Jaime dos Santos Cardoso


Professor Auxiliar da Universidade do Porto (arguente principal)

Prof. Doutor António José Ribeiro Neves


Professor Auxiliar da Universidade de Aveiro (orientador)
agradecimentos Ao meu orientador, Doutor António Neves, por acreditar no meu trabalho,
por toda a motivação e orientação ao longo deste último ano. À minha
co-orientadora, Doutora Ana Tomé, pela disponibilidade e revisão deste
documento. Agradeço também á equipa CAMBADA, pela oportunidade
que me foi dada para trabalhar neste projeto.
À minha famı́lia, pai e mãe, por toda a confiança depositada em mim e no
meu trabalho em todas as fases da minha vida. À minha irmã, por todas as
palavras de motivação e incentivo, nos momentos menos bons. A todas as
amizades feitas ao longo do meu percurso universitário, obrigado por todo
o companheirismo, ajuda e boa disposição. Deixo ainda um agradecimento
às amizades de longa data, que mesmo longe, também estiveram presentes
nesta fase da minha vida, festejaram comigo as minhas pequenas conquistas,
e se orgulham do meu percurso. Um muito obrigado a todos.
Palavras-chave Visão Robótica, deteção de objetos, deteção de contornos, SIFT, SURF,
transformada de Hough, Template Matching, OpenCv.
Resumo A visão por computador assume uma importante relevância no desenvolvi-
mento de aplicações robóticas, na medida em que há robôs que precisam de
usar a visão para detetar objetos, uma tarefa desafiadora e por vezes difı́cil.
Esta dissertação foca-se no estudo e desenvolvimento de algoritmos para a
deteção e identificação de objetos em imagem digital para aplicar em robôs
que serão usados em casos práticos.
São abordados três problemas: Deteção e identificação de pedras decorati-
vas para a indústria têxtil; Deteção da bola em futebol robótico; Deteção
de objetos num robô de serviço, que opera em ambiente doméstico. Para
cada caso, diferentes métodos são estudados e aplicados, tais como, Tem-
plate Matching, transformada de Hough e descritores visuais (como SIFT e
SURF).
Optou-se pela biblioteca OpenCv com vista a utilizar as suas estruturas de
dados para manipulação de imagem, bem como as demais estruturas para
toda a informação gerada pelos sistemas de visão desenvolvidos. Sempre
que possivel utilizaram-se as implementações dos métodos descritos tendo
sido desenvolvidas novas abordagens, quer em termos de algoritmos de pre-
processamento quer em termos de alteração do código fonte das funções
utilizadas. Como algoritmos de pre-processamento foram utilizados o dete-
tor de arestas Canny, deteção de contornos, extração de informação de cor,
entre outros.
Para os três problemas, são apresentados e discutidos resultados experi-
mentais, de forma a avaliar o melhor método a aplicar em cada caso. O
melhor método em cada aplicação encontra-se já integrado ou em fase de
integração dos robôs descritos.
Keywords Robotic vision, object detection, contours detection, SIFT, SURF, Hough
transform, Template Matching, OpenCV.
Abstract The computer vision assumes an important relevance in the development
of robotic applications. In several applications, robots need to use vision
to detect objects, a challenging and sometimes difficult task. This thesis is
focused on the study and development of algorithms to be used in detection
and identification of objects on digital images to be applied on robots that
will be used in practice cases.
Three problems are addressed: Detection and identification of decorative
stones for textile industry; Detection of the ball in robotic soccer; Detec-
tion of objects in a service robot, that operates in a domestic environment.
In each case, different methods are studied and applied, such as, Template
Matching, Hough transform and visual descriptors (like SIFT and SURF).
It was chosen the OpenCv library in order to use the data structures to im-
age manipulation, as well as other structures for all information generated by
the developed vision systems. Whenever possible, it was used the implemen-
tation of the described methods and have been developed new approaches,
both in terms of pre-processing algorithms and in terms of modification of
the source code in some used functions. Regarding the pre-processing algo-
rithms, were used the Canny edge detector, contours detection, extraction
of color information, among others.
For the three problems, there are presented and discussed experimental re-
sults in order to evaluate the best method to apply in each case. The
best method for each application is already integrated or in the process of
integration in the described robots.
Contents

Contents i

1 Introduction 1
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Object Detection on digital images 5


2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Digital images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1.1 Types of images . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1.2 Digital cameras . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Image filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 OpenCv Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hough transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Linear Hough transforms . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Circular Hough transforms . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Visual descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Descriptors matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Matching strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Efficient matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Location of object: the RANSAC algorithm . . . . . . . . . . . . . . . . . . . 21
2.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

i
3 Features-based descriptors 23
3.1 Scale Invariant Feature Transform descriptor - SIFT . . . . . . . . . . . . . . 23
3.1.1 Scale space construction . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Keypoints detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Keypoint descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Speeded Up Robust Feature descriptor - SURF . . . . . . . . . . . . . . . . . 32
3.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Box space construction . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Keypoints detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Keypoints descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Features from Accelerated Segment Test - FAST . . . . . . . . . . . . . . . . 39
3.4 Binary Robust Independent Elementary Features - BRIEF . . . . . . . . . . . 41
3.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Decorative stones detection and recognition for a textile industry robot 45


4.1 Study of the appropriate methodology . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Developed detection system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Grayscale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Ball Detection for Robotic Soccer 59


5.1 Background implementation issues . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Study of HoughCircles() parameters . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Upper threshold for the internal Canny edge detector . . . . . . . . . 62
5.2.2 Threshold for center detection . . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Maximum and minimum radius to be detected . . . . . . . . . . . . . 64
5.3 Team play with an arbitrary FIFA ball . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Validation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

ii
5.4 Ball in the air . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Validation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6 Objects Detection and Recognition for a Service Robot 73


6.1 RoboCup environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Test scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Developed vision system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Efficiency of the visual descriptors . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Conclusions and future work 87


7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A Results: Decorative stones detection 89

Bibliography 93

iii
iv
Chapter 1

Introduction

In the last decade, technology evolve exponentially. Robotics is an example of this fast
development, in a few decades the scientific fantasy turns into reality.
A robot can be defined by an agent in a specific environment where it can take information
and react with a specific goal. The information about the environment can be obtained by
a variety of sensors. One of the most rich type of sensor, and nowadays widely used, is a
digital camera. Computer vision systems for robot applications allows an agent to understand
the environment around it, detect shapes as objects and classify them. Performing this task
in a time-constrain manner requires an efficient vision system and the processing time is
a relevant issue to be considered when a vision system is implemented. Moreover, several
parameters affect the performance of a vision system. The type of camera or lenses plays an
important role in the vision system, however, some external parameters, such as the conditions
of illumination in the environment where the robot works, the correct calibration of the vision
system, among others, imposes some constraints when a vision system is developed.
Since the beginning, researchers are trying to develop computer vision systems to imitate
the human vision, which is a complex system that acquire images through the eyes and
then process them in the brain, in order to react to external stimuli. In a computer vision
system, images are acquired by a digital camera and processed in a computational system
with the same objective that the human brain. The main goal in computer vision has been
to duplicate the abilities of human vision by electronically perceiving and understanding an
image. However, it is assumed in the scientific community that this goal is still far to be
reached.

1
Computer vision is a topic for study and research in several areas. Currently, vision
systems are used in industrial environments, medicine, astronomy, forensic studies, biometrics,
among others. This dissertation focus the use of vision systems for robot applications, namely
in three different applications: robotic soccer, industrial and domestic environment. All the
applications focus in autonomous robots that performs behaviors or tasks with a high degree
of autonomy, without human intervention.

1.1 Objectives

The objective of this work is to describe, understand and evaluate several vision algo-
rithms to be used in robotic applications. In particular, we intend to study and develop
algorithms for the analysis of shapes, features and color information on digital images in or-
der to detect and classify objects. These algorithms are developed taking into consideration
time constrains so that they can be used in autonomous robots operating in an industrial or
domestic environment.
This thesis focus on solutions for three practical applications:

• Development of a decorative stones detector for textile industry to Tajiservi company;

• Integration of a new approach to ball detection in the vision system of the CAMBADA
soccer team of University of Aveiro;

• Development of a object detector to integrate in the CAMBADA@Home autonomous


service robot from University of Aveiro.

When the development is concluded, each algorithm should be integrated and tested in
each application.

1.2 Thesis structure

The remaining of the thesis is structured as follows:

• In Chapter 2 there are presented several known algorithms for object detect found in
the literature. Three different approaches are study along this thesis, in order to solve

2
the several problems proposed. In this chapter techniques used with the aim of object
detection, like Template matching and Hough transform are presented.

• In Chapter 3 a deep analysis on visual descriptors is present. SIFT, SURF, FAST and
BRIEF algorithms are explained.

• In Chapter 4 there are proposed several solutions to solve a practical problem. A


decorative stones detector for textile industry was developed and results presented and
discussed. In this chapter several pre-processing steps are proposed and explained, in
order to achieve the best algorithm to detection. Three algorithms are developed and
results compared and discussed. Techniques like Canny edge detector, dilation and
erosion and contours representation are used.

• In Chapter 5 is discussed the problem of the detection of the ball in the robotic soccer.
Circular Hough transform is applied in order to solve several challenges related to the
CAMBADA team.

• In Chapter 6 is proposed a solution to an autonomous service robot, in domestic envi-


ronment, to detect objects. SIFT and SURF algorithms are tested and evaluated.

• In Chapter 7 is described the conclusion of this thesis and as well as ideas for future
work in order to improve the proposed algorithms.

3
4
Chapter 2

Object Detection on digital images

An object recognition and detection system tries to find and to classify objects of the real
world from an image of the world, using object models which are known a priori. Humans
recognize multiples objects in images with little effort. The image of the objects may vary in
different view points, namely different sizes, scale or even when they are rotate, but objects
can even be recognized when they are partially obstructed from view. In robotics, finding
an object and classifying it is a non-trivial task in cluttered environments. The problem
of object detection and recognition is a notoriously difficult one in the computer vision and
robotics communities. Develop an autonomous robot with the ability to perceive objects in
a real-world environment is still a challenge for computer vision systems. Many approaches
to perform this task have been implemented over the last decades.

From robotics to information retrieval, many desired applications demand the ability to
identify and localize categories, places and objects. Object detection algorithms typically use
extracted features and learning algorithms to recognize instances of an object category. It is
commonly used in applications such as image retrieval and security and surveillance systems.

A variety of models can be used, including: Image segmentation and gradient-based,


through blob analysis and background subtraction; Feature-based object detection, through
detecting a reference object in a cluttered scene using feature extraction and matching; Tem-
plate matching approaches when the scale factor is not important, among others.

5
2.1 Basic concepts

This section will present the basic elements of image processing to accomplish the most
fundamental tasks, developed along this thesis.

2.1.1 Digital images

An image may be defined as a two-dimensional function, f (x, y), where x and y are
spatial (plane) coordinates, and the amplitude of f at any pair of coordinates (x, y) is called
the intensity, of the image at that point. When x, y, and the intensity values of f are all
finite, discrete quantities, it is denoted a digital image. A digital image is composed of a finite
number of elements, each of which has a particular location and value. These elements are
called pixels. In Figure 2.1 is shown a scheme of a digital image.

Figure 2.1: Digital image representation: rectangular matrix of scalars or vectors, where
f (i, j) are named pixels, Nc refers the number of columns and Nc the number of rows.

2.1.1.1 Types of images

Digital images can be classified in four types: black and white, grayscale, color and indexed
images. In this section it is presented a brief explanation relatively to each type of image.

• Black and white images, also called binary images, whose pixels have only two
possible intensity values. These images have been quantized into two values, usually 0
and 1 or 0 and 255, depending on the numeric type used to represent the information.

6
Black and white images needs a low storage or transmission and simple processing. On
the other hand, these images have limited applications, restricted to tasks where internal
details are not required as a distinguishing characteristic. Black and white images are
sometimes used in intermediates steps on a computer vision system.

• Grayscale images, also called monochromatic, denotes the presence of only one value
associated to each pixel. They are the result of measuring the intensity of light reflected
by an object or some other information, like temperature or distance, at each pixel in
a single band of the electromagnetic spectrum. Grayscale images combine black and
white in a continuum producing a range of shades of gray. This range is represented
from 0 to 2b , where b denoted the number of bits used to represent each pixel.

• Color images include the presence of several channels, normally three, to characterize
each pixel. For example, the human eye has three types of receptors that detect light
in red, green and blue bands. For the brain, each color corresponds to electromagnetic
waves with different wavelengths that are received by the eye. In the digital world,
color can be represented with a set of numbers, which are interpreted as coordinates in
a specific color space. The RGB, YCbCr and HSV are the most commonly used color
spaces in computer vision. They differ in the mathematical description of each color.

• Indexed images are a form of vector quantization compression of a color image. When
a color image is encoded, color information is stored in a database denominated palette
and the pixel data only contain the number (the index) that corresponds to a color in
the palette. This color table stores a limited number of distinct colors (typically 4, 16,
. . . , 256). Indexed images reduce the storage space and transmission time, but limits
the set of colors per images.

In Figure 2.2 it is possible to see an example of an image represented in the different types
described above.

2.1.1.2 Digital cameras

A digital camera can be seen as a device that captures the light from the environment
and produces a digital image that can be stored internally or transmitted to another device.

7
(a) (b) (c)

(d) (e) (f)

Figure 2.2: Types of images: (a) Black and white; (b) Grayscale; (c) Color; (d), (e) and (f)
Indexed images with different number of colors in the palette.

The electronic component responsible for this transformation is a chip composed of a light-
sensitive area made of crystal silicon in a photodiode which absorbs photons and releases
electrons through the photoelectric effect. The electrons are accumulated over the length of
the exposure. The charge that is generated is proportional to the number of photons that
hit the sensor. This electric charge is then transferred and converted to a voltage that is
amplified and then sent to an analog to digital converter where it is digitized.
There exist two main different technology of chips in digital cameras: CCD (Charge
Coupled Device) or CMOS (Complementary Metal Oxide Semiconductor). They perform
similarly but differ in how the charge is transferred and where it is converted to a voltage.
Depending on the application, one has advantages and disadvantages regarding the other.
When a camera takes a picture, the shutter opens briefly and each pixel on the image
sensor records the light that falls on it by accumulating an electrical charge. To end the
exposure, the shutter closes and the charge from each pixel is measured and converted into
a digital number. Pixels on an image sensor only capture brightness, not color. In most of
the color cameras red, green, and blue filters are placed over individual pixels on the image
sensor being possible latter to reconstruct a color image. This filter is called Bayer matrix
[1] and it is illustrated in Figure 2.3.

8
Figure 2.3: The Bayer arrangement of color filters on the pixel array of an image sensor (left).
Cross section to red, green and blue color (right) [2].

In the market exists a variety of cameras to fulfill the requirements of a specific application.
It is possible to find cameras with high resolution and long exposure time for scientific usage,
infrared cameras to military and rescue applications, high dynamic range for hard illumination
scenarios, stereo and 3D cameras to extract spatial information of an environment, among
others. Robots can use different types of cameras, according to the final application.

2.1.2 Image filtering

When an image is acquired it is often not used directly for object detection. Sometimes it
is necessary to eliminate or transform the information present on an image. This process can
be performed by an operation denoted filtering. When an image is filtered, its appearance,
completely or just a region, changes by altering the shades and colors of the pixels in some
manner. Filters are common used to blurring, sharpening, edge extraction or noise removal.
A filtering process is a convolution operation between an image with another matrix
called kernel. Each new pixel value is calculated as a function of corresponding old pixel
value and its neighboring. The numbers in the kernel represents the weights by each pixel
of the neighborhood in the original image will be multiplied. The action of a convolution is
a sum of this product. Figure 2.4 illustrate a schematic to help understand how convolution
occurs.

2.1.3 OpenCv Library

OpenCv (Open Source Computer Vision) is a computer vision library, launched by Intel.
The first version was released to the public at the IEEE Conference on Computer Vision and

9
Figure 2.4: Center element of the kernel is placed over the source pixel. The source pixel is
then replaced with a weighted sum of itself and nearby pixels [3].

Pattern Recognition in 2000. Currently, OpenCv is owned by a nonprofit foundation called


OpenCV.org.

It focuses mainly on real-time extraction and processing of meaningful data from images.
OpenCv has several hundreds of image processing and computer vision algorithms which make
developing advanced computer vision applications easy and efficient. Its primary interface is
in C++, but it still retains a less comprehensive though extensive older C interface. There
are now full interfaces in Python and Java. OpenCv library runs under Linux, Windows and
Mac OS X.

OpenCv functionalities were subdivided and modularized, providing a set of modules that
can execute roughly the functionalities. In this project several modules are used, such as:
core, imgproc, highgui and calib3d.

In this chapter we present different approaches for object detection and recognition and
introduce some techniques that have been proposed in the literature. We will discuss the
different recognition tasks that a vision system may need to perform, analyzing the complexity
of these tasks and present useful approaches in different phases of the recognition task.

Three approaches are considered, namely Template matching method, Hough transform
and visual descriptors based on extraction of features.

10
2.2 Template Matching

Template matching is a technique in digital image processing used for finding small parts
of an image which match a model image. This process involves calculating the correlation
between the image under evaluation and the reference image, for all possible positions between
the two images. They are compared using a technique called sliding window. This window
moves by each pixel processed (from left to right and top to down) and it is created a result
matrix, where each element contains the result of a proximity measure that represents the
“good” or “bad” similarity between them. Mathematically, the source image I is the image
where is searched the database image T . The result matrix R corresponds to the calculation
carried out and the position of each element of that matrix are designated (x, y). Image under
evaluation I is of size W × H, and reference image T of size w × h. The result matrix R will
have W − w + 1 × H − h + 1, with a condition: (size I) > (size T ).
It is possible to calculate the similarity of two images in several ways. The OpenCv library
has available six different methods of calculation [4]:

• Squared difference: This method calculate the difference squared. The best match
was found in the global minimums, so a perfect match will be 0:

X
R(x, y) = (T (x′ , y ′ ) − I(x + x′ , y + y ′ )) 2 (2.1)
x′ ,y ′

• Cross correlation: This method uses multiplication, so a perfect match will be large
and bed matches will be smaller or 0:

X
R(x, y) = (T (x′ , y ′ ) · I(x + x′ , y + y ′ )) (2.2)
x′ ,y ′

• Correlation coefficient: In this method the mean is subtracted from both the ref-
erence and the image under evaluation before computing either the inner product. A
perfect match will be 1:

X
R(x, y) = (T ′ (x′ , y ′ ) · I ′ (x + x′ , y + y ′ )) (2.3)
x′ ,y ′

11
• Normalized methods: Normalization is a process that changes the range of pixel
intensity values. It is useful, for example, in images with poor contrast due to glare:

− I(x + x′ , y + y ′ )) 2
′ , y′)
P
x′ ,y ′ (T (x
R(x, y) = qP (2.4)
′ ′ 2 ′ 2
P ′
x′ ,y ′ T (x , y ) · x′ ,y ′ I(x + x , y + y )

′ , y′ )
· I(x + x′ , y + y ′ ))
P
x′ ,y ′ (T (x
R(x, y) = qP (2.5)
′ ′ 2 ′ 2
P ′
x′ ,y ′ T (x , y ) · x′ ,y ′ I(x + x , y + y )

′ (x′ , y ′ )
· I ′ (x + x′ , y + y ′ ))
P
x′ ,y ′ (T
R(x, y) = qP (2.6)
′ ′ ′ 2 ′ 2
P ′ ′
x′ ,y ′ T (x , y ) · x′ ,y ′ I (x + x , y + y )

where in (2.3) and (2.6):

1 X
T ′ (x′ , y ′ ) = T (x′ , y ′ ) − · T (x′′ , y ′′ )
w · h ′′ ′′
x ,y

1 X
I ′ (x + x′ , y + y ′ ) = I(x + x′ , y + y ′ ) − · I(x + x′′ , y + y ′′ ).
w · h ′′ ′′
x ,y

After finished the computation of the correlation between the database image and all the
possible region on the image, the best matches can be found. When Equation 2.1 or 2.4
is used, the match is found in the global minimum, for the remaining equations, the best
matches corresponds to a global maximum (Figure 2.5 illustrate the calculation of result
matrix). This maximum or minimum value corresponds the upper left corner in the reference
image.
In Figure 2.6 it is possible see an example of applying the Template matching algorithm.

2.3 Hough transform

The Hough transform is a feature extraction technique used in image processing. With
this method it is possible to find lines, circles, or other simple forms in an image. This

12
Figure 2.5: Result matrix calculation for method 1 - Equation 2.1. I denotes de image under
avaluation, T the reference image and R the result of correlation between both matrixes. The
best match position corresponds to a global minimum.

Figure 2.6: Example of applying the Template matching algorithm. The image under evalu-
ation is at the top right and below the illustration of the result matrix with the correlation
values for method 6 - Equation 2.6 (maximum corresponds to the location where it was found
the best matching - marked with a red circle).

transform, developed by Hough in 1959, was used in physics experiments [5] being its use in
computer vision introduced by Duda and Hart in 1972 [6].
The Hough transform algorithm, initially developed to detect lines in images, can also be
extended to detect other simple images structures. The line transform is a relatively fast way
of searching in a binary image for straight lines. Contrary, circular shapes detection requires

13
a large computation time and memory for storage, increasing the complexity of extracting
information from an image. The Hough transform requires a pre-processing filtering to detect
edges on an image.

2.3.1 Linear Hough transforms

In the Cartesian coordinate system, lines can be represented using the following equation:

y = mx + b,

where (x, y) is a point through which the line passes, m is the slope of the line and b is the
intercept value with the y axis. In the Hough transform, a different parametrization is used.
Instead of the slope-intercept form of lines, the algorithm use the normal form. In the polar
coordinate system, a line is formed using two parameters: an angle θ and a distance ρ. ρ
is the length of the normal from the origin (0, 0) onto the line and θ is the angle that this
normal makes with the x axis (see Figure 2.7).

Figure 2.7: Hough space representation - ρ refers the length of the normal from the origin
(0, 0) onto the line and θ is the angle this normal makes with the x axis.

In this representation, the equation of the line is:

ρ = x cos θ + y sin θ.

For each pixel at (x, y) and its neighborhood, the Hough transform algorithm determines
if there is enough evidence of a straight line at that pixel. For each point in the original space
it consider all the lines which go through that point at a particular discrete set of angles,
chosen a priori. For each angle θ, it is calculated the distance to the line through the point

14
at that angle. Make a corresponding discretization of the Hough spaces will result in a set
of boxes in Hough space. These boxes are called the Hough accumulators. For each line,
increment a count in the Hough accumulator at point (θ, ρ). After considering all the lines
through all the points, a Hough accumulator with a high value will probably correspond to a
line of points. In Figure 2.8 is possible see an example how Hough transform works.

Figure 2.8: Hough transform process for three points and with six angle groupings. In the
top, solid lines of different angles are plotted, all going through the first, second and third
point (left to right). For each solid line, the perpendicular which also bisects the origin is
found, these are shown as dashed lines. The length and angle of the dashed line are then
found. The values are shown in the table below the diagram. This is repeated for each of the
three points being transformed. The results are then plotted as a graph (bottom). The point
where the lines cross over one another indicates the angle and distance of the line formed by
the three points that were the input to the transform. The lines intersect at the pink point;
this corresponds to the solid pink line in the diagrams above, which passes through all three
points [7].

15
2.3.2 Circular Hough transforms

Circular shapes can be detected based on the equations for circles. The equation of a
circle is:

r2 = (x − a)2 + (y − b)2 ,

where (a, b) represent the coordinates for the center and r in the radius of the circle. The
parametric representation of the circle is given by:

x = a + r cos θ y = b + r sin θ.

For each edge point, a circle is drawn with that point as origin and radius r (if the radius
is not known, then the locus of points in parameter space will fall on the surface of a cone,
varying the radius). In circular case, the accumulator is a tree-dimensional array, with two
dimensions representing the coordinates of the circle and the last one specifying the radius.
The values in the accumulator are increased every time a circle is drawn with the desired
radius over every edge point. The accumulator, which kept counts of how many circles pass
through coordinates of each edge point, proceeds to a vote to find the highest count. The
coordinates of the center of the circles in the images are the coordinates with the highest
count. In Figure 2.9 it is possible to see an example of application of Hough transforms in
circular detections.

Figure 2.9: Hough transform - example of circle detection. In an original image of a dark
circle (radius r), for each pixel a potential circle-center locus is defined by a circle with radius
r and center at that pixel. The highest-frequency pixel represents the center of the circle
(marked with red color) [8].

16
2.4 Visual descriptors

The concept of feature detection refers to methods that aim at computing abstractions of
image information and making local decisions at every image point.
This approach of performing matching using features, usually involves 3 steps:

• Detection: Points of interest are identified in the image at distinctive locations.

• Description: The neighborhood of every point of interest is represented by a distinctive


feature vector, which is later used for comparison.

• Matching: Features vectors are extracted in two different images. Descriptor vectors
between the images are matched on the basis of a nearest neighbor criterion.

In Chapter 3 it will be present in detail several algorithms regarding the detection of


points of interest and extraction of features vectors.

2.5 Descriptors matching

Image matching is a fundamental aspect of many problems in computer vision, including


object or scene recognition. With the several methods to extract keypoits, features are found,
constructing a set of vectors with some features in an image. Matching process consists to
find matches (pairs of keypoints) between different views of an object or scene. In Figure 2.10
it is possible to see the main principle of matching approach.
There are different methods to calculate distance between two vectors of the same length:
Euclidean, Manhattan and Hamming methods are explained below. The distance between
two n-dimensional vectors a and b, can be express by:

• Euclidean: v
u n
uX
d=t (aj − bj )2 ;
n=1

• Manhattan:
n
X
d= |aj − bj |;
n=1

17
Figure 2.10: Features matching process: on the left the object to be recognized, on the right
the scene where the object will be found. In the middle is represented the features of two
pictures and respective matching signed with black arrows.

• Hamming: distance d between two vectors A and B ∈ F (n) is the number of coeffi-
cients in which they differ. F is a finite field with q elements. Usually used in binary
descriptors.

The feature matching problem is usually divided into two separate components. The first
is to select a matching strategy, which determines which correspondences are passed on to
the next stage for further processing. The second is to devise efficient data structures and
algorithms to perform this matching as quickly as possible.

2.5.1 Matching strategy

Three different methods are considering and illustrated in Figure 2.11.

• Fixed Threshold: Is the simplest matching strategy. Two regions are matched if the
distance between their descriptors is below a threshold;

• Nearest Neighbor (NN): Match the nearest neighbor in feature space;

• Nearest Neighbor Distance Ratio (NNDR): Compare the nearest neighbor dis-
tance to that of the second nearest neighbor.

18
Figure 2.11: Fixed threshold, nearest neighbor, and nearest neighbor distance ratio matching.
At a fixed distance threshold (dashed circles), descriptor F1 fails to match F2 , and F3
incorrectly matches F4 . If we pick the nearest neighbor, F1 correctly matches F2 , but F3
incorrectly matches F4 . Using nearest neighbor distance ratio (NNDR) matching, the small
NNDR d1 /d2 correctly matches F1 with F2 , and the large NNDR d′1 /d′2 correctly rejects
matches for F3 .

It is difficult to set a threshold value: threshold too high results in too many false positives,
i.e., incorrect matches being returned, while setting the threshold too low results in too many
false negatives, i.e., too many correct matches being missed.
A descriptor can have several matches and several of them may be correct. In the case of
nearest neighbor-based matching, two regions A and B are matched if the descriptor DB is
the nearest neighbor to DA and if the distance between them is below a threshold. With this
approach, a descriptor has only one match.
The third matching strategy is similar to nearest neighbor matching, except that the
threshold is applied to the distance ratio between the first and the second nearest neighbor.
The regions are matched if:

d1 ||DA − DB ||
N N DR = = < th,
d2 ||DA − DC ||

where DB is the first and DC is the second nearest neighbor to DA .

2.5.2 Efficient matching

Once we have decided on a matching strategy, we still need to efficiently search for po-
tential candidates. It is necessary to compare all features against all other features in each
pair of potentially matching images. As linear search is too costly for many applications, this

19
has generated an interest in algorithms that perform approximate nearest neighbor search, in
which non-optimal neighbors are sometimes returned. Such approximate algorithms can be
orders of magnitude faster than exact search, while still providing near-optimal accuracy.

Several approaches are study to improve an efficient matching. Devise an indexing struc-
ture is a better approach to search features rapidly near a given feature. Multidimensional
search tree or hash table are examples of an indexing structure implementation. In this thesis,
only a brief explanation about randomized kd-tree and hierarchical k-means tree algorithms
are aborted. This algorithms were integrated in FLANN library, explained below.

Fast Library for Approximate Nearest Neighbor - FLANN

FLANN is a library for performing fast approximate nearest neighbor searches in high
dimensional spaces. It contains a collection of algorithms found to work best for nearest
neighbor search and a system for automatically choosing the best algorithm and optimum
parameters depending on the dataset.

Marius Muja and David G. Lowe in their experiments [9], obtained the best performance
in two algorithms. These algorithms used either the hierarchical k-means tree or multiple
randomized kd-trees. Following it is presented a simple description of these algorithms.

The classical kd-tree algorithm [10] can be defined by a binary tree in which each node
represents a subtile of the records in the file and a partitioning of that subtile. The root of
the tree represents the entire file. Each non-terminal node has two successor nodes. These
successor nodes represent the two sub-files defined by the partitioning. The terminal nodes
represent mutually exclusive small subsets of the data records, which collectively form a
partition of the record space. To find the nearest neighbor of a query point, a top-down
searching procedure is performed from the root to the leaf nodes.

The nearest neighbor search in the high-dimensional case may require visiting a very large
number of nodes, and even the process costs linear time. Kd-trees splits the data in half at each
level and the randomized trees are built by choosing the split dimension randomly from the
first dimensions on which data has the greatest variance. When searching the trees, a single
priority queue is maintained across all the randomized trees so that search can be ordered
by increasing distance to each bin boundary. The degree of approximation is determined by

20
examining a fixed number of leaf nodes, at which point the search is terminated and the best
candidates returned.
The hierarchical k-means tree algorithm divides the dataset recursively into clusters (see
Figure 2.12). The k-means algorithm is used by setting k to two in order to divide the dataset
into two subsets. Then, the two subsets are divided again into two subsets by setting k to
two. The recursion terminates when the dataset is divided into single data points or a stop
criterion is reached.

Figure 2.12: Hierarchical k-means tree [11].

2.6 Location of object: the RANSAC algorithm

The problem of object recognition assume a classical approach to this problem: extract
local features from images (such as SIFT or SURF) and match them using some decision
criterion. Then, a last step of this procedure usually consists in detecting groups of local
correspondences under coherent geometric transformations. To estimate the homography and
the correspondences which are consistent with this estimate, a RANdom SAmple Consensus
(RANSAC) algorithm can be used.
The algorithm was first published by Fischler and Bolles in 1981 [12]. RANSAC is an
iterative method to estimate parameters of a mathematical model from a set of observed
data. It is a non-deterministic algorithm: it produces a reasonable result only with a certain
probability (this probability increasing as more iterations are allowed).
The study of RANSAC algorithm is not the objective of this thesis. More details about
this algorithm are described in the original paper [12].

21
2.7 Final remarks

Template Matching method tries to find small parts of an image which match a model
image. The correlation process can be computed by different methods: Squared difference,
cross correlation, correlation coefficient and normalized methods. The normalization process
divide each inner product by the square root of the energy of the reference image and the
image under evaluation. These methods are used because they can help reduce the effects of
lighting differences between the images to comparison. About the computational efficiency,
normalized methods have more accurate matches. Regarding to the processing time, simple
methods (squared difference) are faster, while normalized methods are slower [13] [14].
The performance of the Hough transform is highly dependent on the results from the edge
detector. This factor requires that an input image must be carefully chosen for maximize the
edge detection. Object size in picture and the distance between the objects cause variations
in the results.

22
Chapter 3

Features-based descriptors

In this chapter, the concept of feature detection will be studied in more detail, taking into
consideration four well known algorithms: SIFT, SURF, FAST and BRIEF.

3.1 Scale Invariant Feature Transform descriptor - SIFT

Scale-invariant feature transform (or SIFT) is a popular image matching algorithm in com-
puter vision to detect and describe local features in an image. The algorithm was published
by David Lowe[15].
The SIFT descriptor is invariant to translations, rotations and scaling transformations
in the image domain. It is robust to moderate perspective transformations and illumination
variations.
The SIFT method operate in a stack of gray-scale images with increasing blur, obtained
by the convolution of the initial image with a variable-scale Gaussian. A differential oper-
ator is applied in the scale-space, and candidate keypoints are obtained by extraction the
extrema of this differential. Position and scale of detected points are refined, and possible
unstable detections are discarded. The SIFT descriptor is built based on local image gradient
magnitude and orientation at each sample point in a region around the keypoint location.
The descriptor encodes the spatial gradient distribution over a keypoint neighborhood using
a 128-dimensional vector.
An illustrative diagram of the SIFT algorithm is represented in Figure 3.1. The details of
algorithm are presented as follows.

23
Figure 3.1: Description scheme of the SIFT algorithm.

3.1.1 Scale space construction

A scale-space is constructed from the input image by repeated smoothing and subsampling.
This process involves the convolution process between a Gaussian kernel, G(x, y, σ), and an
input image, I(x, y):

L(x, y, σ) = G(s, y, σ) ∗ I(x, y),

where ∗ denotes the convolution operator and

1 −(x2 +y2 )/(2σ2 )


G(x, y, σ) = e ,
2πσ 2
σ denotes the standard deviation and σ 2 the variance of the Gaussian kernel.
The scale-space of an image is constructed from the initial image, repeatedly convolved
with Gaussians, with increasing levels of blur, to produce images separated by a constant
factor k. A stack of images is constructed, each image with different levels of blur and
different sampling rates. This set is split into subsets where images share a common sampling
rate. These subsets are called octaves. In the scale space, first octave is constructed by
increasing sampling rate by a factor two. In the next octaves, the sampling rate is iteratively
decreased by a factor of two and σ doubling.
The standard value of scales per octave is nspo = 3, but in practice, scale-space will also
include three additional images per octave, necessary to applied differential operator in the
next step.
Figure 3.2 depicts the digital scale-space in terms of the sampling rates and level of blur.

24
Figure 3.2: Each image is characterized by its level of blur and its sampling rate. Images
are organized per octaves. In the same octave, images share a common sampling rate, with
increasing level of blur separated by a constant factor k. For each image, in the octave,
the level of blur double comparing with the same position of image in the previous octave
σo,1 = 2 × σo−1,1 with o = 1, · · · , 4. Lin refers to the initial image, with level of blur σin = 0.5,
and sampling rate S = 1 (by default). Lo,s denotes localization of each image in scale-space,
number of octave, o, and the number of scale into the octave, s.

In Figure 3.3 it is possible see the scale space construction applied in an image.

Difference of Gaussians

A Difference-of-Gaussians (DoG) is computed from the differences between the adjacent


levels (separated by a constant multiplicative factor k) in the Gaussian scale-space.

DoG(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)

= L(x, y, kσ) − L(x, y, σ).

The three additional images per octave are used in this step. The auxiliary images are
necessary because the extreme DoG scale-space need another image to comparison (See Fig-
ures 3.4 and 3.5).

25
Figure 3.3: In the top right, color image represent the input image. First octave is represented
in the top and successive octaves are present then. The scale is reduced and level of blur
increase. This images are generated by a demo algorithm, available online, after adaptation
and described in a paper [16].

Figure 3.4: Example to DoG computation in the second octave. Do,s denotes localization of
each DoG image in scale-space, computation by D1,s = L1,s+1 − L1,s with s = 0, · · · , 4 . Pairs
of continuous images are subtracted and the difference between the images at scale kσ and σ
is atributed a level of blur σ (procedure is not centered). Auxiliary images are represented in
red color.

26
Figure 3.5: Example to DoG computation in the second octave. Auxiliar images are used in
this step. This images are generated by the same demo algorithm, used in Figure 3.3.

3.1.2 Keypoints detection

Detecting keypoints is not a trivial computation. After constructing the DoG scale-space
it is necessary analyze all pixels, to determine the extrema in each DoG image. Maximum
and minimum values cannot be directly computed, it is necessary refine their location (via
a quadratic interpolation). Finally, unstable candidates points are discarded. This process
eliminates low contrasted extrema and candidate keypoints on edges. Following is presented
details about this steps.

Extraction of candidate keypoints (Locate DoG extrema)

Interest points are obtained from the points at which the DoG values assume extrema
with respect to both the spatial coordinates in the image domain and neighbors pixels in
adjacent DoG scale-space. This step is illustrated in Figure 3.6.
The extrema detection is a rudimentary way to detect candidate keypoints of interest.
This technique produce unstable and sensitive to noise detections.

Candidate keypoints refinement

SIFT method use a local interpolation to refine the position and scale of each sample
point. Given a point in the DoG scale-space Ds,m,n , with the quadratic function Ds,m,n (x)
at the sample point (m, n) locate in o octave and s scale into octave, given by:

27
Figure 3.6: Maxima and minima are computed in each DoG image by comparing a pixel (in
red color) with 26 neighbors, 8 pixels in the same DoG image and 9 × 2 in the adjacent scales
(marked at gray color).

1
Ds,m,n (x) = Ds,m,n + xT Gs,m,n + xT Hs,m,n x,
2
where x = (x1 , x2 , x3 ) is the offset from this point, Gs,m,n and Hs,m,n the gradient and Hessian
matrix, respectively:
   
(Ds+1,m,n − Ds−1,m,n )/2 h11 h12 h13
   
Ds,m,n =  (Ds,m+1,n − Ds,m−1,n )/2 , Hs,m,n =  h21 h22 h23
   

   
(Ds,m,n+1 − Ds,m,n−1 )/2 h31 h32 h33

with
h11 = Ds+1,m,n + Ds−1,m,n − 2Ds,m,n

h22 = Ds,m+1,n + Ds,m−1,n − 2Ds,m,n

h33 = Ds,m,n+1 + Ds,m,n−1 − 2Ds,m,n

h12 = (Ds+1,m+1,n − Ds+1,m−1,n − Ds−1,m+1,n + Ds−1,m−1,n )/4

h13 = (Ds+1,m,n+1 − Ds+1,m,n−1 − Ds−1,m,n+1 + Ds−1,m,n−1 )/4

h33 = (Ds,m+1,n+1 − Ds,m+1,n−1 − Ds,m−1,n+1 + Ds,m−1,n−1 )/4

This quadratic function is an approximation of the second order Taylor development.


The location of the extremum, x∗ , is determined setting it to zero, giving:

x∗ = −(Hs,m,n )−1 Gs,m,n .

28
If the offset x∗ is less than 0.5, the extremum is accepted and corresponding keypoint
coordinates are recalculated:
 
So ∗
(σ, x, y) = σmin 2(x1 +s)/nspo , So (x∗1 + m), So (x∗3 + n) ,
Smin

where So denotes the sampling rate into the octave, Smin = 0.5, and nspo = 3 by default.
If this condition falls outside the domain of validity and sample point is changed, the
interpolation is performed instead that point. To get the interpolated estimate for the location
of the extremum, the final offset x∗ is added to the location of its sample point (s, m, n) + x∗ .
This process is repeated until the interpolation is validated. If after five iterations the result
is still not validated, the candidate keypoint is discarded.
This post-processing stage is, in particular, important to increase the accuracy of the scale
estimates for the purpose of scale normalization.

Filter low contrast responses and edges

In order to discard low contrast extrema, SIFT method uses a threshold value, applied
in Ds,m,n (x∗ ). In the standard case, with three scales per octave, the value of threshold is
T hDoG = 0.03:

If |Ds,m,n (x∗ )| < T hDoG the candidate keypoint is discarded.

Some of candidate keypoints may even subsist after the last two process. The DoG
function will have a strong response along edges. Interpolation refinement and discarding by
value threshold on the DoG value can’t be sufficient.
Undesirable keypoints can be computed by a Hessian matrix (2 × 2) in the location and
scale of the keypoint:  
h11 h12
Hs,m,n =  ,
h21 h22
where:

h11 = Ds,m+1,n + Ds,m−1,n − 2Ds,m,n ,

h22 = Ds,m,n+1 + Ds,m,n−1 − 2Ds,m,n ,

29
h12 = h21 = (Ds,m+1,n+1 − Ds,m+1,n−1 − Ds,m−1,n+1 + Ds,m−1,n−1 )/4.

Edges present a large curvature orthogonal to the edge and a small one along the edge.
By analyzing the eigenvalues of the Hessian matrix it is possible to detect if keypoint compose
an edge or not. If it is computed a big ratio between the largest eigenvalue λmax , and the
λmax
smallest one λmin , indicates the presence of an edge: r = λmin .

A second threshold value is considered. Keypoints are discarded if the ratio between the
eigenvalues r is less than a threshold T hEdge (standard value is T hEdge = 10). Eigenvalues
can be related with determinant and trace of Hassian matrix by:

tr(Hs,m,n ) = h11 + h22 = λmax + λmin ,

det(Hs,m,n ) = h11 h22 − h212 = λmax λmin .

Then,

tr(Hs,m,n )2 (λmax + λmin )2 (r + 1)2


= = .
det(Hs,m,n ) λmax λmin r

This is known as the Harris-Stephen edge response [17].


Finally, to discard keypoints candidates on edges, the following test is applied:

(T hEdge +1)2
If edgeness (Hs,m,n ) > T hEdge then discard candidate keypoint.

3.1.3 Keypoint descriptors

Now legitimate keypoints are found. They’ve been tested to be stable, and the scale at
which the keypoint was detected is known (it’s the same as the scale of the blurred image).
So the algorithm have scale invariance. The next step is to assign an orientation to each
keypoint in order to obtained rotation invariance.

Assign keypoints orientations

A Gaussian smoothed image, Lo,s is selected, according to the scale of the keypoint. The
gradient magnitude, m(x, y), and orientation, θ(x, y) are computed to the keypoint consider-
ing:

30
q
m(x, y) = (Lo,s (x + 1, y) − Lo,s (x − 1, y))2 + (Lo,s (x, y + 1) − Lo,s (x, y − 1))2 ,
 
Lo,s (x, y + 1) − Lo,s (x, y − 1)
θ(x, y) = tan−1 .
Lo,s (x + 1, y) − Lo,s (x − 1, y)
An orientation histogram is created. Each sample added to the histogram is weighted by
its gradient magnitude and by a Gaussian-weighted circular window (standard deviation equal
to 1.5 × σ), reducing the contribution of distant pixels. In the histogram, the 360 degrees of
orientation are subdivided into 36 bins (each 10 degrees). The “amount” that is added to the
bin is proportional to the magnitude of gradient at that point. Once done this for all pixels
around the keypoint, the histogram will have a peak at some point. In addition to the biggest
mode, other three modes whose amplitude is within the 80% can be considered. Because this
reason, it is possible a keypoint generate various descriptors. This new keypoint has the same
location and scale as the original, but its orientation is equal to the other peak. In Figure 3.7
it is possible see an example histogram.

Figure 3.7: Histogram of gradient magnitudes and orientations. In this example, a keypoint
will create two different orientations.

Keypoint descriptor

The goal of SIFT algorithm is to generate a very unique fingerprint for the keypoint. To
do this, a 16 × 16 window around the keypoint is considered. This window is broken into 4 × 4

31
windows. For each 4 × 4 region, the gradient magnitudes and orientations are calculated and
an histogram constructed (see Figure 3.8).

Figure 3.8: Computation of the keypoint descriptor. Keypoint is marked at red color. The
gradient magnitude and orientation are computed, in a 16 × 16 window around the keypoint.
These samples are accumulated into orientation histograms summarizing the contents over
4×4 subregions. The length of each arrow corresponding to the sum of all gradient magnitudes
near that direction in this region. The amount added also depends on the distance from the
keypoint (computed by the Gaussian-weighted). So gradients that are far away from the
keypoint will add smaller values to the histogram [18].

The descriptor is a vector that contains the value of all orientation histogram entries.
Each histogram have 8 orientations bins, therefore the descriptor vector contains 8 × 16 = 128
elements.
In order to reduce the effects of illumination change, the descriptor vector is normalized
to unit length. To reduce the effects of non-linear illumination a threshold of 0.2 is applied
and the vector is again normalized.
In Figure 3.9 it is possible see all keypoints detected by SIFT algorithm.

3.2 Speeded Up Robust Feature descriptor - SURF

Speeded Up Robust Feature (or SURF) is a fast and robust algorithm for local, similarity
invariant representation and comparison, first presented by Herbert Bay (2006) [19]. Similarly
to the SIFT approach, SURF is a detector and descriptor of local scale and rotation-invariant
image features.

32
Figure 3.9: Keypoints detected - SIFT algorithm by OpenCv library.

The SURF method uses integral images in the convolution process, useful to speed up
this method. The box-space is constructed by using box filters approximation, by convolving
of the initial images with box filters at several different discrete size. For this reason, I
will use the term box-space in this document. To select interest point candidates, the local
maxima of a Hessian matrix is computed and a quadratic interpolation is used to refine the
location of candidate keypoints. Contrast sign of the interest point are stored to construct
the keypoint descriptor. Finally, the dominant orientation of each keypoint is estimate and
vector descriptor is computed.
Algorithm scheme of the SURF is illustrated in Figure 3.10 and the details of algorithm
are presented as follows.

Figure 3.10: Description scheme of the SURF algorithm.

33
3.2.1 Pre-processing

To understand how SURF algorithm works it is necessary to clarify some concepts. A


brief explanation about integral images and box-Hessian aproximation are presented below.

Integral images

An integral image is a data structure that allows rapid summing of subregions. The
integral image, denoted IP (x, y), at location (x, y) contains the sum of the pixel values above
and to the left of (x, y) in the input image I:
j≤y
i≤x X
IP (x, y) =
X
I(x, y).
i=0 j=0

Using integral images, it takes only three additions and four memory accesses to calculate
the sum of intensities inside a rectangular region of any size (see Figure 3.11).

(a) (b)

Figure 3.11: Integral image computation. (a) integral image representation; (b) area A
computation using integral images: A = L4 + L1 − (L2 + L3 ) [20].

Box-Hessian aproximation

The SURF descriptor is based on the determinant of the Hessian matrix. The Hessian
matrix, H, is the matrix of partial derivatives of the image I(x, y):
 
d2 I d2 I
dx2 dxdy
H(I(x, y)) =  .
d2 I d2 I
dxdy dy 2

34
The second order scale normalized Gaussian is the chosen filter by convolution with the
image I(x, y). With Gaussian function it is possible vary the amount of smoothing during
the convolution process in scale space construction and construct kernels for the Gaussian
derivatives in x, y and xy directions. The Hessian matrix can be calculated now, as function
of space (x, y) and scale, σ:

 
Lxx (x, y, σ) Lxy (x, y, σ)
H(x, y, σ) =  ,
Lxy (x, y, σ) Lyy (x, y, σ)

where Lxx is the convolution of the Gaussian second order derivative with the image I in
point (x, y) and similarly for Lyy and Lxy .
The respective kernels can be represented by using box filters. The masks used are a very
crude approximation and are shown in Figure 3.12.

Figure 3.12: Laplacian of Gaussian approximation. Top: the Gaussian second order par-
tial derivative in x (Lxx ), y (Lyy ) and xy (Lxy ) directions, respectively; Bottom: box-filter
approximation in x (Dxx ), y (Dyy ) and xy (Dxy ) directions [21].

The black and white sections in each filter represents the weights applied. In Dxx and
Dyy filters the black regions are weighted with a value of −2 and white regions with 1. For
the Dxy filter, the black region is weighted with −1 and white region with 1.
SURF algorithm uses the following approximation for the Hessian determinant:

det(Happrox ) = Dxx Dyy − (0.9Dxy )2 .

35
Box filters are kernels with constant elements in rectangular regions and integral images
help the sum of pixel values in a rectangular region very fast. The use of box filters enables
the use of integral images, which it is possible to speed up the convolution operation.

3.2.2 Box space construction

In the SIFT method, the scale-space is implemented by a iterative convolution between


the input image and a Gaussian kernel, repeatedly sub-sampled. The size is reduced and blur
level increase. This process is not computationally efficient, since images need to be resized
in each layer and each relies on the previous. To improve this step, SURF algorithm applies
kernels with increasing size to the original image. All images in the box-space can be created
simultaneously, with the same time processing (the calculation time is independent of the
filter size). The original image keeps unchanged and only varies the filter size.
The box-space is divided into octaves, similarly to SIFT. A new octave corresponds to the
doubling of the kernel size and a sub-sampling of factor two. Each octave is also divided in
several levels with increase blur level.

3.2.3 Keypoints detection

A candidate keypoint is obtained by computation of maxima responses of the determinant


of the box-Hessian matrix in box-space by comparing a point with its 26 neighbors in the
box-space just like SIFT.
When a maxima is found the area covered by the filter is rather large and this introduces
a significant error for the position of the interest point. In order to extract the exact location
of the interest point a simple second order interpolation is used.

3.2.4 Keypoints descriptor

The purpose of a descriptor is to provide a unique and robust description of a feature.


The SURF descriptor is based on Haar wavelet responses in x and y direction and can be
calculated efficiently with integral images. SURF descriptor use only a 64-dimensional vector
to construct a feature.
In order to be invariant to rotation it is necessary to fix a reproducible orientation based
on information from a circular region around the interest keypoint.

36
Assign keypoints orientations

The area to be considered, in order to assign keypoint orientation, is a circular region


with radius 6s where s denotes the scale at which the interest point was detected. At every
point in this neighborhood, responses to horizontal and vertical box filters are computed. The
Haar wavelet filter compute the responses in x and y directions. The filters are illustrated in
Figure 3.13. The black segment have the weight −1 and white segment +1.

Figure 3.13: Wavelet filter to compute the response in x (left) and y (right) direction. The
Haar wavelet response set to a side of 4s.

The computation of the gradient at this scale in this neighborhood is obtained using
convolution with box-filters illustrated above. The horizontal and vertical result of convolution
is called ds and dy respectively.
The interest area are weighted with a Gaussian (σ = 2s) centered at the interest point to
give some robustness for deformations and translations.
A histogram is built to estimate the dominant orientation. The response of a Gaussian are
represented as vectors in a space with horizontal response strength along the x axis and vertical
response strength along the y axis. A sliding orientation window of size π/3 is considered
and the horizontal and vertical responses inside the window are summed, constructing a new
vector. This diagram is illustrated in Figure 3.14.
After calculation all vectors orientation, the longest vector over all windows defines the
orientation of the interest point. In the following step, the SURF descriptor will be calculated.

37
Figure 3.14: Orientation assignment. A circular neighborhood around the interest keypoint is
considered. A sliding orientation window compute the dominant orientation of the Gaussian
weighted Haar wavelet response at every sample point in the region [22].

Keypoint descriptor

To extract the descriptor, a square region centered in the keypoint is constructed. This
square have a size of 20s and contains the orientation of the keypoint. The region is split up
into 4 × 4 sub-regions. A set of four features are calculated at 5 × 5 regularly spaced grid
points, in each of these subregions. These four features include Haar wavelet response in the
horizontal and vertical directions and their absolute values. (See Figure 3.15).

Figure 3.15: Keypoint is assigned at red color. Descriptor region split up into 4 × 4 sub-
regions (left). In each 2 × 2 sub-division is calculated the sum dx , dy , |dx | and |dy |, computed
relatively to the orientation of the grid (right) [19].

Each 2 × 2 sub-division have a set of four features, this result in a descriptor vector of
length 64 (16 sub-regions ×4 features).

38
In Figure 3.16 it is possible to see keypoints detected by SURF algorithm.

Figure 3.16: Keypoints detected - SURF algorithm by OpenCv library.

3.3 Features from Accelerated Segment Test - FAST

Features from Accelerated Segment Test (or FAST) in an algorithm proposed originally
by Rosten and Drummond [23] in 2006 for identifying corners in an image. FAST algorithm
is an attempt to solve a common problem, related with processing time, in robots. In real-
time, when a robot needs to extract information about the environment to react according
to it, the algorithms presented above (like SIFT and SURF) are efficient, but too computa-
tionally intensive for use in real-time applications of any complexity. Thus, such detection
schemes are not feasible in real-time machine vision tasks, despite the high-speed processing
capabilities of today’s hardware. Unlike SIFT and SURF, the FAST algorithm only detect
corners/keypoints, not producing descriptors. This detector can be used in others descriptor
algorithms to detect keypoints.
The FAST detector uses a circle of 16 pixels and radius 3 to classify whether a candidate
point ρ (with intensity Iρ ) is actually a corner. Each pixel in the circle is labeled from integer
number 1 to 16 clockwise (illustrated in Figure 3.17). If a set of n contiguous pixels in the

39
circle are all brighter than the intensity of candidate pixel ρ plus a threshold value t or all
darker than the intensity of candidate pixel ρ minus threshold value t, then ρ is classified as
corner. n is usually chosen as 12 and all pixels in the circle are examining. Each pixel, x, can
take three states: darker than ρ, lighter than ρ or similar to ρ:

Darker: Ix ≤ Ip − t,

Similar: Ip − t < Ix < Ip + t,

Brighter: Ip + t ≤ Ix .

where Ix is the intensity of pixel that be analyzed.

Figure 3.17: FAST algorithm neighborhood: the pixel ρ is the center of a candidate corner.
A circle of 16 pixels and radius 3 are considered. The intensity of each pixel is evaluated.
The arc (dashed line) passes through 12 contiguous pixels which are brighter than ρ by more
than the threshold.

To make the algorithm fast, it first compares the intensity of four pixels (as in Figure 3.17:
pixels with numbers 1, 5, 9 and 13) with the intensity of a central pixel Ip . If at least three of
these four pixels do not have values above Ip + t or below Ip − t, then ρ cannot be a corner. If
this test is passed, then all the 16 pixels are tested to check if there are 12 contiguous pixels
that satisfy this criterion.
In this approach the detector has several weaknesses: the algorithm test does not generalize
well for n < 12, generating a very large number of keypoints; The position and choice of the

40
fast test pixels makes certain implicit assumptions regarding the features; Many features are
detected close to each other.
A machine learning approach has been added to the algorithm to deal with these issues,
explained in the original paper [23].

3.4 Binary Robust Independent Elementary Features - BRIEF

The BRIEF algorithm was presented in 2010 [24], and was the first binary descriptor
published, on the basis of simple intensity difference tests. BRIEF takes only the information
at single pixels location to build the descriptor, so to improve its sensitiveness to noise the
image is first smoothed by a Gaussian filter. This is done by picking pairs of pixels around
the keypoint, according to a random or nonrandom sampling pattern, and then comparing
the two intensities. If the intensity of the first pixel is higher that of second the test returns
1, or 0 otherwise. This is obtained using the following scheme τ for a patch p of size S × S :

 1 if p(x) < p(y)
τ (p; x, y) =
 0 otherwise,

where p(x) is the smoothed intensity of the pixel at a sample point x = (u, v)T . A set of nd
pairs is defined, so as to generate an nd dimensional bit string. The BRIEF descriptor is then
taken to be the nd -dimensional bit string:

X
2(i−1) τ (p; xi , yi ).
0≤i≤nd

Frequently nd = 128, 256 and 512 (good compromises between speed, space and accuracy).
The authors of BRIEF consider five methods to determine the vectors x and y:

(a) x and y are randomly uniformly sampled;

(b) x and y are randomly sampled using a Gaussian distribution, meaning that locations that
are closer to the center of the patch are preferred;

(c) x and y are randomly sampled using a Gaussian distribution where first x is sampled with
a standard deviation of 0.04S 2 and then the yi s are sampled using a Gaussian distribution
- Each yi is sampled with mean xi and standard deviation of 0.01S 2 ;

41
(d) x and y are randomly sampled from discrete location of a coarse polar grid;

(e) For each i, xi is (0, 0) and yi takes all possible values on a coarse polar grid.

Figures 3.18, which illustrates examples of the five sampling strategies will help clear up
the definitions.

(a) (b) (c) (d) (e)

Figure 3.18: Choose the test locations [24].

BRIEF descriptor is a compact, easy-computed, highly discriminative algorithm. In the


matching process it uses Hamming distance, this process can be very efficiently implemented
by doing a bitwise XOR operation between the two descriptors.

3.5 Final remarks

A SIFT keypoint is a selected image region with an associated descriptor. Their descriptors
are stored in a vector that contains the information necessary to classify a keypoint. It
is possible obtain features, highly distinctive, useful in the matching process. The scale-
space of representation provides SIFT with scale invariance. In order to achieve rotation
invariance, each keypoint are assigned an magnitude and orientation. The keypoints extracted
by SIFT have a scale and orientation associated with them, characterizing this algorithm
highly distinctive and robust.
The SURF algorithm works with the same objective that the SIFT descriptor. Keypoints
are assigned with scale and rotation invariance in order to achieve distinctive features in
an image. The SURF descriptor is an improvement of SIFT with respect to the speed of
processing time. Integral images associated with the Laplacian of Gaussian approximation is
an ingenious construction to speed up the convolution operation.

42
More visual descriptors exists in literature, such as BRISK, FREAK, OBR, among others,
but only SIFT and SURF are tested in this project.

43
44
Chapter 4

Decorative stones detection and


recognition for a textile industry
robot

This chapter describes a vision system that was developed for a textile industry robot,
based on the Template matching algorithm. The main objective of this vision system is to
detect the position of stones in a plate and classify them among several shapes available. In
Figure 4.1 it is possible see various shapes of stones and the robot prototype.

(a) (b)

Figure 4.1: (a) Various forms of decorative stones; (b) Robot prototype.

Decorative stones are placed on a plate, having a camera parallel to the plane where they
are located. After detect and classify the stone, a robotic arm pick the stone in order to apply

45
in cloths.
Since the working plane is fixed regarding the position of the camera, the computer vision
system does not have to deal with the scale of the objects. However, these objects can be
rotated regarding the camera. With these assumptions, and the requirements of the robot
prototype, namely the use of a low cost camera and the capability of manipulate at least one
stone per second, it was decides to explore the use of Template matching algorithms.
Based on the Template matching algorithm, the detection system acquire images to be
evaluated and compare them to a database previously created with reference images of stones.
The system returns the position where it was found the best matching (see Figure 4.2). The
details of the developed vision system are presented next.

Figure 4.2: Conceptual description of the vision system. “DB” refers to a database of images,
“FRAME” denotes the image under evaluation.

4.1 Study of the appropriate methodology

As starting point, it was decided to explore the Template matching algorithms available
in the OpenCv library. There are implemented six different matching methods. To choose the
best method to apply in this application, all the methods were tested on two types of stones -

46
“Big Star” and “Heart”, since they have a more complex shape. In a pre-processing step, the
conversion to grayscale was applied to both images. In Table 4.1 it is presented experimental
results regarding the detection rate of each type of stone. It is possible to conclude that the
processing time is slightly larger in normalized methods, however, they provide more accurate
matches. In this application the accuracy is an important request, so the method described
by the Equation 2.6 is chosen to be used in this application. Several different approaches
were developed and tested, as described next.

Methods/equations
Stone format 2.1 2.4 2.2 2.5 2.3 2.6

2/9 2/9 0/9 2/9 2/9 6/9 Stones detected/total stones


Big Star
864 900 740 892 816 920 Template matching time (ms)

5/12 5/12 0/12 6/12 0/12 6/12 Stones detected/total stones


Heart
4338 4518 3744 4446 4050 4680 Template matching time (ms)

Table 4.1: Application of different Template matching methods to “Big Star” and “Heart”
stones. Rate detection and processing time are presented, obtained using a laptop with a
AMD APU E450 1.65GHz (dual core).

4.2 Developed detection system

Videos with the stones were collected by a camera, installed in the robot prototype, with
a resolution of 1024 × 768 pixels. The Figure 4.3 illustrates one of the frames captured with
the camera regarding a specific stone, called “Big Star”.
For each specific stone, it was created a database with images of each stone rotated 20
degrees. This is necessary because the Template matching algorithm is not invariant to
rotation. Figure 4.4 shows the database used to the stone “Big Star”. The number of images
in each database depends on the shape of the stone. Stones with more symmetry have their
shape repeated after a few number of rotations. Images with less symmetry needs a greater
number of rotations until go back to initial position.
The shape of each stone is the main factor to take into account. For this reason, it
is necessary to simplify and reduce the complexity of images before applying the Template

47
Figure 4.3: Original frame captured by the camera installed in robot - “Big Star” stone.

Figure 4.4: Database for stone “Big Star”.

matching algorithm. For this reason, both images (database and image under evaluation) will
pass by a pre-processing algorithm.
In the pre-processing algorithm, three different approaches were considered: the conversion
of the image to grayscale, edges detection and finally, the optimization of the previous in order
to obtain well-defined contours.

4.2.1 Grayscale

Template matching algorithm can be used with color images, however, in this case of
study, color parameter is not important. For this reason color images are not considered to
be used directly for Template matching.
The first approach to solve the problem consists in calculate the correlation on the gray
values of the images. Each frame of video pass through a grayscale transformation process.
The diagram that illustrates this first approach is presented in Figure 4.5. Figure 4.6 shows
an image after transformation.

4.2.2 Edges

In order to reduce the undesirables effects of light on the detection and classification
system, a second approach considering the application of a filter for edge detection before the

48
Figure 4.5: Diagram regarding the pre-processing algorithm “Grayscale”.

Figure 4.6: Application of “Grayscale” algorithm - “Big Star”.

application of Template matching was implemented and explored.


Edge detection is a process which aim at identifying points in a digital image that have
discontinuities. In this process, it is important finding the approximate absolute gradient
magnitude at each point of a grayscale image. The points at which image values changes
sharply are typically organized into a set of curved line segments denoted edges.
Various algorithms have been proposed in the last years to determine edges in an image,
such as Sobel, Roberts Cross, Prewitts, Laplacian, and Canny [25], to name a few.
The Sobel operator computes an approximation of the gradient of an image intensity
function. For this, Sobel is computed by convolve the image with a vertical and horizontal
template kernel. Figure 4.7 shows how the algorithm produce separate measurements of the
gradient component in each orientation.
Combining both results above it is possible to find the absolute magnitude of the gradient
at each point by:
q
G= G2x + G2y .

Is is possible to calculate the gradient direction by:

Θ = atan2 (Gy , Gx ).

49
Figure 4.7: Horizontal Gx and vertical Gy kernel to Sobel operator. I refer the image to be
operated and ∗ denotes the 2-dimensional convolution operation.

Robert’s Cross and Prewitts operators are very similar to the Sobel operator. Figure 4.8
shows vertical and horizontal kernel used in the convolution process for both edge detection
techniques.

Figure 4.8: Horizontal Gx and vertical Gy kernel to Robert’s Cross operator (in the top) and
Prewitts (in the bottom). I refer the image to be operated and ∗ denotes the 2-dimensional
convolution operation.

The Laplacian of an image highlights regions of rapid intensity change and it is, therefore,
often used for edge detection. The Laplacian algorithm uses the second derivative to detecting
edges, characterized by draw extremely thin edges. The Laplacian L(x, y) of an image with
pixel intensity values I(x, y) is given by:

d2 I d2 I
L(x, y) = + .
dx2 dy 2

Unlike the other two edge detectors presented, the Laplacian edge detector only uses one
kernel. Two commonly used kernels are shown in Figure 4.9.
The Canny algorithm uses a multi-step process to detect the possibility of an edge. It

50
Figure 4.9: Example of kernels used by the Laplacian operator. I refer the image to be
operated and ∗ denotes the 2-dimensional convolution operation.

is a first derivative Gaussian operator that smooths the noise and finds edges. Initially, the
image is uniformed for a two-dimensional Gaussian function and, subsequently separated in
the directions x and y. Thus, it is possible to calculate the smooth surface gradient in the
convolution image (through the first derivative). After obtaining a measure of the intensity of
each point in the image, discarded the points of minimum intensity. Exists a threshold which
defines the existence of an edge. These adjustable parameters are called upper threshold and
lower threshold in the function of OpenCv.

In Figure 4.10 there are presented the images obtained with different algorithms to de-
tect edges. Observing the results, in order to be applied in the stones detector, the Canny
algorithm detect well-defined edges, being the choice for this application.

Figure 4.10: Application of Canny, Sobel and Laplacian algorithms for edges detection -
“Heart”.

51
The diagram illustrating this second approach is presented in Figure 4.11. The Figure 4.12
shows the frame after the Canny edge detection.

Figure 4.11: Diagram of the pre-processing algorithm using the “Canny” edge detector.

Figure 4.12: Application of “Canny” algorithm - “Big Star”.

4.2.3 Contours

By applying the Canny algorithm it was possible to observe that images have not perfect
contours. In order to optimize the use of the Canny algorithm in this project, trying to reduce
the possible noise and giving primary emphasis to each stone format, it is necessary optimize
this method. The main goal is to fill the object of reference, based on the external contour.
An illustrative diagram of the algorithm for this third approach is represented in Figure 4.13.
Each of these steps of this third approach are described next.

• Dilation and erosion:


Dilation and erosion [22] are basic operations of mathematical morphology. In this case
study, it is useful to use these two processes together because it is intended to reduce
the noise of a binary image. In Figure 4.14 it is possible to see what happens with the

52
Figure 4.13: Diagram of the pre-processing algorithm using the “Optimized Canny” with
contour detection.

application of these two methods. Primarily, in the dilatation process white pixels are
added, subsequently, the erosion process gives a more “fine” shape of the edges. In this
case, the edges fill partially the shape of the stone.

Figure 4.14: Application of the algorithms for dilation and erosion - “Big Star”.

• Edges to Contours:
OpenCV provides an algorithm that returns the contours detected in a given image.
Thus, by using the findContours() method that the implements the algorithm described
in [26] it is possible to operate on the existing contours.

53
• Fill:
To fill each extracted contour, in order to obtain a better value of matching between
the image taken by the camera and the images of the database, it is used the method
drawContours() that implements the algorithm described in [26] to fill the contours.

The parameters used in the process of dilation and erosion are differentiated according
to the size and shape of each stone. This selection was done experimentally for each stone
separately. In Figure 4.15 it can be seen the application of the final algorithm for several
stones.
After study the problem with the video samples, the algorithms were implemented in the
robot prototype, developed for Tajiservi company.

Figure 4.15: “Optimized Canny” algorithm - “Ovoid, Quadrate, Rectangle and Diamond”.

4.3 Results

The results were based on the application of the three algorithms developed and presented
before in the provided sample videos. To simulate the real situation, in which each stone
detected is removed by a robotic arm, the image is filled with a black rectangle where the
best matching is detected, resembling the real case, in which the stone is removed. The stones
are removed in order until reach the minimum numeric value for which the matching is still
valid. At the same time it is drawn over the original image a square around the stone detected
for debugging purposes.

54
In Figure 4.16 it is shown the analysis of a frame for stones of various formats. At the
top left is visible the reference image from database with the best matching in each stone
detected (it is only presented the current detection).

(a) (b)

Figure 4.16: Original frame with stones detected marked: (a) “Big Star” (b) “Ovoid,
Quadrate, Rectangle and Diamond”.

Table 4.2 shows the number of stones detected of each format and for all the algorithms.
The stones were counted until the first “false” matching occurs, not considering any “false
positive”. In the graph of Figure 4.17, it is shown the detection rate (obtained from normal-
ization of the values of Table 4.2).
In order to the robot know that there is no stone on the plate, it is needed to obtain a
threshold value for the detection of each type of stone. The graphics (present in Anexe A,
in Figure A.1) illustrates the correlation values obtained in the detection of each stone. It
is possible to see that for the first detections the correlation value is close to 1, indicating a
“good” matching. This value decreases as the stones are being removed. For each algorithm,
red color in the graphic corresponds to the first false matching. The peak values illustrated
in the graphics correspond to some cases when occurs the overlapping with the square drawn
to simulate the stone removal. Their “hide” the previous stone, and subsequently a higher
correlation value is calculated. In the real application this do not apply, when a stone is
detected it is removed and a new image is acquired.

55
Stone format Grayscale Canny Optimized Canny

Big Star 6/9 7/9 9/9


Small Star 13/13 9/13 12/13
Heart 7/12 11/12 11/12
Drop 11/22 7/22 15/22
Diamond 9/22 8/22 12/22
Ovoid 1/13 10/13 11/13
Quadrate 4/10 3/10 10/10
Rectangle 7/17 4/17 12/17
Donut 8/11 11/11 10/11

Table 4.2: Experimental results (number of stones detected / total number of stones).

Figure 4.17: Results graph.

It was performed an experiment of each algorithm applied to all the different forms of
stone (see Annex A). Observing Table A.1, it is possible to see that the total processing time
for each stone slightly increases with the optimized algorithm, as expected. The stones with
less symmetry have a large number of images in the database, which leads also to a increase
of the processing time. Values of Table A.2 are obtained in the robot prototype, after the
integration of the best algorithm. In each detection, only a region of interest (in the frame

56
under evaluation) have been taken into account, reducing the processing time. Each stone,
in the data base have 2 degrees of rotation, forcing a great number of comparisons.

4.4 Final remarks

At the end of this chapter it is possible conclude that the initial objectives for the vision
system to be used in the textile robot prototype were met. Overall, it is possible to verify
that in video samples, performance is better in large stones. In the case of small stones, these
have a small distance between each, increasing the difficulty of the simulation process, when
the black square is drawn.
Regarding the different algorithms implemented, in the simplest case where only a con-
version from the color image to grayscale is performed, already obtain promising results, and
some stones are detected.
The implementation of the Canny algorithm reduces the dependence of illumination,
however, each stone is represented by a small area, its edge, being confused with the noise
that covers the surface where these stones are located.
In Canny algorithm optimization, the matching area is increased, allowing a correlation
value close to 1. This algorithm shows very positive results when does not occurs overlap of
stones.
As regards the preparation of the databases, the stones are represented with 20 degree
rotations. In the robot prototype, the databases have the stones rotated with a step of 2
degrees, achieving better results in terms of manipulation. This leads, however, to a longer
processing time, solved by reducing the size of analyzed frame (taking into account a region
of interest) and with the use a better computer in the prototype. The database can be built
loading only one image in the program and rotating it in real time. This solution is used in
the robot prototype.
The team responsible for the development of the machine in the company integrated
the best algorithm presented in this chapter. By the time of writing, the machine is fully
functional.

57
58
Chapter 5

Ball Detection for Robotic Soccer

CAMBADA (acronym of Cooperative Autonomous Mobile roBots with Advanced Dis-


tributed Architecture) is the RoboCup Middle Size League soccer team from the University
of Aveiro, Portugal. This project started officially in October 2003 and, since then, the team
has participated in several RoboCup competitions and Portuguese Robotics Festivals.

The current version of the vision system used in the robots is an omni-directional setup
based on a catadioptric configuration implemented with an Ethernet camera and a hyperbolic
mirror. In Figure 5.1 it is possible see a robot of the team and an example of an image acquired
by the camera installed on the robot.

(a) (b)

Figure 5.1: (a) Robot of the CAMBADA team; (b) Example of image acquired by vision
system of the robot.

59
The detection of field lines, obstacles and the ball in robotic soccer is based on color
segmentation techniques and blobs analysis, as currently used in CAMBADA team [27]. For
RoboCup 2014, several challenges have been selected for the teams and in this chapter one
in particular is studied: Challenge 6 - Team play with an arbitrary FIFA ball [28]. Another
issue addressed in this chapter is related to the problem of ball detection in the air, applied
in defender behavior, mainly in the goalkeeper, when the opponent team try to score a goal.
To try solving these challenges, the Hough transform presented in Chapter 2 was used and
the implementation made for the CAMBADA team is described in this chapter. Moreover,
experimental results are presented to show the effectiveness of the proposed approach.

5.1 Background implementation issues

OpenCv has available a function, called HoughCircles(src gray, circles, CV HOUGH GRAD,
dp, min dist, param 1, param 2, min radius, max radius) [26], which implements the circular
Hough transform. Several input parameters are necessary, following the explanation about
each one:

• src gray: Input image (grayscale);

• circles: A vector that stores sets of 3 values: xc , yc , r for each detected circle;

• CV HOUGH GRAD: Define the detection method. Currently this is the only one avail-
able in OpenCV ;

• dp = 1: The inverse ratio of resolution;

• min dist = src gray.rows/8: Minimum distance between detected centers;

• param 1 = 200: Upper threshold for the internal Canny edge detector;

• param 2 = 100∗: Threshold for center detection;

• min radius = 0: Minimum radius to be detected. If unknown, considers zero as default;

• max radius = 0: Maximum radius to be detected. If unknown, put considers zero as


default.

60
In this chapter it is presented experimental results based on the use of the OpenCv im-
plementation of circular Hough transform, considering several approaches for its use, namely
pre-processing steps and validation algorithms.

To evaluate the performance of the developed approaches, sample videos were acquired
with the CAMBADA robots in the recently built robotic soccer field at University of Aveiro.
A Kinect was used, placed on the goalkeeper robot. The depth component isn’t object of
study in this project and for this reason, only color images acquired by the Kinect are used.
In Figure 5.2 it is possible to see a frame, acquired by the Kinect.

Figure 5.2: Sample image acquired by the Kinect, located in the goalkeeper robot.

In Chapter 2, the algorithm and implementation of circular Hough transforms was ex-
plained. In Figure 5.3 it is present the application of HoughCircles() and the correspondent
accumulator matrix obtained. It is possible to see the absolute maximum value in the accu-
mulator in the center of the circle detection, and other maximums located in the surroundings.

5.2 Study of HoughCircles() parameters

The choice of the value for each parameter is essential to the success of detection. For
example, it is imprudent to try find large radius in the field, since the radius of the ball is
small. For this reason, it is necessary to study the effect of each parameter individually.

61
(a) (b)

Figure 5.3: Hough circle transform application, param 1 = 40, param 2 = 20,
min radius = 10, max radius = 25): (a) Ball detection ; (b) Respective accumulator
matrix representation.

5.2.1 Upper threshold for the internal Canny edge detector

Large intensity gradients are more likely to correspond to edges than small intensity
gradients. It is impossible, in most of the cases, to specify a threshold at which a given
intensity gradient corresponds to an edge. To understand the effect of the upper threshold,
the variation of Canny edge detector parameter is illustrated in Figure 5.4. To show the
results, only a small region of the original frame is presented.

Choosing a low value of the Canny threshold, several false detection occurs. The scenario
have useless information, and with a low threshold many edges are extracted. A high value
of threshold is dangerous, since edges are not sufficient to form a perfect circle to an easy de-
tection with Hough transform. To solve this problem, a pre-processing technique was tested:
Extraction of the RGB channels and its analysis individually.

Extraction of RGB channels

Color channels refer to certain components of a color image. RGB is the most widely
used model, in which any pixel of a color image is represented by its red, green, and blue
components. Sometimes, the information we are interested on resides more on a single color
channel only. In this case, we can consider splitting the color channels from the original

62
(a) (b)

(c) (d) (e)

(f) (g) (h)

Figure 5.4: Variation of Canny parameter: (a) Original image; (b) Gray image; (c)
thrCanny = 20; (d) thrCanny = 30; (e) thrCanny = 40; (f) thrCanny = 50; (g) thrCanny = 60;
(h) thrCanny = 70.

image and process the color channels separately. If a user wants to extract an object from its
background whose color is mainly red, eventually better results can be obtained if only the
red channel is used.

In Figure 5.5 it is possible to see the extraction of three channels in the test image.

To evaluate the effect of Canny after color separation, the Canny algorithm was applied
in the red, green and blue channels. These tests are shown in Figure 5.6.

By observation, it is possible to see that the green channel assumes the best solution, not
only in the perfect format that the ball contour is extracted but also in a large scale of Canny
thresholds. This result is expected, because the green color of the field.

63
(a) (b) (c)

Figure 5.5: Figure 4.3 (a) split in RGB channels: (a) Red channel; (b) Green channel; (c)
Blue channel.

5.2.2 Threshold for center detection

It is the accumulator threshold for the circle centers at the detection stage. A smaller it is,
more false circles may be detected. Circles, corresponding to the larger accumulator values,
will be returned first. The maximums values of accumulator are discarded if inferior that the
threshold value. In the proposed experiments, with a thrdetection = 20 a great number of false
detections are discarded.

5.2.3 Maximum and minimum radius to be detected

To choose the radius values, a simple test is made. A video is recorded, moving the ball
longitudinally, parallel to the camera, along a strait line. In the experiment, the maximum
value of radius, when the ball is closer to the camera, is maxradius = 30. When the ball is
far, the radius decreases, being the minradius = 1.
After understand the effect of each parameter in the HoughCircles() transform, two sepa-
rated problems are studied as following: Team play with an arbitrary FIFA ball in the ground
plane and detection of the ball in the air.

5.3 Team play with an arbitrary FIFA ball

Detect an arbitrary ball (being the color unknown), it is possible taking into account the
circular format. With circular Hough transform, edges are extracted, and only the format is
relevant for detection. Three sample videos were recorded to evaluate the efficiency of the
developed algorithms:

64
Red Green Blue

thrCanny = 20

thrCanny = 30

thrCanny = 40

thrCanny = 50

thrCanny = 60

thrCanny = 70

Figure 5.6: Canny applied in RGB channels varying the Canny threshold.

65
• Video 1: Blue ball moving along the ground plane;

• Video 2: Yellow ball moving along the ground plane;

• Video 3: Red ball moving along the ground plane;

In figure 5.7 it is possible to see several moments of a video sample.

Figure 5.7: Sample frame acquired by the Kinect with the blue ball.

5.3.1 Validation process

The distance of the ball to the robot can be related with the radius of the ball detected.
The distance can be calculated in pixels, since the ball only moves in the ground plan. A ref-
erence position is taken into account centerref = (xref , yref ), where xref = f rame.cols/2 and
yref = f rame.rows, corresponding to the robot position. The center of the circle detection is
known, center = (x, y), returned by the HoughCircles() function. It is possible calculate the
distance in pixels between the reference and the center of the ball detected. The distance is
calculated by:
q
d = (x − xref )2 + (y − yref )2 .

Known the distance and radius correspondent to each distance, it is possible to discard radius
outside the range defined.
Figure 5.8 shows the radius variation of the detected ball with the distance to the robot.
It is possible to see that the radius of the ball detector decrease with distance, as expected.
In order to achieve a mathematical model to approximate the variation of radius in function
of distance, an exponential approximation was computed, and a upper and lower curve was
considered to accommodate the detection of balls in a real application. Functions T hrmin :

66
radius = 346, 11e((−0,057×distance)−0,50) and T hrmax : radius = 346, 11e((−0,057×distance)+0,50)
are used in the developed algorithm in order to produce a validation vector with the maximum
and minimum radius for each distance computed.

Figure 5.8: Radius variation with the distance of the ball. Data extracted by the analysis of
Video 1 with thrCanny = 60; thrdetection = 20; minradius = 1; maxradius = 30. Experimental
data are approximated by an exponential function. The maximum and minimum threshold
is obtained using the exponential equation. T hrmin : radius = 346, 11e((−0,057×distance)−0,50) ,
T hrmax : radius = 346, 11e((−0,057×distance)+0,50) .

In the practical case, only the longitudinal distance of the ball is considered. For this
reason, the distance computed in the algorithm proposed is calculated only in function of the
lines of each frame: distance′ = f rame.rows − y.
With the process of radius validation in function of distance, parameters in the Hough-
Circle() function can be changed, to produce a greater number of potential ball detections.
In Figure 5.9 it is possible see the results, after validation.

5.3.2 Results

In Table 5.1 it is possible to view the rate detection for each sample video. Each detection
is considered by observation of 60 frames, with the validation criteria explained above. There
are no reference to false positives since in the test sequence, the ball assumes less than 1% of
false detections.

67
Figure 5.9: Experimental results after the validation process. Red circles represent the
“good” validation of radius, violet detection corresponds to discarded circle detections. With
thrCanny = 40; thrdetection = 17; minradius = 1; maxradius = 30.

Algorithm Blue ball Yellow ball Red ball

Grayscale 88% 23% 80%


Green channel 92% 43% 92%

Table 5.1: Experimental results regarding the detection rate. Test applied in 60 frames.

The yellow ball assumes a less detection rate because it needs a less Canny threshold
in order to all its edges can be detected when the ball is present. This not happen with
the yellow ball, a partial circle is draw when edge detector is applied, compromising the
following processes in the developed algorithm. The objective of the developed algorithm
is that this assumes a generic solution for balls detection. For this reason, the parameters
of HoughCircle() transform assumes a constant value, when tested with different color balls.
The values selected for final experiments: thrCanny = 40, thrdetection = 17, minradius = 1 and
maxradius = 30.
To evaluate the performance of each algorithm referring to the time processing, Table 5.2
show the time spent in each process. Both algorithms assume a similar value of time process-
ing.

5.4 Ball in the air

In the study of ball detection in the air during a game, color information of the ball is
known beforehand. With color information, it is possible to combine the Hough transform to
locate potential circle detection. Several false detections occurs when HoughCircles() function
is applied and using color information it is possible to improve these results.

68
Algorithm Pre processing HoughCircle() Draw Total

Grayscale 13, 28 28, 48 2, 05 43, 82


Green channel 12, 35 29, 37 2, 35 44, 07

Table 5.2: Time processing in ms: pre-processing time refers to the grayscale conversion
function in the first algorithm and RGB split in the second algorithm, where the Green
channel was chosen; HoughCircle() refers to the time of ball detection (includes the Canny
time processing); draw correspond to the time of validation the radius and display of debug
information. Test applied in 60 frames on Video 1.

To study the detection of the ball in the air, three sample videos were recorded:

• Video 1: Blue ball jumping perpendicular to camera plane;

• Video 2: Yellow ball jumping perpendicular to camera plane;

• Video 3: Red ball jumping perpendicular to camera plane.

In Figure 5.10 it is possible to see several moments of a video sample.

Figure 5.10: Sample frames acquired by the Kinect with the ball in the air.

5.4.1 Validation process

To validate the presence of a ball in the scene, each circle detection is analyzed and eval-
uated. In each circle detection, a region of interest (ROI) is obtained, and color information
extracted. In the ROI, all pixels are tested in order to study their color using the HSV color
space. Unlike RGB, HSV separates the image intensity from chroma or color information.

69
This is very useful when the goal is the classification of a specific color according to a color
range defined beforehand. For this reason it is the color space chosen for experiments.
To validate the presence of the ball, reference values of hue are considered (all the possible
values of saturation and value are considered):

• Blue: [50; 140];

• Yellow: [15; 40];

• Red: [0; 15]and[160; 180].

In order to detect the presence of the ball in each ROI (see Figure 5.11), all pixels in this
image are tested and evaluated. Pixels inside the range considered are counted. To decide if
exist a dominant color in the ROI a threshold is computed:

• Blue: If contP ixelB > 0, 40 × totalROIP ixels a blue ball is present in the ROI;

• Yellow: If contP ixelY > 0, 40 × totalROIP ixels a yellow ball is present in the ROI;

• Red: If contP ixelR > 0, 40 × totalROIP ixels a red ball is present in the ROI.

Figure 5.11: Region of interest obtained after the Hough transform.

In the robotic laboratory, the light of the lamp, situated on the top of the field and reflected
on the image, leads to false detections. To eliminate this false detection, it is necessary to
filter the saturation and value information that define the light colors. These parameters can
be adapted according to the conditions of the environment where robots play. In Figure 5.12
it is possible to view the color space of blue color.

70
Figure 5.12: HSV color space representation. Discarded zone signed with a red rectangle.
s > 30 && v < 100 is considered in the color test.

5.4.2 Results

In Table 5.3 it is possible to view the rate detection for each sample video. In this situation
false positives does not occurs.

Blue ball Yellow ball Red ball

Rate detection 50% 36% 64%

Table 5.3: Experimental results regarding the detection rate. With thrCanny = 40;
thrdetection = 18; minradius = 10; maxradius = 25.

In these experiments, Canny algorithm is applied in a grayscale image, without other


pre-processing method. The objects located in the back of room introduce a considerable
noise in the image analyzed, when the ball is in the air. Change the Canny threshold is not
sufficient to improve this result.

5.5 Final remarks

The dependence of the edge detector is the principal performance factor of the Hough
transform. Scenes without noise produce positive results, but when clutter increase, the
detector assumes worse results. It is possible to observe that when Canny algorithm detects
an imperfect circle, HoughCircle() handles missing and occluded data with success, producing
positive results.

71
The choice of parameters is a complex task. It is necessary a compromise between them
in order to produce the best results to generic scenarios: difference of ball colors, different
conditions of illumination in the scene, presence of furniture or objects in the scene, among
others.

72
Chapter 6

Objects Detection and Recognition


for a Service Robot

The rise and development of robot technology has already caused great transformation in
many fields of science and technology. Nowadays, robots have not just played a significant
role in industrial fields, but also have entered in families houses, applicable for service and
entertainment, which gradually become an important part in people’s daily lives. One can
now easily expect that applications of robots will expand to support our society in the 21st
century. Robots for the personal use will coexist with humans and provide supports such as
the assistance for the housework, care of aged and the physically handicapped people, in the
fastest aging society in the world.
The development of service robots is a research focus of current robot domain. Create
an autonomous agent, capable to take decisions in a variable environment is a hard task for
study and research for many areas.
In this chapter it is intended to develop an efficient vision system for object detection,
using the visual descriptors explained in the Chapter 3. To implement this system, the rules
of the RoboCup@Home league were take into account [29].

6.1 RoboCup environment

RoboCup@Home is an annual international robotics challenge, integrated in the RoboCup


competition. The RoboCup was founded in 1997, with the aim to promote robotics and

73
research, by offering a publicly appealing. RoboCup@Home appeared in 2006 and focuses
on the introduction of autonomous robots to human society. In Figure 6.1 it is presented a
typical scenario used in RoboCup@Home challenges. The RoboCup@Home arena is a realistic
home setup consisting of inter-connected rooms like, for instance, a living room, a kitchen, a
bath room, and a bed room.

Figure 6.1: Typical arena in RoboCup@Home challenge, 2013 [29].

The competition aims to develop technology in order to assist domestic tasks in the
future. To evaluate the robot’s abilities and performance, they are exposed to a set of tests in
a real world scenario. Focus lies on the following main domains: Human-Robot-Interaction
and Cooperation, Navigation and Mapping in dynamic environments, Computer Vision and
Object Recognition under natural light conditions, Object Manipulation, Adaptive Behaviors,
Behavior Integration, Ambient Intelligence, Standardization and System Integration.

The CAMBADA@Home project was created in January 2011, in University of Aveiro,


following the team past experience in the CAMBADA robotic soccer team [30]. The new
CAMBADA@Home platform is designed as a three layer mechanical/electronically platform
which can accommodate in an effective way the number of sensors and actuators needed to
perform the RoboCup@Home challenges. The vision system is located on top of robot, using
a Kinect sensor [30]. In Figure 6.2 it is possible see the CAMBADA@Home robot.

74
Figure 6.2: CAMBADA@Home robot.

6.2 Test scenario

Some tests in the RoboCup@Home league involve the manipulation of objects. These
objects resemble items usually found in household environments like, for instances, soda
cans, coffee mugs or books. In the competition, the Technical Committee will compile a list
of 25 objects. There are no restrictions on object size, appearance or weight. However, it can
be expected that the selected objects are easily manipulable by a human using a single hand.

To evaluate the efficiency of the developed vision system for a service robot some experi-
ments were conducted with the CAMBADA@Home robot.

Some objects were selected and a several videos were acquired using the Kinect installed
on the robot. The objects used in the experiments are represented in Figure 6.3. The scenario
chosen is a shelf and the videos were recorded moving the robot around the shelf, varying the
scale and rotation of the camera relatively to the position of objects (see scheme of Figure 6.4).
Objects are disposed in the shelf, without occlusion, easily accessible to robot, according to
the rules of the RoboCup@Home.

In the experiments, only color information acquired by the Kinect is used. The study of
depth component isn’t the objective of this thesis.

75
Figure 6.3: Object 1: Cleaning stuff; Object 2: Yogurt; Object 3: Juice; Object 4: Book;
Object 5: Tin.

Figure 6.4: The gray rectangle represent the shelf, with objects. The dashed line represent
the movement of robot. In all process, the Kinect was directed to the shelf. It is possible to
appoint two details: (1) variation in the scale; (2) variation in rotation. Note that the objects
4 and 5 suffer a higher angle of rotation than the others objects.

Several scenarios were recorded:

• scenario 1: All objects in the scene;

• scenario 2: Missing object 1;

• scenario 3: Missing object 2;

• scenario 4: Missing object 3;

• scenario 5: Missing object 4;

• scenario 6: Missing object 5.

In Figure 6.5 it is possible to view several moments of a sample video.

76
Figure 6.5: Scenario 1: It is possible to observe the variation in scale and rotation.

6.3 Developed vision system

The approach to be study in the vision system refers to the use of visual descriptors and
in particular two specific algorithms, like SIFT and SURF, as described in Chapter 3.

FAST algorithm is only a feature detector, it only detects interesting points on image,
does not compute a vector descriptor. BRIEF is a feature descriptor but it does not provide
any method to find the features. The objective of this thesis isn’t the development of new
methods, for this reason the FAST and BRIEF algorithms aren’t applied in the practice case.

The sequence of operations regarding the developed vision system is illustrated in Fig-
ure 6.6 and the details are presented as follows.

Figure 6.6: Description of the object detection system.

77
The detection and classification of an object have the same procedure for both algorithms,
SIFT and SURF. The difference between the algorithms occurs when keypoints are detected
and features are extracted. The following procedures are not distinctive. This step take an
image of each object of interest and a pre-processing step is performed to extract the keypoints
and features of reference. For this reason, it is necessary to know the objects to be searched
for. The extraction process of features were described in Chapter 3.
The scene to be analyzed can be processed in real time. For the experiments in this thesis,
as described before, several videos were recorded, and each frame processed individually. The
same process of features extraction are repeated for each frame and the distance between
features (object and scene) are computed using the FlannBasedMatcher library [26]. This
matcher is trained on a descriptor collection and calls its nearest search methods to find
the best matches. KDTreeIndexParams is the clustering parameter by default used in this
experiments. In this step, two values are obtained. These values represent maxima, maxdist ,
and minima, mindist , distance between matches. In order to reduce the large number of
matches obtained, a reference threshold is applied, and only matches with distance less than
2 × mindist are considered a “good match”. In the end of this process it is necessary conclude
if the object is present in the scene or not. Several parameters are taken into account:

• the number of matches: if goodMatches.size()<= 4 the object is not considered to be


on the scene;

• the average distance between matches of “good matches”: a threshold are applied for
both algorithms, SIFT and SURF.

If the object exist in the scene, RANSAC algorithm is used to estimate its location. The
RANSAC algorithm need more than four “good matches” to perform the localization. For
this reason, the first parameter taken into account is the size of “good matches” vector.
In Figure 6.7 it is possible to see an application of the developed system.

6.4 Efficiency of the visual descriptors

To study the efficiency of the visual descriptors in an image and determine the threshold
values referred before, two situations are analyzed: find an object in a scene where the object

78
(a) (b)

Figure 6.7: Object detection system: Color lines represent the “good matches”, green rect-
angle is generated by RANSAC algorithm. (a) Detection of object 3 to SURF algorithm; (b)
Detection of object 1 to SURF algorithm.

is present and try to find the object in a scene where the object is not present. In each case,
130 frames are analyzed and the average between “good matches” is extracted. In Figure 6.8
it is possible to see the comparison of the average value, in sample videos, regarding SIFT
extractor. Figure 6.9 show the number of “good matches” for each case.
Analyzing the graphics, it is possible to see that the average of distance values assumes
different values when object is present in the scene or when is not present. Looking to the
average value of distance, in the graphic (a), all objects are in the scene and the average
distance assumes values between 100 and 300. In graphics (b), (c), (d), (e) and (f) the object
missing assumes average values above 250. Although the distance between matches increase
when the object is not present, the number of “good matches” also increase, explained by
the mindist value. This value increase when object missing and when the first threshold is
applied, (2 × mindist), a greater number of “good matches” are obtained.
In Figures 6.10 and 6.11 it is possible see the same results regarding the SURF algo-
rithm. The average distance values and the size off “good matches” assumes a less oscillatory
behavior, when compared with the SIFT extractor.

6.5 Results

The threshold chosen for SURF and SIFT descriptors was obtained by the observation
of the average distance value. The threshold values chosen for this experiments are the

79
(a) (b)

(c) (d)

(e) (f)

Figure 6.8: Average distance value to “good matches”, to SIFT extractor in 130 frames: (a)
all objects in the scene; (b) missing object 1; (c) missing object 2; (d) missing object 3; (e)
missing object 4; (f) missing object 5.

following: thrSIF T = 250 and thrSU RF = 0, 29. If the distance assumes greater values than
the threshold, the system assumes that object is not present in the scene. The results obtained
with the detection system are present in Tables 6.1 and 6.2.

In order to understand the variation of the rate detection in each object, the size of
descriptors (in each object) is present in Table 6.3.

80
(a) (b)

(c) (d)

(e) (f)

Figure 6.9: Number of “good matches”, to SIFT extractor in 130 frames: (a) all objects in
the scene; (b) missing object 1; (c) missing object 2; (d) missing object 3; (e) missing object
4; (f) missing object 5.

It is possible to observe that:

• Object 4 have the large number of descriptors, justifying the best detection rate;

• Object 5 suffer a huge angle rotation, relative to the camera. For this reason, in the
end of the sample video, a small number of detections occurs.

• The greater number of false positives in object 3 was expected, in the case of SIFT de-
scriptor. In the initial study, illustrated in Figure 6.9 (d), it is noticeable the oscillation
in the average distance value.

81
(a) (b)

(c) (d)

(e) (f)

Figure 6.10: Average distance value to “good matches”, to SURF extractor in 130 frames:
(a) all objects in the scene; (b) missing object 1; (c) missing object 2; (d) missing object 3;
(e) missing object 4; (f) missing object 5.

Analyzing the system developed, relatively to the time processing, three measures were
taken into account. The time to extract the descriptors in the image of the scene (the pre-
processing extraction of features in the reference image is not considered), the processing time
of FlannBasedMatcher and the time to draw the results (matches and the location of object,
by RANSAC). The last time is zero when the detection fail. Table 6.4 shows the average

82
(a) (b)

(c) (d)

(e) (f)

Figure 6.11: Number of “good matches”, to SURF extractor in 130 frames: (a) all objects in
the scene; (b) missing object 1; (c) missing object 2; (d) missing object 3; (e) missing object
4; (f) missing object 5.

of time processing, for both extractors. As expected, the time necessary for each frame
evaluation using the SIFT algorithm assumes a double value, compared with the SURF.

6.6 Final remarks

The SURF algorithm assumes a fast time processing, compared to SIFT extractor. The
compromise between time processing and rate detection confers that the SIFT extractor is
not the best choice.
The distance between matches is not sufficient to conclude about the presence of an

83
Object 1 Object 2 Object 3 Object 4 Object 5

Scenario 1 86, 9% 86% 86, 2% 99, 2% 92, 3%


Scenario 2 0% 84, 6% 88, 5% 100% 72, 3%
Scenario 3 96, 1% 0, 8% 91, 5% 100% 80, 7%
Scenario 4 100% 96, 1% 52, 3% 100% 85, 4%
Scenario 5 93, 1% 98, 5% 92, 3% 39, 2% 70, 8%
Scenario 6 98, 5% 93, 8% 80, 8% 100% 55, 4%

Table 6.1: Experimental results: Detection rate regarding the SIFT algorithm, gray cells
corresponds to false positives. 130 frames are analyzed.

Object 1 Object 2 Object 3 Object 4 Object 5

Scenario 1 86% 80, 7% 99, 2% 97, 7% 60%


Scenario 2 20% 85, 4% 95, 4% 96, 9% 65, 4%
Scenario 3 92, 3% 23, 8% 97, 8% 98, 5% 63, 8%
Scenario 4 91, 5% 73, 8% 20, 7% 94, 6% 83, 8%
Scenario 5 92, 3% 77, 7% 93, 8% 1, 5% 40, 8%
Scenario 6 90, 8% 93, 8% 99, 2% 95, 4% 48, 5%

Table 6.2: Experimental results: Detection rate regarding the SURF algorithm, gray cells
corresponds to false positives. 130 frames are analyzed.

Object 1 Object 2 Object 3 Object 4 Object 5

Descriptor size in SIFT [128 × 350] [128 × 133] [128 × 328] [128 × 588] [128 × 241]
Descriptor size in SURF [64 × 128] [64 × 41] [64 × 154] [64 × 363] [64 × 136]

Table 6.3: Experimental results: Size of the descriptor vector for each object in SIFT and
SURF algorithms.

object. Sometimes the minimum value of distance occurs in false matches (in Figure 6.7 (a)
it is possible to see that four false matches occurs, and these assumes the minimum values of
distance).

The properties of descriptors about the invariance of scale, rotation and illumination suffer
some limitations. The best results are obtained when the camera is located in the front of

84
Algorithm Extraction FLANN Visual Total

SIFT 1438 47 18 1503


SURF 588 17 40 646

Table 6.4: Experimental results regarding the processing time for SIFT and SURF algorithm.
All times are in ms. The “Extraction” refers to the time of features extraction, “FLANN”
refers to the time necessary in the match process, “Visual” refers to the time of draw matches,
processing of RANSAC algorithm and presentation of visual information. Values obtained by
average in a sequence of 130 frames, considering the Video 1 with all objects in the scene. The
time present in the table corresponds for the time in each frame. These values were obtained
using a laptop with a AMD APU E450 1.65GHz (dual core).

objects, and close to them.


The method used to compute the distance between descriptors is the same in SIFT and
SURF extractors - Euclidean distance. In each algorithm the distance assumes different range
of values, explained by the information present in each vector, when the descriptor vector is
generated.
The results can maybe be improved by using the deph information. A novel descriptor
called Binary Robust Appearance and Normals Descriptor (BRAND) [31], efficiently combines
appearance and geometric shape information from RGB-D images, and is largely invariant to
rotation and scale transform. According to the authors BRAND achieves improved results
when compared to state of the art descriptors based on texture, geometry and combination
of both information.

85
86
Chapter 7

Conclusions and future work

The main goal of the work under this thesis was the study of several approaches used
in object detection in the field of real-time computer vision for autonomous robots. Three
different problems were presented and a usable solution for each practical application was
proposed and analyzed.
In the development of a decorative stones detector for textile industry, three algorithms
were tested, differing in the pre-processing of the images to be analyzed by Template matching
algorithm. In the simplest case, only a conversion from the color image to grayscale is per-
formed. To improve the first approach, the Canny algorithm was applied, with the advantage
of reducing the problem of illumination variation. The final algorithm includes processes of
dilation and erosion, in order to reduce the noise in the images to be analyzed. Filling the
contours was the solution to improve the result of correlation between images, in order to
obtain a value close to one. The Template matching algorithm assumes a positive result in
this application, since the working plane is fixed regarding the position of the camera and the
computer vision system does not have to deal with the scale of the objects.
Regarding the use of the Hough transform for ball detection, it requires a pre-processing
filtering to detect edges on an image. The performance of the Hough transform is highly
dependent on the results of the edge detector. Object size on the image and the distance
between the objects cause variations in the results. In the practical application, the choice of
HoughCircle() parameters were obtained by observation, making it a complex task.
In the development of an object detector to be applied in an autonomous service robot for
a domestic environment, SIFT and SURF algorithms were tested and evaluated. SIFT needs

87
more time to compute the vector descriptor and the rate detection isn’t proportional to the
cost of time. For this reason, SURF algorithm assumes the best choice. In experiments it is
possible to conclude that the properties of descriptions about the invariance of scale, rotation
and illuminations suffer some limitations.
By the end of this work, all the developed vision systems are being used or being included
in the robots prototype.

7.1 Future work

Template matching method assumes a simplest approach in object detection. New algo-
rithms of correlation can be study, in order to improve the values of correlation between two
images.
The use of depth information isn’t the objective of this thesis, but it is expected that
results of detection can be improved when this information is used, applied in the problems
described in Chapters 4 and 5. The study of BRAND [31] descriptor assumes an important
base, in a new approach, when the depth information will be used. Another example is the
algorithm that will be presented in the RoboCup 2014 regarding ball detection [32].
An efficient recognition system can use a set of reference images, extracting the features
and storing them in a database. This model is called bag-of-words (BoW). It was introduced
in 2003 and can be applied to image classification, by treating image features as words [33].
The BoW concept is based on a visual dictionary, a library that contains distinct, compact
image patterns, so called features. An image is represented by several patches, and as long
as a significant portion of these patches are matchable between images, similarity can be
determined.

88
Appendix A

Results: Decorative stones


detection

89
(a) (b) (c)

90
(d) (e) (f)

(g) (h) (i)

Figure A.1: Correlation vs. number of stone detected: (a) “Big Star”; (b) “Small Star”; (c) “Heart”; (d) “Drop”; (e) “Diamond”; (f)
“Ovoid”; (g) “Quadrate”; (h) “Rectangle”; (i) “Donut”.
Gray Canny Dilation and erosion findContours() drawContours()
Stone format IUE(µs) DB(µs) IUE(µs) DB(µs) IUE(µs) DB(µs) IUE(µs) DB(µs) IUE(ms) DB(µs) TM(µs) Total(ms)

Big Star 8000 288 34000 3200 19000 2000 3700 420 69000 400 920000 1060
Small Star 7000 152 35000 1800 15300 1292 4200 360 140000 520 740000 940
Heart 6900 612 34000 7110 15000 4860 4000 1458 132000 2160 460800 4820
Drop 7400 486 35000 5472 18700 3690 3800 1224 124000 1260 3384000 3580
Diamond 5000 333 35000 3411 18000 2295 4200 810 138000 900 2493000 2700
91

Ovoid 4660 810 33500 6210 15000 3060 4300 945 136000 1215 2079000 2280
Quadrate 4500 360 35600 3200 15400 1880 4500 480 127000 760 1380000 1570
Rectangle 8000 342 31700 3465 11000 1593 4200 630 216000 900 2538000 2800
Donut 3700 80 33500 1140 5400 355 5300 240 478000 560 250000 780

Table A.1: The processing time related to the detection of the first stone in a frame. The abbreviation “DB” refers to a processing
time of all database images, “IUE” is the image under evaluation and “TM” refers to the time necessary for the template matching
algorithm running. These values were obtained by approximations, using a laptop with a AMD APU E450 1.65GHz (dual core).
Stone format Num. of Mean Min time Max time Total Founds T hr1C T hr2C Dilate Erode Angle
BD time (ms) (ms) runs step
images (ms)

Small Star 36 792 640 951 169 169 17 17 2 2 2


Heart 180 2799 2543 3089 16 16 41 30 2 2 2
Drop 180 2942 2528 3432 22 22 20 20 1 1 2
Diamond 90 1498 1310 1623 39 39 15 15 2 2 2
Ovoid 90 1540 1310 1825 41 41 25 25 0 0 2

92
Quadrate 45 714 593 983 79 71 20 20 0 0 2
Rectangle 90 1567 1342 1638 46 46 19 19 3 3 2
Donut 1 40 31 109 604 604 20 20 0 0 2

Table A.2: Data obtained in the integration of the algorithm in the robot prototype: In each detection, only a region of interest
(in the frame under evaluation) have into account. Each stone, in data base have 2 degrees of rotation, forcing a great number of
comparisons. Canny thresholds and dilate and erode parameters used are present in the table. Authors : Sergio Martins, sfsm@ua,pt;
Joao Silva, jmls.esmoriz@gmail.com with a model: Intel(R) Core 2 DUO T5800 CPU @ 2,0GHz; RAM: 3GB.
Bibliography

[1] B. E. Bayer. Color imaging array. United States Patent, July 1976.

[2] Cburnett. Wikipedia: Bayer filter, http://en.wikipedia.org/ (last visited: 27/03/2014).

[3] iOS Developer Library. Performing convolution operations, https://developer.apple.com


(last visited: 27/03/2014).

[4] OpenCV Documentation. Object Detection - matchTemplate, http://docs.opencv.org


(last visited: 27/03/2014).

[5] P. V. C. Hough. Machine analysis of bubble chamber pictures. Proceedings of the Inter-
national Conference on High Energy Accelerators and Instrumentation, pages 554–556,
1959.

[6] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves
in pictures. Communications of the Association for Computing Machinery, pages 11–15,
1972.

[7] Mike. Wikipedia: Hough transform, http://en.wikipedia.org/ (last visited: 31/03/2014).

[8] Segmentation: Edge-based segmentation, http://www.engineering.uiowa.edu (last vis-


ited: 31/03/2014).

[9] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm
configuration. International Conference on Computer Vision Theory and Applications
(VISAPP), pages 331–340, 2009.

[10] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches
in logarithmic expected time. ACM Trans. Math. Softw., 3(3), September 1977.

93
[11] A. Bocker, S. Derksen, E. Schmidt, and G. Schneider. Hierarchical k-means clustering,
universitat frankfurt am main, 2004.

[12] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography. SRI International, 1981.

[13] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision in C++ with the
OpenCV Library. O’Reilly Media, 2012.

[14] J. Martin and J. L. Crowley. Comparison of correlation techniques. Conference on


Intelligent Autonomous Systems (IAS), Karlsruhe, pages 3–5, March 1995.

[15] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International


Journal of Computer Vision, 60(2):91–110, November 2004.

[16] I. Rey-Otero and M. Delbracio. Anatomy of the sift method. Image Processing On Line
(IPOL), March 2014.

[17] C. Harris and M. Stephens. A combined corner and edge detector. Proceedings of the
Fourth Alvey Vision Conference, pages 147–151, 1988.

[18] SIFT - Scale Invariant Feature Transform, http://www.aishack.in (last visited:


27/03/2014).

[19] A. Ess H. Bay, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF).
Computer Vision and Image Understanding, 110(3):346–359, June 2008.

[20] K. G. Derpanis. Integral image-based representations. Departament of Computer Science


and Engineering, York University, page 2, March 2007.

[21] C. Evans. Notes on the OpenSURF Library, University of Bristol. page 4, January 2009.

[22] S. Brahmbhatt. Pratical OpenCV. Technology in action, 2013.

[23] E. Rosten and T. Drummond. Machine learning for high speed corner detection. 9th
European Conference on Computer Vision (ECCV), 1:430–443, 2006.

[24] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary Robust Independent El-
ementary Features. 11th European Conference on Computer Vision (ECCV), Heraklion,
Crete. LNCS Springer, September 2010.

94
[25] R. Maini and Dr. H. Aggarwal. Study and comparison of various image edge detection
techniques. International Journal of Image Processing (IJIP), 3(1):3–5.

[26] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with the OpenCV
Library. O’Reilly Media, 2008.

[27] A. J. R. Neves, Armando J. Pinho, Daniel A. Martins, and Bernardo Cunha. An efficient
omnidirectional vision system for soccer robots: from calibration to object detection.
Mechatronics, 21(2):399–410, March 2011.

[28] Middle size robot league, rules and regulations for 2014.
http://wiki.robocup.org/images/d/d1/Msl-rules2014.pdf, last visited:
20/05/2014, 2014.

[29] Kai Chen, Dirk Holz, Caleb Rascon, Javier Ruiz des Solar, Amirhosein Shantia, Komei
Sugiura, Jörg Stückler, and Sven Wachsmuth. Robocup@home 2014: Rule and reg-
ulations. http://www.robocupathome.org/rules/2014_rulebook.pdf, last visited:
20/05/2014, 2014.

[30] J. Cunha, J. Azevedo, M. Cunha, L. Ferreira, P. Fonseca, N. Lau, C. Martins, A. Neves,


E. Pedrosa, A. Pereira, L. Santos, and A. Teixeira. CAMBADA@Home’2013: Team
Description Paper. Transverse Activity on Intelligent Robotics IEETA/DETI, RoboCup
Symposium 2013, 2013.

[31] E. R. Nascimento, G. L. Oliveira, M. Campos, A. W. Vieira, and W. Schwartz. BRAND:


A Robust Appearance and Depth Descriptor for RGB-D Images. IEEE, Comput. Sci.
Dept., Univ. Fed. de Minas Gerais, Brazil, 2012.

[32] P. Dias, J. Silva, R. Castro, and A. J. R. Neves. Detection of aerial balls using a kinect
sensor. In RoboCup 2014: Robot Soccer World Cup XVIII, Lecture Notes in Artificial
Intelligence. Springer, 2014.

[33] J. S. A. Zisserman. Video google: A text retrieval approach to object matching in videos.
Proceedings of the International Conference on Computer Vision, 2:1470–1477, 2003.

95
96

Você também pode gostar