Você está na página 1de 53

Journal of Advances in Information Technology

ISSN 1798-2340

Volume 3, Number 3, August 2012


Contents

REGULAR PAPERS

Artificial Immune Network Clustering Approach for Anomaly Intrusion Detection
Murad Abdo Rassam and Mohd. Aizaini Maarofi

Segmentation Based Personalized Web Page Summarization Model
K.S. Kuppusamy and G. Aghila

A Wavelet-Wavelet Based Processing Approach for Microcalcifications Detection in Mammograms
Salim Lahmiri

Recognition of Tongueprint Textures for Personal Authentication: A Wavelet Approach
Salim Lahmiri

The Need for Effective Information Security Awareness
Fadi A. Aloul

A New Approach on Cluster based Call Scheduling for Mobile Networks
P. K. Guha Thakurta, Saikat Basu, Sayan Goswami, and Subhansu Bandyopadhyay

Energy Efficient Cell Survival and Cell Merging Approaches for Auto-Configurable Wireless Sensor
Networks
M R Uddin, M A Matin, M K H Foraji, and B Hossain


147


155


162


168


176


184


191



Artificial Immune Network Clustering approach
for Anomaly Intrusion Detection
Murad Abdo Rassam
Universiti Teknologi Malaysia,
Faculty of Computer Science and Information Systems,
81310, Skudai, Johor, Malaysia.
E-mail :eng.murad2009@gmail.com

Mohd. Aizaini Maarof
Universiti Teknologi Malaysia,
Faculty of Computer Science and Information Systems,
81310, Skudai, Johor, Malaysia.
E-mail :aizaini@utm.my



AbstractMany Intrusion Detection approaches (IDS) have
been developed in the literature. Signature based
approaches for anomaly detection need to be updated with
the latest signatures of unknown attacks and hence being
impractical. Anomaly based approaches on the other hand,
suffer from high false alarms as well as low detection rates
and need labeled dataset to construct the detection profile.
In fact this kind of labeled dataset cannot be obtained easily.
In this paper, we investigate the application of bio-inspired
clustering approach, named Artificial Immune Network, for
clustering attacks for intrusion detection systems. To reduce
the dimension of the DARPA KDD Cup 1999 dataset,
Rough Set method was applied to get the most significant
features of the dataset. Then the Artificial Immune Network
clustering algorithm, aiNet, has been applied on the reduced
dataset. The results show that detection rate was enhanced
when most significant features were used instead of the
whole features. In addition, it shows that, Artificial Immune
Network is robust in detecting novel attacks.

Index TermsIDS, Feature Reduction, Artificial Immune
Network, Clustering.
I. INTRODUCTION
Intrusion Detection Systems (IDSs) as defined in [1],
are security tools that like other measures such as
antivirus software, firewalls, and access control schemes,
are intended to strengthen the security of information and
communication systems. The main function of such tools
is to differentiate between normal activities of the system
and any other deviations that could be intrusive.
Two main intrusion detection systems approaches have
been classified in the literature: anomaly intrusion
detection system and misuse intrusion detection system.
The former focuses on the unusual activities of patterns
and uses the normal behavior patterns to identify any
deviation of that behavior. However, the later can
recognize only the known attack patterns and uses
predefined signatures of attacks.
Many studies in the literature like in [2,16,18] have
tackled the IDS problem as a pattern recognition problem
or rather classified as learning system. There is a need for
removing redundant and irrelative features for learning
systems in order to reduce the complexity and increase the
accuracy of classification systems [2, 17]. To this end, we
need to reduce the representation space of such features in
order to cope with the requirements of accurate and
inexpensive IDSs.
Bello et al. in [3] suggested that feature reduction was
necessary to reduce the dimensionality of training dataset.
Furthermore, they claimed that feature reduction helps to
enhance the speed of data manipulation and improves the
classification rate by reducing the influence of noise.
Many anomaly detection systems have been proposed
in the literature based on different soft computing and
machine learning techniques. Some studies apply single
learning techniques, such as neural networks [19], genetic
algorithms [20], support vector machines [21], bio-
inspired algorithms [22.] and many more.
Furthermore, some IDSs mentioned in [9, 24, 25] are
based on ensemble or a combination of different learning
techniques. All of these techniques in particular have been
developed to classify or recognize whether the incoming
network access is normal or an attack.
Inspired by biology, computing models can be
designed to make the use of concepts, principles and
mechanisms underlying biological systems. Some
biologically inspired techniques are evolutionary
algorithms, neural networks, molecular computing,
quantum computing, and immunological computation.
Recently, these bio-inspired systems are getting more
attention because of their ability of to adapt naturally with
the environment in which they applied. One of these
systems is the human immune system which provides the
inspiration for solving a wide range of innovative
problems [23].
The aim of this paper is to address the impact of the
feature reduction in designing the anomaly detection
system. In addition, it introduced the use of the bio-
inspired artificial immune network algorithm (aiNet) to
detect the novel attacks that have not been seen in the
training patterns.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 147
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.147-154
The rest of the paper is organized as follows. Section 2
gives a view on the methods used in this study which are
rough set and artificial immune network. Section 3
describes some related works in both areas namely,
feature reduction and unsupervised immune network for
clustering. In section 4, the experiments using KDD CUP
99 dataset are shown. It also includes an analysis of the
results and performance comparison against k-Means
method. We conclude the paper in Section 5.
II. BACKGROUND
A. Rough Set Theory
Rough Set can be defined as a mathematical tool for
approximate reasoning for decision support and is
particularly well suited for classification of objects [4]. It
has been stated that, this tool can also be used for feature
reduction and feature extraction. The most attractive
characteristics of rough set is that it deals with
inconsistencies, uncertainty and incompleteness of data
instances by determining an upper and a lower
approximation to set membership. It has been
successfully used in the literature as a selection tool to
discover data dependencies, find out all possible feature
subsets, and remove redundant information. More
theoritical definitions about rough eet can be found in [5].
B. Artificial Immune Network
Immune network theory has been proposed first by
Jerne in [6] and it has been widely used in the
development of Artificial Immune System (AIS) [7]. This
theory suggests that for each antibody molecule, there is
a portion of their receptor that can be recognized by other
antibody molecules. As the results, a network
communication can occur within the immune system, and
it is called as Immune Network.
Network activation and network suppression are two
important characteristics of immune network. According
to de Castro and Timmis [8], the recognition of antigen by
an antibody results in network activation, whereas the
recognition of an antibody by another antibody results in
network suppression. The antibody Ab
2
is said to be the
internal image of the antigen Ag, because Ab
1
is capable
of recognizing the antigen and also Ab
2
. According to
the Immune Network theory, the receptor molecules
contained in the surface of the immune cells present
markers, named idiotopes, which can be recognized by
receptors on other cells [8]. Fig.1 below gives a view
about the immune network.

Figure 1A view on idiotypic Immune Network [8]

Artificial Immune Network is a dynamic
unsupervised learning method. The Artificial Immune
Network model consists of a set of cells called antibodies
interconnected by links with certain strengths. These
networked antibodies (idiotypic network) represent the
network internal images of pathogens (input patterns)
contained in the environment in which it is exposed. The
algorithm of Immune Network aiNet is given below:
1. Load antigen population.
2. Initialize the Immune Network by randomly selecting
an antigen from antigen population as a seed for each
cluster.
3. While the termination condition is not true:
a. For each antigen pattern in the antigen population
i. Present an antigen to the network
ii. Determine the affinity of each antibody in
each cluster to the antigen
iii. Select the n highest affinity antibodies from
the network
iv. For each of these highest affinity antibodies:
1. If its affinity is greater than the
affinity threshold d
Then.
a. Reproduce the antibody
proportionally to its affinity
b. Each clone undergoes a
mutation inversely
proportional to its affinity.
c. Increase the fitness of those
antibodies
v. End for
vi. If none of the highest affinity antibodies
could bind the antigen then generate a new
cluster by using the antigen as a seed.
b. End for
c. Compute the affinity between antibody-antibody
within each cluster and do suppression.
d. Calculate affinity between cluster-cluster and do
suppression.
e. Delete the antibodies in each cluster whose fitness is
less than a threshold f
4. End while
5. Output each cluster in the network
6. Output each cluster in the network
148 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
III. RELATED WORK
A. Feature Reduction in Intrusion Detection Systems
In the literature, most of approaches for IDSs examine
all features of dataset to detect intrusions. In fact, some of
the features may be redundant or somehow contribute
little to the detection process. The purpose of this phase of
the study is to identify the important input features in the
IDS dataset that contribute to the detection process and
hence improve the efficiency and the effectiveness of our
proposed model.
In [9] Chebrolu et al., have investigated the
performance of two feature reduction algorithms
involving Bayesian networks (BN) and Classification and
Regression Trees (CART). In addition, they also
investigated ensemble of BN and CART. Their results
indicated that feature reduction is important to design any
IDS that is efficient and effective for real world detection
systems. The use of Rough Set for feature reduction has
been studied by Zhang et al. in [4]. Their study showed
that this tool is capable of getting the classification rules
to determine the category of attack in IDS. In fact, they
did not show the features that were implemented in the
classification process.
According to [9], data reduction can be achieved by
different ways like filtering, data clustering and feature
selection. Generally, authors in [10] stated that, the
capability of anomaly intrusion detection is often hindered
by the inability to accurately classify a variation of normal
behavior as an intrusion. In addition, this study also stated
that network traffic data is huge, and it causes a
prohibitively high overhead and often becomes a major
problem in IDS. They also demonstrated that, the
elimination of these unimportant and irrelevant features
did not significantly lowering the performance of IDS.
Chakraborty in [11] argued that, the existence of these
irrelevant and redundant features does generally affects
the performance of machine learning or pattern
classification algorithms in detecting attacks. Hassan, et
al., in [12] proved that proper selection of feature set has
resulted in better classification performance.
B. Immune Network Clustering
The importance of IDSs is not in its ability to tackle
the huge number of vulnerabilities that have been
identified in advance but in their ability of detecting
unknown number of unexposed vulnerabilities that may
not be immediately available to the experts for analysis
and inclusion in the knowledge base [13]. In order to
cope with this need, authors in [13] introduced an
unsupervised anomaly detection based on clustering.
They argued that, their approach increase the detection
rate of different kinds of unknown attacks.
Generally, labeled data or purely normal data is not
readily available since it is time consuming and
expensive to manually classify it. Purely normal data is
also very hard to obtain in practice, since it is very hard
to guarantee that there are no intrusions when they were
collecting network traffic [14]. To this end, in order to
address these problems an unsupervised anomaly
detection approach using artificial immune network is
proposed due to the ability of this bio-inspired algorithm
to adapt and cluster normal and attacks data without any
prior knowledge.
IV. EXPERIMENTS AND RESULTS
In this section we starts by giving some information
about the dataset been used to validate our approach.
After that, we show the experimental procedure used to
implement our approach. The experiments have been done
in two phases, first phase; we apply the rough set tool to
reduce the features of the dataset. Then feature subset
obtained from the first phase is used as input to the second
phase, immune network clustering to cluster normal data
from attacks.
A. Dataset
KDD Cup 1999 is the dataset that is used to validate
the proposed approach. It is a common benchmark
dataset usually used by many researchers for evaluation
of intrusion detection techniques.
The original dataset contain 744 MB data with
4,940,000 records. However, most of researchers dealt
only with a small part of the dataset (10% percent) which
have been chosen for conducting experiments on this
dataset. The 10% of the data contains 494021 records.
The dataset has 41 features for each connection record
plus one class label. Some features are derived features,
which are useful in distinguishing normal connection
from attacks. These features are either nominal or
numeric.
The KDD CUP dataset can be classified into four main
categories of attacks. A brief description of each class in
the subsequent sections.
Denial-of-service attack: is a class of attacks
where an attacker makes some computing or
memory resource too busy or too full to respond
to requests.
Probing: is a class of attacks where an attacker
scans a network to get some information about
potential vulnerabilities in the network.
User to Root Attacks: is a class of attacks
where an attacker gets an access to a normal
user account on the system to get a root user
access to the system later.
Remote to User Attacks is a class of attacks
where an attacker sends some packets to a
system over a network remotely, and then it gets
some information about the potential
vulnerabilities in this system.
B. Feature Reduction using ROSETTA
For validating our proposed approaches, three different
samples of the dataset are used, each of which contains
10,000 instances. The distribution of data and the number
of instances for each class in these samples are shown in
Table 1.
TABLE 1
THE DISTRIBUTION OF ATTACKS IN THE DATA SAMPLES
Normal Probe DoS U2R R2L
2000 684 6907 34 375

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 149
2012 ACADEMY PUBLISHER
After data samples preparation, rough set is applied on
these data samples. Rough Set is implemented using
ROSETTA (Rough SET Toolkit for data Analysis) system
developed by Ohrn [15].
The experimental procedure can be outlined as follows:
First, the raw data samples are transformed into Tables
recognized by ROSETTA. After the preprocessing of data
samples, each data sample is split into two parts: the
training dataset and the testing dataset based on the
splitting factor determined by the user (i.e. split factor is
0.4 means that 40% of the data sample for training and the
remaining 60% for testing).
Many algorithms can be used to reduce the data
samples i.e. GA, Johnson Holte1R, and Dynamic
algorithms. The GA algorithm is used to reduce the data
sample features in this study. We are interested in GA,
because according to Ohrn [15]; it is used to find minimal
hitting sets and it gives less number of reducts as
compared to Johnsons algorithm.
The set of reducts obtained in the third step is used to
generate the rules using the GA built in algorithm in
ROSETTA tool. These rules will be used later to classify
the other part of data sample which is the testing part.
After a number of experiments, the most 8 significant
features are obtained and shown in the following table.

TABLE 2
THE MOST 8 SIGNIFICANT FEATURES OBTAINED BY ROUGH SET IN
THREE DIFFERENT SAMPLES OF DATA.
Data
Sample
8 most significant features
Sample 1 C E F Y AD AF AG AI
Sample 2 C E F W AG AF AH AJ
Sample 3 C E F Y W AE AI AN

Table 2 suggests that all samples shared 3 common
features and the rest varies in the number of occurrences
in each sample. Features C, E, and F are common in all
samples. Features AF, and AG are common between
sample1 and sample2. Features Y and AI are common
between sample1 and sample3. Feature W is common
between sample2 and sample3. According to this
commonality we found that the most 8 significant
features in the three samples and in the whole dataset are
shown in the following table.

TABLE 3
THE MOST 8 SIGNIFICANT FEATURES OBTAINED BY ROUGH SET
C E F W Y AF AG AI

The corresponding network features and the description
of each feature are shown in table 4.










TABLE 4
THE CORRESPONDING NETWORK FEATURES AND THE
DESCRIPTION OF THE FEATURES FOR THE OBTAINED FEATURES
Feature
label
Corresponding
Network Feature
Description of feature
C Service
Type of service used to
connect (e.g. fingure, ftp,
Telnet, SSh, etc.).
E Src_bytes
Number of bytes sent from the
host system to the destination
system.
F Dst_bytes
Number of bytes sent from the
destination system to the host
system.
W Count
Number of connections made
to the same host system in a
given interval of time
AF Dst_host_count
Nnumber of connections from
the same host to destination
during a specified time
window.
AG
Dst_host_srv_
count
Number of connections from
the same host with same
service to the destination host
during a specified time
window.
AI
dst_host_diff_
srv_rate
Number of connections to
different services from a
destination host.

Beside its use in feature reduction, rough set also was
applied to classify the data in order to evaluate the
performance of the classification by rough set classifier
before and after feature reduction. The results are shown
in the following table.
TABLE 5
THE CLASSIFICATION ACCURACY OBTAINED BY ROUGH SET ON
THREE DIFFERENT SAMPLES USING ALL 41 FEATURES.

Type
Sample
1
Sample
2
Sample
3
Mean StDv
Normal 92.8% 95.2% 79% 92% 0.04
A
t
t
a
c
k


Prob 94.3% 100% 99.3% 97.9% 0.03
DoS 99.9% 99.9% 100% 99.9% 0.00
U2R 46.7% 66.7% 26.7% 46.7 0.20
R2L 92.5% 84.3% 94% 90.2 0.05
Table 5 shows the result of classifying the data
samples using the whole dataset features which are 41
features. From the Table we notice how the imbalanced
classes U2R and R2L are misclassified. These classes are
rare in the main KDD CUP 99 dataset and their ratio in
the dataset is very small so the data used by this study
was grouped into samples to maintain the original
distribution as in the main dataset.
After applying the rough set classifier on the dataset
with all 41 features, we also applied it on the dataset with
the new reduced feature subset for the same data samples
to see the effect of feature reduction.




150 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
TABLE 6
THE CLASSIFICATION ACCURACY OBTAINED BY ROUGH SET ON
THREE DIFFERENT SAMPLES USING ONLY THE 8 MOST
SIGNIFICANT FEATURES.
Type
Sample
1
Sample
2
Sample
3
Mean StDv
Normal 93.2% 97.5% 88.8% 93.2% 0.643
A
t
t
a
c
k


Prob 95.5% 94.7% 96.4% 95.5% 0.008
DoS 99.4% 99.7% 99.3% 99.4% 0.002
U2R 34.3% 66.7% 80% 60.3% 0.235
R2L 85% 84.9% 99.3% 90% 0.082

By looking at the result of classification of the samples
using only the most 8 significant features, we notice that
there is no great reduction in accuracy for some classes
but also there is an increase of accuracy in others. The
reason is that the instances of some classes that occupy
most of the data space (i.e. Normal, and DoS) have
redundant features that do not play any role in detecting
these instances.
In addition, the features in these instances are less
correlated. As a result of that, the feature reduction
process did not affect the performance of the classifier in
these classes. Meanwhile, the instances in other classes
(i.e. R2L and U2R) which are called imbalanced classes
have noisy and uncorrelated features that affect the
classification accuracy. Furthermore, these classes
contain attacks that are rare in the data space. The feature
reduction process plays a role in eliminating the
uncorrelated features and hence increases the accuracy of
the classifier.
The features obtained by our model are compared with
the features selected by Chebrolu et al. in [9] using
Bayesian Networks approach (BN) as shown in Fig. 2
below. We found that the 8 features obtained by our
study were among the 12 features selected by their study
and they are: C, E, F, L, W, X, Y, AB, AE, AF, AG, AI.

Figure 2. A comparison with BN approach in [9].
C. Immune Network Clustering using aiNet algorithm
In this phase, the same data samples that have been
used for feature reduction using rough set are also used
here to examine the ability of the aiNet algorithm in
clustering different classes of data. In these data samples
the distribution of attacks and normal instances is as
shown in Table 7.




TABLE 7
THE DISTRIBUTION OF THE NORMAL AND ATTAK INSTANCES IN
DATA SAMPLES
Sample/Class Normal Probe DoS U2R R2L
All samples 2000 684 6907 34 375

Before the data samples were fed to the immune
network model, the normalization process is applied. In
the KDD CUP99 dataset the attributes are either
numerical or nominal. By normalization, the nominal
attributes are converted into linear discrete values
(integers). For example, ftp protocol is represented by 1
and http protocol is represented by 2. Then, the
attributes fall into two main types: discrete-valued
features and continuous-valued. If one of the features has
a large range, it can overpower the other features. Many
methods can be used for normalization like distance-
based method and Mean/Median Scaling method among
others.
After setting up the parameters of the aiNet algorithm
such that (Ngen= 10,
d
=1,
S
=0.3, Percentile amount
of clones to be re-selected=10, and the learning rate=0.4),
and applying it on the data samples, the results are shown
in Table 7.
TABLE 8
CLUSTERING RESULTS OBTAINED BY AINET CLUSTERING

Table 8 shows the result of clustering data samples
into 5 clusters using aiNet, each class in the data sample
is represented by one separate cluster. From Table 7 we
see that for each class the actual distribution of data is
different from the result clusters. This is common in all
clustering methods because it depends on the distances
between data instances and in our dataset there are
similarities between normal traffic and attacks and also
between the attacks themselves. These similarities make
it difficult to differentiate between normal and attack
instances clearly.
The results shown in Table 8 can be presented in
binary classification format, segregating between normal
and anomalies (attacks) as shown in Table 9.

TABLE 9
THE RESULT OF CLUSTERING DATA SAMPLES IN TWO
CATEGORIES (NORMAL AND ANOMALIES)

Binary-classification representation is useful especially
in obtaining detection rate (DR) and false positive rate
(FPR). Table 10 shows DR and FPR based on our
experiments on three sample sets.
Sample/Class Normal Probe DoS U2R R2L
Sample 1 3580 1598 4776 9 37
Sample 2 3495 640 5823 6 36
Sample 3 3420 1590 4949 5 36
Sample/Class Normal Anomalies (attacks)
Sample 1 3580 6420
Sample 2 3495 6505
Sample 3 3420 6580
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 151
2012 ACADEMY PUBLISHER

TABLE 10
DETECTION RATE AND FALSE POSITIVE RATE FOR THE
CLUSTERING PROCESS DONE BY AINET ALGORITHM

Sample/Class
Detection Rate False Positive
Rate
Sample 1 80.25% 0.1975
Sample 2 81.31% 0.1868
Sample 3 82.25% 0.1775

The relation between FPR and DR can be expressed
using the ROC curve. The following Figures show the
ROC curves for sample1 of the dataset.

Figure 3. ROC curve for sample1

From Figure 3, we found that the FPR is quite low
than other approaches used for intrusion detection as we
will see later in the analysis. Based on the above results,
aiNet seems to be robust enough to distinguish attacks
from normal traffic. It is shown that aiNet can cluster
attacks in the absence of labels and without any prior
knowledge.
To further evaluate the performance of aiNet, a
comparison was done with k-Means, a commonly used
clustering method in many fields including intrusion
detection. We have applied k-Means algorithm on the
same data samples. Before applying k-Means, k which
denotes the number of clusters has to be set and the seeds
for all of k clusters were then randomized.
The following figure shows the ROC curve for the
performance of K-Means method. It shows the relation
between the DR and the FPR.


Figure 4. ROC curve for sample1 of group 2 using K-Means.

Fig.4 suggests that K-Means has a high FPR and
relatively low DR. This is due to the nature of intrusion
detection data where the distribution of attacks among the
different classes is not balanced and there are similarities
between instances from different classes. The results also
indicate that k-Means which heavily relies on distance
measure, could poorly assign the data into their right
clusters.


Figure 5. The comparison between the ROC curves of both aiNet and
K-Means methods for the same data sample.

In Fig. 5, we show the ROC comparison between
aiNet algorithm and the K-Means algorithm to indicate
the performance of aiNet relative to k-Means. We see
that aiNet performs better than K-Means in both
performance measures, DR and FPR.
Further investigation has been done on the aiNet
characteristics show that aiNet is efficient enough in
compressing the datasets. Figure 6 below shows the
tradeoff between the suppressing threshold
S
and the
output cells that aiNet produces at the end of clustering
process.

Figure 6. The tradeoff between the suppression threshold and the
number of output cells produced by aiNet.

Fig. 6 shows that, beside the ability of aiNet in
clustering attacks, it has also the ability to compresses the
dataset which make it more suitable for large scale
datasets.
152 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
V. CONCLUSION AND FUTURE WORK
In this paper, the impact of using a proper feature
reduction tool has been found to enhance the detection
accuracy and reduce the false alarms in the IDSs. In
addition, the novel unknown attacks that have not been
seen during the training can be detected using a bio-
inspired immune network clustering approach. The
experimental results show that the accuracy of detection
has been improved using rough set for feature reduction.
Meanwhile, the problem of detecting novel attacks has
been addressed by artificial immune network clustering
algorithm (aiNet). To prove the viability of our approach,
a comparison with k-Means clustering approach has been
done and showed that our approach gives better results in
terms of detection accuracy of novel attacks. The findings
also show that Immune Network clustering approach is
robust in detecting novel attacks in the absence of labels.
To make the usage of aiNet easier, our future work
will include the automatic setting of its parameters.
Another point to be focused on for future studies is to
study the visibility of using a semi supervised approach
instead of unsupervised to enhance the accuracy of
detection by introduce some labels to the clustering
approach.
REFERENCES
[1] G. Teodoro, J. Daz Verdejo, G. Macia-Fernandez, and
E. Vazquez. Anomaly-based network intrusion detection:
Techniques, systems, and challenges Computers and
security, Vol 28, issue 1-2, pp. 18-2,March 2009.
[2] A. Zainal, M.A. Maarof, and S. Shamduddin. Feature
selection using rough set in intrusion detection, in Proc.
IEEE TENCON, p.4,2006.
[3] R. Bello, Y. Caballero. Nowe, Y. Gomex, and P.
Vrancx .A Model Based on Ant Colony System and
Rough Set Theory to Feature Selection. IN GECCO05,
Washington DC, United States. pp. 275-276, June 25-
29,2005.
[4] L. Zhang, G. Zhang, L. Yu , J. Zhang , and Y.
Bai,.Intrusion Detection Using Rough Set Classification.
Journal of Zheijiang University Science. pp. 1076-1086,
2004.
[5] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning
about Data. Kluwer Academic Publishers, 1991.
[6] N.K. Jerne. Network theory of the immune system. 1974.
Ann. Immunol,Paris, 1974.
[7] L.N. De Castro, J. Timmis. Artificial immune systems as
a novel soft computing paradigm. Soft Computing Vol 7
526544 ,Springer-Verlag, 2003.
[8] De Castro, L.N. and Timmis, J. (2002). An Artificial
Immune Network for Multimodal Function. Proceedings
of the 2002 Congress on immune systems, Vol.1, pp. 699-
704.
[9] S. Chebrolu, A. Abraham, and J.P. Thomas. Feature
Deduction and Ensemble Design of Intrusion Detection
Systems. International Journal of Computers and
Security .Vol 24, Issue 2, pp. 295-307, 2004.
[10] A.H. Sung, and S. Mukkamala. The Feature Selection and
Intrusion Detection Problems. LNCS, vol. 3321, Springer
Hieldelberg, pp. 468-48,2004.
[11] B. Chakraborty. Feature Subset Selection by Neuro-rough
Hybridization.. LNCS, Springer Hiedelberg, pp. 519-526,
2005.
[12] A. Hassan, M.S. Nabi Baksh, A.M. Shaharoun, and H.
Jamaluddin. Improved SPC Chart Pattern Recognition
Using Statistical Feature. International Journal of
Production Research, Vol 41 Issue 7, , pp. 1587-1603,
2003.
[13] S. Zanero. Improving Self Organizing Map Performance
for Network Intrusion Detection. In: SDM 2005
Workshop on "Clustering High Dimensional Data and its
Applications,2005.
[14] K. Leung. and C. Leckie. Unsupervised Anomaly
Detection in Network. Intrusion Detection Using Clusters.
Appeared at the 28th Australasian Computer Science
Conference, The University of Newcastle, Australia, 2005.
[15] A. Ohrn, and J. Komorowski. A Rough Set Toolkit for
Analysis of Data. In Proceedings of the third Joint
conference on Information Sciences, Vol 3, pp.403- 407,
USA,1997.
[16] G. Liu, Z. Yi. and S. Yang. A Hierarchical Intrusion
Detection Model based on the PCA Neural Networks.
International Journal of Neurocomputing, Vol 70,
pp.1561-1568.2007.
[17] D.M. Farid, N. Harbi, and M.Z. Rahman. Combining
Nave Bayes and Detection Tree for Adaptive Intrusion
Detection. Internation Jornal of Network Security & Its
Applications (IJNSA), Vol2-2, pp. 12-25. 2010.
[18] L. Deng, and D.Y. Gao. Research on Immune based
Adaptive Intrusion Detection System Model. In
Proceedings of IEEE International Conference on
Networks Security, Wireless Comunications and Trusted
Computing, pp. 488-491.2009.
[19] A.K. Ghosh, J. Wanken, and F. Charron. Detecting
anomalous and unknown intrusions against programs. In
Proceedings of the 1998 Annual Computer Security
Applications Conference (ACSAC'98), December 1998.
[20] T. Shon, Y. Kim, C. Lee, J. Moon. Machine learning
framework for network anomaly detection using SVM.
Information Assurance Workshop, IAW'05.2005.
[21] D. Kim, and J. Park. Network-Based Intrusion Detection
with Support Vector Machines. Lecture notes in computer
science, pp.747-756, Springer.2003.
[22] D. Dasgupta.Immunity-Based Intrusion Detection System:
A General Framework. In Proc. of the 22
nd

NISSC.1999b.
[23] X.Hang and H.Dai. An Immune Network Approach for
Web Document Clustering. In Proc of WI, pp. 278-284.
2004.
[24] Y. Yasami, S. Khorsandi, SP. Mozaffari, and A.
Jalalian .An unsupervised network anomaly detection
approach by k-Means . In IEEE Symposium on
Computers. 2008.
[25] X. Cheng, Y.P. Chin and S.M. Lim . Design of
multiplelevel hybrid classifier for intrusion detection
system using Bayesian clustering and decision trees.
Elsevie Pattern Recognition Letters,Vol 29, Issue 7, ,
Pages 918-924. May 2008

















JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 153
2012 ACADEMY PUBLISHER
Murad A. Rassam is currently a
Ph.D. student in the Information
Assurance & Security Research
Group (IASRG) at Universiti
Teknologi Malaysia in Skudai, Johor,
Malaysia. He received his B.Sc. in
Information Technology Engineering
from Tishreen University, Lattakia,
Syria, in 2005. He received his M.Sc.
degrees in Computer Science from the Universiti Teknologi
Malaysia, Skudai, Johor, Malaysia in 2010. His research
interests include wireless sensor network intrusion detection,
network intrusion detection, and application of soft computing
techniques and machine learning to computer and network
security. He is involved as a reviewer for some international
journals.



Mohd A. Maarof, Ph.D. Is a
Professor at Faculty of Computer
Science and Information System,
Universiti Teknologi Malaysia
(UTM). He obtained his B.Sc.
(Computer Science) and M.Sc.
(Computer Science) from U.S.A and
his PhD from Aston University,
Birmingham, United Kingdom in the
area of Information Technology (IT)
Security. He is currently leading the
Information Assurance & Security Research Group (IASRG) at
UTM. Currently his research involve in the areas of Intrusion
Detection System, Malware Detection, Web Content Filtering,
and Cryptography.






154 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
Segmentation Based, Personalized Web Page
Summarization Model

K.S.Kuppusamy
Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry, India
Email: kskuppu@gmail.com

G.Aghila
Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry, India
Email: aghilaa@yahoo.com



AbstractThe process of web page summarization differs
from the traditional text summarization due to the inherent
features in the structure of web pages comparing with
normal documents. This paper proposes a model for web
page summarization based on the segmentation approach.
The proposed model performs an inclusive
summarization by representing entities from different
portions of the web page resulting in the miniature of the
original page, termed as Micro-page. With the
incorporation of personalization in the summarization
process, the micro-page can be tailored based on the user
preferences. The empirical validation of the proposed model
is carried out with the help of prototype implementation
which depicts encouraging results.

Index Terms information retrieval, segmentation, web
page summarization, personalization

I. INTRODUCTION
The grasping of a lengthier document can be made
simpler and faster by the summarization process. The
web pages are a special kind of documents with some
inherent additional features like hyperlinks, visual
markups and meta tags etc. The summarization
approaches followed for the traditional text documents
need to be enriched with additional components so that
they would harness the features present in the web pages,
resulting in an enhanced output.
This paper proposes a model for web page
summarization based on the web page segmentation. The
segmentation process splits a web page into various
distinct portions. The summary would be more effective
if it includes representative items from these portions.
The proposed model encapsulates this benefit by creating
the summary as a bottom-up process from the segment
level to the page level.
With the incorporation of personalization in the
summarization stage makes it possible to render user
specific results. Two different users looking at the same
web page might expect a different summary based on
various factors which includes their area of interest.
The proposed model captures the user interest in the
form of profile-keywords. These profile-keywords would
provide a valuable input during the summary creation
process.
The objectives of this research work include the
following:
- Proposing a model for web page summarization
based on the segmentation approach.
- Enhancing the proposed model with the inclusion
of personalization.
- Validation of the proposed segmentation based
personalized web page summarization model
with the help of prototype implementation.

The remainder of this paper is organized as follows:
Section II would provide the related works carried out in
this domain, which formed the basic motivation to
propose this model. Section III would illustrate the
proposed model and the algorithms. Section IV is about
the experiments conducted on the prototype
implementation and the result analysis. Section V would
illustrate the conclusions and future directions for this
work.
II. RELATED WORKS
This section would highlight the related works carried
out in this domain. The proposed model includes three
major active research domains which are as listed below:
- Web Page Segmentation
- Personalization
- Web page summarization
A. Web Page Segmentation
Web page segmentation is an active research topic in
the information retrieval domain in which a wide range of
experiments are being conducted. Web page
segmentation is the process of dividing a web page into
smaller units based on various criteria. The following are
four basic types of web page segmentation methods [1]:

- Fixed length page segmentation
- DOM based page segmentation
- Vision based page segmentation
- Combined / Hybrid method
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 155
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.155-161
A comparative study among all these four types of
segmentation is illustrated in [1]. Each of above
mentioned segmentation methods have been studied in
detail in the literature. Fixed length page segmentation is
simple and less complex in terms of implementation but
the major problem with this approach is that it doesnt
consider any semantics of the page while segmenting. In
DOM base page segmentation, the HTML tag trees
Document Object Model would be used while
segmenting. An arbitrary passages based approach is
given in [2]. Vision based page segmentation (VIPS) is in
parallel lines with the way how human views a page.
VIPS [3] is a popular segmentation algorithm which
segments a page based on various visual features.
Apart from the above mentioned segmentation
methods a few novel approaches have been evolved
during the last few years. An image processing based
segmentation approach is illustrated in [4]. The
segmentation process based text density of the contents is
explained in [5]. The graph theory based approach to
segmentation is presented in [6].

B. Personalization
Personalization is the process of customizing based on
the user requirements and preferences. There exist many
research works to personalize based on user feedbacks.
The work presented in [7], proposes a method which
utilizes the experiences of the earlier usage. Generally,
the personalized result rendering is based upon the
feedback from the end-users. There exist two types of
feedbacks. They are as listed below:
- Explicit feedback
- Implicit feedback
In the explicit feedback mechanism user has to
explicitly indicate the relevant and non-relevant items. In
the case of implicit feedback it would be gathered
automatically based on the actions performed by the user.
Here the user is not required to explicitly mark it as
relevant or irrelevant. Both these types of feedbacks are
discussed in [8], [9], and [10].
An automatic personalization system based on usage
mining is depicted in [11]. Aggregate usage profile based
web personalization is explored in [12].

C. Web page summarization
The web page summarization is a sub-domain of the
text summarization which is also an active research area.
The process of summarization can be broadly sub divided
in to two types. They are as listed below:
- Extractive summarization
- Abstractive summarization
In the case of extractive summarization the candidate
sentences are chosen from the original text to form the
summary. In the abstractive approach novel sentences are
created based on the semantics. This approach is more
complicated and employs various Natural Language
Processing (NLP) techniques [13].
The research work explained in [14] falls under the
extractive summarization technique. In the case of
extractive summarization the candidate sentences are
chosen based on their ranks. Sentences with higher ranks
would be chosen as part of the summary based on the
compression ratio.
There exist certain additional features associated with
the web pages comparing the normal documents. So the
summarizer for web pages needs to exploit these features
to provide a better summary. An approach based on the
usage of click through data while summarizing web pages
is provided in [15].
The approach illustrated in [16] utilizes the hyperlinks
in the web pages to enhance the summarization process.

III. THE MODEL
This section would illustrate the proposed model for
web page summarization using segmentation. The Fig.1
illustrates the proposed model with various components
in it.
The proposed model receives the source web page as
input. This source web page needs to be segmented to
carry out the summarization, as shown in (1).

1 2 3 { , , ... } n P S S S S =
(1)

The segmentation is carried out so that the segments
cover the entire page and there exist no overlap among
the segments. This is illustrated in (2), (3) and (4) as two
criterion.

Criteria 1: During segmentation the components are
selected such that they are non-overlapping.

( ) ( )
i j i j
s , s : s s NULL? i, j 1, k = =
(2)

Criteria 2: Segmentation incorporates all parts of the
web page.
1 2 3 k i
s s s . s P . = (3)

1
k
j
i k P s
=
= (4)

The above two criteria ensures that all portions of the
web page is covered and there is no overlap among them.
The summarizer has to take these segments as input.
The summarization process is carried out on these
segments individually. The summarization task on each
of these segments is carried by incorporating four critical
factors. They are as listed below:
- Segment Weight
- Luhns Significance Factor
- Profile Keywords
- Compression ratio
The summarization on each of the segments would be
based on this quadruple as shown in (5)
( , , , ) c | q E = (5)
The four parameters specified in (5) represent the
above specified four factors.
156 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER









































Figure 1.The Proposed Web Page Summarizer Model

A. Segment Weight
To calculate the segment weight, a customized version
of our earlier model [17] is used. The segment weight for
each of the segments is calculated as a sum of four
different weights as shown in (6).
( ) ( )
i
s , , , L T V M e = (6)
Where
- L represents the link weight
- T represents the Theme weight
- V represents the Visual Feature weight
- M represents the Image weight
To calculate the above specified four weight factors
the following steps are followed:

Step 1: Remove the stop words from the web page

1 2 { , ,.. } n P P w w w =
(7)

Step 2: Sort the words in the descending order based
on the word occurrences. The word occurrence is
indicated by the | | operator in (8).

1 2 1 { , ,... }, , 1| | | | 1.. 1 n i i P t t t i i t t i n = < = (8)

Step 3: Consider the top N terms from the above list
which is termed as page seed array, as shown in (9).
The value of N can be selected so that it reflects the
top ten percentages of the terms extracted.

1 2 { , ,... } j t t t | = (9)

The number of terms matching between the specified
component and the page seed array terms is used to
calculate the four weights specified in (6).

In addition to the terms selected from the content of
the page, the keywords in the meta tag would also be
added to the page seed array . The addition of this
component is carried out because of the fact that the
meta keywords are a good indicator for theme of the
document. So inclusion of this meta keyword
component would enrich the quality of summarization
process.

1 2 { , ... } n | | = (10)

After the construction of the page seed array, the
remaining weight components can be calculated based on
this page seed array. The link weight calculation is done
as shown in (11).

{ } | | (| ( ) ( ) / 2) i i L l syn l syn | | = + (11)
Where i l represents the terms in the individual links
anchor tag. The syn indicates the synonym operation.
This process would assign a weight for each link tag. The
top n links with maximum weight would be chosen.

The image weight calculation is done as shown in (12).

{ } | | (| ( ) ( ) | /2) i i M m syn m syn | | = + (12)

Where i m represent the terms in the alt attribute of
the image present in that segment. The image with
maximum weight can be chosen for the summary.

The visual feature weight calculation is done as
shown in (13).

{ } | | *| | ((| ( ) ( ) | *| |) / 2) i i V e vf syn e syn vf | | = + (13)
Where i e represents the html elements present in the
segment and |vf| represents the weight associated with
that visual element. This visual weight feature is
calculated to give more weight to elements that have been
given additional visual emphasized. For example the text

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 157
2012 ACADEMY PUBLISHER
appears inside the bold tag should be given more weight
than the normal ones.

The title of a page is a good indicator of theme of the
page. So the contents in the segment which are matching
the title should get more importance to be included in the
summary. This is depicted in (14).

{ } | | | ( ( ) ( ) |) / 2) i i T e t syn e syn t = + (14)

In (14), t represents set of terms in the title of the page
and e
i
represents the elements in the segment.

B. Profile Keywords
The profile keyword is a set that would hold the
keywords that represent the users area of interest. The
inclusion of profile keywords in the summarization
process makes the summary to be tailor made according
to the preferences of the user.
The elements in the segment which contains the terms
from both the page seed array and profile keywords
should be given more weight in the summary creation
process.

| | | | { } | | (| ( ) ( ) ( ) | /2) i i e K syn e syn syn K | | = + (15)
In (15), K represent the terms from the profile
keywords.
C. Luhns Significance Factor
The Luhns algorithm [18] for auto summarization is a
well known statistics based summarization method. The
Luhns formula is utilized to calculate the importance of
sentences present in a document based on the distance
measure of important words in that document.
The proposed model utilizes the Luhns significance
factor to select important sentences from the segment.

{ } ( ) i LS S | = (16)
In (16) LS represents the Luhns significance factor.
The set | would hold the Luhns significance factor of
each of the sentences in that segment.

D. Compression Ratio
The compression ratio is an important factor that
decides the final contents of the micro-page. In all the
above steps, different sets have been derived with weights
associated with their elements. The compression ratio
would decide the number of entities to be selected from
the derived sets.
The final summary would be formed by selecting the
top ranked items whose count would be decided by the
compression ratio.

| | | | | | | | | | | |
100 100 100 100 100 100
L T V M
L T V M
q q q q q q |
|
( ( ( ( ( (
E =
`
( ( ( ( ( (
)

(17)
The compression ratioq is multiplied by the number of
items in that set and divided by hundred. The resultant
value of the above calculation is used as the threshold
value to select the top n items in individual set.
The algorithm for the above specified model is as
given below:






















































Algorithm SegmentSummarize
Input : Page P, Segment S
i
, Profile Keywords K,
Compression ratioq , Page Seed Array
Output : Segment Summary

Begin
For each link l
i
in the segment
begin
Linkweight[l
i
] = | | (| ( ) ( ) / 2) i i l syn l syn | | +
end
For each image m
i
in the segment
begin
Imageweight[m
i
] = | | (| ( ) ( ) | /2) i i m syn m syn | | +
end
//calculate the visual feature weight
For each element e
i
in the segment
begin
visualweight[e
i
]=
| | *| | ((| ( ) ( ) | *| |) / 2) i i e vf syn e syn vf | | +
end
For each element e
i
in the segment
begin
themeweight[e
i
]= | | | ( ( ) ( ) |) / 2) i i e t syn e syn t +
end
//calculate the Lunhs significance factor for
//segment sentences
for each sentence in segment
begin
| [sn
i
]= LS(sentence)
end
//identify the elements to be present in the summary
links_count = q * |L| /100
image_count = q * |M| /100
vf_count = q * |V| /100
theme_count = q * |T| /100

profile_count = q * | | /100
luhns count = q * | | | /100

segsum = segsum + topn(links[links_count])
segsum = segsum +topn(images[image_count])
segsum = segsum + topn(V[vf_count])
segsum=segsum+topn(T[theme_count])
segsum = segsum + topn( [profile_count])
segsum = segsum + topn( | [luhns count])

return segsum

End

158 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
The SegmentSummarize algorithm uses the page seed
Array as an input. The algorithm BuildPageSeedArray is
used to construct the seed elements.










































The micro page builder component would receive
the segsum from above algorithm as input and build the
target summary page.


III. EXPERIMENTS AND RESULTS
This section would highlight the experimental setup
used for the validation of above mentioned model and
algorithms. The prototype implementation is done with
the software stack including Linux, Apache, MySql and
PHP. For client side scripting JavaScript is used. With
respect to the hardware, a dual processor system with 3
GHz of speed and 4 GB of RAM is used. The internet
connection used in the experimental setup is a 64 Mbps
leased line.
The Fig.2 shows a sample page to be summarized. The
screenshot in Fig.3 shows the user interface of the
proposed system with a text box to get the url from the
user; a combo box with compression ratio values listed in
the order of 10; a command button to initiate the
summarization. The micro-page is displayed in the same
screen once the server finishes the task and sends the
output back to the client.


















Figure 2. The Source Web Page




















Figure 3. The Micro Page


To validate the proposed model a set of experiments
were conducted. The values listed in Table. I correspond
to twenty sample experiments of user I. The column SSP
indicates Segments in Source Page, SMP indicates
Segments in Micro-Page, ISP indicates Images in Source
Page, IMP indicates Images in Micro-Page, LSP indicates
Links in Source Page and LMP indicates Links in Micro-
Page.

Algorithm BuildPageSeedArray
Input : Page-Url PU

Output : Page Seed Array

Begin
//Fetch the Page Contents for PU.
P = fetch_contents(PU)

//Extract the keywords from meta tag.
M = 1 2 { , ... } n

//remove stop words from P

1 2 { , ,.. } n P P w w w =


//calculate the frequency of occurrence of each
word
Freq_array = |occurrence(P)|

//sort the array in descending order
Freq_array = sort_descending(freq_array)

//fetch the top 10% of items from freq_array

n = count(freq_array)

for index = 0 to n * 0.1
top(index) = freq_array(index)

top= freq_array

// merge the array top with meta tag keywords array

= top M

return

End



JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 159
2012 ACADEMY PUBLISHER
TABLE I. EXPERIMENTAL RESULTS USER I
Page SSP SMP ISP IMP LSP LMP
1 25 20 4 2 14 7
2 27 22 3 1 12 5
3 15 14 6 4 14 6
4 10 8 5 5 12 5
5 5 5 2 0 14 6
6 17 16 6 3 15 11
7 12 8 4 1 25 12
8 8 6 3 3 12 8
9 11 9 1 1 10 6
10 12 11 0 0 11 8
11 24 20 3 2 14 7
12 25 22 2 1 12 5
13 13 10 6 5 13 6
14 8 6 4 4 11 4
15 6 5 2 1 12 5
16 17 16 5 3 1 1
17 11 8 4 2 25 17
18 7 6 2 2 11 8
19 11 9 1 1 11 7
20 11 11 4 3 12 8

The experimental result analysis for User I is charted
in Fig.2. Hence personalization is incorporated in to the
summarization process; users with different interest
would get different summaries. The purpose of the
experiments is to check the fact that the micro page
represents elements from various segments of the source
page. The mean of the values for SSP and SMP
establishes the fact that relevant portions from majority of
segments are carried out in to the micro-page.













Figure 2. The Mean Analysis for Users I

In Fig.2 the values listed on top of SSP and SMP
indicates the mean after clustering. The data set is
clustered in to various groups. This clustering is done to
illustrate the fact that the proposed model works fine for
pages with less number of segments and large number
segments as well. With respect to ISP and IMP, only the
images satisfying the filtering criteria have become part
of the micro-page. A similar criterion is applied to LSP
and LMP as well.
For User II and III the results are charted out in Table
II and Table II respectively.
TABLE II. EXPERIMENTAL RESULTS USER II
Page

SSP SMP ISP IMP LSP LMP
1 25 21 4 4 14 10
2 27 25 3 2 12 6
3 15 13 6 5 14 7
4 10 9 5 5 12 6
5 5 5 2 1 14 8
6 17 13 6 4 15 12
7 12 10 4 2 25 13
8 8 7 3 2 12 8
9 11 9 1 0 10 7
10 12 11 0 0 11 8
11 24 21 3 1 14 13
12 25 23 2 1 12 11
13 13 11 6 4 13 8
14 8 7 4 3 11 4
15 6 4 2 2 12 11
16 17 15 5 5 1 1
17 11 10 4 3 25 10
18 7 5 2 1 11 6
19 11 10 1 0 11 4
20 11 9 4 2 12 9
TABLE III. EXPERIMENTAL RESULTS USER III
Page SSP SMP ISP IMP LSP LMP
1 25 20 4 3 14 12
2 27 24 3 1 12 6
3 15 11 6 4 14 8
4 10 8 5 4 12 10
5 5 4 2 0 14 11
6 17 16 6 5 15 13
7 12 11 4 3 25 13
8 8 6 3 1 12 8
9 11 10 1 0 10 6
10 12 10 0 0 11 8
11 24 22 3 2 14 12
12 25 22 2 1 12 10
13 13 11 6 5 13 6
14 8 6 4 2 11 10
15 6 5 2 1 12 8
16 17 15 5 4 1 1
17 11 8 4 3 25 14
18 7 6 2 1 11 5
19 11 8 1 0 11 7
20 11 9 4 2 12 10












Figure 3. The Mean Analysis for all three users


160 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
From the values listed in all the above three tables,
Table I, Table II and Table III, it can observed the values
of SMP, LMP, IMP varies across the users. The user
profile-keywords play an important role in the proposed
model in the summary creation process.
V. CONCLUSIONS AND FUTURE DIRECTIONS
The proposed model summarizes a page incorporating
both segmentation as well as personalization. The derived
conclusions are as listed below:
- The web page summarization process carried out
by associating segmentation creates a
representative micro page which incorporates
items from various portions of the web page.

- The summaries generated can be tailor made to
suit the needs and preferences of the user.

The future directions for this research work are as
listed below:
- Making the personalization more effective by
following ontology based data representation
instead of using profile-key word approach.
- Extending this work to include languages other
than English.
REFERENCES
[1] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma.
Block-based web search. In SIGIR 04: Proceedings of the
27th annual international ACM SIGIR conference on
Research and development in information retrieval, pages
456463, New York, NY, USA, 2004. ACM
[2] Kaszkiel, M. and Zobel, J., Effective Ranking with
Arbitrary Passages, Journal of the American Society for
Information Science, Vol. 52, No. 4, 2001, pp. 344-364.
[3] D. Cai, S. Yu, J. Wen, and W.-Y. Ma, VIPS: A vision-
based page segmentation algorithm, Tech. Rep. MSR-TR-
2003-79, 2003.
[4] Cao, Jiuxin , Mao, Bo and Luo, Junzhou, 'A segmentation
method for web page analysis using shrinking and
dividing', International Journal of Parallel, Emergent and
Distributed Systems, 25: 2, 93 104, 2010.
[5] Kohlschtter, C. and Nejdl, W. A densitometric approach
to web page segmentation. In Proceeding of the 17th ACM
Conference on information and Knowledge
Management (Napa Valley, California, USA, October 26 -
30, 2008). CIKM '08. ACM, New York, NY, 1173-1182,
2008.
[6] Deepayan Chakrabarti , Ravi Kumar , Kunal Punera, A
graph-theoretic approach to webpage segmentation,
Proceeding of the 17th international conference on World
Wide Web, April 21-25, Beijing, China, 2008.
[7] Barry Smyth, Evelyn Balfe, "Anonymous personalization
in collaborative web search", Information Retrieval (2006)
9: 165190.
[8] Rocchio, J. "Relevance feedback in information retrieval"
in G. Salton (Ed.), The SMART retrieval system:
Experiments in automatic document processing (pp. 313-
323). Englewood Cliffs, NJ: Prentice-Hall, 1971.
[9] Fox, S., Kamawat, K., Mydland, M., Dumais, S., and
White, T."Evaluating implicit measures to improve the
search experiences" in ACM Transactions on Information
Systems, vol. 23(2), 147-168, 2005.
[10] Jung, S., Herlocker, J.L, and Webster, J. "Click data as
implicit relevance feedback in web search" in Information
Processing and Management vol. 43, 791-807, 2007.
[11] Bamshad Mobasher, Robert Cooley, and Jaideep
Srivastava. 2000. Automatic personalization based on Web
usage mining. Commun. ACM 43, 8 (August 2000), 142-
151. DOI=10.1145/345124.345169
http://doi.acm.org/10.1145/345124.345169I.
[12] Mobasher Bamshad., Luo Tao., Nakagawa Miki.,
"Discovery and Evaluation of Aggregate Usage Profiles for
Web Personalization" in "Data Mining and Knowledge
Discovery", 61-82, 2002.
[13] Yeh Ye et al., 2007. Document concept lattice for text
understanding and summarization. Information
Processing & Management 43 (6), 16431662.
[14] Nomoto, T., Matsumoto, Y., 2001. A new approach to
unsupervised text summarization. In: Proceedings of the
24th ACM SIGIR, pp.26-34.
[15] Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang,
Yuchang Lu, and Zheng Chen. 2005. Web-page
summarization using clickthrough data. In Proceedings of
the 28th annual international ACM SIGIR conference on
Research and development in information retrieval (SIGIR
'05). ACM, New York, NY, USA, 194-201.
http://doi.acm.org/10.1145/1076034.1076070
[16] J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi.
Enhanced web document summarization using
hyperlinks In Proceedings of the 14th ACM conference
on Hypertext and hypermedia, pages 208-215, New York,
NY, USA, 2003. ACM Press.
[17] K.S.Kuppusamy, G.Aghila, Museum:Multidimensional
Segment Evaluation Model, Journal of Computing, Vol 3,
Issue 3, March 2011 (Accepted Paper)
[18] H. Luhn. The automatic creation of literature abstracts.
IBM Journal of Research and Development, 2(2):159--165,
1958.


K.S. Kuppusamy is an Assistant
Professor at Department of Computer
Science, School of Engineering and
Technology, Pondicherry University,
Pondicherry, India. He has obtained his
Masters degree in Computer Science and
Information Technology from Madurai
Kamaraj University. He is currently
pursuing his Ph.D in the field of Intelligent Information
Management. His research interest includes Web Search
Engines, Semantic Web.


G. Aghila is a Professor at
Department of Computer Science, School
of Engineering and Technology,
Pondicherry University, Pondicherry,
India. She has got a total of 20 years of
teaching experience. She has received her
M.E (Computer Science and Engineering)
and Ph.D. from Anna University, Chennai,
India.
She has published nearly 40 research papers in web crawlers,
ontology based information retrieval. She is currently a
supervisor guiding 8 Ph.D. scholars. She was in receipt of
Schrneiger award. She is an expert in ontology development.
Her area of interest include Intelligent Information Management,
artificial intelligence, text mining and semantic web
technologies.


JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 161
2012 ACADEMY PUBLISHER

A Wavelet-Wavelet Based Processing Approach
for Microcalcifications Detection in
Mammograms

Salim Lahmiri
University of Quebec at Montreal/Department of Computer Science, Montreal, Canada
Email: lahmiri.salim@courrier.uqam.ca



AbstractA new methodology for the detection of
microcalcifications (MCs) in mammograms is presented.
Since MCs correspond to the high frequency components of
the mammograms, a further multiresolution analysis is
applied to these components. In particular, we seek to
capture better high frequency features of a mammogram by
performing a second analysis only to its high frequency
components. For instance, in the first step, discrete wavelet
transform (DWT) is applied to the mammograms and HH
image is extracted. In the second step, the DWT is applied to
the previous HH image. Then six statistical features are
computed. Finally, principal component analysis is
employed to reduce the number of features. The k-Nearest
Neighborhood (k-NN) algorithm is employed for the
classification task using cross-validation technique. A
similar approach is adopted with use of the discrete Fourier
transform (FT). The experimental results show strong
evidence of the proposed methodology for MCs detection in
digital mammograms.

Index TermsMCs, discrete Fourier transform, discrete
wavelet transform, k-NN.

I. INTRODUCTION
In the Western World, breast cancer is the most
common form of cancer in the female population and
early detection of it is one of the most important factors
helping to recover from the disease. Indeed, the causes of
breast cancer remain unknown, and early detection is the
key to reduce the death rate [1]. This can be achieved
through mammography screening programs performed by
computer-assisted diagnosis (CAD) systems to eliminate
the operator dependency and improve the diagnostic
accuracy. Therefore, it is a valuable and beneficial tool
for breast cancer detection and classification. Symptoms
of breast cancer include clustered MCs, speculated
lesions, circumscribed masses, ill defined masses, and
architectural distortions [2]. The MCs can be divided into
benign (passive) and malignant (active). Pathologists
consider benign cell not dangerous. But malignant
(cancer) cell can become larger and affect the around area
and causes dead. The detection of MCs has received a
large attention in the literature. In particular, many multi-
resolution techniques have been employed to process the
mammograms in order to detect the clustered MCs. For
example, the discrete wavelet transform (DWT) was used
to characterize digital mammograms and a specific
threshold based on standard deviation of the decomposed
images coefficients is employed to select a subset of
representative features [3]. The employed wavelet bases
were the Haar, Daubechies4, Biorthogonal2.4, Coiflets2
and Symlets2. Then distance metrics are used to measure
the similarity between unknown mammogram and classes
signatures. The validation and the testing process are
respectively performed with 75% and 25% of the images.
The experiments performed show that successful
accuracy varies from 62.50% to 100% for the
classification problem between normal, MCs, spiculated,
and circumscribed areas. The authors conclude that
generally the distance metrics used for the classification
purpose present similar results. In addition, the Haar
wavelet achieves better results in all the tested classes.
Furthermore, the selection of features by threshold helps
reducing the number features used to form the signatures
of classes. In [4], different types of wavelet packets
including Daubechies, Symlet, Coiflet, and Biorthogonal
were employed, all at two-level expansion. Then, 18
features were computed from the high frequency sub-
band, and principal component analysis (PCA) was
performed to eliminate the features that contribute less
than 3% to the total variation of the data set. As a result,
only 7 features formed the inputs set to be fed to the
backpropagation neural networks for the classification
purpose. The performance was evaluated in terms of the
receiver operating characteristic (ROC). It shows that
best performance was achieved by the Coiflet wavelet
with areas under ROC curve ranging from 0.90 to 0.97. A
multiresolution approach to automated classification of
mammograms using Gabor filters was proposed in [5].
First, Gabor filters of different frequencies and
orientations have been applied on mammograms to
produce filtered outputs. Then, for each filtered output,
the mean and the standard deviation of the coefficient
magnitude are used as image features for classification.
Second, t-test statistic is performed on each feature to
select significant features. Finally, the selected features
are applied in mammogram classification using k-NN
algorithm. The obtained classification rate with 14
selected features is 80% significantly higher than 75%
performance obtained with 48 features. The authors
concluded that Gabor filter is able to extract textural
162 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.162-167

patterns of mammograms, and that statistical t-test and its
p-value can be used to reduce feature space and speed up
the classification process while providing good
classification rates. More recently, dual-tree complex
wavelet, curvelet and contourlet were also successfully
applied to mammograms to extract features. For instance,
the authors in [6] used curvelet transform and extracted a
set of coefficient from each level of decomposition.
These coefficients are used to classify the mammogram
images into normal or cancer classes. The Euclidean
distance was used to design the classifier. At scale 3, a
100% successful classification rate was obtained in
benign class when a set 10%, 20%, 60%, 70%, and 80%
of coefficients are used. The malignant class achieved
100% classification rate at 30%, 40%, 50%, 70%, 80%,
and 90% percentages of coefficients used. The authors in
[7] used dual-tree complex wavelet transform (DT-CWT)
as feature extraction technique and support vector
machine (SVM) to classify two classes of MCs: benign
and malignant. Using this methodology, the experimental
result achieved 88.64% classification accuracy. In [8], the
contourlet transform was employed for feature extraction.
Using the contourlet coefficients, the classification is
performed based on successive enhancement learning
(SEL) weighted SVM, support vector-based fuzzy neural
network (SVFNN), and kernel SVM. The obtained
correct classification accuracies are 96.6%, 91.5% and
82.1% respectively.
In an attempt to capture directional features in
mammograms, the authors in [9] combined the discrete
wavelet transform and the Gabor filter. In particular, the
two-dimensional discrete wavelet transform is employed
to process the mammogram and obtain its high-high (HH)
frequency sub-band image. Then, a Gabor filter bank is
applied to the latter at different frequencies and spatial
orientations to obtain new Gabor images from which the
average and standard deviation are computed. Finally,
these statistics are fed to a support vector machine with
polynomial kernel to classify normal versus cancer
mammograms. The obtained classification results using
ten cross-validation technique showed the superiority of
their approach to the standard approach, which only uses
the discrete wavelet transform to extract features from
mammograms. Therefore, the authors concluded that high
frequency directional features are important to improve
the correct classification rate of MCs in mammograms.
Although the curvelet, contourlet, DT-CWT, and
Gabor transforms were employed for features extraction,
the discrete wavelet transform (DWT) remains the most
employed tool for processing mammograms [4]since it is
able to perform signal analysis at different time and
frequency scales. In addition, DWT offers a low
computational cost in comparison with cuvelet, contourlet,
DT-CWT and Gabor transform. The DWT decomposes
an image into four orthogonal sub-bands: low-low (LL),
high-low (HL), low-high (LH), and high-high (HH). The
sub-bands (sub-images) LL, HL, LH, and HH contain
respectively approximation, horizontal, vertical, and
diagonal information. In the next octave, the LL sub-band
is further decomposed in the same manner.
The purpose of this paper is to propose a simple
methodology for features extraction from mammograms
based on a further analysis of high frequency components
of the mammograms. We rely on high frequency
components of the mammograms since MCs are usually
found in dense biological tissue, which corresponds to
high frequencies in the frequency domain of the image
[4][2][10]. In particular, we aim to apply DWT to HH
sub-bands to extract further accurate high frequency
information. In particular, DWT is applied to the
mammogram and its high-high (HH) image is obtained.
Then, a second DWT is applied to the HH image obtained
in the previous step. The purpose of applying a second
DWT uniquely to HH image is to accurately capture high
frequency information from high frequency image. A
similar approach with use of discrete Fourier Transform
(FT) is proposed. For instance, FT is applied to the
mammogram to obtain a Fourier image. Then, a second
FT is applied to the Fourier image obtained in the
previous step. As in the first approach, the purpose is to
accurately capture high frequency information from high
frequency image. Therefore, a second FT is applied to the
Fourier image obtained in the previous step. We aim to
adopt Fourier transform in the second approach to check
the effectiveness of high frequency information in the
detection of MCs. To the best of our knowledge, no such
methodologies have been adopted in the literature to
extract better high frequency features from mammograms.
Therefore, we suggest examining the effectiveness of a
further analysis of high frequency components of a
mammogram to better characterize MCs.
The paper is organized as follows: The methodology is
described in Section II. The experimental results are
presented in Section III. Finally, conclusions are given in
Section IV.
II. METHODOLOGY
A. Discrete Fourier Transform
Fourier transform (FT) is an effective tool for signal
analysis. For instance, FT is more generic than power
spectral density (PSD) and phase spectrum approaches
and provides better recognition than the PSD approach
[11]. In addition, the DFT is efficient in detecting
periodicity in the frequency domain [12] and Fourier
descriptors are greatly immune to the noise [13].
Moreover, DFT is translation invariant with respect to the
spectrum [14]. The two-dimensional discrete Fourier
Transform is defined as:
( ) ( )

=
(

|
|
.
|

\
|
+ =
1
0
1
0
2 exp ,
1
,
M
m
N
n
N
nv
M
mu
j n m f
MN
v u F t (1)
where ](m,n) is an image with MN, j = 1 , u =
0,1,,M 1; v = 0,1,,N 1. The values of F(u,v) are
the Fourier coefficients of the expansion of the
exponential into sine and cosine with the variables u and
v. The features used to detect cancer images are extracted
from the frequency image F(u,v).

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 163
2012 ACADEMY PUBLISHER

B. Discrete Wavelet Transform
Wavelet transform belongs to the multiresolution
transformation, performing the decomposition of the
signal on different levels. Then, the wavelet functions
have good localization abilities in both time and
frequency, enabling good representation of the local
features of the patterns [13]. As mentioned in the
introduction, the discrete wavelet transform (DWT)
decomposes an image into several sub-bands according to
a recursive process (see Figure. 1). These include LH1,
HL1 and HH1 which represent detail images and LL1
which corresponds to the approximation image. The
approximation and detail images are then decomposed
into second-level approximation and detail images, and
the process is repeated to achieve the desired level of the
multi-resolution analysis. The obtained coefficients
values for the approximation and detail sub-band images
are useful features for texture categorization [15][16]. To
obtain the set of features that characterize a given texture
image, the 2D-DWT wavelet transform is used to find its
spectral components. This allows transforming each
texture image into a local spatial/frequency representation
by convolving the image with a bank of filters. Then, the
image features are extracted from the obtained 2D-DWT
representation.


Figure 1: Two level 2D-DWT decomposition of an image.


A 1D-DWT is defined as follows:
( ) ( )

=
j i
j i j i
x c x f
,
, ,
(2)
where
( ) x
j i,

are the wavelet functions and c


i,j
are the
DWT coefficients of f(x). They are defined by:
( ) ( ) x x f c
j i j i , ,

}
+

= (3)
A mother wavelet
( ) x
is used to generate the wavelet
basis functions by using translation and dilation
operations:
( ) ( ) j x x
i
i
j i
=

2 2
2
,
(4)
where j and i are respectively the translation and dilation
parameters. The one-dimensional wavelet decomposition
can be extended to two-dimensional objects by separating
the row and column decompositions [10][17]. For
instance, the 2D-DWT is achieved by alternating row and
column filtering in each level with iteration from the LL
(low-pass/low-pass) subband as shown in Figure 1. For
instance, the 2-D wavelet analysis process (See Figure 2)
consists of filtering and down-sampling horizontally
using 1-D low-pass filter to each row in the image F(x, y)
to produce the coefficient matrices F
L
(x, y) and F
H
(x, y).
Then, vertical filtering and down-sampling are performed
using the low-pass and high-pass filters L and H to each
column in F
L
(x, y) and F
H
(x, y) to produce four sub-
images F
LL
(x, y), F
LH
(x, y), F
HL
(x, y) and F
HH
(x, y) for one
level of decomposition. The sub-images images F
LL
(x, y),
F
LH
(x, y), F
HL
(x, y) and F
HH
(x, y) correspond respectively
to LL, LH, HL, and HH images in Figure 1. A second
level decomposition is considered in this study. The
Daubechies-4 wavelet [18] is chosen as the mother
wavelet in this paper since it has the advantage of better
resolution for smoothly changing signals [19]. Finally,
features are extracted from HH2 sub-image since detail
coefficients in level 2 and 3 contain fine breast structure
and micro-calcifications [2].

Figure 2: 2D-DWT decomposition process of an image.

C. Features Extraction
The statistics used to describe the processed images are
the mean, standard deviation, smoothness, third moment,
F(x,y)
2+x
FH(x,y)
2+x
FL(x,y)
L(x)
2+y 2+y
FHL1(x,y) FHH1(x,y)
H(x)
H(x)
L(x)
2+y 2+y
FLL1(x,y) FLH1(x,y)
H(x)
L(x)
H(x) L(x)
Level two decomposition

LL2

HL2
HH2 LH2


HL1



HH1



LH1
164 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER

uniformity, and entropy. They are chosen since they are
widely used in patter recognition [20]. The statistics are
expressed as follows:
( )

=
= =
1
0
L
i
i i
z p z m Mean (5)
( )
2
2
. o o = = = z Dev St (6)
( )
2
1
1
1
1
o +
= = R Smoothness (7)
( ) ( )
i
L
i
i
z p m z Moment th

=
= =
1
0
3
3
. 3 (8)
( )

=
= =
1
0
2
L
i
i
z p U Uniformity (9)
( ) ( )
i
L
i
i
z p z p e Entropy
2
1
0
log

=
= = (10)

where z is a random variable indicating intensity, p is the
probability density of the ith pixel in the histogram, and L
is the total number of intensity levels. Finally, principal
component analysis (PCA) is applied to the features set to
reduce the number of characteristics to be fed to the
classifier. For instance, k-NN is employed in this study
for classification.

D. Design of Experiments
As mentioned in the introduction, the following
experiments are conducted:
(1) Fourier: Apply Fourier transform to mammogram.
Extract high frequency image. Then, compute features.
Finally, apply PCA. Employ k-NN for classification task.
(2) Fourier-Fourier: Apply Fourier transform to
mammogram. Extract Fourier processed image. Then,
apply Fourier transform to the previous Fourier processed
image to obtain the Fourier-Fourier image. Compute
features. Finally, apply PCA. Employ k-NN for
classification task.
(3) DWT: Apply wavelet transform to mammogram.
Extract high frequency image (HH2). Then, compute
features. Finally, apply PCA. Employ k-NN for
classification task.
(4) DWT-DWT: Apply DWT to the mammogram.
Extract high frequency image (HH2). Then, apply another
DWT to the previous high frequency image (HH2) to
obtain HH2*. Compute features. Finally, apply PCA.
Employ k-NN for classification task.

The experiments (1) to (4) are shown in Figures 3,4,5,
and 6 respectively.



Figure 3. Experiment (1): Fourier approach.

E. The Classifier
The k-nearest neighbor algorithm (k-NN) [21] is
employed for classification task in this study. It is a
nonparametric method that assigns query data to the class
that the majority of its k-nearest neighbors belong to. For
instance, the k-NN algorithm uses the data directly for
classification without the need of an explicit model.


Figure 4. Experiment (2): Fourier-Fourier approach.



Figure 5. Experiment (3): DWT approach.



Figure 6. Experiment (4): DWT-DWT approach.
Mammogram
Fourier Extract
Fourier-Fourier
Image
Classification k-NN
Extract
Fourier Image
PCA
Fourier
Compute
Statistics
Mammogram
DWT
Classification
PCA
Compute Statistics
from HH2*
Extract
HH2
DWT
Extract
HH2*
k-NN
Mammogram
DWT
Compute
Statistics from
HH2 image
Classification
PCA k-NN
Extract
HH2
Image
Mammogram Fourier
Compute
Statistics
Classification
k-NN
Extract
Fourier Image
PCA
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 165
2012 ACADEMY PUBLISHER

The performance of k-NN depends on the number of
the nearest neighbor k. In general, there is no solution to
find the optimal k. However, trial and error approach is
usually used to find its optimal value. The main objective
is to find the value of k that maximizes the classification
accuracy. The main advantage of k-NN algorithm is the
ability to explain the classification results. On the other
hand, its major drawback is the need to find the optimal k
and to define the appropriate metric to measure the
distance between the query instance and the training
samples. In this paper, the distance metric chosen is the
Euclidean distance. The standard algorithm of k-NN is
given as follows:
(1) Calculate Euclidean distances between an unknown
object (o) and all the objects in the training set;
(2) Select k objects from the training set most similar to
object (o), according to the calculated distances;
(3) Classify object (o) with the group to which a majority
of the K objects belongs.
III. DATA AND RESULTS
In order to test the proposed methodology, one hundred
digital mammograms were taken from The Digital
Database for Screening Mammography (DDSM) [22].
They consisted of fifty normal images and fifty cancer
images. An example of a digital mammogram is shown in
Figure 7. The DWT and DWT-DWT images of a normal
mammogram are shown in Figure 8, and its Fourier and
Fourier-Fourier images are shown in Figure 9.


Figure 7: Example of a normal mammogram.


Figure 8: Left: DWT. Right: DWT-DWT.


Figure 9: Left: Fourier. Right: Fourier-Fourier.

The features selected by PCA and their respective
cumulative variance proportion (CVP) are shown in
Table 1. It shows that the most significant features are
smoothness and uniformity for DWT-DWT images and
uniformity and entropy for Fourier-Fourier images. The
value of k used to perform the k-NN algorithm was varied
from 2 to 10. The optimal k that maximizes the average
recognition rate was found to be 2. The experiments were
conducted with 10-fold cross-validation. For each fold,
the correct classification rate (hit ratio) and standard
deviation are computed. Figure 10 provides the obtained
results.

Table 1: Features selected by PCA
Features CVP
DWT image smoothness, uniformity 99%
DWT-DWT image smoothness, uniformity 98%
Fourier image mean, entropy 92%
Fourier-Fourier image uniformity, entropy 98%

60.13%
87.78% 88.05%
91.70%
0.1115
0.0303 0.0370
0.0128
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Fourier Fourier-Fourier DWT DWT-DWT
average
st.dev

Figure 10. Classification results using k-NN.

It shows that the feature extraction approach based on
DWT-DWT clearly improved the classification accuracy
with respect to using DWT only. For instance, when
using only DWT feature extraction, the achieved average
accuracy was 88.05% (0.037), and when using DWT-
DWT based feature extraction, it was 91.70.13% (0.012).
On the other, the feature extraction approach based on
Fourier-Fourier approach clearly improved the
classification accuracy with respect to using Fourier only.
For instance, when using only Fourier feature extraction,
the achieved average accuracy was 60.13.05% (0.1115),
and when using Fourier-Fourier based feature extraction,
it was 87.78% (0.0303). Then, the results show strong
evidence that the two-step based approaches are suitable
to extract high frequency features to better detect MCs in
digital mammograms.

166 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER

In sum, the DWT-DWT approach provides higher
accuracy than the Fourier-Fourier approach. In addition,
both proposed approaches allow obtaining high frequency
characteristics needed to detect MCs. In other words, our
methodology based on a further analysis of high
frequency images using DWT or Fourier transform helps
extracting suitable features to detect MCs.

IV. CONCLUSION
The discrete wavelet transform (DWT) is the most
employed tool for processing mammograms to detect
microcalcifications (MCs) since it is able to perform
signal analysis at different time and frequency scales.
Indeed, MCs are usually found in dense biological tissue,
which corresponds to high frequencies in the frequency
domain of the image. Therefore, we seek to capture better
high frequency features of a mammogram by performing
a further analysis of its high frequency components. In
particular, the DWT is applied to the mammogram to
obtain its high-high (HH) image at level two. Then, a
further decomposition by DWT at level two is applied to
the HH image of the previous step to obtain HH*. Finally,
statistical features are computed from HH* and PCA is
performed to select the most significant features. The k-
NN algorithm is employed to classify normal versus
cancer images using cross-validation technique. Recall
this approach DWT-DWT. A similar approach with use
of discrete Fourier Transform (FT) is proposed. For
instance, FT is applied to the mammogram to obtain a
Fourier image. Then, a second FT is applied to the
Fourier image obtained in the previous step. Recall this
approach Fourier-Fourier.
The simulation results show strong evidence of the
effectiveness of our two methodologies since the average
recognition rate improves and its standard deviation
decreases.
REFERENCES

[1]H.D Cheng, Juan Shan, Wen Ju, Yanhui Guo and
LingZhang, Automated breast cancer detection and
classification using ultrasound images: A survey, Pattern
Recognition, Vol. 43, pp.299 317, 2010.
[2]Bouyahia S., J. Mbainaibeye, and N. Ellouze, Wavelet
Based Microcalcifications Detection in Digitized
Mammograms, ICGST-GVIP Journal, Vol.8, pp.23-31,
2009.
[3]Ferreira Cristiane Bastos Rocha and Dibio Leandro Borges,
An Evaluation of Wavelet Features Subsets for
Mammogram Classification, Lecture Notes in Computer
Science, Vol.3773, pp.620-630, 2005.
[4]Sepehr M.H. Jamarani, Gholamali Rezai-Rad and Hamid
Behnam, A Novel Method for Breast Cancer Prognosis
Using Wavelet Packet Based Neural Network, Proceedings
of the IEEE Engineering in Medicine and Biology, Shanghai,
China, September 1-4, pp.3414- 3417, 2005.
[5]Dong A. and Wang B., Feature Selection and Analysis on
Mammogram Classificatio, Proceedings of IEEE Pacific
Rim Conference on Communications, Computers and Signal
Processing, Victoria, B.C., Canada, August 23-26, 2009.
[6]Eltoukhy M.M.M, I. Faye, and B.B Samir, Using Curvelet
Transform to Detect Breast Cancer in Digital Mammogram.
IEEE International Colloquium on Signal Processing & Its
Applications (CSPA), pp.340-345, 2009.
[7]Tirtajaya A. and D.D. Santika, Classification of
Microcalcification Using Dual-Tree Complex Wavelet
Transform and Support Vector Machine, IEEE International
Conference on Advances in Computing, Control, and
Telecommunication Technologies, pp.164-166, 2010.
[8]Moayedi F., Azimifar Z., and Boostani R., Katebi S.,
Contourlet-based mammography mass classification using
the SVM family, Computers in Biology and Medicine,
Vol.40, pp.373383, 2010.
[9]Lahmiri S. and Boukadoum M., Hybrid discrete wavelet
transform and Gabor filter banks processing for mammogram
features extraction Classifier, IEEE New Circuits and
Systems (NEWCAS) International Conference, June 26-29,
Bordeaux, France, pp.53-56, 2011.
[10]Sakka E., Prentza A., Lamprinos I.E. and Koutsouris D.,
Microcalcification Detection using Multiresolution Analysis
Based on Wavelet Transform, IEEE International Special
Topic Conference on Information Technology in
Biomedicine, Ioannina, Epirus, Greece, October 26-28, 2006.
[11]Gelman L., and Braun S., the optimal usage of the Fourier
transform for pattern recognition, Mech. Systems Signal
Process., Vol.15, pp.641645,2001.
[12]Liu F. and Picard R.W., Periodicity, directionality, and
randomness: Wold features for image modeling and
retrieval, IEEE Trans. Pattern Anal. Mach. Intell.,
Vol.18,pp.722733, 1996.
[13]Osowski Stanislaw and Dinh Nghia Do, Fourier and
wavelet descriptors for shape recognition using neural
networksa comparative study, Pattern Recognition,
Vol.35 pp.19491957, 2002.
[14]Chen Guangyi and Bui T.D, Invariant Fourier-wavelet
descriptor for pattern recognition, Pattern Recognition, Vol.
32, pp.1083-1088, 1999.
[15]Sengur Abdulkadir and Turkoglu Ibrahim, and M. Cevdet
Ince., Wavelet Packet Neural Networks for Texture
Classification, Expert Systems with Applications, Vol. 32,
pp.527-533, 2007.
[16]Wang Z-Z and Yong J-H., Texture Analysis and
Classification with Linear Regression Model Based on
Wavelet Transform, IEEE Transactions on Image
Processing, Vol.17, pp.1421-,14302008.
[17]Heil C.E., and Walnut D.F, "Continuous and Discrete
Wavelet Transforms", SIAM Review, vol. 31, no. 4,pp.628-
666, 1989.
[18]Daubechies I., Ten Lectures on Wavelets, SIAM,
Philadelphia, PA, 1992.
[19]Algorri M-E and Fernando F-M., Classification of
Anatomical Structures in MR Brain Images Using Fuzzy
Parameters, IEEE Transactions on biomedical engineering,
Vol. 51, No.9, pp.1599-1608, 2004.
[20]Sheshadri H.S and Kandaswamy A., Detection of breast
cancer by mammogram image segmentation, Journal of
Cancer Research and Therapeutics, Vol.1, pp :232-234, 2005.
[21]Dasarthy, B.V., Nearest Neighbor Classification
Techniques, IEEE Press, Hoboken(NJ), 1990.
[22]http://marathon.csee.usf.edu/Mammography/Database.html





JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 167
2012 ACADEMY PUBLISHER

Recognition of Tongueprint Textures for Personal
Authentication: A Wavelet Approach
Salim Lahmiri
University of Quebec at Montreal/Department of Computer Science, Montreal, Canada
lahmiri.salim@courrier.uqam.ca



Abstract In order to verify tongueprint images, three
approaches for texture analyses were considered and their
performances are compared. They are wavelet transform,
Gabor filter, and spectral analysis. In all approaches, six
statistical measures are applied to the processed images to
extract features. They are the mean, standard deviation,
smoothness, third moment, uniformity, and entropy. Finally,
k-nearest neighbor algorithm (k-NN) is used to classify
tongue textures for verification purposes. The obtained
recognition rates show that features extracted from wavelet
analysis allow achieving the highest accuracy (92%) among
the other approaches. On the other hand, features extracted
from spectral images lead to the lowest recognition rate
(75%). Features extracted from Gabor filter banks obtained
83%. Therefore, we conclude that wavelet-based features
outperform Gabor and spectral-based features employed in
the literature.

Index Terms Tongueprints, wavelets transform, Gabor
filter, Spectral analysis, k-NN.
I. INTRODUCTION
Biometric systems seek to automatically recognize
persons based on their physiological and/or behavioural
characteristic. In particular, primary focus of biometric
technology is verification and identification of humans
using their naturally possessed biological (biometric)
properties [1]. Biometrics science is largely employed in
security and surveillance applications, forensics, secure
control access, and automatic banking, to name a few.
The most measurable traits proposed in literature for
authentication are fingerprints, palmprints, and face.
Other traits and corresponding technologies are described
in [2]. Recently, due to the importance of biometrics
many comprehensive surveys and works have been
published on fingerprint [3], palmprint [4], face [5][6],
retina [7], iris [8], speech recognition [9], gait [10], and
behavioral human-computer interaction (HCI) [11].
Recently, a novel biometric system for person
identification based on the tongueprint was proposed and
tested by Zhang et al. [12]. Indeed, the tongue is a unique
and distinctive organ because it is characterized by its
stable geometric features, crack features and texture
features [12]. In addition, unlike fingerprints and
palmprints which are exposed to external environment
changes, the tongue is well protected since it is contained
inside the mouth. Finally, the squirm of the human tongue
could be used to identify persons. The authors in [12]
employed a fusion approach that makes use of both static
(geometric, crack, and texture features) and dynamic
features (squirm) to achieve a 95% recognition rate.
In addition, The authors in [12] provided four
verification performances: 89.3% using curvature of the
contour of the tongue to detect the geometric shape,
79.4% and 72.5% using Gabor filter and spectral analysis
respectively to classify texture, and 71% using manifold
learning technologies [12] to classify squirm.
According to the authors in [12], tongue contour
extraction is difficult since the surface color of the tongue
is highly similar to that of the ambient biological tissue.
On the other hand, the authors employed spectral analysis
to process tongue textures. But, techniques of spectral
analysis suffer from artefacts and limited resolution [13].
In addition, the fusion system is not easy to implement
and data processing is time consuming. Furthermore,
Gabor filters have three major limitations [14][15]. First,
the outputs of Gabor filter banks are not mutually
orthogonal; then a significant correlation between texture
features may occur. Second, there is a need for an optimal
tuning of its parameters including the filter central
frequency, the filter bandwidth among the x and y-axis
and the filter orientation. Third, Gabor filter banks come
with high computational costs. The disadvantages of
Gabor filter can be avoided if the wavelet transform is
used [14][15]. For instance, wavelet transform provides a
precise analysis of a signal at different scales. In addition,
it uses a low pass and high pass filters that remain the
same between two consecutive scales. Therefore, it does
not require proper tuning of parameters as with Gabor
filters.
The purpose of this study is to design a biometric
authentication system based on the analysis of
tongueprint texture only. Indeed, the literature argues that
texture features are useful in the classification of tongue
images [12][16][17]. In addition, tongue contour
extraction is avoided since it is difficult to perform and
squirm features are ignored because of their poor
performance [12]. To overcome limitations of Gabor
filter banks and spectral analysis that were employed in
[12], discrete wavelet transform is used to process
tongueprint images. It is widely recognized as a powerful
technique for feature extraction in pattern recognition
[18][19]. In addition, relying on texture only and ignoring
geometric and crack features allows implementing a
simple and fast identification system. In sum, the main
hypothesis is that relying only on texture analysis using
wavelets transform may lead to higher classification
168 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.168-175

performance compared to Gabor filter and spectral
analysis that were used in [12].
The paper is organized as follows. In Section 2, the
review of literature is presented. Section 3 presents the
methods. The data and experimental results are presented
in Section 4. Finally, the conclusion and future directions
are given in Section 5.
II. RELATED WORKS
A very limited number of papers found in literature
examined the tongue texture for classification purposes
and mainly for medical diagnosis. This method of
diagnosis is important in Traditional Chinese Medicine
(TCM) [13]. In this paper, the authors proposed a
computerized tongue diagnosis where Bayesian Network
classifier based on chromatic and textural measurements
to classify healthy and abnormal tongues (13 diseases)
from a group of 455 patients. The correct classification
rate was 75.8%. Another paper [19] used Gabor Wavelet
Opponent Colour Features (GWOCF) to analyze tongue
images in order to perform a tongue diagnosis in TCM. In
particular, they employed colour information to pre-
classify the known texture image before extracting
GWOCF to achieve 89% recognition rate using patients
tongue images captured in Guangzhou Traditional
Chinese Medicine Hospital. In a subsequent paper,
authors in [20] computed the entropy and energy
functions to represent the texture features, and employed
a k-Means algorithm to select the clusters and finally used
3-D visualization to classify 11 normal tongue images
and 8 tongue images from patients with gastro cancer.
The author concluded that color and texture features are
sensitive to abnormal tongues. The authors in [21]
designed a Computerized Tongue Examination System
(CTES) to automate the diagnostic of tongue images
based on chromatic and spatial textural properties. In
particular, colors of substance and coating, thickness of
coating and the detection of grimy coating have been
measured. Indeed, textural features including the angular
second moment (ASM), contrast, correlation, variance
and entropy were used to determine the grimy coating of
the tongue. The k-NN algorithm successfully classified
86% of the tongue images. The authors concluded that
there is a real potential for computerized tongue diagnosis.
Authors in [22] proposed a Tongue-Computing Model
(TCoM) for the diagnosis of appendicitis. Chromatic and
textural metrics are the basic features that are jointly used
to classify tongue images. In addition, they have
proposed a new measurement called Grade of
Differentiation (GOD) to evaluate the classification
performance of different metrics. Then, the nearest
distance rule for the classification of each metric is used
to classify images. Finally, a survivor metric is employed
to obtain a final decision. The experimental results from
912 tongue images show that the ratio of the correct
classification is 92.98% and that of false classification is
8.52%.
As mentioned above, a novel biometric system for
person identification based on the tongueprint was
proposed and tested in [12]. According to the authors, the
major advantage of their proposed system is its non-
invasiveness. For instance, to recognize persons, [12]
employed a fusion approach that makes use of both static
(geometric, crack, and texture features) and dynamic
features (squirm). The obtained recognition rate was 95%.
This fusion approach performed better than using
separately geometric features, crack features, textures
features, or squirm features. The tongueprint is a
promising candidate for biometric identification and
worthy of further research. The authentication system
proposed by [12] uses both static and dynamic features.
They are contained in two modules. The first module is
called enrolment module where static and dynamic
features are extracted. The second module is called
recognition module that operates following two steps.
The first step is the liveness detection where information
regarding the squirm of the tongue is used to detect
whether the subject is alive. In the second step, both static
physiological features and dynamic squirm features are
extracted. The extraction of static physiological features
step is aimed to extract geometric features, crack features
and textural features. The geometric features are
measures of the width of the tongue, its thickness, and the
curvature of its contour. Combining these three measures
allows forming the geometric vector. The crack features
correspond to the lines found on the centre of the tongue
surface. Therefore, a region of interest (ROI) is defined
and along the centre of the tongue surface and two-
dimensional Gabor filter is applied to extract crack
features from ROI. Finally, frequency domain images are
represented using the polar coordination system and
energy is computed to form the textural feature. In the
step of the extraction of dynamic features, the tongue
squirm is captured in a sequence of continuous images.
Then, the authors used the orthogonal neighbourhood
preserving projections (ONPP) to reduce the tongue
squirm into a lower dimensional feature space. According
to the authors [12], the ONPP technique was used since it
is a linear dimensionality reduction that allows preserving
both local and global geometry of high dimensional data
samples. A dynamic descriptor is calculated using the
mean and the variance obtained from the analysis of the
low dimensional manifold obtained using the ONPP
technique. The dynamic descriptor is employed to
classify the input image sequence as valid or invalid.
Then, the squirms of all subjects are clustered. Finally,
static features and dynamic features are grouped to form
one vector of features for each subject. Then, the mean
and the variance of each vector are calculated. The
classification of subject is made by minimum distance
method using the computed mean and variance.
The proposed fusion system of [12] provides higher
classification rate (95%). However, this system comes
with several drawbacks as mentioned in introduction. The
tongue contour extraction is difficult since the surface
color of the tongue is highly similar to that of the ambient
biological tissue [12]. In addition, this system is not easy
to implement and data processing is time consuming.
Finally, spectral analysis was chosen to perform images
and extract features; which is not an adequate approach.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 169
2012 ACADEMY PUBLISHER

Gabor filter banks are not mutually orthogonal, they need
for an optimal tuning of parameters and leads to high
computational costs.
The purpose of this study is to compare the
performance of the k-NN classifier given the type of the
approach followed to extract texture features from
tonguprints: wavelet analysis, Gabor filter, and spectral
analysis approach. The main hypothesis is that relying
only on texture analysis using wavelets transform may
lead to higher classification performance compared to
Gabor filter banks and spectral analysis employed in [12].
In addition, relying on texture only and ignoring
geometric and crack features allow implementing a
simple and fast identification system. In addition, texture
analysis has been proven to helpful for the extraction of
features [3][4][5][8]. Moreover, a system where textural
properties of a tongue are used can be implemented in
clinical medicine for diagnosis purposes [13].
In sum, the contribution of our paper is to
experimentally shed light on the benefit of the DWT-
based features in tongueprint textures recognition. Indeed,
in previous studies textural features were directly
extracted from tongueprint images [13] [20][21][22] or
from spectral and Gabor processed images [12][19].
However, these approaches come with serious drawbacks.
In one hand, features extracted directly from tongueprint
images do not contain high frequency information that
characterizes biological tissue. On the other hand, as
mentioned in introduction spectral analysis suffer from
artefacts and limited resolution, and Gabor transform
requires optimal tuning of its parameters and may lead to
correlated features. As a result, using DWT to extract
features would allow obtaining better high frequency
features than spectral and Gabor transform. Indeed, the
disadvantages of Gabor filter can be avoided if wavelet
transform is adopted since it provides a precise analysis
of a signal at different scales [14][15].
III. METHODOLGY
The purpose is to compare the recognition ability of
wavelet features in comparison with Gabor processed
images and spectral features. The research methodology
consists of four parts as follows:
1) Extraction of region of interest (ROI) from the
original images.
2) Processing ROI using wavelet, Gabor, and spectral
analysis.
3) Computing statistics of the processed ROI.
4) Computing statistics of the non-processed ROI.
5) Classification of features using k-NN.
6) Comparison of the results.

A. Spectral Analysis
Based on the Fourier transform, spectral analysis
allows detecting high-energy bursts in the spectrum of
texture. The spectrum is expressed in polar coordinates to
obtain a function S(r,u ) where S is the spectrum function
and r and u are the variables in the polar coordinate
system. For each direction u, the function S(r,u ) is
considered as a one dimensional function: S
u
(r ).
Meanwhile, for each frequency r, the function S(r,u ) is
considered as a one dimensional function: S
r
(u ). Then,
the spectral measures to describe texture are obtained by
summing the functions S
u
(r ) and S
r
(u ) as follows:
( ) ( )

=
=
t
u
u
0
r S r S (1)
( ) ( )

=
=
0
1
r
r
r
S S u u (2)
where r
0
is the radius of a circle centered at the origin.
For each tongueprint image, the features S
u
(r ) and S
r
(u )
constitutes the spectral energy that describes its entire
texture as in [12]. Then, they are fed to the classifier. The
biometric system based on texture spectra for
tongueprints verification is shown in Figure 1.

B. Gabor Filter
The one dimension (1-D) Gabor filter was first
defined by Gabor [22] and was later extended to 2-D by
Daugman [23]. The Gabor filter is extensively used in
texture analysis since it decomposes an image into
components corresponding to different scales and
orientations [24-26].

Figure 1. Biometric system with spectra features.

Therefore, the two-dimensional (2D) Gabor is able to
capture visual properties such as spatial localization,
orientation selectivity, and spatial frequency. Thus, Gabor
filter is well-adapted for image processing applications;
especially texture analysis. The 2D Gabor filter is the
product of a 2D Gaussian and a complex exponential
function. For instance, the Gabor function is a complex
sinusoid centered at a given frequency and modulated by
a Gaussian envelope. The Gabor filter comprises both
real and imaginary parts. The general form of the real part
of a 2-D Gabor function is defined as follows:
( ) ( ) x f
y x
f y x G
y x
y x
'
(
(
(

|
|
|
.
|

\
|
|
|
.
|

\
|
'
+
|
|
.
|

\
| '
= t
o o
u o o 2 cos
2
1
exp , , , , ,
2
2
(3)
( ) ( ) u u sin cos y x x + = ' (4)
( ) ( ) u u sin cos x y y = ' (5)
Tongueprint
Image
Extract
ROI
Spectral
Analysis
Compute
S(r) & S(u)


k-NN
Verification
170 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER

where o
x
and o
y
are respectively the standard deviations
of the Gaussian envelope along the x and y axes
respectively. The parameters ] and u are respectively the
central frequency and the rotation of the Gabor filter.
Then, to obtain the Gabor-filtered image ](x,y) of a given
input image I(x,y) the 2-D convolution operation (-) is
performed as follows:
( ) ( ) ( ) y x I f y x G y x f
y x
, , , , , , , - = u o o (6)
The selection of the parameters (o
x
, o
y
, ], u) is
crucial and difficult. Bianconi and Fernndez [25]
investigated the effects of Gabor filter parameters on
texture classification using a large set of experiments on
different textures. They conclude that the number of
orientations did not show significant effects on the
percentage of correct classification. In other words,
increasing the number of orientations would produce a
considerable waste of computational time, without
tangible increase in the correct classification rate [25].
Thus, the value of u is set to 0 in our study
1
. On the other
hand, they found that best classification rate is obtained
when the parameters o
x
, o
y
, and ] are set to their lowest
level (0.5). In other words, they concluded that low
selectivity in the frequency positively affects correct
texture classification. Therefore, the standard deviations
of the Gaussian envelope along the x and y axes and the
central frequency of the Gabor filter are all set to one half
in our study. The biometric system based on Gabor filter
is shown in Figure 2.


Figure 2. Biometric system with Gabor filter.

C. Wavelet Transform
The discrete wavelet transform (DWT) decomposes
an image into several sub-bands according to a recursive
process (Figure 3). These include LH1, HL1 and HH1
which represent detail images and LL1 which
corresponds to the approximation image. The
approximation and detail images are then decomposed
into second-level approximation and detail images, and
the process is repeated to achieve the desired level of the

1
Features were also extracted from Gabor filter banks with three
orientations; for instance u = 0, t/4, t/2 and following the design rules
as in [14]. However, the obtained classification results were similar to
the approach we follow and which is proposed in [25].
multi-resolution analysis. The obtained coefficients
values for the approximation and detail sub-band images
are useful features for texture categorization [18][19]. To
obtain the set of features that characterize a given texture
image, the 2D-DWT wavelet transform is used to find its
spectral components. This allows transforming each
texture image into a local spatial/frequency representation
by convolving the image with a bank of filters. Then, the
image features are extracted from the obtained 2D-DWT
representation. A 1D-DWT is defined as follows:
( ) ( )

=
j i
j i j i
x c x f
,
, ,
(7)
Where
( ) x
j i,

are the wavelet functions and and c


i,j
are
the DWT coefficients of f(x). They are defined by:
( ) ( ) x x f c
j i j i , ,

}
+

= (8)
A mother wavelet
( ) x
is used to generate the
wavelet basis functions by using translation and dilation
operations:
( ) ( ) j x x
i
i
j i
=

2 2
2
,
(9)
where j and i are respectively the translation and dilation
parameters. The one-dimensional wavelet decomposition
can be extended to two-dimensional objects by separating
the row and column decompositions [27][28]. A second
level decomposition is considered.
For instance, features are extracted from HH2. The
Daubechies-4 wavelet [29] is chosen as the mother
wavelet since it has the advantage of better resolution for
smoothly changing signals [30]. The biometric system
based on wavelet analysis and used for tongueprints
verification is shown in Figure 4.



Figure 3. Two level 2-D DWT decomposition of an image.


LL2

HL2
HH2 LH2


HL1



HH1



LH1
Tongueprint
Image
Extract
ROI
Gabor
Transform
Compute
Statistics


k-NN
Verification
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 171
2012 ACADEMY PUBLISHER


Figure 4. Biometric system with wavelet features.
D. Features Extraction From DWT and Gabor Filter
The statistics used to describe the processed images
are the mean, standard deviation, smoothness, third
moment, uniformity, and entropy. They are chosen since
they are widely used in patter recognition [31]. The
statistics are expressed as follows:
( )

=
= =
1
0
L
i
i i
z p z m Mean (10)
( )
2
2
. o o = = = z Dev St (11)
( )
2
1
1
1
1
o +
= = R Smoothness (12)
( ) ( )
i
L
i
i
z p m z Moment th

=
= =
1
0
3
3
. 3 (13)
( )

=
= =
1
0
2
L
i
i
z p U Uniformity (14)
( ) ( )
i
L
i
i
z p z p e Entropy
2
1
0
log

=
= = (15)
where z is a random variable indicating intensity, p is the
probability density of the ith pixel in the histogram, and L
is the total number of intensity levels. A vector is
constructed with these statistics as feature components to
be fed to the k-NN classifiers.

E. The Classifier
The k-nearest neighbor algorithm (k-NN) was first
introduced by [32] and is a nonparametric method that
assigns query data to the class that the majority of its k-
nearest neighbors belong to. For instance, the k-NN
algorithm uses the data directly for classification without
the need of an explicit model. The performance of k-NN
depends on the number of the nearest neighbor k. In
general, there is no solution to find the optimal k.
However, trial and error approach is usually used to find
its optimal value. Therefore, our objective is to find the
value of k that maximizes the classification accuracy. The
main advantage of k-NN algorithm is the ability to
explain the classification results. On the other hand, its
major drawback is the need to find the optimal k and to
define the appropriate metric to measure the distance
between the query instance and the training samples. In
this paper, the distance metric chosen is the Euclidean
distance. The standard algorithm of k-NN is given as
follows:
(1) Calculate Euclidean distances between an unknown
object (o) and all the objects in the training set;
(2) Select k objects from the training set most similar to
object (o), according to the calculated distances;
(3) Classify object (o) with the group to which a majority
of the K objects belongs.
IV. EXPERIMENTAL RESULTS
The images are provided by the Biometric Research
Centre of The Hong Kong Polytechnic University [33]
(PolyU/HIT Tongue database). The PolyU/HIT Tongue
Database contains 12 color images in BMP image format,
which are used in [13][17][30]. According to [33] these
images are captured using a black box with special
lighting. Thus the capturing environment is stable. The
number of images in the database is twelve. An example
of the original images and ROI are given in Figures 5 and
6. The original tongue database in [12] contains 174
images. We were provided only with twelve images. Nine
images are used for training and three images for testing.
The training database is composed of six true images and
three impostor images. The images that form training and
testing databases were randomly selected. Finally, the
correct recognition rate is computed for each experiment.
The Matlab software is used to perform wavelet analysis,
Gabor filter, and spectral analysis and to train and test the
k-NN classifier. The classifier have been trained and
tested with k varying from one to five. The best
recognition rates were obtained for k=1 and k=2 which
give similar performances.


Figure 5. A sample from the original images.




Figure 6. An example of the ROI image 87 by 87 pixels.

Tongueprint
Image
Extract
ROI
Apply
DWT
Compute
Statistics from
HH2

k-NN
Verification
172 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER

Then, increasing k above 2 leads to increasing the error
rate of the classifier for all types of features. Therefore,
k=1 is chosen as the optimal value. Finally, three
performance measures are used. They are the correct
classification rate (CCR), sensitivity, and specificity. The
performance measures are given as follows:
CS CCS CCR= (16)
TPS CCPS y Sensitivit = (17)
TNS CCNS y Specificit = (18)
where CCS is correctly classified samples, CS is
classified samples, CCPS is correctly classified positive
samples, TPS is true positive samples, CCNS is correctly
classified negative samples, and TNS is true negative
samples. The experimental results for each experiment
given the type of features are shown in Figures 7 and 8.
On the other hand, Figures 9, 10, and 11 show the scatter
plots of spectral, Gabor and DWT approach respectively.
By analyzing the results, the k-NN correctly classifies the
testing data with 92% accuracy. On the other hand, the
correct detection rate obtained with Gabor filter features
is 83%. Finally, k-NN achieves the lowest performance
when features from spectral analysis are considered as
inputs (75%). Besides, the experimental results suggest
that all three approaches successfully detected the true
persons with 100% sensitivity. However, they all fail to
correctly detect impostors. For instance, the specificity
statistic obtained by DWT, Gabor and spectral analysis is
83%, 67%, and 50% respectively.
The results show evidence of two findings. First,
features obtained with wavelet transform perform much
better that features obtained with Gabor filter and spectral
analysis. Then, our choice for wavelet transform to
extract features from tongue texture is justified. Second,
features obtained with Gabor filter perform much better
than features obtained with spectral analysis. This result
is consistent with finding in [12].

0.92
0.83
0.75
-
0.20
0.40
0.60
0.80
1.00
DWT-HH2 Gabor filter Spectral analysis


Figure 7. Correct classification rates.

1 1 1
0.83
0.67
0.50
0
0.2
0.4
0.6
0.8
1
1.2
DWT-HH2 Gabor filter Spectral analysis
Sensitivity
Specificity


Figure 8. Sensitivity and specificity.


0 20 40 60 80 100 120
0
20
40
60
80
100
120


True Persons
Imposters

Figure 9. Scatter plot of spectral analysis approach.


141 141.5 142 142.5 143 143.5 144 144.5
125.8
125.85
125.9
125.95
126
126.05
126.1
126.15
126.2
126.25
126.3


True Persons
Imposters

Figure 10. Scatter plot of Gabor filters approach.


124 125 126 127 128 129 130
126.4
126.5
126.6
126.7
126.8
126.9
127
127.1
127.2


True Persons
Imposters


Figure 11. Scatter plot of DWT approach.






JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 173
2012 ACADEMY PUBLISHER

V. CONCLUSION
The purpose of this study is to compare the performance
of the k-NN classifier given the approach used to extract
features from tongueprint texture for person verification.
Unlike previous literature [12] that uses Gabor filter and
spectral analysis for feature extraction from tongueprint
for person authentication, the wavelet transform is
considered in this study. The results show strong
evidence that features obtained with wavelet transform
help improving the correct detection rate. In addition,
features obtained with spectral analysis perform the worst.
For instance, features extracted with Gabor filters
perform much better than spectral analysis features. This
finding is consistent with the results found in [12].
Finally, all features extraction approaches provide a
100% sensitivity statistic. Therefore, they are all capable
to detect true persons. On the other hand, the wavelet
approach gives the highest detection rate of the impostors.
The wavelet-based system for tongueprint authentication
has proven its effectiveness and could be a very
promising approach to be implemented in clinical
medicine for diagnosis purposes mainly the Traditional
Chinese Medicine.
The spectral analysis which is based on the Fourier
transform characterizes the spatial-frequency distribution,
but ignores the information in the spatial domain. On the
other hand, Gabor filter jointly considers specific
frequency and orientation characteristics to analyze
textured images. Then, the performance of k-NN with
Gabor features is higher. However, the design of optimal
Gabor filter is a very hard since it is depends on many
parameters. Thus, human intervention is required to select
the appropriate parameters for texture analysis. Finally,
the wavelet transform for feature extraction is useful
since it provides uncorrelated data and improves
recognition accuracy with information extracted from HH
subbands. This indicates that high frequency channels
contain more information regarding the texture of the
biological tissue. Furthermore, wavelet transform does
not require a lot of parameters like Gabor filter. Indeed,
only the type of mother wavelet and level of resolution
should be selected.
In sum, it is suggested to use wavelet transform in
tongueprint texture identification and verification. For
future work, a larger database will be considered. In
addition, it is recommended to explore different types of
mother wavelets at different resolution levels to examine
their effects on the recognition performance of a
biometric system based on tongueprints.
REFERENCES
[1] S. N. Yanushkevich, Synthetic Biometrics: A Survey,
IEEE International Joint Conference on Neural Networks,
Vancouver, BC, Canada, July 16-21, pp.676-683,2006.
[2] Kresimir Delac, and Mislav Grgic, A Survey of Biometric
Recognition Methods, 46th International Symposium
Electronics in Marine, Zadar, Croatia, June 16-18, 2004.
[3] Ahmad, Fadzilah and Mohamad, Dzulkifli, A Review on
Fingerprint Classification Techniques, IEEE International
Conference on Computer Technology and Development,
Volume 2, Nov 13-15, pp.411-415, 2009.
[4]Adams Kong, David Zhang and Mohamed Kamel, A
Survey of Palmprint Recognition, Pattern Recognition, Vol.
42, pp.1408-1418, 2009.
[5] Elham Bagherian and Rahmita Wirza O.K. Rahmat, Facial
Feature Extraction for Face Recognition: A Review, IEEE
International Symposium on Information Technology, Vol.
2, Aug 26-28, pp.1-9, 2008.
[6] Ray K. C. Lai, Jack C. K. Tang, Angus K. Y. Wong, and
Philip I. S. Lei, Design and Implementation of an Online
Social Network with Face Recognition, Journal of
Advances in Information Technology, Vol 1, No 1, pp. 38-42,
2010.
[7] N. Radha, T. Rubya, and S. Karthikeyan, Securing Retinal
Template Using Quasigroups, Journal of Advances in
Information Technology, Vol 2, No 2, pp. 80-86, 2011.
[8] Richard Yew Fatt Ng, Yong Haur Tay, and Kai Ming Mok,
A Review of Iris Recognition Algorithms, IEEE
International Symposium on Information Technology, Vol.
2, pp.1-7, 2008.
[9] Potamianos Gerasimos, Audio-Visual Automatic Speech
Recognition and Related Bimodal Speech Technologies: A
Review of The State-of-The-Art and Open Problems, IEEE
Workshop on Automatic Speech Recognition &
Understanding, Vol. 17, pp.2222, 2009.
[10] Yi-Bo Li, Tian-Xiao Jiang, Zhi-Hua Qiao, and Hong-Juan
Qian, General Methods and Development Actuality of Gait
Recognition, International Conference on Wavelet Analysis
and Pattern Recognition, Vol. 3, pp.1333-1340, 2007.
[11] Roman V. Yampolskiy, Human Computer Interaction
Based Intrusion Detection, IEEE Fourth International
Conference on Information Technology, April 2-4, pp.837-
842, 2007.
[12] David Zhang, Zhi Liu, and Jing-Qi Yan, Dynamic
Tongueprint: A Novel Biometric Identifier, Pattern
Recognition, Vol. 43, pp. 1071-1082, 2010.
[13] Valeri Mikhnev, A Comparative Study of Data
Processing Algorithms for Radar Imaging of Stratified
Building Structures, International Symposium on Non-
Destructive Testing in Civil Engineering (NDT-CE), 2003.
[14] Buciu .I and Gacsadi A.,Gabor Wavelet Based Features
for Medical Image Analysis and Classification, IEEE 2nd
International Symposium on Applied Sciences in Biomedical
and Communication Technologies (ISABEL), November 24-
27, Bratislava, Slovak Republic, pp. 1-4, 2009.
[15] Arivazhagan S. and Ganesan L.,Texture Classification
using Wavelet Transform, Pattern Recognition Letters,
Vol.24, pp. 1513-1521, 2003.
[16]Bo Pang, David Zhang, Naimin Li, and Kuanquan Wang.
Computerized Tongue Diagnosis Based on Bayesian
Networks, IEEE Transactions on Biomedical Engineering,
No 10, Vol. 51, pp. 1803-1810, 2004.
[17] Bo Pang, David Zhang, and Kuanquan Wang, Tongue
Image Analysis for Appendicitis Diagnosis, Information
Sciences, Vol. 175, pp. 160-176, 2005.
[18] Abdulkadir Sengur, Ibrahim Turkoglu, and M. Cevdet Ince.,
Wavelet Packet Neural Networks for Texture
Classification, Expert Systems with Applications, Vol. 32,
Issue 2, pp. 527-533, 2007.
[19] Yuen P.C., Kuang Z.Y., Wu W., and Wu. Y.T., Tongue
Texture Analysis using Gabor Wavelet Opponent Colour
Features for Tongue Diagnosis in Traditional Chinese
Medecine, In Texture Analysis in Machine Vision. Series in
Machine Perception and Artificial Intelligence, Vol. 40, pp.
179-188, 2000.
174 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER

[20] Yang Cai, A Novel Imaging System for Tongue
Inspection, Proceedings of the 19th IEEE Instrumentation
and Measurement Technology Conference, Vol. 1, pp. 159-
163, 2002.
[21] Chuang-Chien Chiu, A Novel Approach Based on
Computerized Image Analysis for Traditional Chinese
Medical Diagnosis of The Tongue, Computer Methods and
Programs in Biomedicine, Vol. 61 pp. 7789, 2000.
[22] Gabor D., Theory of Communication, J IEE, Vol. 93,
Issue 26, pp.429-57, 1946.
[23] Daugman J.G., Uncertainty Relation for Resolution in
Space, Spatial-Frequency and Orientation Optimized by
Two-Dimensional Visual Cortical Filter, J Opt Soc Am, Vol.
2, Issue 7, pp. 1160-1169, 1985.
[24] Mohammed Al-Rawi and Jie Yang, Using Gabor Filter
for The Illumination Invariant Recognition of Color
Texture, Mathematics and Computers in Simulation, Vol.
77, pp. 550-555, 2008.
[25] Francesco Bianconi and Antonio Fernndez, Evaluation
of The Effects of Gabor Filter Parameters on Texture
Classification, Pattern Recognition, Vol. 40, pp. 3325-3335,
2007.
[26] Wei Wang, Jianwei Li, Feifei Huang, and Hailiang Feng,
Design and Implementation of Log-Gabor Filter in
Fingerprint Image Enhancement, Pattern Recognition
Letters, Vol. 29, pp. 301-308, 2008.
[27]Sakka E., Prentza A., Lamprinos I.E. and Koutsouris D.,
Microcalcification Detection using Multiresolution Analysis
Based on Wavelet Transform, IEEE International Special
Topic Conference on Information Technology in
Biomedicine, Ioannina, Epirus, Greece, October 26-28, 2006.
[28Heil C.E., and Walnut D.F, "Continuous and Discrete
Wavelet Transforms", SIAM Review, Vol. 31, No. 4,pp. 628-
666, 1989.
[29] Ingrid Daubechies, Ten Lectures on Wavelets. SIAM,
Philadelphia, PA, 1992.
[30] Mara-Elena Algorri and Fernando Flores-Mangas,
Classification of Anatomical Structures in MR Brain
Images Using Fuzzy Parameters, IEEE Transactions on
Biomedical Engineering, Vol. 51, No.9, pp. :1599-1608, 2004.
[31]Holalu Seenappa Sheshadri and Arumugam Kandaswamy,
Breast Tissue Classification using Statistical Feature
Extraction of Mammograms, Medical Imaging and
Information Sciences, Vol. 23, No. 3 pp.105-107, 2006.
[32] Fix E. and Hodges J.L., Discriminatory Analysis-
Nonparametric Discrimination: Consistency Properties,
Repot No. 4. Randolph Field, Tews: US. Air Force School of
Aviation Medicine, 1951.
[33] http://www.comp.polyu.edu.hk/~biometrics/



Salim Lahmiri (M.Eng, Ph.D, Canada) is interested in
biomedical signal and image processing, computer vision, and
pattern recognition.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 175
2012 ACADEMY PUBLISHER
The Need for Effective
Information Security Awareness

Fadi A. Aloul
Department of Computer Science & Engineering
American University of Sharjah, Sharjah, UAE
Email: faloul@aus.edu


AbstractSecurity awareness is an often-overlooked factor
in an information security program. While organizations
expand their use of advanced security technology and
continuously train their security professionals, very little is
used to increase the security awareness among the normal
users, making them the weakest link in any organization. As
a result, today, organized cyber criminals are putting
significant efforts to research and develop advanced hacking
methods that can be used to steal money and information
from the general public. Furthermore, the high internet
penetration growth rate in the Middle East and the limited
security awareness among users is making it an attractive
target for cyber criminals.

In this paper, we will show the need for security awareness
programs in schools, universities, governments, and private
organizations in the Middle East by presenting results of
several security awareness studies conducted among
students and professionals in UAE in 2010. This includes a
comprehensive wireless security survey in which thousands
of access points were detected in Dubai and Sharjah most of
which are either unprotected or employ weak types of
protection. Another study focuses on evaluating the chances
of general users to fall victims to phishing attacks which can
be used to steal bank and personal information.
Furthermore, a study of the users awareness of privacy
issues when using RFID technology is presented. Finally, we
discuss several key factors that are necessary to develop a
successful information security awareness program.

Index TermsInformation Security, Security Awareness,
Security Audits, Phishing Attacks, Wireless Security, RFID
Security, UAE.

I. INTRODUCTION
Internet users in the Middle East have been
continuously increasing in the past few years. According
to the World Internet Usage Statistics News [1], while the
Middle East constitutes 3.2% of the worldwide internet
users, it has registered an internet usage growth of 1825%
in the past 10 years, compared with the growth of 445%
in the rest of the world. It also reported that Bahrain,
UAE, and Qatar had the highest internet penetration rates
in the Middle East as of June 30, 2010 with rates
equivalent to 88%, 75.9%, and 51.8% of their population,
respectively. This growth has attracted hundreds of online
companies to conduct business in the Middle East and
allowed many existing sectors, such as education, health,
airline, and government, to move their operations online.
Another study by the Arab Advisors Group [2] showed
that the UAE had the highest e-commerce penetration
rate in 2008. Specifically, 21.5% of UAE, 14.3% of Saudi
Arabia, 10.7% of Kuwait, and 1.6% of Lebanon residents
engaged in web commerce and in most cases such
engagements required the use of credit cards. A study
conducted by Lafferty Group [3] showed that the number
of credit cards in the Middle East and North Africa region
jumped by 24% in 2006 to 6.23 million and is expected to
see a 51% increase in the number of credit card users in
2008 as compared to 2006.
The high number of internet penetration and credit
card use growth, fueled by advances in the internet
technology, has lead to a significant increase in the
number of online transactions, electronic data, and smart
mobile devices. However, the last few years have also
seen an increase in the number of cybercrime incidents in
the Middle East. Local media occasionally report
incidents of online fraud, attempts to hack banks, and
websites being shut down or defaced. For example, in
May 2008, Al-Khaleej Newspaper website, a reputable
newspaper based in UAE, was defaced by hackers [4].
Later that year, in October 2008, Arabiya.net website, a
reputable Middle East News Channel, was also defaced
[5]. In both incidents, the hackers claimed to have
conducted the attacks because of political reasons. In
May 2008, the Bahraini Telco company was targeted by
phishing attacks [6]. Later that year, the National Bank of
Kuwait was also targeted by phishing attacks [7]. In
January 2010, several UAE bank websites were a target
of phishing attacks as reported by ITP [8]. In April 2010,
it was reported that several users lost their UAE bank
savings through internet fraud attacks [9]. In April 2010,
the UAE Ministry of Education was infected by a
computer virus [10]. In June 2010, Saudi Arabias Riyad
Bank website was hacked [11]. In June 2010, Al Jazeera
Sport Worlds Cup broadcasting was also interrupted by
hackers [12].
The worldwide increase in information technology (IT)
security incidents is mainly due to the (1) increase in
electronic data, (2) increase in mobile devices, (3)
increase of organized cybercrime groups, (4) increase of
intelligent external and internal IT security threats, (5)
difficulty of tracing the attackers, (6) limited cybercrime
laws, and (7) limited IT security knowledge among
internet users. The hackers are also motivated by various
reasons for conducting their attacks. Examples include: (1)
176 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.176-183
spreading a political message, (2) gaining financially (i.e.
Theft), (3) stealing information, (4) causing damage and
disturbance, and (5) achieving self satisfaction and fame.
The increase in IT security incidents has alerted
governments to introduce federal laws to fight IT crimes,
also known as e-crime or cybercrime. Many countries in
North America, Europe, and Asia have already
implemented and enforced such laws. A few Middle
Eastern countries have already introduced such laws [13].
The UAE was one of the first Middle Eastern countries to
introduce a cybercrime federal law in January 2006. The
law consisted of 26 articles and covered the majority of
cybercrime incidents. The penalty ranged from fines up to
100,000 UAE Dirhams and/or 15 years of imprisonment.
Saudi Arabia followed by introducing a cybercrime
federal law in October 2006. Such laws helped reduce the
number of IT security incidents, but unfortunately
incidents still occur in the region and are mainly because
of the (1) lack of cybercrime laws in most of the Middle
East countries, (2) limited enforcement of cybercrime
laws, (3) lack of knowledge among residents of such
cybercrime laws, and (4) few computer incident forensics
teams that exist in the region.
Today, as organizations expand their use of advanced
secure technologies, hackers are attempting to break into
organizations by targeting the weakest link: the
uneducated computer user [14]. According to [15],
computer user mistakes are considered one of the top
threats to IT security in organizations. In this paper, we
will show the need for security awareness programs in
schools, universities, governments, and private
organizations in the Middle East by presenting results of
several security awareness studies conducted among
students and professionals in UAE in 2010. The first
study, presented in Section 2, focuses on studying the
chances of general users to fall victims to phishing
attacks which can be used to steal bank and personal
information. We present the results of an approved
phishing audit made without notice within an academic
organization. The study is the first-of-its-kind in UAE
and has shown to be very useful in increasing the general
security awareness. The second study, presented in
Section 3, involves a comprehensive wireless security
survey in which thousands of access points were detected
in Dubai and Sharjah most of which are either
unprotected or employ weak types of protection. In
Section 4, we discuss the level of RFID security
awareness in the UAE. In Section 5, we list the key
factors necessary to develop a successful security
awareness program in the Middle East. We finally
conclude by showing examples of recent Middle Eastern
governmental initiatives to spread security awareness
among its citizens.
II. PHISHING ATTACKS IN UAE
Phishing is a form of Internet fraud that aims at
stealing valuable information such as credit cards, social
security numbers, user IDs and passwords. The fraud
starts by creating a fake website that looks exactly like
that of a legitimate organization but with a slightly
different URL address. In many cases, the organizations
are financial institutions such as banks. An email is then
sent to thousands of internet users requesting them to
access the fake website, which is a replica of the trusted
site, to update their records by entering their personal
details, including security access codes. The page
generally looks genuine. Note that the email has a FROM
address that is identical to the original organization
address, e.g. Human Resource or IT director, to make
users believe that the email is authentic. However, the
FROM field in an email can be easily faked by a hacker
and the email is actually coming from the hackers
computer. Once the user enters his or her personal
information into the fake website, the personal
information is sent to the hacker, and the user is
redirected to the legitimate website in order not to detect
the fraud attempt.
According to the Anti-Phishing Working Group [16],
the number of unique fake phishing websites exceeded
42,000 pages per month in 2009, compared to 23,000
pages per month in 2008. That is almost one new
phishing website every one minute. The high number of
phishing websites reflects the effectiveness of the
phishing hacking method.
In the Middle East, cyber criminals are increasingly
targeting UAE residents with advanced hacking methods,
one of which is phishing scams [17]. Such scams have
caused UAE banks to raise their IT security services in
recent years. Although, UAEs Cybercrime Law, Article
#10, imposes a fine and an imprisonment for any person
that steals or transfers money using online fraud, several
phishing attacks against UAE were detected in 2009 [18].
One of the detected attacks involved a duplicate website
of the UAEs Ministry of Labor which had a URL of:
http://www.uaeministryoflabour.tk. Note that the
authentic URL of the Ministry is http://www.mol.gov.ae.
The fake website was cheating people who wanted to find
a job in the UAE [19].
General user education is considered one of the most
important and widely-used approaches in fighting
phishing attacks. Several organizations have launched
awareness campaigns to educate the user on the meaning
of phishing attacks and how to detect such attacks and
avoid falling victims to them [20]. The campaigns can
include various formats of communication such as emails,
posters, in-class training, web seminars, games [21], etc.
While such campaigns help companies meet the
compliance requirements of security standards, such as
ISO [22] and NIST [23], recent studies have questioned
the effectiveness of such campaigns in protecting the
general users from falling victims to phishing attacks [21].
The educational campaigns typically keep track of the
users who took the training, i.e. the number of users who
attended the awareness sessions, the number of users who
passed the exams, etc. However, the campaigns fail to
identify the impact of the awareness sessions, i.e. the
number of users who might fall victims to real attacks
after the awareness sessions or the usefulness of the
awareness sessions.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 177
2012 ACADEMY PUBLISHER
In order to study the impact of awareness sessions and
the vulnerability of general users to phishing attacks,
several studies recommended the use of controlled in-
house phishing audits. In [24], the authors discussed the
urgent need for effective user privacy education to
counter social engineering attacks on secure computer
systems after they conducted a social engineering audit
among 33 employees in an organization asking for their
usernames and passwords in which 19 employees gave
their passwords. The study also noticed that the level of
user education against social engineering attacks was not
uniform between the organizations departments. Another
phishing audit was made among 576 office employees in
London in 2008 [25]. Results showed that 21% of the
respondents were willing to give their passwords out with
the lure of a chocolate bar and 58% would reveal their
password over the phone if the caller claimed he or she
was from the IT department. The audit also noted that
43% of the respondents rarely or never changed their
passwords and 31% of them used one password for all
their accounts. In [21], the authors conducted a phishing
audit among the employees of a Portuguese company that
was followed by phishing training and another phishing
audit. The authors noticed a failure rate of 42% in the
initial experiment and 12% in the latter experiment which
reflects the effectiveness of the phishing awareness
training. Another group conducted a similar two-phase
phishing experiment at the United States Military
Academy at West Point, New York [26]. Experiment
results also showed the ability of the participants to better
identify the phishing attacks after the training sessions.
The New York State Office of Cyber Security & Critical
Infrastructure Coordination conducted a similar two-
phase phishing experiment among their employees [27].
The experiment results also showed the ability of the
employees to better identify the phishing attacks after the
training sessions.
In order to study the vulnerability of users to phishing
attacks in the Middle East, a controlled phishing
experiment was conducted among the students, faculty,
and staff of the American University of Sharjah (AUS) in
UAE. The university consists of 5,000 students and 5,000
alumni in addition to 1,000 faculty and staff. The
students come from 80+ nationalities. The university was
founded in 1997 and offers 25 majors and 48 minors at
the undergraduates level and 13 masters degrees
programs through four colleges (Arts and Science;
Engineering; Architecture, Art and Design; Business and
Management). The language of instruction at the
University is English. The experiment was performed by
three students and their advisor in coordination with the
AUS IT Director and the approval of the Universitys
Provost. No one else knew about this experiment in the
University. A fake website was setup to look identical to
an AUS website that is accessed by the users to change
their AUS passwords (see Figures 1 and 2). Note that the
phishing website URL address is different from the
original website URL address (https://passwords.
aus.edu). An email was sent to all AUS users asking them
to urgently change their passwords due to a security
breach. The AUS FROM address was faked to look
identical to the AUS IT Department email address. Once
the email was received by the users, they were requested
to click on a link https://passwords.aus.edu which
redirected the users to the phishing website URL. The
users were asked to enter their usernames and click on the
continue button. They were supposed to be taken to a
second page to enter their old and new passwords;
however, to ensure that no passwords were entered, the
users were directed to a second page with a timeout error
and a message asking them to try again after an hour due
to heavy system usage. A database was used to log all
entered usernames with the corresponding date and time.
User anonymity was ensured and no usernames were
revealed. The goal was only to count the number of
potential victims. The phishing website was left online
for 10 days. The AUS IT Department typically sends a
warning email to all AUS users whenever similar
phishing emails are sent to AUS users. The Department
also sends periodical emails alerting users to the latest IT
security threats. In the experiments case, the IT
Department sent a warning email a few hours after the
original phishing email. Despite the warning emails, 954
users out of the 11,000 AUS users entered their
usernames to the phishing website. Of those, 96% were
students. The number of male and female victims was
almost equal. In terms of student levels, the victims also
ranged from all levels, freshman to senior students.
However, the highest number of victims was from the
freshman level. Interestingly, over 200 users fell victims
to the phishing experiment after the IT Departments
warning email was sent. This shows that, unfortunately,
some users ignore such warning emails and dont take
them seriously. Furthermore, if this sophisticated attack
was real or involved banking details, the consequences
would have been severe.
At the end of the experiment, an illustrative website
was setup explaining the details of the experiment,
discussing the results, and advising users on what a
phishing attack is and what they should do to avoid
falling victims in future. The website was announced to
all AUS users and published in the local media. The
experiment results were daunting and showed the need
for significant security awareness training, yet many
users, especially the victims, became more aware of
phishing attacks after the experiment.
To analyze the impact of the awareness sessions,
another phishing audit was conducted two weeks later.
Interestingly, only 220 users fell victims to the audit all of
which were students. The second audit showed a drop in
the number of victims from 9% to 2% which reflects the
effectiveness of the conducted awareness sessions. The
controlled phishing audit can identify the effectiveness of
the information security awareness campaign.
Note that universities have always been a target for
cyber criminals since universities typically have a large
number of computing stations, fast internet bandwidth,
and allow guest access [14]. Yet, very few universities
are known to offer IT security awareness sessions to its
students and staff [28]. Recently, several researches have
178 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
been exploring the factors that affect information security
awareness in universities [28, 29].
As users, today, are becoming familiar with phishing
attacks, hackers are launching more sophisticated
phishing attacks known as Spear Phishing. The idea is to
send a phishing email targeting specific names in
governments or financial enterprises. The emails typically
belong to senior executives and include personally
identifiable information that is collected of public
websites or social pages, e.g. Facebook or LinkedIn. Only
a limited number of emails are sent to make the emails
look credible and avoid publicizing the attack. Such
attacks usually end up with the victim passing his/her
personal information and passwords.
Another advanced phishing attack is referred to as
Pharming where the hacker tampers the domain name
server (DNS) system such that the user would be
redirected to the fake website while displaying the
original legitimate URL address. This makes it more
difficult for the user to identify the fraud attempt.
A new advanced phishing attack known as Real-Time
Phishing targets two-factor authentication systems, such
as one-time passwords or tokens. The attack works by
immediately using the captured password from the
phishing website to access the bank website and commit
the fraud as opposed to using the password at a later time.


Figure 1. Original Password Portal Website for the American
University of Sharjah.


Figure 2. Phishing Password Portal Website for the American
University of Sharjah. Note that the URL address is different from the
original URL address in Figure 1.
III. WIRELESS SECURITY IN UAE
Wireless internet users are on the rise. According to
Computer Industry Almanac Inc, 38.7% of the world
internet users used wireless networks in 2008 and the
number will increase to 65.7% in 2014 [30]. Wireless
networks allow for easy access to the internet and reduces
the need for wires. Most PDAs, phones, and laptops
today have wireless internet devices that allow users to
connect to wireless hotspots. Today, wireless access
points are sold in normal supermarkets for less than $100
and are deployed in most homes, companies, universities,
hospitals, airports, etc.
Nevertheless, using the wireless access point without
changing its default configuration allows for data to be
exchanged between the access point and the wireless
device, e.g. laptop, in clear air unencrypted. In most cases,
users dont read the access point manual or spend the
time on changing the default configuration. When no
encryption is used, an attacker can easily eavesdrop on
any exchanged communication and steal the users
private data, such as emails or bank account info. An
attacker can also connect to the internet via the access
point and use it to avoid paying internet charges or more
seriously to conduct attacks against others and hide the
attackers identity.
Today, several wireless encryption systems exist.
Examples include WEP and WPA [31]. The WEP system,
which was introduced in 1999, has been shown to have
security flaws and security consultants are continuously
advising customers not to use WEP. An attacker can
easily break into the WEP system and identify the WEP
password with freely available tools. WPA is considered
the newest and most secure wireless encryption system.
In 2010, a wireless security assessment was conducted
in two cities of UAE: Dubai, and Sharjah. Residential and
commercial areas were assessed for the number of
wireless access points and the percentage of users that
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 179
2012 ACADEMY PUBLISHER
employ any type of encryption. The study found 12,000
access points in the two cities, of which 40% employed
WPA encryption, 38% employed WEP encryption, and
22% had no encryption (see Figure 3).
A similar survey was conducted in 2008 in the three
cities of UAE: Abu Dhabi, Dubai, and Sharjah [32]. The
2008 study found 15,000 access points in the three cities,
of which 35% employed WPA encryption, 33%
employed WEP encryption, and 32% had no encryption
(see Figure 4). While clearly the number of access points
with no encryption dropped by 10%, the number of
access points with the weak WEP encryption increased.
This shows the limited wireless security awareness
among some users and the need for additional education.



Figure 3. Percentage of Wireless Access Point Encryption Types in the
2010 UAE Survey.


Figure 4. Percentage of Wireless Access Point Encryption Types in the
2008 UAE Survey.
IV. RFID SECURITY IN UAE
Radio Frequency Identification (RFID) is a new
technology used for identifying and tracking objects,
known as RFID tags. The tags are typically applied into
products, animals, or humans. An advantage of the RFID
tag is its ability to be identified beyond the line of sight of
the reader and up to hundreds of meters away.
Organizations from all over the world have been heavily
investing in RFID to help them reduce their operation
cost, improve their business, and increase their revenue.
The Middle East has seen a rise in RFID applications
in the past few years. Dubai in United Arab Emirates
started using RFID gates for e-tolling. The Saudi Post
Corporation uses RFID tags to track valuable mail.
Emirates Motor Company, the worlds largest Mercedes
Benz facility, uses RFID tags to reduce the amount of
time needed to locate the vehicles in its large service
centers. Jewelry shops use RFID tags for fast detection of
missing items. UAE Universities, such as the American
University of Sharjah, are placing RFID tags on the
diplomas that they issue to ensure the validity of the
certificate. Child stores, such as Baroue in Kuwait, are
using RFID tags to allow parents to track their children
while playing in the store. Several organizations in the oil
& gas, construction, and health industries are using RFID
tags for access management.
According to VDC Research, the market of RFID
services in the Middle East was estimated at $29.4
million in 2009 and expected to reach $69.1 million in
2012 [33]. In contrast, the RFID services markets in
North/South America and Asia-Pacific are expected to
reach $1.28 billion and $1.6 billion, respectively, in 2012.
Although the RFID market in the Middle East is still
small, the growth rate is high.
Unfortunately, the introduction of new technologies
always comes with a byproduct, which is the abuse of the
technology. Today, several security researchers have
already highlighted various security weaknesses in RFID
systems, mainly being the illicit tracking of RFID tags. In
addition to privacy concerns, RFID tags can be used for
user profiling without the users knowledge. For example,
access to RFID tags can reveal reading habits in the case
of tagged books or the financial situation in the case of
tagged banknotes. Such weaknesses call for the need of
public awareness of RFID technology and the
understanding of its benefits, challenges, and risks.
According to [34], the general awareness regarding
RFID is low in both the United States and Europe. The
RFID awareness in the Middle East is also limited. The
last few years, have seen a few RFID-based conferences
hosted in the Middle East, but more is needed to increase
the general awareness among the public with respect to
how the RFID technology works and its security and
privacy concerns.
V. CYBER SECURITY AWARENESS
Hackers are continuously identifying new means of
stealing information. Unfortunately, the presence of
uneducated users in an organization makes them an
easy target for hackers and vulnerable to privacy attacks
[35]. User education and training is a must to combat IT
security threats. Users should not only learn the material
but they should also apply it in their daily life. This is not
a simple task to achieve and not the sole responsibility of
the user or the organization. Many groups have to be
involved to produce an IT security-aware resident. We
summarize some of the recommendations below:

Governments should produce cybercrime laws and
enforce them. They should also work closely with
other governments since many attacks can be
conducted from abroad. They should also establish
180 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
Computer Emergency Response Teams (CERT) that
are dedicated to the detection, prevention, and
response of cyber security incidents.
Computer Emergency Response Teams (CERT)
should be established to enhance the security
awareness among residents. CERT can also help
establish new cybercrime laws, train computer
forensic teams, and assist organizations and users in
fighting cybercrime. They can declare a cyber
security awareness month that can help increase the
publics attention to the importance of cyber security
awareness. In the Middle East, UAE, Saudi Arabia,
and Qatar have recently introduced CERT centers
[36, 37, 38].
Police Departments should have computer
forensics teams that are specialized in obtaining,
recovering, examining, analyzing, and presenting
electronic evidence stored on computers or electronic
devices.
Enterprises should offer security training to its
employees and clients. This could be online or onsite
training or a combination of both. The training
should be done regularly, e.g. twice a year, since new
IT security threats appear constantly. The enterprise
management should endorse and financially support
the training. The enterprise should also have a local
awareness campaign by distributing posters and
emailing newsletters to alert users to the latest IT
security threats. Note that the method of preparing
the awareness material is very important and the
content needs to be customized for different users.
For example, language and culture must be taken
into consideration when preparing the awareness
material. Similarly, the method of delivering the
awareness material is very important and the
communication process needs to be customized for
different users. For example, in a university, social
media websites, such as Facebook, can be used to
deliver the awareness material since a large number
of students are likely to be using social media
websites. In order to be compliant with standards that
require awareness programs, e.g. ISO 27001,
learning management systems can be used to track
the users learning activity. Audits, similar to the one
conducted at the American University of Sharjah and
reported in Section 2, should also be conducted to
measure the level of security awareness among the
users and the effectiveness of the security awareness
campaign and training. Such audits must be carefully
planned and implemented to meet their goals, while
protecting the privacy and the personal data of the
assessed users. A central point of contact regarding
IT security related matters should be established to
ease the communication with the users. The
education material should cover the organizations IT
security policies and the penalties for not following
the rules. Finally, enterprises should adopt a
proactive rather than a reactive approach to security
awareness.
Telecommunication companies (ISPs) should
offer advices on how to use the internet safely or
configure any internet device securely.
Media should continuously post IT security
advices, report IT security incidents, and the penalty
that the attackers got.
Users should train themselves by constantly
reading magazines, books and online articles on IT
security threats and what to do to protect themselves
from such threats.
Non-Governmental Organizations (NGOs)
should lead IT security awareness campaigns and
provide support for those who have questions or have
security problems.
Schools and Universities should offer security
awareness campaigns and integrate IT security topics
into their computer courses curriculum.

In 2008, a new initiative has been proposed to fight
cyber terrorism by bringing governments, businesses, and
academia together from all over the world. The initiative,
known as the International Multilateral Partnership
Against Cyber Terrorism (IMPACT) [39], consists of the
international partnership of more than 30 countries to
study and respond to high-level cyber security threats.
VI. CONCLUSIONS
As Middle East organizations expand their use of
advanced security technology and use the latest hardware
and software, it is becoming more difficult to conduct
technical attacks. Similarly, the organizations are
developing well-written complete security policies and
hiring IT security experts that are also helping in reducing
the number of possible attacks. Unfortunately, little is
used to secure the weakest link, i.e. the users. This is
pushing attackers to gain unauthorized access to
information by exploiting users trust and tendency to
help. The paper discussed the security awareness among
users in the Middle East and reported the findings of
several IT security awareness studies conducted among
students and professionals in UAE. It discussed the
importance of assessing the security awareness by
running controlled audits. Several key factors to help
increase the security awareness among users were also
presented.
ACKNOWLEDGMENT
The author would like to thank Ahmed El Zarka,
Arsalan Bhojani, Jamshaid Mohebzada, Maram Jibreel,
Rayan Al-Omran, and Rim Zakaria for collecting some of
the data.
REFERENCES
[1] Miniwatts Marketing Group, 2010 Internet World Stats.
Available at: http://www.internetworldstats.com/stats.htm.
[2] B2C e-commerce volume exceeded US$ 4.87 billion in
Kuwait, Lebanon, Saudi Arabia and UAE in 2007, Arab
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 181
2012 ACADEMY PUBLISHER
Advisors Group Report, 2008. Available at:
http://www.arabadvisors.com/Pressers/presser-040208.
htm-0.
[3] Explosion in popularity of credit cards in the Middle
East, Lafferty Group Report, 2007. Available at:
http://www.lafferty.com/pdffiles/Lafferty%20MENA%20ca
rds%20-%20press%20release%20180607%20_3_.pdf.
[4] Leading UAE newspapers website hacked, Arabian
Business, 2008. Available at: http://www.arabianbusiness.
com/519982-leading-uae-newspapers-website-hacked.
[5] Al Arabiya hit by Sunni-Shiite hacking war, Al Arabiya
News Channel, 2008. Available at:
http://www.alarabiya.net/articles/2008/10/10/57995.html.
[6] Batelco internet subscribers targeted by phishing attack,
Arabian Business, 2008. Available at:
http://www.arabianbusiness.com/520459-batelco-internet-
subscribers-targeted-by-phishing-attack.
[7] NBK online banking customers targeted by phishing
attacks, Arabian Business, 2008. Available at:
http://www.arabianbusiness.com/522781-nbk-online-
banking-customers-targeted-by-phishing-attack.
[8] UAE bank targeted in major phishing attacks, ITP, 2010.
Available at: http://www.itp.net/579059-uae-bank-targeted
-in-major-phishing-attack.
[9] Phishing raid empties bank accounts, The National, 2010.
Available at:
http://www.thenational.ae/apps/pbcs .dll/article?AID=/201
00405/NATIONAL/704049912&SearchID=733987396980
56.
[10] Internet virus infects Ministry of Education, UAE Today,
2010. Available at: http://www.emaratalyoum. com/local-
section/accidents/2010-04-12-1.106891.
[11] Riyad bank website hacked, AMEinfo, 2010. Available at:
http://www.ameinfo.com/235378.html.
[12] Al Jazeera blames hackers for World Cup interruption,
The National, 2010. Available at http://www.thenational.
ae/apps/pbcs.dll/article?AID=/20100613/NATIONAL/7061
29867&SearchID=73402848695382.
[13] International Cyber Crime Law. Available at: http://www.
cybercrimelaw.net/ laws/alfabetic/s-t.html.
[14] F. Katz, The effect of a university information security
survey on instructing methods in information security, in
Proc. of the Annual Conference on Information Security
Curriculum Development, pp. 43-48, 2005.
[15] M. Whitman and H. Mattord, Principles of Information
Security, Course Technology, 2
nd
edition, 2007.
[16] APWG, Phishing Activity Trends Report, Q4 2009.
Available at: http://www.antiphishing.org/reports/apwg_
report_Q4_2009.pdf.
[17] Emirates vulnerable to internet attacks, The National,
2008. Available at: http://www.thenational.ae/apps
/pbcs.dll/article?AID=/20080814/NATIONAL/420302377
&SearchID=73402849474210.
[18] UAE cybercrime squad gunning forward, Arabian
Business, 2009. Available at: http://www.Arabianbusiness.
com/553470-uae-cybercrime-squad-gunning-forward.
[19] Phishing website of bogus recruitment agency blocked,
Gulf News, 2008. Available at: http://gulfnews.com
/news/gulf/uae/employment/phishing-website-of-bogus-
recruitment-agency-blocked-1.84296.
[20] D. Timko, The Social Engineering Threat, Information
Systems Security Association Journal (ISSA), January
2008.
[21] P. Kumaraguru, S. Sheng, A. Acquisti, L. Cranor, and J.
Hong, Lessons From a Real World Evaluation of Anti-
Phishing Training, in Proc of the IEEE eCrime
Researchers Summit, pp. 1-12, 2008.
[22] ISO/IEC 27001:2005 - Information technology Security
techniques Information security management systems
Requirements. Published by International Organization for
Standardization (ISO) and the International
Electrotechnical Commission (IEC), October 2005.
[23] NIST An Introduction to Computer Security. Published
by National Institute of Standards and Technology (NIST),
2004.
[24] G. Orgill, G. Romney, M. Bailey, and P. Orgill, The
urgency for effective user privacy-education to counter
social engineering attacks on secure computer systems, in
Proc. of the 5
th
Conference on Information Technology
Education, pp. 177-181, 2004.
[25] Women 4 times more likely than men to give passwords
for chocolate, Infosecurity Europe, 2008. Available at:
http://www.infosec.co.uk/page.cfm/T=m/Action=Press/Pre
ssID=1071
[26] R. Dodge, C. Carver, and A. Ferguson, Phishing for User
Security Awareness, Computers and Security, 26(1), pp.
73-80, February 2007.
[27] New York State Office of Cyber Security & Critical
Infrastructure Coordination, Gone Phishing. A Briefing
on the Anti-Phishing Exercise Initiative for New York
State Government. Aggregate Exercise Results for public
release, 2005.
[28] Y. Rezgui and A. Marks, Information security awareness
in higher education: An exploratory study, Computers and
Security, 27 (7-8), pp. 241-253, 2008.
[29] A. Marks, Exploring universities information systems
security awareness in a changing higher education
environment, Ph.D. Thesis, University of Salford, 2007.
[30] Computer Industry Almanac Inc, Wireless Internet Users,
http://www.c-i-a.com/pr032102.htm
[31] J. Edney and W. Arbaugh, Real 802.11 Security: Wi-Fi
Protected Access and 802.11i, Addison-Wisley, 2003.
[32] A. Kalbasi, O. Alomar, M. Hajipour, and F. Aloul,
Wireless security in UAE: A survey paper, in Proc. of
the IEEE GCC Conference, 2007.
[33] Middle East RFID market heats up, RFID Journal, 2009.
Available at: http://www.rfidjournal.com/article/view/4618
[34] RFID and Consumers: What European Consumers Think
About Radio Frequency Identification and the Implications
for Business, Cap Gemini, Paris, 2005. Available at:
http://www.us.capgemini.com/DownloadLibrary/requestfil
e.asp?ID=450.
[35] Z. Khattak, J. Manan, and S. Sulaiman, Analysis of Open
Environment Sign-in Schemes-Privacy Enhanced &
Trustworthy Approach, in Journal of Advances in
Information Technology, 2(2), pp. 109-121, May 2011.
[36] UAE-CERT. Available at: http://www.aecert.ae/.
[37] Saudi Arabia-CERT. Available at: http://www.cert.gov.sa/.
[38] Qatar-CERT. Available at: http://www.qcert.org/.
[39] International Multilateral Partnership Against Cyber
Terrorism (IMPACT). Available at: http://www.impact-
alliance.org/.










182 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
Fadi Aloul: Dr. Aloul is currently an
Associate Professor of Computer
Science & Engineering at the American
University of Sharjah, UAE. Dr. Aloul
holds a Ph.D. and M.S. degrees in
Computer Science & Engineering from
the University of Michigan, Ann Arbor,
USA, respectively, and a B.S. degree in
Electrical Engineering summa cum
laude from Lawrence Technological
University, Michigan, USA. He is a Certified Information
Systems Security Professional (CISSP). He was a post-doc
research fellow at the University of Michigan during summer
2003 and a visiting researcher with the Advanced Technology
Group at Synopsys during summer 2005.
Dr. Aloul received a number of awards including the
prestigious Sheikh Khalifa, UAEs President, Award for Higher
Education, the Semiconductor Research Corporation Research
Fellowship, and the AUS CEN Excellence in Teaching Award.
He has 80+ publications (available at http://www.aloul.net) in
international journals and conferences, in addition to 1 US
patent. His current research interests are in the areas of design
automation, combinatorial optimization, and computer security.
He is a senior member of the Institute of Electrical and
Electronics Engineers (IEEE) and the Associate of Computing
Machinery (ACM). He is the founder and chair of the UAE
IEEE Graduates of Last Decade (GOLD) group.

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 183
2012 ACADEMY PUBLISHER
A New Approach on Cluster based Call
Scheduling for Mobile Networks

P. K. Guha Thakurta

Department of CSE, NIT, Durgapur-713209, India

Email: parag.nitdgp@gmail.com

Saikat Basu
1
, Sayan Goswami
2
, Subhansu Bandyopadhyay
3
Department of Computer Science, Louisiana State University, USA
1
,
Sapient Global Markets, India
2

Department of CSE, University of Calcutta, Kolkata-700009, India
3

Email: sbasu8@lsu.edu
1
, sayan.nitd@gmail.com
2
, subhansu@computer.org
3



Abstract An efficient cluster based approach on call
scheduling in mobile networks is proposed in this paper.
The dynamic threshold value () is formulated with
justification. The clusters are formed on the basis of defined
threshold value. The cluster head has been selected with
respect to different weight metrics for improving call
scheduling. The leader (cluster head) election procedure is
also described for link breakage and link emergence
respectively. After the formation of clusters, the subsequent
call scheduling algorithm has also been outlined in detail in
this work.

Index TermsMobile computing, clustering, call scheduling,
Routing, Dynamic thresholding.

I. INTRODUCTION
The rapid growth of cellular telephony needs efficient
resource allocation strategies. Hence, an effective call
selection procedure is also required at the same time.
During network congestion, Call Admission Control
(CAC) strategy is used to give permission to limited
number of users as well as deny service for rest of the
users [3]. Consequently, Quality of Service (QoS)
becomes an important factor for admitted users. It is
therefore necessary to consider two near contradictory
requirements allocating resources as well as ensuring
Quality of Service (QoS) when all users are trying to
make a request at the same time.
Nodes communicate with each other using multi-hop
links in mobile cellular networks. Each node in the
network has call forwarding capability to other nodes. So,
various routing strategies [11] have been designed to
address the problem of finding the routing path. The
cluster based routing protocol proposed in [9] assumes
that the mobile nodes are location-aware. The procedure
behind the foundation of location different neighbors
location with respect to a specific node is beyond the
scope of this work. One of the most important parameters
to be considered for leader election in a cluster is the
congestion metric. However, a limitation in the proposed
routing protocol in [10] has been observed regarding this
issue. An efficient call scheduling procedures known as
Priority based Tree Generation for mobile networks
(PTGM) in [1] has described a tree based methodology
with the foundation of unique path sequence. This tree
based call scheduling procedure has been mapped into
Cartesian coordinate system, with mobile terminal (MT)
placed at the origin (0,0). Hence, other cells are
represented as points in the coordinate system with
following certain criteria [2]. This coordinate based
routing protocol (CSTR) has been formulated with the
help of a tree structure and all possible routing paths
could also be enumerated in a simple manner. This
routing path analysis needs a more efficient methodology
to increase throughput and reduce network latency at the
same time.
In this paper, a new constraint based spatial clustering
algorithm has been designed for call scheduling. This
algorithm is based on a dynamic threshold value. This
threshold value has been formulated with respect to the
Euclidian distances between the cells of the coordinate
based system representation in [2]. On the basis of
threshold value, the different clusters of cells could be
formed and hence, a new clustering algorithm is proposed.
The positional identity (x,y) of each cell is broadcast to
all other cells in the network. Consequently, it causes
congestion known as Broadcast Storm. To prevent it, a
new multicast clustering algorithm is also proposed in
this paper. Once the clusters have been formed, the leader
election in a cluster is done with a weight based
algorithm to reduce the searching complexity to a large
extent. This weight is quantified with respect to different
performance metrics. Due to the emergence of new nodes
or disappearance of the existing ones in/from the network,
the cluster leader needs to be updated. Hence, two
algorithms named as link_emergence and link_breakage
have been proposed. The formation of clusters and call
scheduling through the clustered nodes have been
discussed in detail. The simulated result of performance
analysis for the system is also shown in terms of QoS of
the network.
184 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.184-190
The rest of this paper is organized as follows. In
section 2, a brief description of coordinate based routing
protocol (CSTR) is provided for completeness of the
work. The different issues of the proposed model are
described with various algorithms in section 3. The
experimental results are discussed in section 4. The
section 5 concludes the advantages of the proposed model
with future scope.
II. COORDINATE BASED SEARCH TREE GENERATION WITH
THE DETERMINATION OF ROUTING PATHS (CSTR) [2]
In this model, MT is denoted by (0,0) and each other
cell is having a coordinate of the form (x,y). The cells
covered by the radius r (within transmission covering
range [3]) of MT are mapped as (x,y+1), (x+1,y+1) and
(x+1,y) such that x+1 r and y+1 r. For example, in
Fig. 1(a), cell numbers C
13
, C
12
and C
11
of r = 1 in [1] are
mapped into (1,0), (1,1) and (0,1) respectively. Therefore,
cellular structure of mobile networks detected in [1]
could be mapped into a coordinate based system as
shown in Fig. 1(b).

Fig. 1(a): Cellular structure for Mobile Networks
for r=3

Fig. 1(b): Coordinate based representation of Fig.
1(a) (in [2])


III. THE PROPOSED MODEL
The model proposed in this paper is the collection of
several functional events in a sequential manner. These
are listed as follows:
(i) Determination of threshold value;
(ii) Formation of Constraint based Spatial Clustering
Algorithm and identification of broadcast storm;
(iii) Prevention of broadcast storm by using multicast
clustering algorithm;
(iv) Introduction of weight metric and subsequent
proposal of a leader election algorithm for the reduction
of searching complexity;
(v) The necessity of dynamic link handling using
weighted mobility-adaptive leader election algorithm;

A. Determination of Threshold Value
The threshold value selection for a cluster based
system reflects the number of clusters obtained. If the
threshold value is very small, then there would be a high
number of clusters and consequently each connected
component size becomes quite small. So it results in a
low throughput. On the other hand, the reverse situation
would occur for the high threshold value. This leads to
high congestion and latency. The threshold values could
be defined as follows:

Threshold () =


where, D denotes the distance between the two farthest
cells in radius r,
N denotes the total number of cells in radius r,
max and min denotes the maximum and minimum
Euclidian distances between cells belonging to radii i and
j respectively.
Considering Fig. 1(a) and Fig. 1(b), the threshold value
would be 0.56 for cells belonging to radius r = 2 and
1.618 for that of r = 1 and r = 2.

B. Constraint Based Spatial Clustering Algorithm
Once the threshold value is determined, the clustering
process [12] is initiated. Here, the metric used is the
positional identity (x,y) of the cell, where x denotes the
position of the cell along the X-axis and y denotes its
position along the Y-axis. Each cell advertises its
positional identity to adjacent cells in the form of
broadcast packets. It is performed in the network as
shown in Fig. 1(b). Now, the difference between the
positional identity values of the two cells is compared
with the predefined threshold values. If it is less than the
threshold value, an edge is constructed between the two
vertices (cells). Otherwise, no edge exists between these
two cells. Hence, a number of connected components
(clusters) are obtained, maintaining the tradeoff between
throughput and latency. This procedure is described by
the following algorithm.

Algorithm:
begin
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 185
2012 ACADEMY PUBLISHER
Set the threshold value
for each cell i
positional identity
i
( x
i
, y
i
)
BROADCAST_PACKET ( positional identity
i
) /*
sends a broadcast packet to all of its neighbors */
end for
for each cell j
while RECEIVE_PACKET( ) is not NULL
(x
i,
y
i
) positional identity
i

diff SQRT ((x
i
x
j
)
2
+ (y
i
y
j
)
2
)
if diff <
construct an edge e
ij
between v
i
and v
j

else
continue
end if
end while
end for
end

The various functions used in the above algorithm are:
1) BROADCAST_PACKET( ) Advertises the
positional identity of a cell to all the adjacent
cells in the network.
2) RECEIVE_PACKET( ) A cell receives a
packet from an adjacent cell using this routine.

A major problem has been identified in the above
algorithm known as broadcast storm. Due to the
broadcasting of packets, it may happen that there would
be the possibility of multiple copies of the same packet
reaching any particular cell. Subsequently, it causes
network congestion.

C. Prevention of Broadcast Storm
A multicast clustering algorithm has been proposed to
prevent broadcast storm. The key feature of this
algorithm is that sending multicast packets to the
neighbors except the one from which it receives a
multicast packet. This prevention procedure is described
by the following multicast clustering algorithm.

Algorithm:
begin
Set the threshold value
for a particular cell i
positional identity
i
( x
i
, y
i
)
MULTICAST_PACKET ( positional identity
i
) /*
sends a multicast packet to all the neighbors but one cell
at a time */
end for
for each cell j NEIGHBOR(i)
while RECEIVE_PACKET( ) is not NULL
(x
i,
y
i
) positional identity
i

diff SQRT ((x
i
x
j
)
2
+ (y
i
y
j
)
2
) /* SQRT
returns the square root */
if diff <
construct an edge e
ij
between v
i
and v
j

else
continue
end if
MULTICAST_PACKET ( positional identity
j
) /*
repeat the process for each neighbor */
end while
end for
end
The functions used:
1) MULTICAST_PACKET( ) - Advertises the
positional identity of a cell to all the adjacent
cells in the network except the one from which it
itself received the multicast packet.
2) RECEIVE_PACKET( ) A cell receives a
packet from an adjacent cell using this routine.

D. A New Approach towards Leader Election
Once the clusters have been formed, the next step is to
design an efficient leader election algorithm. Leader
election is a fundamental control problem in both wired
and wireless systems [4]. A leader is required in a group
communication system to handle the transmission of
messages to the members of the group. Here, it is
represented by a cluster-head of a cluster. The dynamic
cluster-head selection is required for the constant change
of the network topologies. The weights (W) are assigned
to each member of the cluster. The maximum weighted
member is to be considered as the cluster-head. These
weights are dependent on the following parameters.
1) Degree of a vertex (cell): This denotes the number of
cells (D) connected to that specific cell. If the degree of
the cell is higher, then it has a greater probability of being
elected as a cluster-head due to nearest neighborhood
principle [5].
2) Mean of the distances: This parameter () represents
the mean of the distances of a cell from its neighbors. If
the value of is small, then its probability of being
elected as a cluster-head is higher.
3) Counter values associated with each cell (C): The
concept of counters was introduced in [6]. The counter
values are increased with respect to time and provide a
measure of the congestion level along a specific path.
Naturally, higher counter value of a particular cell means
high congestion. Subsequently, it has a lower probability
to be elected as the cluster-head.
Once all the parameters have been quantified, the
weight (W) can be defined as follows.

W




The leader election algorithm runs recursively and
eventually terminates after electing a unique cluster-head
for each cluster. This algorithm is comprised of the
following procedures.
ELECT_HEAD ( ) Any cell which has no cluster-
head begins a diffusing computation [7] and calls this
procedure. This in turn sends the elect_head message to
all its neighbors.
RECV_MSG ( ) -- When a cell receives an elect_head
message through the RECV_MSG ( ) procedure, it sets
the sending cell as its parent.
186 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
SEND_NODE ( ) -- It returns the cell (C
ij
) which sent
the last message to the receiving cell.
SEND_ACK ( ) -- If a cell receives an elect_head
message from a non-parent cell, then it returns an ACK
message to the cell. This message contains the
information about the current best valued cell and its
identity in the cluster. This is a recursive procedure which
continues until a unique leader is chosen for the cluster.
SEND_NAK ( ) -- This procedure is called when a
cell has received an elect_head message from another cell
that is designated as its parent.
BROADCAST_PACKET ( ) When the cell initiates
the diffusing computations, subsequently it learns about
the leader. Then, it sends a broadcast packet to all other
cells belonging to the cluster using this procedure.
On the basis of the defined procedures, an algorithm is
described for the cluster-head selection problem
considering the highest weighted node as follows.

Algorithm:
Leader_selection( )
begin
for each cluster C
i
do
for each node A
i
in C
i
do
while CLUSTER_HEAD(A
i
) is NULL
ELECT_HEAD ( A
i
)
end while
end for
for any node A
j
in C
i

if RECV_MSG ( ) == elect_head /*
elect_head message sent by the ELECT_HEAD ( )
procedure */
if SEND_NODE ( ) == PARENT(A
j
) /*
indicates the parent of A
i
in the spanning tree */
SEND_NAK ( PARENT (A
j
) )
else
SEND_ACK (recv_node) /*
indicates the node from which the elect_head message
was received by the node A
j
*/
else if for all nodes CHILD(A
j
)
RECV_ACK ( ) == true
then SEND_ACK ( PARENT (A
j
) )
end if
end if
end for
if for all nodes A
j
CHILD(A
i
) /* CHILD (A
i
)
returns the children of the node A
i
in the spanning tree */
RECV_ACK ( ) == true
then BROADCAST_PACKET ( leader) /* leader
indicates the most valued node based on the weight */
end if
end for
end

E. Weighted Mobility Adaptive Leader Election
The leader election problem defined in the previous
section has been described from the static point of view.
The weights of each node are to be calculated during link
breakage (i.e., when one or more nodes are detached from
the cluster) or link emergence (i.e., when new nodes are
added to the cluster). Subsequently, the leader election
algorithm is updated, considering these two cases.
1) Link Breakage: If a link fails, then the nodes
connected through the link is detached from the cluster.
In the proposed algorithm, two new message packets
namely ping and response are introduced. All nodes keep
on sending ping packets to adjacent nodes to check the
condition of their links. Whenever a node receives a ping
packet, it replies with a response packet. So, once a node
has sent a ping packet and has not received a response
from a particular node, this means the link has failed.
Consequently, the nodes participate in a leader election
mechanism to choose a new leader. This process is
described by the following algorithm.

Algorithm:
begin
for each node A
i
in C
i

for every other node A
j
in C
i

SEND_PACKET ( ping )
end for
end for
for each node A
i
in C
i
if RECEIVE_PACKET ( ) == response
continue
else
Leader_selection ( ) /* calls the main
Leader selection algorithm for determining the new
leader */
end if
end for
end

The procedures used in the above algorithm are:
1) SEND_PACKET(ping) Send a ping packet to
adjacent nodes to check the condition of the link.
2) RECEIVE_PACKET ( ) - Receives a packet from
adjacent nodes.
3) Leader_selection ( ) The main leader election
algorithm proposed in section III.D.

2) Link Emergence: When a new cell site appears in
the network, then a new base station is added to include
the cell site in the cellular network. Then a new link
comes up that connects the Base Station to the network.
In this situation, two cases may occur. Either the new
node(cell) has a weight less than that of the current
cluster-head or greater than that. If the weight of
currently arrived node is greater than the cluster head,
then a packet with its own value and the link information
is broadcast to the others in the cluster. Accordingly, the
cluster head is updated and subsequently, all nodes of the
cluster are informed of the changes. Otherwise, the new
node is simply registered in the cluster. This process is
described by the following algorithm.

Algorithm:
begin
for each node A
i
added to C
i

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 187
2012 ACADEMY PUBLISHER
if weight
Ai
< weight
clusterhead

C
i
C
i
{A
i
}
else
BROADCAST_PACKET ( leader, A
i
)
/* broadcasts a packet declaring itself as the leader
and passing its control information to other nodes */
end if
end for
end

The procedure used:
BROADCAST_PACKET ( ) The node with the
highest weight broadcasts a packet declaring itself as the
leader and passing its control information to other nodes.

CALL SCHEDULING

Once the clusters are formed and maintained, these can
be used to handle incoming call requests. When a mobile
node sends a call request, the call is forwarded to the
corresponding Base Station which is then forwarded to
the cluster-head of the corresponding cluster. This
cluster-head then forwards the call to the cluster-head of
the cluster to which the call is to be forwarded. The
cluster-head then forwards the call to the base station of
the call receiving node and subsequently forwarded to the
call receiving node.
So, the path of the call request can be represented as
follows:
C
S
BS
S
Leader
S
Leader
R
BS
R
C
R


Where, C
S
, BS
S
and Leader
S
denote the sender side and
C
R
, BS
R
and Leader
R
denote the receiver side.

CALL ROUTING BETWEEN LEADER
S
AND
LEADER
R


Once the Leader
S
and Leader
R
cells have been
predetermined, the next step is to find the shortest path
between Leader
S
and Leader
R
. This is done by using the
Kruskals algorithm. This algorithm finds a minimum
spanning tree for a connected weighted graph. So, to map
our problem to the Kruskals algorithm, a connected
weighted graph is constructed where the cluster-heads
represent the vertices and the edges are represented by
weights that denote the Euclidean distances between the
nodes. For sake of completeness the Kruskals algorithm
is presented next.

Algorithm:
Kruskal(G = <N, A>: graph; length: A R
+
): set of
edges
Define an elementary cluster C(v) {v}.
Initialize a priority queue Q to contain all edges in
G, using the weights as keys.
Define a forest T //T will ultimately contain
the edges of the MST
// n is total number of vertices
while T has fewer than n-1 edges do
// edge u,v is the minimum weighted route from u
to v
(u,v) Q.removeMin()
// prevent cycles in T. add u,v only if T does not
already contain a path between u and v.
// the vertices has been added to the tree.
Let C(v) be the cluster containing v, and let C(u)
be the cluster containing u.
if C(v) C(u) then
Add edge (v,u) to T.
Merge C(v) and C(u) into one cluster, that is,
union C(v) and C(u).
return tree T

The above algorithm can be shown to run in O(E log E)
time, or equivalently, O(E log V) time, all with simple
data structures. These running times are equivalent
because:
E is at most V
2
and logV
2
= 2logV is O(log V).
If we ignore isolated vertices, which will each be
their own component of the minimum spanning
forest, V E+1, so log V is O(log E).

Next the call scheduling algorithm is presented.


Algorithm:
Call_Schedule (Call t
i
)
begin:
forward call from calling node C
Si
to base station BS
Si

forward call from base station BS
Si
to cluster-head H
Si

G FORM_GRAPH(network N)
min_span_tree T Kruskal(G)
Edge set {E} SELECT_EDGE_SET(H
Si
, H
Ri
)
forward call from cluster-head H
Si
to cluster-head H
Ri

forward call from cluster-head H
Ri
to base station BS
Ri

forward call from base station BS
Ri
to receiver node C
Ri

end

The functions used are presented below:
1) FORM_GRAPH(network N) Forms a connected
weighted graph from network N with the mobile
nodes as vertices and connecting links as edges
having weights equal to the Euclidean distances
between the nodes.
2) SELECT_EDGE_SET(H
I
, H
J
) Selects the edge
set joining the cells H
I
and H
J
in the minimum
spanning tree T.

The time required for cluster formation is already
evaluated in [13, 14]. So, the methodology proposed
here reduces the computation time to a large extent
with respect to previous approaches.

IV. EXPERIMENTAL RESULTS
The proposed model is simulated with Matlab 7.5.0.
Here, the result shows the effect of clustering on call
scheduling. The Fig. 2 shows the congestion in the
network with respect to various cells with and without
188 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
clustering. The weight metric used is the counter values
that were introduced in [6] and discussed in Section III.D.
The counter values are an effective measure for the
network congestion with respect to the traffic requests
being handled by a given Base Station(cell) at a given
point of time. The x-axis gives the ids of the various
cells in the network. Thus, e.g., 250 here refers to cell
C
250
. Random calls were initiated with a pseudo-random
number generator. The algorithm was executed for 100
iterations and the average
values of the Weight metric(C) were plotted as the graph
shown in Fig.2.


Fig. 2. Congestion in the network for various cells

Quality of Service(QoS)

The Quality of Service(QoS) is inversely proportional to
the congestion in the network. So, the QoS metric(Q)
increases when the congestion metric(C) decreases and
vice versa.
This can be mathematically represented as :

QoS(Q)




The Quality of Service metric Q can be parametrically
represented as the following mathematical formulation:



Where, - + minimum value of log


is a predefined constant having value 0.5. The value of
is chosen in such a way in order to make the parameter
Q to assume positive values only.
Hence we get the following graph for QoS versus the
cells.


Fig 3. Quality of Service(QoS) for various cells

V. CONCLUSION

The procedure for cluster based call scheduling
methodology has been described in this work. The
searching cost is reduced for using the cluster head in
routing. At the same time, the leader (cluster head)
election algorithm has been enhanced with the inclusion
of link_breakage and link_emergence procedures. So, it
increases the flexibility of dynamic call scheduling. The
use of Kruskals algorithm for finding the shortest path
and the subsequent call scheduling algorithm proposed in
this paper are reactive call scheduling strategies that,
though efficient in terms of storage costs require higher
computational resources because the Shortest paths have
to found out each time a call is initiated. This problem
can be solved by maintaining routing tables.
Further study on extending this model for construction of
routing tables is in progress.
REFERENCES
[1] P.K.Guha Thakurta and Subhansu Bandyopadhyay, A
New Dynamic Pricing Scheme with Priority based Tree
Generation and Scheduling for Mobile Networks, IEEE
Advanced Computing Conference, March 2009.
[2] P.K.Guha Thakurta, Rajarshi Poddar and Subhansu
Bandyopadhyay, A New Approach on Co-ordinate based
Routing Protocol for Mobile Networks, IEEE Advanced
Computing Conference, February 2010 .
[3] Wen-Hwa Liao, Jang-Ping Sheu and Yu-Chee Tseng,
GRID: A Fully Location-Aware Routing Protocol for
Mobile Ad Hoc Networks, Journal on Telecommunication
Systems, Springer Netherlands, Vol 18, No. 1-3,
September 2001.
[4] Sudarshan Vasudevan, Jim Kurose and Don Towsley,
Design and Analysis of a Leader Election Algorithm for
Mobile Ad Hoc Networks, IEEE ICNP, 2004, page(s):
350-360.
[5] C. Bohm and F. Krebs, The k-nearest neighbour join:
Turbo charging the kdd process, Journal on Knowledge
and Information Systems, Vol. 6, No. 6, 2004.
[6] P.K.Guha Thakurta, Subhansu Bandyopadhyay, S. Basu
and S. Goswami, A new approach on Congestion Control
with Delay Reduction in Mobile Networks, Second
International Conference on Advances in Recent
Technologies in Communication and Computing
(ARTCom), October, 2010.
[7] E. J. Dijkstra and C.S. Scholten, Termination detection
for diffusing computations, In Information Processing
Letters, Vol. 11, No. 1, pp. 1-4, August 1980.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 189
2012 ACADEMY PUBLISHER
[8] Joseph. B. Kruskal, On the Shortest Spanning Subtree of a
Graph and the Traveling Salesman Problem, In
proceedings of the American Mathematical Society, Vol 7,
No. 1 (Feb, 1956), pp. 4850.
[9] Liliana M. Arboleda C. and Nidal Nasser, Cluster-based
routing protocol for mobile sensor networks. In
Proceedings of the 3rd international conference on Quality
of service in heterogeneous wired/wireless networks
(QShine '06). ACM 2006, New York, NY, USA, Article 24.
[10] M. Rezaee and M. Yaghmaee, Cluster based Routing
Protocol for Mobile Ad Hoc Networks, InfoComp, Vol 8,
No 1, March 2009, pp 30-36.
[11] Rana E. Ahmed, A Fault-Tolerant Routing Protocol for
Mobile Ad Hoc Networks, Journal of Advances in
Information Technology, Volume 2, Number 2, May 2011,
Page (s): 128 132.
[12] Subhash K. Shinde, Uday V. Kulkarni, Hybrid
Personalized Recommender System Using Fast K-medoids
Clustering Algorithm, Journal of Advances in Information
Technology, Volume 2, Number 3, August 2011, Page(s):
152 158.
[13] J. Usha , Ajay Kumar and A.D. Shaligram, "Clustering
Approach for Congestion in Mobile Networks", IJCSNS
International Journal of Computer Science and Network
Security, Volume 10 Number 2, February 2010, Page(s)
113-118.
[14] Mary Inaba, Naoki Katoh, Hiroshi Imai, "Applications of
weighted Voronoi diagrams and randomization to
variance-based k-clustering", In proceedings of the 10th
annual Symposium on Computational Geometry, New
York, USA, 1994, ACM Digital Library, ISBN:0-89791-
648-4.

P. K. Guha Thakurta: He passed B.Tech and M.Tech in
computer science & engineering from Kalyani University and
Calcutta University in 2002, 2004 respectively. He is currently
working as an Assistant Professor of CSE dept. in National
Institute of Technology, Durgapur, India. His research area is
Mobile Computing.

Saikat Basu is a PhD student at the Computer Science
Department at Louisiana State University, USA. He received his
Btech degree in Computer Science from National Institute of
Technology Durgapur, India in 2011. His research interests lie
in the field of Mobile Computing, Security and Video
Surveillence.

Sayan Goswami is a junior associate at Sapient Global Markets.
He received his Btech degree in Computer Science from
National Institute of Technology Durgapur, India in 2011. His
research interests lie in the field of Mobile Computing and
Embedded Systems.

Subhansu Bandyopadhyay is an eminent professor in the dept.
of Computer Science & Engineering, University of Calcutta.



190 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
Energy Efficient Cell Survival and Cell Merging
Approaches for Auto-Configurable Wireless
Sensor Networks

M R Uddin
1
, M A Matin
2
, M K H Foraji
1
, and B Hossain
1

1
Department of EECS, North South University, Dhaka, Bangladesh
2
Department of EEE, Faculty of Engineering, Institut Teknologi Brunei, Brunei Darussalam
Email: si.ijcnis@gmail.com



Abstract In practice, sensor nodes (SN) are battery-
powered devices and the variant rate of depletion of power
in the devices can hamper the networks efficient operation
as the node lifetimes directly affect the network lifetime of
wireless sensor network (WSN). There are normally, in
practical systems, certain considerations taken at each layer
to prolong the network lifetime. This paper focuses on the
topology control process. The authors propose a new auto-
configurable algorithm to re-organize the network topology
efficiently. The proposed algorithm merges both cell
merging and cell manager selection process together and is
applied in clustered homogeneous cellular architecture. The
simulation result proves that the proposed algorithm
consumes less power and consequently prolongs the network
lifetime than the existing auto-configurable algorithm.

Index Terms wireless sensor network, auto-configuring or
self-configuring algorithms, cell merging, clustering
I. INTRODUCTION
A wireless sensor network can be divided into three
major segments like sensor components, microelectronic
systems and a wireless network. The devices that are
used in wireless sensor network contain four fundamental
units: power supply unit, central processing unit, data
communication unit, and sensor component, altogether
form an embedded system. Nowadays, the advancement
of the microelectronics makes low-power smaller size
inexpensive sensors. Sensors which build the backbone of
wireless sensor networks operate at a predetermined set
of instructions, the problem becomes very serious when
these batteries are not rechargeable and in practice a
significant amount of power is spent for processing of
transmission and reception of information. The variant
rate of depletion of energy in the nodes can seriously
hamper the networks efficient operation and therefore its
lifetime [1]. For this reason, WSNs have to be self-
configuring or self-healing in case of any type of failure
and these processes should be optimized so that failure
could be recovered by consuming less amount of energy.
Therefore, the energy should be maintained and used in a
planned way by the sensor networks through efficient
survival algorithms. As a result several algorithms and
approaches came out with energy efficient solutions for
WSN. The purpose of this paper is to demonstrate energy
efficient algorithms for self-configuring WSN which will
consume less energy than other existing algorithms and
prolong the network lifetime. The concentration of this
paper is on cell manager selection process and cell
merging technique. For simplicity, the proposed
algorithm is applied in homogeneous WSNs.
This paper is organized as follows: Section 2 provides
a detailed description of the existing algorithms and
approaches. The auto-configurable algorithm and its
extended version have been described in section 3.
Section 4 explains our proposed algorithm. Section 5
presents simulation results and performance evaluations.
Finally, section 6 concludes the paper.
II. RELATED WORKS
In wireless sensor network, energy efficient
communication is the subject of survival of a network. As
a result, the researchers are mostly focused towards
energy efficient communication, energy management,
and extending the network lifetime. The Low-Energy
Adaptive Clustering Hierarchy (LEACH) has been
proposed in [2] that utilizes a randomized periodical
rotation of cluster heads for balancing the energy load
among the sensors. This LEACH is further modified in [3]
named LEACH-C (Centralized) which uses a centralized
controller for selecting cluster heads. The main
shortcomings of these algorithms are the selection of non-
automatic cluster head and the requirement that the
position of all sensors must be known. Another extended
version of LEACHs stochastic algorithm is described in
[4] with a deterministic cluster head selection technique.
Though, this algorithm increases network lifetime
compared with original LEACH protocol, it did not solve
the previous shortcomings. In [5], the optimal cluster size
and the optimal assignment of sensors to cluster heads
have determined using the Ad hoc Network Design
Algorithm (ANDA). This maximizes the network lifetime
but a priori knowledge of the number of cluster heads,
number of sensors in the network, and the location of all
sensors is required. The Weighted Clustering algorithm
(WCA) [6] considers
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 191
2012 ACADEMY PUBLISHER
doi:10.4304/jait.3.3.191-195


Figure 1. Classification of power savings approaches in WSN [24]
some parameters while choosing clusters such as the
number of neighbours, transmission power, mobility, and
battery usage. The number of sensors is limited in this
algorithm for a cluster so that cluster heads can handle
the load without degrading the network performance.
These clustering algorithms need synchronous clocking
for exchanging information among sensors nodes which
limits these algorithms to smaller networks [9]. M.
Bhardwaj and A. P. Chandrakasan in [10] derived upper
bounds on the lifetime of sensor networks, while J. Zhu
and S. Papavassiliou in [11] presented an analytical
model to estimate and evaluate the network lifetime. In
[12], provides a globally optimal solution through graph
theoretic approach to the problem of maximizing a static
network lifetime. In [7, 8], the authors used a
decentralized algorithm for clustering an ad hoc sensor
network. Each sensor monitors communication among its
neighboring clusters and makes decision (based on the
number of neighbors and a randomized timer) either to
join a nearby cluster, or else form a new cluster with
itself as cluster head. In [13-23], the authors proposed
several optimized cluster head or cell manager selection
algorithms and approaches. However, these algorithms
and approaches used some complex methods for
communicating and monitoring neighboring clusters and
can be classified as in Fig. 1. Following these existing
algorithms and approaches, we come up with a modified
algorithm in case of cell manager failure for enhancing
the cell survival time and an energy efficient cell or
cluster merging algorithm for self-configuring WSNs.
III. EXTENDED AUTO-CONFIGURING ALGORITHMS
First the existing auto-configuring algorithm has been
extended using a new cell manager selection process
and later on a new algorithm has been developed in the
next section through combining cell merging and cell
manager selection process together. When the residual
energy of the cell manager is less than or equal to 20%,
cell manager will choose the next higher energy (energy
greater than or equal to 50% of its residual battery energy)
node to assign it as new cell manager from the energy list
which is maintained by the cell manager itself and being
updated periodically from the messages sent by the
member nodes. This is an energy efficient algorithm for
self-organizing WSN in case of cell manager failure (Fig.
2).
If there is no such node found whose residual battery
energy is greater or equal to 50% to take the
responsibility of the cell manager, cell merging activity
will take place. The cell merging or cluster merging
process is a high energy consuming technique for the
survival of the clusters. For understanding cell merging
process a scenario has been shown in Fig. 3. There are 9
cells or clusters and the header of cell 5 is no longer
available to perform its regular operations. So members
need to join a new cell header.
The cell merging process will be done through
following steps:
The neighboring cell managers broadcast a
Join_in message to sensor nodes in the event cell.
To notify the available cell managers,
the Join_in message of neighboring cells is delivered to
all of the sensor nodes in the event cell.
Sensor nodes in the event cell select the
appropriate neighboring cell to join in by checking the
minimum hop count and the residual energy of that
neighboring cell manager. The nodes then reply
acknowledge message to the selected cell manager once
they have accepted a cell manager.
Now lets see how these steps will be applied on our
example. The following tasks will happen:
Each node in the event cell (e.g. cell 5) has been
aware about the unavailability of the cell manager. They
are waiting for the Join_in messages from their
neighboring cells.
Neighboring cell managers start to broadcast
Join_in messages and wait for the acknowledge
messages from the nodes of the event cell.
After receiving Join_in messages, a
designated node first checks whether it belongs to the
event cell. If not, it modifies the hop count of packet and
rebroadcasts it.


Figure 2. Comparison between the autonomic and existing algorithms
of self-configuring WSN
192 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER
A node in the event cell (e.g. cell 5) records the
following information of cell manager upon receiving the
Join_in message: cell_id, node_id, residual battery
energy of the source cell manager and the number of
communication hops.
When the node again receives join_in
message from the same cell manager, it drops the packet
for reducing redundant message transmission in the
network.
If the node in the event cell receives the
join_in messages from different cell managers (i.e.
from cell 1, cell 2, cell 3, cell 4, cell 6, cell 7, cell 8 and
cell 9), it selects the right cell manager by considering
maximum residual energy and minimum hop count
towards the source cell managers. Thus it selects cell 2
with fewest hops and sufficient residual energy to merge
with which is indicated by blue arrows in Fig. 3.
IV. PROPOSED ALGORITHM
In this paper, a new algorithm has been proposed for
enhancing cell survival time and energy efficient cell
merging process. In the extended auto-configuring
algorithm, when residual energy of the cell manager is
low (less than or equal to 20% of its residual battery
energy), cell manager will select the next high energy
node and appoint as new cell manager from the energy
list which is being periodically updated from the
messages sent by the member nodes. A member node
should have greater or equal to 50% of its residual battery
energy for being appointed as cell manager. If there is no
node (residual battery energy is greater or equal to 50%)
to take cell managers responsibility in that cell, cell
merging activity will take place which is shown in Fig. 3.
In our proposed algorithm, we are considering an
additional condition for selecting a new cell manager in
case of current cell manager failure which is explained
below-
1) Search for a member node having energy greater
than or equal to 50%
2) If no such node found, search for a member node
having energy greater than or equal to 25%
3) If no such node found, initiate cell merging
procedure (modified).

Figure 3. Clustering or cell merging process
The aim of considering the second condition is to
extend the operational lifetime of a cell. Other recovery
procedures like physical replacement, maintenance etc
can be done during this extended survival time. For the
third condition, we are proposing a modified cell merging
process which will consume less energy. Cell merging is
an energy consuming process because it involves all
nodes of the event cell and all cell managers in the group
for exchanging messages. Cell managers frequently send
update message to their group managers consisting of
residual energy level of cell members, and the number of
available nodes in the cell. Following receiving and
aggregating update messages from cell managers, the
group manager gets an overview about its group status
and constructs a topology map. Thus, the group manager
is capable of taking proper actions (e.g. altering the cell
formation) according to the events or changes in the
group. In case of failure of cell manager,, the border
nodes of virtual cells are capable of merging together to
produce a large cell. In the proposed algorithm, we are
proposing some modifications in cell merging technique
which will consume less energy. Proposed steps for cell
merging are as follows:
1) Cell manager will inform its group manager that
there is no node to take cell mangers responsibility and
appoint a border node to communicate with the
neighboring cell manager.
2) Group manager will check the energy list which
contains energy status of the cell managers of that group.
Every cell has a unique id number. Group manager will
search the energy list by cell id to find a cell which has
minimum hop count i.e. adjacent to the event cell and has
a cell manager having higher and sufficient residual
energy. Then the group manager will instruct that cell
manager to broadcast a Join_in message to the event
cell.
3) The appointed border node will start
communicating with the selected cell as a merged cell.

The detail flow charts of the two proposed algorithms
are given Fig. 4 and 5.

V. PEFORMANCE EVALUATION
The performance of the proposed algorithm is
evaluated using the network simulator NS2 [25, 26]
which is given in Fig. 6 and 7. Number of sensor nodes is
varied from 10 to 300. Each sensor is assumed to have an
initial energy of 2000 mJ.
In the existing auto-configuring algorithm, when a cell
managers residual battery energy becomes low, it looks
for a node whose energy is greater than or equal to 50%
of its residual battery energy. If such node exists, it will
be designate as the new cell manager. If no such node
found, traditional cell merging process will take place
which consumes high energy.

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 193
2012 ACADEMY PUBLISHER


Figure. 4. Flow chart for cell manager selection

In the proposed algorithm, when a cell managers
residual battery energy becomes low, it looks for a node
whose energy is greater than or equal to 50% of its
residual battery energy. If such node exists, it will be
designated as the new cell manager. If no such node
found, the cell manager looks for a node whose energy is
greater than or equal to 25% of its residual battery energy.
If such node exists, it will be designated as the new cell
manager. If no such node found, cell merging process
will take place in our proposed scheme.




Figure. 5. Flow chart for cell merging procedure




Figure 6. Example of cell merging using proposed scheme. Group
manager selects a cell from blue plane to merge with the non-functional
cell (red) in the green plane.





The performance graph is being generated based on the
energy consumption of the existing and proposed
algorithm during cell merging process.

194 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012
2012 ACADEMY PUBLISHER


Figure 7. Comparison of existing and the proposed algorithm of self-
configuring WSN for cell merging process
VI. CONCLUSION
The main purpose of cell merging is to increase the
number of sensor nodes in a cell to distribute the
workload of the cell manager with neighboring cells to
form a larger cell. When the remaining residual battery
energy of a cell manager or cell head is below the preset
threshold, the cell merging process is invoked to combine
bordering or neighboring cells. As a result, the network
lifetime is extended. In this paper, energy efficient
modified algorithms for wireless sensor network have
been proposed based on the existing auto-configuring
algorithm. The proposed algorithm can extend the cell
survival time by selecting a suitable sensor node to be a
cell manager and merge cells to reorganize the topology
efficiently.
REFERENCES
[1] B Paul, M A Matin, Optimal geometrical sink location
estimation for two-tiered wireless sensor networks IET
Wirel. Sens. Syst., 2011, vol. 1, no.2, pp.74-84.
[2] W. R. Heinzelman, A. Chandrakasan and H. Balakrishnan ,
Energy efficient communication protocol for wireless
microsensor networks, In Proceedings of IEEE HICSS,
January 2000.
[3] W. R. Heinzelman, A. Chandrakasan, H. Balakrishnan ,
An application specific protocol architecture for wireless
microsensor network, IEEE Transaction on Wireless
Communications, 1(4), 2002.
[4] M.J. Handy, M. Haase, D. Timmermann, Low energy
adaptive clustering hierarchy with deterministic cluster-
head selection, 4th International Workshop on Mobile and
Wireless Communications Network, pp. 9-11, September
2002.
[5] C.F. Chiasserini, I. Chlamtac, P. Monti and A. Nucci,
Energy efficient design of wireless ad hoc networks, In
Proceedings of European Wireless, February 2002.
[6] M. Chatterjee, S. K. Das, and D. Turgut, WCA: A
weighted clustering algorithm for mobile ad hoc
networks, Journal of Cluster Computing, Special issue on
Mobile Ad hoc Networking, No. 5, pp. 193-204, 2002.
[7] C.Y. Wen and W. A. Sethares, Automatic decentralized
clustering for wireless sensor networks, EURASIP
Journal on Wireless Communications and Networking,
Volume 2005, Issue 5, pp. 686-697.
[8] C. Y. Wen and W. A. Sethares, Adaptive Decentralized
Re-Clustering for Wireless Sensor Networks, in Proc. of
IEEE International Conference on Systems, Man, and
Cybernetics, Taipei, Taiwan, October 2006.
[9] J. Lundelius and N. Lynch. An upper and lower bound for
clock synchronization, Information and Control, Vol. 62
1984.
[10] M. Bhardwaj and A. P. Chandrakasan, , Bounding the
lifetime of sensor networks via optimal role assignments,
IEEE INFOCOM 2002, vol. 3, 2002, pp. 1587-1596.
[11] J. Zhu and S. Papavassiliou, On the energy-efficient
organization and the lifetime of multi-hop sensor
networks, IEEE Communications Letters, vol. 7, no. 11,
November 2003, pp. 537-539.
[12] I. Kang and R. Poovendran, Maximizing static network
lifetime of wireless broadcast ad hoc networks, IEEE
International Conference on Communications (ICC) 2003,
Anchorage, Alaska.
[13] A. Bharathidasan and V. Anand, Sensor networks: An
overview, Technical Report CA 95616.
[14] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E.
Cayirci, (2002), A survey on sensor networks, IEEE
Communication Magazine, pp. 102-114.
[15] F. Lewis, Wireless Sensor Networks, Smart
Environments: Technology, Protocols and Applications:
Wiley-InterScience.
[16] W. L. Lee, A. Datta, and R. Cardell-Oliver, Network
Management in Wireless Sensor Networks, Handbook of
Mobile Ad Hoc and Pervasive Communications: American
Scientific Publishers.
[17] L. Subramanian and R. H. Katz, An architecture for
building self-configurable systems, Proceeding of the
IEEE/ACM Workshop on Mobile Ad Hoc Networking and
Computing (MobiHOC 2000), pp. 63-73.
[18] G. Gupta and M. Younis, Load-Balanced Clustering in
Wireless Sensor Networks, Proceedings of International
Conference on Communication (ICC 2003) Anchorage,
AK.
[19] J. L. Chen, H. F. Lu, and C. A. Lee, Autonomic self-
organization architecture for wireless sensor
communications, International Journal of Network
Management, vol. 17, pp. 197-208.
[20] G. Venkataraman, S. Emmanuel, and S.Thambipillai,
Energy-efficient cluster-based scheme for failure
management in sensor networks, IET Communications,
vol. 2, pp. 528-537.
[21] M. Asim, H. Mokhtar and M. Merabti, A self-managing
fault management mechanism for wireless sensor
networks, International Journal of Wireless & Mobile
Networks (IJWMN), Vol.2, No.4, pp. 184-197.
[22] M.Asim, M. Yu, H.Mokhtar, and M.Merabti, A Self-
Configurable Architecture for Wireless Sensor Networks,
2010 Developments in E-systems Engineering, pp.76-81.
[23] M. Asim, H. Mokhtar, and M. Merabti, A cellular
approach to fault detection and recovery in wireless sensor
networks, The Third International Conference on Sensor
Technologies and Applications, SENSORCOMM 2009
Greece, pp. 352-357.
[24] G. Anastasi, M. Conti, M. Di Francesco and A Passarella,
Energy Conservation in Wireless Sensor Networks: a
Survey, Journal Ad Hoc Networks, vol. 7 issue 3, May,
2009
[25] The Network Simulator NS -2, http://isi.edu/nsnam/ns/
[26] XGraph, http://www.isi.edu/nsnam/xgraph/

JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 3, NO. 3, AUGUST 2012 195
2012 ACADEMY PUBLISHER

Call for Papers and Special Issues

Aims and Scope
JAIT is intended to reflect new directions of research and report latest advances. It is a platform for rapid dissemination of high quality research /
application / work-in-progress articles on IT solutions for managing challenges and problems within the highlighted scope. JAIT encourages a
multidisciplinary approach towards solving problems by harnessing the power of IT in the following areas:
Healthcare and Biomedicine - advances in healthcare and biomedicine e.g. for fighting impending dangerous diseases - using IT to model
transmission patterns and effective management of patients records; expert systems to help diagnosis, etc.
Environmental Management - climate change management, environmental impacts of events such as rapid urbanization and mass migration,
air and water pollution (e.g. flow patterns of water or airborne pollutants), deforestation (e.g. processing and management of satellite imagery),
depletion of natural resources, exploration of resources (e.g. using geographic information system analysis).
Popularization of Ubiquitous Computing - foraging for computing / communication resources on the move (e.g. vehicular technology), smart
/ aware environments, security and privacy in these contexts; human-centric computing; possible legal and social implications.
Commercial, Industrial and Governmental Applications - how to use knowledge discovery to help improve productivity, resource
management, day-to-day operations, decision support, deployment of human expertise, etc. Best practices in e-commerce, e-commerce, e-
government, IT in construction/large project management, IT in agriculture (to improve crop yields and supply chain management), IT in
business administration and enterprise computing, etc. with potential for cross-fertilization.
Social and Demographic Changes - provide IT solutions that can help policy makers plan and manage issues such as rapid urbanization, mass
internal migration (from rural to urban environments), graying populations, etc.
IT in Education and Entertainment - complete end-to-end IT solutions for students of different abilities to learn better; best practices in e-
learning; personalized tutoring systems. IT solutions for storage, indexing, retrieval and distribution of multimedia data for the film and music
industry; virtual / augmented reality for entertainment purposes; restoration and management of old film/music archives.
Law and Order - using IT to coordinate different law enforcement agencies efforts so as to give them an edge over criminals and terrorists;
effective and secure sharing of intelligence across national and international agencies; using IT to combat corrupt practices and commercial
crimes such as frauds, rogue/unauthorized trading activities and accounting irregularities; traffic flow management and crowd control.
The main focus of the journal is on technical aspects (e.g. data mining, parallel computing, artificial intelligence, image processing (e.g. satellite
imagery), video sequence analysis (e.g. surveillance video), predictive models, etc.), although a small element of social implications/issues could be
allowed to put the technical aspects into perspective. In particular, we encourage a multidisciplinary / convergent approach based on the following
broadly based branches of computer science for the application areas highlighted above:

Special Issue Guidelines
Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by
invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal.
Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the
readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length.
The following information should be included as part of the proposal:
Proposed title for the Special Issue
Description of the topic area to be focused upon and justification
Review process for the selection and rejection of papers.
Name, contact, position, affiliation, and biography of the Guest Editor(s)
List of potential reviewers
Potential authors to the issue
Tentative time-table for the call for papers and reviews

If a proposal is accepted, the guest editor will be responsible for:
Preparing the Call for Papers to be included on the Journals Web site.
Distribution of the Call for Papers broadly to various mailing lists and sites.
Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be
informed the Instructions for Authors.
Providing us the completed and approved final versions of the papers formatted in the Journals style, together with all authors contact
information.
Writing a one- or two-page introductory editorial to be published in the Special Issue.

Special Issue for a Conference/Workshop
A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like general
chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is typically made of
10 to 15 papers, with each paper 8 to 12 pages of length.
Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop:
Selecting a Title for the Special Issue, e.g. Special Issue: Selected Best Papers of XYZ Conference.
Sending us a formal Letter of Intent for the Special Issue.
Creating a Call for Papers for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees.
Information about the Journal and Academy Publisher can be included in the Call for Papers.
Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus the
evaluation from the Session Chairs and the feedback from the Conference attendees.
Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors
should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced.
Providing us the completed and approved final versions of the papers formatted in the Journals style, together with all authors contact
information.
Writing a one- or two-page introductory editorial to be published in the Special Issue.

More information is available on the web site at http://www.academypublisher.com/jait/.

Você também pode gostar