Você está na página 1de 12

Int. J. Inf. Secur.

(2009) 8:2535
DOI 10.1007/s10207-008-0061-2
REGULAR CONTRIBUTION
Large-scale network intrusion detection based on distributed
learning algorithm
Daxin Tian Yanheng Liu Yang Xiang
Published online: 14 November 2008
Springer-Verlag 2008
Abstract As network trafc bandwidth is increasing at an
exponential rate, its impossible to keep up with the speed of
networks by just increasing the speed of processors. Besides,
increasingly complex intrusion detection methods only add
further to the pressure on network intrusion detection (NIDS)
platforms, so the continuous increasing speed and throughput
of network poses new challenges to NIDS. To make NIDS
usable in Gigabit Ethernet, the ideal policy is using a load
balancer to split the trafc data and forward those to differ-
ent detection sensors, which can analyze the splitting data
in parallel. In order to make each slice contains all the evi-
dence necessary to detect a specic attack, the load balancer
design must be complicated and it becomes a newbottleneck
of NIDS. To simplify the load balancer this paper put forward
a distributed neural network learning algorithm (DNNL).
Using DNNL a large data set can be split randomly and each
slice of data is presented to an independent neural network;
these networks can be trained in distribution and each one
in parallel. Completeness analysis shows that DNNLs learn-
ing algorithm is equivalent to training by one neural network
which uses the technique of regularization. The experiments
to check the completeness and efciency of DNNL are per-
formed on the KDD99 Data Set which is a standard intrusion
detection benchmark. Compared with other approaches on
the same benchmark, DNNL achieves a high detection rate
and low false alarm rate.
D. Tian Y. Liu (B)
College of Computer Science and Technology,
Jilin University, 130012 Changchun, China
e-mail: lyh_lb_lk@yahoo.com.cn
Y. Xiang
School of Management and Information Systems,
Central Queensland University,
Rockhampton, QLD 4702, Australia
Keywords Intrusion detection system Distributed
learning Neural network Network behavior
1 Introduction
With the widespread use of networked computers for criti-
cal systems, computer security is attracting increasing atten-
tion and intrusions have become a signicant threat in recent
years. As a second line of defense for computer and net-
work systems, intrusion detection systems (IDS) have been
deployed more and more widely along with network security
techniques such as rewalls. Intrusion detection techniques
can be classied into two categories: misuse detection and
anomaly detection. Misuse detection looks for signatures of
known attacks, and any matched activity is considered an
attack; anomaly detection models a users behaviors, and any
signicant deviation fromthe normal behaviors is considered
the result of an attack. The main shortcoming of IDS is false
alarm which is caused by misinterpreting normal packets as
an attack or misclassifying an intrusion as normal behavior.
This problem is more severe under fast Ethernet, with the
result that network IDS (NIDS) cant be adapted to protect
the backbone network. Since network trafc bandwidth is
increasing at an exponential rate, its impossible to keep up
with the speed of networks by just increasing the speed of
processors.
To resolve the problemand make NIDS which can be used
in Gigabit Ethernet, one approach is improving the detection
speed by moving the matching away from the processor and
on to an FPGA [14], using high performance string match-
ing algorithm [57] and reducing the dimensionality of the
data, thereby minimizing computational time [8, 9]. Another
approach is using both distributed and parallel detecting
methods. this is the best way to make NIDS keep up with
the speed of networks. The main idea of distributed NIDS
123
26 D. Tian et al.
is splitting the trafc data and forwarding them to detection
sensors, so that these sensors can analyze the data in parallel.
Paper [10] presents an approach which allows for meaning-
ful slicing of the network trafc into portions of manageable
size. However, their approach uses a simple Round-Robin
algorithm for load balancing. The splitting algorithm of [11]
ensures that a single slice contains all the evidence necessary
to detect a specic attack, making sensor-to-sensor interac-
tion unnecessary. Although the algorithm can dynamically
balance the sensors loads by choosing the sensor with the
lightest load to process the new connections packets, it still
may lead to some sensor losing a packet if the trafc of one
connection is heavy. Paper [12] has a design for a ow-based
dynamic load-balancing algorithm, which divides the data
stream based on the current value of each analyzers load
function. The incoming data packets, which belonged to a
new session, are forwarded to the analyzer that has the least
load currently. Paper [13] presents an active splitter architec-
ture and three methods for improving performance: the rst
is early ltering/forwarding, where a fraction of the packets
is processed on the splitter instead of the sensors; the second
is the use of locality buffering, where the splitter reorders
packets in a way that improves memory access location on the
sensors; the third is the use of cumulative acknowledgments,
a method that optimizes the coordination between the trafc
splitter and the sensors. The load balancer of SPANIDS [14]
employs multiple levels of hashing and incorporates feed-
back from the sensor nodes to distribute network trafc over
the sensors without overloading any of them. Although the
methods of [1214] reduce the load on the sensors, it compli-
cates the splitting algorithm and makes the splitter become
the bottleneck of the system.
The trafc splitter is the key of the distributed intrusion
detection system. An ideal splitting algorithm should satisfy
these requirements: (1) the algorithm divides the whole traf-
c into slices of equal sizes; (2) each slice contains all the
evidence necessary to detect a specic attack; (3) the algo-
rithmis simple and efcient [11]. Through the above analysis
we can nd that the primary goal of a NIDS load balancer is
to distribute network packets across a set of sensor hosts, thus
reducing the load on each sensor to a level that the sensor can
handle without dropping packets. However, the connection
oriented characteristic makes the load balancer of NIDS is
different from the other environments such as web servers,
distributed systems or clusters. In order to satisfy the require-
ment (2), all the distributed intrusion detection systems pay
more attention to the load balancer, and thus cant satisfy
requirements (1) and (3). In this paper a distributed neural
network learning algorithm (DNNL) is presented which can
be used in distributed anomaly detection system. The idea
of DNNL is different from the common distributed intru-
sion detection system. While the usual methods try to satisfy
requirement (2) through weakening requirements (1) and (3),
DNNL takes the opposite approach, which rst considers
satisfying requirements (1) and (3) and through the learning
algorithm to satisfy requirement (2).
Two important characters of neural networks are: distrib-
uted, where knowledge representation is distributed across
many processing units; parallel, where computations take
place in parallel across these distributed representations.
Although each neural network can run in parallel, a group
of neural networks cant run in distribution to cope with one
problem corporately. Since the learning algorithm requires
all the training data to be submitted to the network one by
one until the network is stable after one or more epochs. This
requirement becomes untenable when the amount of data
exceeds the size of main memory, which is obviously possi-
ble for any realistic database, such as astronomy data [15],
biomedical data [16], bioinformatics data [17], etc. DNNL is
not only a parallel but also a distributed learning algorithm
which uses independent neural networks to process part of
the training data. These independent neural networks can
run in distribution and each one processes in parallel, thus it
can not only take advantage of the neural networks parallel
character but also overcome the drawback of concentrated
training. DNNL can also be used in mobile agent [18], dis-
tributed data mining [19], distributed monitoring [20] and
ensemble system [21].
The rest of this paper is organized as follows. Section 2
describes the main idea of DNNL, and details the basic learn-
ing algorithm. Section 3 presents a metric embedding method
and dissimilarity measure algorithm to make DNNL suit the
data which contains categorical and numerical features. The
experimental results on dataset KDDCUP99 are given in
Sect. 4, and conclusions are made in Sect. 5.
2 DNNL
2.1 The process of DNNL
The main process of DNNLis: rst, splitting the large sample
data into small subsets and forwarding these slices to distrib-
uted sensors; second, each sensors neural network begins
to be trained by the sliced data in parallel until all of them
are stable; third, rebuilding the new training data based on
the training results of each neural network (the new training
datas amount is much less than the total amount of all the
sliced data); last, a concentrated learning is carried out on the
new training data. The process is shown in Fig. 1.
DNNLinvolves twophases learning. Inthe rst phase (dis-
tributed learning), large data are splitted randomly and sent
to independent neuron networks, all the independent neuron
networks learn the knowledge of each slice in distribution
and every one in parallel. In the second phase (concentrated
learning), the training data is built fromthe training results of
123
Large-scale network intrusion detection based on distributed learning algorithm 27
Fig. 1 The process of DNNL
the distributed neural networks. Since the new data is much
less than the original training data, it can be learned by one
neural network in nite time and memory.
2.2 Analysis of DNNLs completeness
The key issue to DNNL is how to build the new data to
ensure the training is complete, that is, the result is equal to
the training on the whole data by one neural network. Next we
rst present the newdatas building method and then analyze
the completeness of DNNL.
A stable neural network maintains the knowledge learned
from the sample data in the weight matrix W
(mn)
, m is the
number of neurons, n is the dimension of each neuron. In
DNNL, the dimension of the neuron of the distributed neural
network is equal to the sample vector x
(1n)
s dimension.
After the distributed neural network is stable, each row of W
can be regarded as one clustering center of the sliced data.
The new data are generated from Gaussian distribution in
each point of W. For example: the whole original data set
X has p q samples, X is split into p slices and each slice
(i )
X(i = 1, . . . , p) has q samples; after the i th neural net-
work trained by the i th slice of data
(i )
X is stable, its weight
(i )
W is composed of r rows (r q). The i th slice of the
new data set
(i )

X is generated from the Gaussian distribu-
tion in each point of
(i )
W. After generation,
(i )

X contains t
(r t q) samples. Since t is also much less than q, the
whole new data set

X is much less than the whole original
data set X. From the above discussion we can nd that after
training each row of the i th neuron networks weight matrix,
(i )
W represents some samples of
(i )
X, in which the distance
between these samples and the corresponding row is lower
than one threshold value. So we can get
(i )
W
j
=
(i )
X
k
+ A
k
(1)
where k identies some samples whose clustering center is
the j th row of
(i )
W, and A
k
is the vector representing the
difference between
(i )
W
j
and
(i )
X
k
. The i th slice of new
data set
(i )

X is generated from the i th neuron network as
(i )

X
l
=
(i )
W
j
+ B
m
(2)
where B
m
is a random vector whose numbers are generated
from Gaussian distribution. Substituting Eq. (1) into Eq. (2)
gets
(i )

X
l
=
(i )
X
k
+ A
k
+ B
m
(3)
The neural network can be represented by function f (W, x).
After training, input of unlabelled data x and the output of f ()
is close or equal to future result y. To feedforward network,
the training is a process to nd W

= argmi n

i =1
n

j =1
y
i j
f (W, x
i
)
j

(4)
A common choice of the error function is the least mean
square error of the form
C (x) =
m

i =1
_
_
y
i
f (W, x
i
)
_
_
2
(5)
its expected value is
E (C (x)) =
__
C (x) f
d
(x, y) dxdy (6)
the function f
d
(x, y) representing the probability density of
the training data. Substituting Eq. (5) into Eq. (6) gets
E (C (x))
=
m

i =1
n

j =1
__
_
y
i j
f (W, x
i
)
j
_
2
f
d
_
x
i
, y
i
_
dx
i
dy
i
(7)
When training with

X the function f () becomes f (W,
x + a + b), which expands it into Taylor series:
f (W, x + a + b)
= f (W, x) + f (W, x + a + b)
T
(a + b)
+
1
2
(a + b)
T

2
f (W, x + a + b) (a + b) +
= f (W, x) +h (x) (8)
where f () is gradient and
2
f () is Hessian matrix. The
expected error value when training with new data set

X can
be written in the form
E (C ( x)) = E (C (x)) + ( f (W, x)) (9)
123
28 D. Tian et al.
where ( f (W, x)) is
( f (W, x))
=
m

i =1
n

j =1
____
_
2
_
y
i j
f (W, x
i
)
j
_
h (x
i
)+h (x
i
)
2
_
f
d
_
x
i
, y
i
_
f
d
(a
i
) f
d
(b
i
) dx
i
dy
i
da
i
db
i
(10)
From Eq. (9) we can nd that training with new data set

X
is equivalent to the technique of regularization which adds a
penalty term to the error function for controlling the bias and
variance of a neural network [22].
This neural network learning rule can be considered as
a gradient optimization process when an appropriate energy
function E (w) is selected, the gradient direction is
dw
dt
=
E (w)
w
(11)
and the synaptic weights are adjusted in the gradient direction
w(k + 1) = w(k)
E (w)
w
(12)
If the data set S in Fig. 2 are trained following this process,
after neural network is stable, that is, the energy function E
reaches to minimum (local or global), the synaptic weights
are the black point in Fig. 2.
If the data set S is randomly (that is not following the
partition boundary) split into two data sets S1 and S2 which
are shown in Figs. 3 and 4, distributed learning rst trains
S1 and S2 independently. After they are both stable, S1s
energy function E1 and S2s energy function E2 are both
reaching to minimum, the concentrated learning is carried
on the learning result of data set S1 and S2.
During the concentrated learning: since the triangle class
and the plus sign class have been generalized very well by
their synaptic weights, these weights will not be adjusted in
great degree to minimize energy function; However the cross
class and the six-pointed star class dont reach to the opti-
mal state, so and therefore their synaptic weights generated
during the distributed learning will continue to adjust until
reaching to minimum. Although the training results (synap-
tic weight number and value) on the splitted data set and the
whole data set may be different, they have the same gen-
eralization ability because they all aim to make Ss energy
function E reach to minimum.
2.3 Competitive learning algorithm based on kernel
function
In order to gain the advantages of being able to learn from
new data, a neural network must be adaptive or exhibit
plasticity, possibly allowing the creation of new neurons.
On the other hand, if the training data structures are unsta-
ble and the most recently acquired piece of information can
Fig. 2 The data set S and its training result
Fig. 3 The data set S1 and its training result
Fig. 4 The data set S2 and its training result
cause major reorganization, then it is difcult to ascribe much
signicance to any particular clustering description. This
problem is even more serious in distributed data training.
SOM [23], dART [24], RPCL [25], etc have presented some
methods to overcome this problem, this paper introduces a
competitive mechanism which absorbs the ideas of above
methods. The learning algorithm is based on Hebb learning
and kernel function. To prevent the knowledge included in
different slices being ignored, DNNL adopts the resonances
mechanism of ART and adds neurons whenever the network
123
Large-scale network intrusion detection based on distributed learning algorithm 29
in its current state does not sufciently match the input. Thus
the learning results of the sensors contain complete or partial
knowledge and the whole knowledge can be learned by the
concentrated learning.
2.3.1 Hebb learning
In DNNL the learning algorithm is based on the Hebbian
Postulate which states that When an axon of cell A is near
enough to excite a cell B and repeatedly or persistently takes
part in ring it, some growth process or metabolic change
takes place in one or both cells such that As efciency, as
one of the cells ring B, is increased.
The learning rule for a single neuron can be derived from
an energy function dened as
E(w) =
_
w
T
x
_
+

2
w
2
2
(13)
where w is the synaptic weight vector (including a bias or
threshold), x is the input to the neuron, () is a differentiable
function, and 0 is the forgetting factor. Also,
y =
d (v)
dv
= f (v) (14)
is the output of the neuron, where v = w
T
x is the activity
level of the neuron. Taking the steepest descent approach to
derive the continuous-time learning rule
dw
dt
=
w
E (w) (15)
where > 0 is the learning rate parameter, we see that the
gradient of the energy function in Eq. (13) must be computed
withrespect tothe synaptic weight vector, that is,
w
E (w) =
E (w)/w . The gradient of Eq. (13) is

w
E (w) = f (v)
v
w
+w = yx +w (16)
Therefore, by using the result in Eq. (16) along with that
of Eq. (15), the continuous-time learning rule for a single
neuron is
dw
dt
= [yx w] (17)
and the discrete-time learning rule (in vector form) is
w(t + 1) = w(t ) +[y(t + 1)x(t + 1) w(t )] (18)
2.3.2 Competitive mechanism based on kernel function
To overcome the problem induced by a trafc splitters, the
inverse distance kernel function is used in Hebb learning.
The basic idea is that not only the winner is rewarded but
also all the losers are penalized at a different rates which are
calculated by the inverse distance function and its input is
the dissimilarity between the sample data and neuron.
The dissimilarity measure function is Minkowski metric:
d
p
(x, y) =
_
l

i =1
w
i
|x
i
y
i
|
p
_
1/p
(19)
where x
i
, y
i
are the i th coordinates of x and y , i = 1, . . . , l,
and w
i
0 is the i th weight coefcient.
When the j th neuron is most similar to the sample, then
the learning rule of i th neuron is
W
i
(t + 1) = W
i
(t ) +
i
[x(t + 1) W
i
(t )] (20)
where

i
=
_
1 : winner, i = j
K (d
i
): others, i = 1, . . . , m and i = j
(21)
and K (d
i
) is the inverse distance kernel,
K (d
i
) =
1
1 + d
p
i
(22)
If the winners dissimilarity measure d < ( is the thresh-
old of dissimilarity), then update the synaptic weight by
learning rule Eq. (20), else add a new neuron and set the
synaptic weight w = x.
2.4 Post-prune algorithm
One of the central issues in network training is to nd the
optimal model f (). Judging the efciency of f () can be bro-
ken into two fundamental aspects: bias and variance. Bias
measures the expected value of the estimator relative to the
true value, and variance measures the variability of the esti-
mator about the expected value. Since DNNL determines
network size by adding neurons incrementally, it may model
noise data into f () and lead to high variance (the phenom-
enon of overtting). To prevent overtting, DNNL uses the
post-prune method whose strategy is based on the distance
threshold: if two weights are too similar they will be substi-
tuted by a new weight. The new weight is calculated as
W
new
= (W
old1
t
1
+ W
old2
t
2
)/(t
1
+ t
2
) (23)
where t
1
is the training times of W
old1
, t
2
is the training
times of W
old2
.
The pruning process is illustrated in Fig. 5, after pruning E,
F, Aand Bare aggregated to EFand AB. The prune algorithm
is shown below:
Step 0: If old weights muster (oldW) is null then algorithm
is over, else proceed;
Step 1: Calculate the distance between the rst weight (fw)
and the other weights;
Step 2: Find the weight (sw) which is most similar to fw;
Step 3: If the distance between sw and fw is larger than the
pruning threshold, then delete fw from oldW and add
123
30 D. Tian et al.
Fig. 5 The pruning process
fw into new weights muster (newW) goto step 0; else
continue;
Step 4: Get fws training times value (ft) and sws training
times value (st);
Step 5: Calculate the new weight (nw) and nws training
times value (nt), nw = (fw ft + sw st)/(ft + st),
nt = ft + st;
Step 6: Delete fwand swfromoldWand add nwinto newW;
goto step 0.
2.5 Learning algorithm of DNNL
The main learning process of DNNL is:
Step 0: Initialize learning rate parameter and the threshold
of dissimilarity ;
Step 1: Obtain the rst input x and set w
0
= x as the initial
weight;
Step 2: If training is not over, randomly take a feature vector
x fromthe feature sample set Xand compute the dissimi-
larity measure between x and each synaptic weight using
Eq. (19);
Step 3: Decide the winner neuron j and test tolerance: If
_
d
j

_
, add a new neuron and set the synaptic weight
w = x, goto Step 2; else continue;
Step 4: Compute
i
by using the result of inverse distance
K (d
i
) ;
Step 5: Update the synaptic weight as Eq. (20), goto Step 2.
3 Data preprocessing
KDDCUP 99 data was collected through a simulation on the
U.S.A. military network by 1998 DARPA Intrusion Detec-
tion Evaluation Program, aiming at obtaining the benchmark
dataset in the eld of intrusion detection. The full data set
contains training data consisting of 7 weeks of network-
based intrusions inserted in the normal data, and 2 weeks
of network-based intrusions in normal data for a total of
4,999,000 connection records described by 41 characteris-
tics. The records are mainly divided into four types of attack:
probe, denial of service (DOS), user-to-root (U2R) and
remote-to-local (R2L).
3.1 Metric embedding
The set of features presented in the KDD Cup data set con-
tains categorical and numerical features of different sources
and scales. An essential step for handling such data is metric
embedding which transforms the data into a metric space. In
this paper the categorical features are represented by metric
A, each categorical feature A
i
expressing g possible categor-
ical values is dened as A
i
=
_
A
1
i
, A
2
i
, . . . , A
g
i
_
; the numer-
ical features are represented by B; then the metric space X
can be dened as X = {A
1
, . . . , A
m
, B
1
, . . . , B
nm
}. That
means each sample data X is described by n features.
3.2 Dissimilarity measure
To numerical features, the value |x
i
y
i
| of Minkowski met-
ric can be calculated directly after normalization. But for
categorical features, we need to dene a new calculation
method. The Hamming distance is often used to quantify the
extent to which two strings of the same dimension differ. An
early application was in the theory of error-correcting codes
where the hamming distance measured the error introduced
by noise over a channel when a message, typically a sequence
of bits, is sent between its source and destination. In DNNL
the calculation of |x
i
y
i
| for categorical features is similar
to Hamming distance. If x
i
and y
i
are categorical features, x
i
is the feature of sample data, y
i
is the corresponding feature
of one training neuron N, and x
i
= A
k
i
(k 1, . . . , g), then
|x
i
y
i
| = 1
c
k
C
(24)
123
Large-scale network intrusion detection based on distributed learning algorithm 31
where c
k
is the number which represents howmany times A
k
i
has been learned by neuron N,
c
k
= num
_
A
k
i
_
(25)
and C is the total number of all the categorical features that
have been learned by neuron N,
C =
m

i =1
g

j =1
num
_
A
j
i
_
(26)
If the neuron N is the winner of this training epoch, its value
of c
k
will be added by 1. Using this method to calculate the
Minkowski metric of categorical features, if the value of one
samples categorical feature A
i
is A
k
i
, then the neurons with
the larger num
_
A
k
i
_
are more similar to this sample regarding
this categorical feature, that is, the value |x
i
y
i
| is much
smaller.
4 Experiments
4.1 Benchmark test
In the KDDCUP 99 data set, a smaller data set consisting
of the 10% the overall data set is generally used to eval-
uate algorithm performance. The smaller data set contains
22 kinds of intrusion behaviors and 494,019 records, among
which 97,276 are normal connection records. The test set is
another data set which contains 37 kinds of intrusion behav-
iors and 311,029 records, among which 60,593 are normal.
4.1.1 Performance measures
The recording format of test results is shown in Table 1.
False alarm is partitioned into False Positive (FP, normal is
detected as intrusion) and False Negative (FN, intrusion is not
detected). True detection is also partitioned into True False
(TF, intrusion is detected rightly) and True Negative (TN,
normal is detected rightly).
Table 1 Recording format of test result
Detection results
Normal Intrusion-1 Intrusion-n
Normal TN00 FP01 FP0n
Actual Intrusion-1 FN10 TP11 FP1n
Intrusion-2 FN20 FP21 FP2n
Behaviors
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Intrusion-n FNn0 FPn1 TPnn
Denition 1 The right detection rate of i th behavior (TR) is
TR = T
i i
_
n

j =1
R
i j
(27)
where T
i i
is the value which lies in Table 1s i th row and i th
column; R
i j
is the value which lies in Table 1s i th row and
j th column.
Denition 2 The right prediction rate of i th behavior (PR)
is
PR = T
i i
_
n

j =1
R
j i
(28)
where T
i i
is the value which lies in Table1s i th row and i th
column; R
j i
is the value which lies in Table1s j th row and
i th column.
Denition 3 Detection rate (DR) is computed as the ratio
between the number of correctly detected intrusions and the
total number of intrusions. If we regard Table1s record as an
(n + 1) (n + 1) metric R, then
DR =
n

i =1
n

j =1
R
i j
_
n

i =0
n

j =0
R
i j
(29)
Denition 4 False positive rate (FPR) is computed as the
ratio between the number of normal behaviors that are incor-
rectly classies as intrusions and the total number of normal
connections, according to the Table1s record
FPR =
n

i =1
FP0i
_
_
n

i =1
FP0i + TN00
_
(30)
4.1.2 Experiment results
To test the performance of DNNL, we rst divide the 494,019
records into 50 slices. Each slice contains 10,000 records
except the last which contains 4,019 records. When learning
rate = 0.1 and the threshold of dissimilarity = 1.5 the
learning results are shown in Figs. 6 and 7. In Fig. 6, X-axis
represents each slice and Y-axis records the corresponding
number of neurons after the neural networks are stable. In
Fig. 7, Y-axis records the corresponding number of behav-
iors included in each slice. From the results we can nd that
the distribution of neurons and behaviors are similar, which
indicates the sensors have learned the knowledge. Since the
behaviors recorded in 18th36th slices and 43th46th are all
smurf intrusion, the number of behavior is 1 and the num-
ber of neurons is 51. After the training on the distributed
learning result, the knowledge is represented by 368 neurons.
There are 37 kinds of intrusion behaviors in the test set.
We rst separate them into four kinds of attacks:
123
32 D. Tian et al.
Fig. 6 The number of neurons of each corresponding slices
Fig. 7 The number of behaviors of each corresponding slices
Probe: {portsweep, mscan, saint, satan, ipsweep, nmap}
DOS: {udpstorm, smurf, pod, land, processtable, warezmas-
ter, apache2, mailbomb, Neptune, back, teardrop}
U2R: {httptunnel, ftp_write, sqlattack, xterm, multihop,
buffer_overow, perl, loadmodule, rootkit, ps}
R2L: {guess_passwd, phf, snmpguess, named, imap,
snmpgetattack, xlock, sendmail, xsnoop, worm}
The test results are summarized in Table 2.
Comparing the result with the rst winner of KDD CUP
99, we see that the TR of DNNL is almost equal to the rst
winner. There are two reasons leading to the low TR of U2R
and R2L: rst, the size of attack instance that pertained to
U2R and R2L is much smaller than that of other types of
attack; second, U2R and R2L are host-based attacks which
exploit vulnerabilities of the operating systems, not of the
network protocol. Therefore, these are very similar to the
normal data. Table 3 shows the DR and FPR of the rst
and second winner of the KDD CUP 99 competition, other
approaches [21] and DNNL. From the comparison we can
nd that DNNL provides superior performance.
4.2 Prototype system test
4.2.1 Test environment
In order to address the problem of intrusion detection analy-
sis in high-speed networks, the data streamon the high-speed
network link is divided into several smaller slices that are
fed into a number of distributed neural networks. In order
to evaluate the effectiveness of the DNNL, we developed a
prototype IDS using the Libpcap. The test environment is
shown in Fig. 8. We used 12 PCs with 100 Mbps Ethernet
cards to serve as background trafc generator, which could
generate more than 1,000 Mbps TCP and UDP streams with
an average packet size of 1,024 bytes. One IBMserver which
runs Web services is the attack object and one attacker sends
attack packets to the server. All the 14 computers are con-
nected to 100Mbps ports on the Huawei Quidway S3526C
switch. All the packets through these ports are mirrored to
a dened mirror port and then distributed to the neural net-
works.
4.2.2 Packet capture
Every station on a LAN hears every packet transmission, so
there is a destination eld and a source eld in each packet.
The Ethernet card can be in promiscuous mode or normal
mode. Under promiscuous mode, the card will receive and
deliver every packet. Under normal mode, if the packet desti-
nation address is identical to the station address, the card will
receive and pass the packet up to the software, if it is not, the
card will just drop the packet (lter it). IDS can run under the
promiscuous mode of Ethernet card to analyze every packets
passing through the LAN. Libpcap is the library we are going
to use to grab packets from the network card directly. The
main functions used are:
pcap_open_live() is used to obtain a packet capture
descriptor to look at packets on the network.
pcap_lookupnet() is used to determine the network num-
ber and mask associated with the network device.
pcap_lookupdev() returns a pointer to a network device
suitable for use with open_live() and lookupnet().
pcap_loop() is used to collect and process packets. The
captured packet will be parsed to formthe network behav-
ior vector.
Our method of parsing based on the character of network
software structuring technique. In the TCP/IP Reference
Model, the Internet layer denes an ofcial packet format
and protocol called IP (Internet Protocol); the layer above
123
Large-scale network intrusion detection based on distributed learning algorithm 33
Table 2 Testing results
Detection results
Normal Probe DoS U2R R2L TR (%)
Normal 58120 927 649 64 833 96.0
Actual Probe 357 3546 174 21 118 85.1
DoS 256 5092 223518 52 435 97.2
Behaviors U2R 143 39 0 23 23 10.1
R2L 14443 14 1 271 1460 9.0
PR (%) 79.3 36.9 99.6 5.3 50.9
Table 3 Comparison with other approaches
Performances Detection rate False positive rate
Algorithms (DR) (%) (FPR) (%)
Winning entry 91.9 0.5
Second place 91.5 0.6
Best linear GP-FP rate 89.4 0.7
Best GEdIDS-FP rate 91 0.4
DNNL 93.9 0.4
Fig. 8 Test environment
the Internet layer is transport layer, two end-to-end protocols
have been dened here. The rst one, TCP (Transmission
Control Protocol) is a reliable connection-oriented protocol
that allows a byte stream originating on one machine to be
delivered without error on any other machine in the Inter-
net. The second protocol in this layer, UDP (User Datagram
Protocol), is an unreliable, connectionless protocol. In the
experiments we use the heads of packets to dene the data
structure of network behavior:
typedef struct _EthernetBehavior
{
u_int8_t ethernet_dest[12]; /* destination ethernet address */
u_int8_t ethernet_sour[12];/* source ehternet address */
u_int16_t ethernet_type;/* packet type ID eld */
}EthernetBehavior;
The ethernet_type shows the nested structure of protocol
headers. It may be an IP, ARP, or some other protocols. For
instance, an IP header can be dened as:
typedef struct _IPBehavior
{
unsigned int header_len; /* the header length */
unsigned int version; /* version of the protocol */
u_int8_t tos; /* type of service */
u_short total_len; /* total length of datagram */
u_short identication; /* identication */
u_int8_t ag_off; /* ags and fragment offset */
u_int8_t time_live; /* the limit of packet lifetimes */
u_int8_t protocol; /* TCP or UDP */
u_int8_t checksum; /* header checksum */
struct in_addr source_addr; /* source address*/
struct in_addr destination_addr; /* destination address */
}IPBehavior;
The variable protocol tells what type of protocol will be
used in the upper layer, it can be TCP, UDP, ICMP, etc. For
example the denition of TCP is:
typedef struct _TCPBehavior
{
u_int16_t sour_port; /* source port */
u_int16_t dest_port; /* destination port */
tcp_seq seq_num; /* sequence number */
tcp_seq ack_num; /* acknowledgement number */
u_int16_t ag; /* ags */
u_int16_t win_size; /* window size */
u_int16_t check_sum; /* header checksum */
u_int16_t urg_pointer; /* urgent pointer */
}TCPBehavior;
123
34 D. Tian et al.
Fig. 9 Test result
During the period of training, IDS uses the behavior
variables of the normal packets to form the binary behavior
matrix. In detecting time, if an intrusion is detected, IDS will
alarm and display the detailed information of the intruder,
which is parsed from the behavior variables.
4.2.3 Experiment result
In this experiment, we evaluate our proposed method. We
rst train the neural network with different normal features,
then use the stable neural network to monitor the system
where some abnormal behaviors are happening under the
same environment. A series of experiments are conducted to
analyze the effects of varying the value of intrusion threshold
to systemerrors. The tests results are graphically represented
in Fig. 9.
We can nd that the performance of IDS is sensitive
according to intrusion threshold. As the threshold value
increases, false positive errors increase while false negative
errors decrease. Since a false negative error is more important
in IDS, we need to concentrate on the decrease of false neg-
ative errors according to the change of the threshold value.
The optimal threshold value is 1.51.6.
5 Conclusions
The bandwidth of networks increases faster than the speed
of processors. Its impossible to keep up with the speed of
networks by just increasing the processors speed of NIDS.
To resolve the problem, this paper presents a DNNL which
can be used in the anomaly detection methods. Completeness
analysis shows that DNNLs learning algorithm is equivalent
to training by one neural network which adds a penalty term
to the error function for controlling the bias and variance of
a neural network. The main contribution of this approach is:
reducing the complexity of load balancing while still main-
taining the completeness of the network behavior, putting
forward a dissimilarity measure method for categorical and
numerical features, and increasing the speed of the whole
systems. In the experiments, the KDD data set is used which
is the common data set used in IDS research papers. If train-
ing with one neural network it will use 67 h whereas DNNL
takes only less than 1 h. Comparisons with other approaches
on the same benchmark show that DNNLs false alarm rate
is very low.
Acknowledgments This research is supported by both the National
Natural Science Foundation of China under Grant No. 60573128 and
the National Research Foundation for the Doctoral Program of Higher
Education of China under Grant No.20060183043.
References
1. Song, H.Y., Lockwood, J.W.: Efcient packet classication for
network intrusion detection using FPGA. In: Proceedings of
the 13th International Symposium on Field-programmable Gate
Arrays, pp. 238245. Monterey (2005)
2. Yang, W., Fang, B.X., Liu, B., Zhang, H.L.: Intrusion detection
system for high-speed network. J. Comput. Commun. 27, 1288
1294 (2004)
3. Baker, Z.K., Prasanna, V.K.: Automatic synthesis of efcient
intrusion detection systems on FPGAs. In: Proceedings of the
14th Field Programmable Logic and Application, pp. 311321.
Leuven, Belgium (2004)
4. Baker, Z.K., Prasanna, V.K.: A methodology for synthesis of ef-
cient intrusiondetectionsystems onFPGAs. In: Proceedings of the
12th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM04), pp. 135144. Napa (2004)
5. McAlerney, J., Coit, C., Staniford, S.: Towards faster string match-
ing for intrusion detection or exceeding the speed of snort. In: Pro-
ceedings of DARPA Information Survivability Conference and
Exposition, pp. 367373. Anaheim (2001)
6. Tuck, N., Sherwood, T., Calder, B., Varghese, G.: Deterministic
memory-efcient string matching algorithms for intrusion detec-
tion. In: Proceedings of the 23rd Conference of the IEEE Com-
munications Society, pp. 26282639. Hong Kong (2004)
7. Tan, L., Sherwood, T.: A high throughput string matching archi-
tecture for intrusion detection and prevention. In: Proceedings of
the 32nd International Symposium on Computer Architecture,
pp. 112122. Madison, Wisconsin (2005)
8. Aggarwal, C., Yu, S.: An effective and efcient algorithm for
high-dimensional outlier detection. J. Int. J. Very Large Data
Bases 14, 211221 (2005)
9. Rawat, S., Pujari, A.K., Gulati, V.P.: On the use of singular value
decomposition for a fast intrusion detection system. J. Electronic
Notes Theor. Comput. Sci. 142, 215228 (2006)
10. Kruegel, C., Valeur, F., Vigna, G., Kemmerer, R.: Stateful intru-
sion detection for high-speed networks. In: Proceedings of the
IEEE Symposium on Security and Privacy, pp. 285294. Califor-
nia (2002)
123
Large-scale network intrusion detection based on distributed learning algorithm 35
11. Lai, H.G., Cai, S.W., Huang, H., Xie, J.Y., Li, H.: A parallel intru-
sion detection system for high-speed networks. In: Proceedings
of Applied Cryptography and Network Security: Second Interna-
tional Conference, pp. 439451. ACNS 2004, Yellow Mountain
(2004)
12. Jiang, W.B., Song, H., Dai, Y.Q.: Real-time intrusion detection
for high-speed networks. J. Comput. Secur. 24, 287294 (2005)
13. Xinidis, K., Charitakis, I., Antonatos, S., Anagnostakis, K.G.,
Markatos, E.P.: An active splitter architecture for intrusion detec-
tion and prevention. J. IEEE Trans. Dependable. Secure Com-
put. 3, 3144 (2006)
14. Schaelicke, L., Wheeler, K., Freeland, C.: SPANIDS: a scalable
network intrusion detection loadbalancer. In: Proceedings of the
2nd Conference on Computing Frontiers, pp. 315322. Ischia
(2005)
15. Szalay, A., Gray, J.: The world-wide telescope. Science 293,
20372040 (2001)
16. Martone, M.E., Gupta, A., Ellisman, M.H.: E-neuroscience: chal-
lenges and triumphs in integrating distributed data frommolecules
to brains. Nature Neurosci. 7, 467472 (2004)
17. Wroe, C., Goble, C., Greenwood, M., Lord, P., Miles, S., Papay, J.,
Payne, T., Moreau, L.: Automating experiments using semantic
data on a bioinformatics grid. IEEE Intell. Syst. 19, 4855 (2004)
18. Wang, Y.X., Behera, S.R., Wong, J., Helmer, G., Honavar, V.,
Miller, L., Lutz, R., Slagell, M.: Towards the automatic generation
of mobile agents for distributed intrusion detection system. J. Syst.
Softw. 79, 114 (2006)
19. Bala, J., Weng, Y., Williams, A., Gogia, B.K., Lesser, H.K.: Appli-
cations of Distributed Mining Techniques For Knowledge Discov-
ery in Dispersed Sensory Data. In: Proceedings of the 7th Joint
Conference on Information Sciences, pp. 14. Cary (2003)
20. Kourai, K., Chiba, S.: HyperSpector virtual distributed monitoring
environments for secure intrusion detection. In: Proceedings of the
1st ACM/USENIXInternational Conference on Virtual Execution
Environments, pp. 197207. Chicago (2005)
21. Folino, G., Pizzuti, C., Spezzano, G.: GP ensemble for distributed
intrusion detection systems. In: Proceedings of the 3rd Interna-
tional Conference on Advanced in Pattern Recognition, pp. 5462.
Bath, UK (2005)
22. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the
bias/variance dilema. Neural Comput. 4, 158 (1992)
23. Kuo, R.J., An, Y.L., Wang, H.S., Chung, W.J.: Integration of self-
organizing feature maps neural network and genetic K-means
algorithmfor market segmentation. J. Expert Syst. Appl. 30, 313
324 (2006)
24. Carpenter, G.A., Milenova, B.L., Noeske, B.W.: Distributed
ARTMAP: a neural network for fast distributed supervised learn-
ing. J. Neural Networks 11, 793813 (1998)
25. Nair, T.M., Zheng, C.L., Fink, J.L., Stuart, R.O., Gribskov, M.:
Rival penalized competitive learning (RPCL): a topology-
determiningalgorithmfor analyzinggene expressiondata. J. Com-
put. Biol. Chem. 27, 565574 (2003)
123
Reproducedwith permission of thecopyright owner. Further reproductionprohibited without permission.

Você também pode gostar