Escolar Documentos
Profissional Documentos
Cultura Documentos
One Shields Avenue, Davis, CA 95616
yhliao, rvemuri @ucdavis.edu
Forrest’s group at the University of New Mexico intro- Ye et al. attempted to compare the intrusion detection
duced the idea of using short sequences of system calls performance of methods that used system call frequen-
issued by running programs as the discriminator for in- cies and those that used the ordering of system calls [16].
trusion detection [5]. The Linux program strace was The names of system calls were extracted from the au-
used to capture system calls. Normal behavior was de-
dit data of both normal and intrusive runs, and labeled
A slight variation [17] of the tf idf weighting, which
as normal and intrusive respectively. It is our impres- takes into account that documents may be of different
$%
sion that they did not separate the system calls based on lengths, is the following:
a collection of documents, where each entry represents The kNN classifier is based on the assumption that the
the occurrence of a word in a document, i.e., , classification of an instance is most similar to the clas-
where is the weight of word i in document j. There sification of other instances that are nearby in the vector
are several ways of determining the weight . Let space. Compared to other text categorization methods
be the frequency of word i in document j, N the number such as Bayesian classifier, kNN does not rely on prior
of documents in the collection, M the number of distinct probabilities, and it is computationally efficient. The
words in the collection, and the total number of times main computation is the sorting of training documents
word i occurs in the whole collection. The simplest ap- in order to find the k nearest neighbors for the test docu-
proach is Boolean weighting, which sets the weight ment.
to 1 if the word occurs in the document and 0 other-
wise. Another simple approach uses the frequency of We seek to draw an analogy between a text document
the word in the document, i.e., . A more com- and the sequence of all system calls issued by a process,
mon weighting approach is the so-called tf idf (term fre- i.e., program execution. The occurrences of system calls
$&%
quency - inverse document frequency) weighting: can be used to characterize program behavior and trans-
!#" ' )(+* (1)
form each process into a vector. Furthermore, it is as-
sumed that processes belonging to the same class will
Table 1: Analogy between text categorization and intrusion detection when applying the kNN classifier.
'
M total number of distinct words total number of distinct system calls
number of times ith word occurs number of times ith system call was issued
?
frequency of ith word in document j frequency of ith system call in process j
jth training document jth training process
X test document test process
cluster together in the vector space. Then it is straight- the Solaris machine. See [19] for a detailed description
forward to adapt text categorization techniques to mod- of BSM events. We only recorded the names of system
eling program behavior. Table 1 illustrates the similarity calls. Other attributes of BSM events, such as arguments
in some respects between text categorization and intru- to the system call, object path and attribute, return value,
sion detection when applying the kNN classifier. etc., were not used here, although they could be valuable
for other methods.
There are some advantages to applying text categoriza-
tion methods to intrusion detection. First and fore- The DARPA data was labeled with session numbers.
most, the size of the system-call vocabulary is very lim- Each session corresponds to a TCP/IP connection be-
ited. There are less than 100 distinct system calls in tween two computers. Individual sessions can be pro-
the DARPA BSM data, while a typical text categoriza- grammatically extracted from the BSM audit data. Each
tion problem could have over 15000 unique words [9]. session consists of one or more processes. A complete
Thus the dimension of the word-by-document matrix A ordered list of system calls is generated for every pro-
is significantly reduced, and it is not necessary to apply cess. A sample system call list is shown below. The first
any dimensionality reduction techniques. Second, we system call issued by Process 994 was close, execve was
can consider intrusion detection as a binary categoriza- the next, then open, mmap, open and so on. The process
tion problem, which makes adapting text categorization ended with the system call exit.
methods very straightforward.
Process ID: 994
close munmap open munmap chmod
execve mmap mmap open close
4 Experiments open mmap mmap ioctl close
mmap close munmap access close
4.1 Data Set open open mmap chown close
mmap mmap close ioctl close
mmap close close access exit
We applied the k-Nearest Neighbor classifier to the 1998
DARPA data. The 1998 DARPA Intrusion Detection The numbers of occurrences of individual system calls
System Evaluation program provides a large sample of during the execution of a process were counted. Then
computer attacks embedded in normal background traf- text weighting techniques were ready to transform the
fic [18]. The TCPDUMP and BSM audit data were col- process into a vector. We used Equation (2) to encode
lected on a network that simulated the network traffic of the processes.
an Air Force Local Area Network. The audit logs con-
tain seven weeks of training data and two weeks of test- During our off-line data analysis, our data set included
ing data. There were 38 types of network-based attacks system calls executed by all processes except the pro-
and several realistic intrusion scenarios conducted in the cesses of the Solaris operating system such as the inetd
midst of normal background data. and shells, which usually spanned several audit log files.
We used the Basic Security Module (BSM) audit data
collected from a victim Solaris machine inside the sim-
ulation network. The BSM audit logs contain informa-
tion on system calls produced by programs running on
Table 2: List of 50 distinct system calls that appear in the training data set.
?
<
4.2 Anomaly Detection build the training normal data set ;
<
for each process in the test data do
<
if has an unknown system call then
First we implemented intrusion detection solely based is abnormal;
<
the other hand, a large training data set means large over- if equals 1.0 then
head in using a learning algorithm to model program be-
\ 8)9;:
is normal; exit;
8)9;: <>=@?]
8Z9;: `^`" ^_" \ UOa
bc 8 adR
havior. There are 5 simulation days that were free of find biggest scores of ;
attacks during the seven-week training period. We arbi- calculate for -nearest neighbors;
<
trarily picked 4 of them for training, and used the fifth if is greater than then
one for testing. Our training normal data set consists is normal;
<
of 606 distinct processes running on the victim Solaris else then
machine during these 4 simulation days. There are 50 is abnormal;
distinct system calls observed from the training data set,
which means each process is transformed into a vector
of size 50. Table 2 lists all the 50 system calls. Figure 1: Pseudo code for the kNN classifier algorithm for
anomaly detection.
Once we have the training data set for normal behav-
ior, the kNN text categorization method can be easily
adapted for anomaly detection. We scan the test audit intrusion detection accuracy against the false positive
data and extract the system call sequence for each new probability. It can be obtained by varying the detection
process. The new process is also transformed to a vec- threshold. We formed a test data set to evaluate the per-
tor with the same weighting method. Then the similarity formance of the kNN classifier algorithm. The BSM data
between the new process and each process in the train- of the third day of the seventh training week was chosen
ing normal process data set is calculated using Equation as part of the test data set (none of the training processes
(3). If the similarity score of one training normal pro- was from this day). There was no attack launched on this
cess is equal to 1, which means the system call frequen- day. It contains 412 sessions and 5285 normal processes.
cies of the new process and the training process match The rest of the test data set consists of 55 intrusive ses-
perfectly, then the new process would be classified as a sions chosen from the seven-week DARPA training data.
normal process immediately. Otherwise, the similarity There are 35 clear or stealthy attack instances included
scores are sorted and the k nearest neighbors are chosen in these intrusive sessions (some attacks involve multiple
to determine whether the new program execution is nor- sessions), representing all types of attacks and intrusion
mal or not. We calculate the average similarity value of scenarios in the seven-week training data. Stealthy at-
the k nearest neighbors (with highest similarity scores) tacks attempt to hide perpetrator’s actions from someone
and set a threshold. Only when the average similarity who is monitoring the system, or the intrusion detection
value is above the threshold, is the new process consid- system. Some duplicate attack sessions of the types eject
ered normal. The pseudo code for the adapted kNN al- and warezclient were skipped and not included in the test
gorithm is presented in Figure 1. data set. When a process is categorized as abnormal, the
session that the process is associated with is classified
In intrusion detection, the Receiver Operating Charac- as an attack session. The intrusion detection accuracy is
teristic (ROC) curve is usually used to measure the per- calculated as the rate of detected attacks. Each attack
formance of the method. The ROC curve is a plot of counts as one detection, even with multiple sessions.
(0.72)
1
(0.70)
0.9
0.8
0.6
(0.66)
0.5
k=5
0.4 k=10
(0.6)
k=25
0.3
(0.40)
0.2
0 0.2 0.4 0.6 0.8 1
False positive rate(%)
Figure 2: Performance of the kNN classifier method expressed in ROC curves. False positive rate vs attack detection rate for k=5,
10 and 25. Corresponding threshold values are shown in the parentheses for k=10.
Unlike the groups who participated in the 1998 DARPA rate at zero false positive rate, the attack detection rate
Intrusion Detection Evaluation program [20], we define reaches 100% quickly, and hence a low false alarm fre-
our false positive probability as the rate of mis-classified quency can be achieved.
processes, instead of mis-classified sessions.
The performance of the kNN classifier algorithm also 4.3 Anomaly Detection Combined with Signa-
depends on the value of k, the number of nearest neigh- ture Verification
bors of the test process. Usually the optimal value of
k is empirically determined. We varied k’s value from
5 to 25. Figure 2 shows the ROC curves for different We have just shown that the kNN classifier algorithm can
k values. For this particular data set, k=10 is a better be implemented for effective abnormality detection. The
choice than other values in that the attack detection rate overall running time of the kNN method is O(N), where
reaches 100% faster. For k=10, the kNN classifier al- N is the number of processes in the training data set (usu-
gorithm can detect 10 of the 35 attacks with zero false ally k is a small constant). When N is large, this method
positive rate. The detection rate reaches 100% rapidly could still be computationally expensive for some real-
when the threshold is raised to 0.72 and the false posi- time intrusion detection systems. In order to detect at-
tive rate remains as low as 0.44% (23 false alarms out of tacks more effectively, the kNN anomaly detection can
5285 normal processes) for the whole simulation day. be easily integrated with signature verification. The ma-
licious program behavior can be encoded into the train-
The RSTCORP group gave good performance during ing set of the classifier. After carefully studying the 35
the evaluation of the 1998 DARPA BSM data [20]. By attack instances within the seven-week DARPA training
learning normal sequences of system calls for more than data, we generated a data set of 19 intrusive processes.
150 programs, their Elman neural networks [8] were This intrusion data set covers most attack types of the
able to detect 77.3% of all intrusions with no false pos- DARPA training data. It includes the most clearly ma-
itives, and 100% of all attacks with about 10% miss- licious processes, including ejectexploit, formatexploit,
classified normal sessions, which means 40 to 50 false ffbexploit and so on.
positive alarms for a typical simulation day with 500
sessions. Their test data consisted of 139 normal ses- For the improved kNN algorithm, the training data set in-
sions and 22 intrusive sessions. Since different test data cludes 606 normal processes as well as the 19 aforemen-
sets were used, it is difficult to compare the performance tioned intrusive processes. The 606 normal processes
of our kNN classifier with that of the Elman networks. are the same as the ones in subsection 4.2. Each new test
Although the kNN classifier has lower attack detection process is compared to intrusive processes first. When-
ever there is a perfect match, i.e., the cosine similarity
5 Summary
Table 3: Attack detection rate for DARPA testing data (k=10
and threshold=0.8) when anomaly detection is combined with
signature verification. In this paper we have proposed a new algorithm based
on the k-Nearest Neighbor classifier method for model-
Attack Instances Detected Detection rate ing program behavior in intrusion detection. Our pre-
Known attacks 16 16 100% liminary experiments with the 1998 DARPA BSM audit
Novel attacks 8 6 75% data have shown that this approach is able to effectively
Total 24 22 91.7% detect intrusive program behavior. Compared to other
methods using short system call sequences, the kNN
classifier does not have to learn individual program pro-
files separately, thus the calculation involved with clas-
sifying new program behavior is largely reduced. Our
results also show that a low false positive rate can be
is equal to 1.0, the new process is labeled as intrusive
achieved. While this result may not hold against a more
behavior (one could also check for near matches). Oth-
sophisticated data set, the k-Nearest Neighbor classifier
erwise, the abnormal detection procedure in Figure 1 is
appears to be well applicable to the domain of intrusion
performed. Due to the small amount of the intrusive pro-
detection.
cesses in the training data set, this modification of the
algorithm only causes minor additional calculation for
normal testing processes.
The tf idf text categorization weighting technique was
adopted to transform each process into a vector. With the
frequency-weighting method, where each entry is equal
The performance of the modified kNN classifier algo-
to the number of occurrences of a system call during the
rithm was evaluated with 24 attacks within the two-week
process execution, each process vector does not carry
DARPA testing audit data. The DARPA testing data con-
any information on other processes. A new training pro-
tains some known attacks as well as novel ones. Some
cess could be easily added to the training data set with-
duplicate instances of the eject attack were not included
out changing the weights of the existing training sam-
in the test data set. The false positive rate was evalu-
ples. This could make the kNN classifier method more
ated with the same 5285 testing normal processes as de-
suitable for dynamic environments that require frequent
scribed in Section 4.2. Table 3 presents the attack de-
updates of the training data.
tection accuracy for k=10 and the threshold of 0.8. The
false positive rate is 0.59% (31 false alarms) when the
In our current implementation, we used all the system
threshold is adjusted to 0.8.
calls to represent program behavior. The dimension of
process vectors, and hence the classification cost, can be
The two missed attack instances were a new denial of
further reduced by using only the most relevant system
service attack, called process table. They matched with
calls.
one of training normal processes exactly, which made it
impossible for the kNN algorithm to detect. The pro-
cess table attack was implemented by establishing con-
nections to the telnet port of the victim machine every 4 6 Discussion
seconds and exhausting its process table so that no new
process could be launched [21]. Since this attack con- In spite of the encouraging initial results, there are sev-
sists of abuse of a perfectly legal action, it did not show eral issues that require deeper analysis.
any abnormality when we analyzed individual processes.
Characterized by an unusually large number of connec- Our approach is predicated on the following properties:
tions active on a particular port, this denial of service the frequencies of system calls issued by a program ap-
attack, however, could be easily identified by other in- pear consistently across its normal executions and un-
trusion detection methods. seen system calls will be executed or unusual frequen-
cies of the invoked system calls will appear when the
Among the other 22 detected attacks, eight were cap- program is exploited. We believe these properties hold
tured with signature verification. These eight attacks true for many programs. However, if an intrusion does
could be identified without signature verification as well. not reveal any anomaly in the frequencies of system
With signature verification, however, we did not have to calls, our method would miss it. For example, attacks
compare them with each of the normal processes in the that consist of abuse of perfectly normal processes such
training data set. as process table would not be identified by the kNN clas-
sifier. ceedings of 1996 IEEE Symposium on Computer
Security and Privacy, 120-128, 1996.
With the kNN classifier method, each process is classi-
fied when it terminates. We argue that it could still be [6] C. Warrender, S. Forrest and B. Pearlmutter, “De-
suitable for real-time intrusion detection. Each intrusive tecting Intrusions Using System Calls: Alternative
attack is usually conducted within one or more sessions, Data Models”, Proceedings of 1999 IEEE Sympo-
and every session contains several processes. Since the sium on Security and Privacy, 133-145, 1999.
kNN classifier method monitors the execution of each
[7] W. Lee, S. J. Stolfo and P. K. Chan, “Learning Pat-
process, it is highly likely that an attack can be detected
terns from Unix Process Execution Traces for In-
while it is in operation. However, it is possible that an at-
trusion Detection”, Proceedings of AAAI97 Work-
tacker can avoid being detected by not letting the process
shop on AI Methods in Fraud and Risk Manage-
exit. Therefore, there is a need for effective classification
ment, 50-56, 1997.
during a process’s execution, which is a significant issue
for our future work. [8] A. K. Ghosh, A. Schwartzbard and A. M. Shatz,
“Learning Program Behavior Profiles for Intrusion
Detection”, Proceedings of 1st USENIX Workshop
7 Acknowledgment on Intrusion Detection and Network Monitoring,
Santa Clara, CA, April 1999.
The authors wish to thank Dr. Marc Zissman of Lincoln
[9] K. Aas and L. Eikvil, Text Categorisation: A Sur-
Laboratory at MIT for providing us the DARPA training
vey, http://citeseer.nj.nec.com/
and testing data. We also thank the reviewers for their
aas99text.html, 1999.
valuable comments. Special thanks to Dr. Vern Paxton
for his insightful comments that helped us to improve the [10] Y. Yang, An Evaluation of Statistical Ap-
quality and readability of the final version. This work proaches to Text Categorization, Technical Report
is supported in part by the AFOSR grant F49620-01-1- CMU-CS-97-127, Computer Science Department,
0327 to the Center for Digital Security of the University Carnegie Mellon University, 1997.
of California, Davis.
[11] Y. Yang, “Expert Network: Effective and Efficient
Learning from Human Decisions in Text Catego-
References rization and Retrieval”, Proceedings of 17th An-
nual International ACM SIGIR Conference on Re-
[1] H.S. Javitz and A. Valdes, The NIDES Statistical search and Development in Information Retrieval
Component: Description and Justification, Tech- (SIGIR’94), 13-22, 1994.
nical Report, Computer Science Laboratory, SRI
International, Menlo Park, CA, March 1994. [12] C. Ko, G. Fink and K. Levitt, “Automated Detec-
tion of Vulnerabilities in Privileged Programs by
[2] H.S. Vaccaro and G.E. Liepins, “Detection of Execution Monitoring”, Proceedings of 10th An-
Anomalous Computer Session Activity”, Proceed- nual Computer Security Applications Conference,
ings of 1989 IEEE Symposium on Security and Pri- Orlando, FL, Dec, 134-144, 1994.
vacy, 280-289, 1989.
[13] P. Uppuluri and R. Sekar, “Experiences with
[3] E. Lundin and E. Johnsson, “Anomaly-based in- Specification-Based Intrusion Detection”, Recent
trusion detection: privacy concern and other prob- Advances in Intrusion Detection (RAID 2001),
lems”, Computer Networks, vol. 34, 623-640, LNCS 2212, Springer, 172-189, 2001.
2000.
[14] D. Wagner and D. Dean, “Intrusion Detection via
[4] V. Dao and V. R. Vemuri, “Computer Network In- Static Analysis”, Proceedings of IEEE Symposium
trusion Detection: A Comparison of Neural Net- on Research in Security and Privacy, Oakland, CA,
works Methods”, Differential Equations and Dy- 2001.
namical Systems, (Special Issue on Neural Net-
works, Part-2), vol.10, No. 1&2, 2002. [15] M. Asaka, T. Onabuta, T. Inoue, S. Okazawa and S.
Goto, “A New Intrusion Detection Method Based
[5] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. on Discriminant Analysis”, IEEE TRANS. INF. &
Logstaff, “A Sense of Self for Unix process”, Pro- SYST., Vol. E84-D, No. 5, 570-577, 2001.
[16] N. Ye, X. Li, Q. Chen S. M. Emran and M. Xu,
“Probabilistic Techniques for Intrusion Detection
Based on Computer Audit Data”, IEEE Trans.
SMC-A, Vol. 31, No. 4, 266-274, 2001.
[17] J. T.-Y. Kwok, “Automatic Text Categorization
Using Support Vector Machine”, Proceedings of
International Conference on Neural Information
Processing, 347-351, 1998.
[18] MIT Lincoln Laboratory,
http://www.ll.mit.edu/IST/ideval/.