Bradley Schatz Thesis

Digital Evidence:
Representation and Assurance
by
Bradley Schatz
Bachelor of Science (Computer Science), UQ, Australia 1995
Thesis submitted in accordance with the regulations for the

Degree of Doctor of Philosophy
Information Security Institute

Faculty of Information Technology
Queensland University of Technology
October, 2007
ii
Keywords
Digital evidence, computer based electronic evidence, digital forensics,
computer forensics, forensic computing, evidence provenance, evidence representation,
knowledge representation.
iii
iv
Abstract
The field of digital forensics is concerned with finding and presenting evidence
sourced from digital devices, such as computers and mobile phones. The complexity of
such digital evidence is constantly increasing, as is the volume of data which might
contain evidence. Current approaches to interpreting and assuring digital evidence rely
implicitly on the use of tools and representations made by experts in addressing the
concerns of juries and courts. Current forensics tools are best characterised as not easily
verifiable, lacking in ease of interoperability, and burdensome on human process.
The tool-centric focus of current digital forensics practise impedes access to
and transparency of the information represented within digital evidence as much as it
assists, by nature of the tight binding between a particular tool and the information that
it conveys. We hypothesise that a general and formal representational approach will
benefit digital forensics by enabling higher degrees of machine interpretation,
facilitating improvements in tool interoperability and validation. Additionally, such an
approach will increase human readability.
This dissertation summarises research which examines at a fundamental level
the nature of digital evidence and digital investigation, in order that improved
techniques which address investigation efficiency and assurance of evidence might be
identified. The work follows three themes related to this: representation, analysis
techniques, and information assurance.
The first set of results describes the application of a general purpose
representational formalism towards representing diverse information implicit in event
based evidence, as well as domain knowledge, and investigator hypotheses. This
representational approach is used as the foundation of a novel analysis technique which
uses a knowledge based approach to correlate related events into higher level events,
which correspond to situations of forensic interest.
v
The second set of results explores how digital forensic acquisition tools scale
and interoperate, while assuring evidence quality. An improved architecture is
proposed for storing digital evidence, analysis results and investigation documentation
in a manner that supports arbitrary composition into a larger corpus of evidence.
The final set of results focus on assuring the reliability of evidence. In
particular, these results focus on assuring that timestamps, which are pervasive in
digital evidence, can be reliably interpreted to a real world time. Empirical results are
presented which demonstrate how simple assumptions cannot be made about computer
clock behaviour. A novel analysis technique for inferring the temporal behaviour of a
computer clock is proposed and evaluated.
vi
Table of Contents
Keywords __________________________________________________________ iii

Abstract_____________________________________________________________ v
Table of Contents ____________________________________________________ vii
List of Tables _______________________________________________________ xi
List of Figures______________________________________________________ xiii
List of Abbreviations _________________________________________________ xv
Declaration ________________________________________________________ xix
Previously Published Material ________________________________________ xxi
Acknowledgements__________________________________________________ xxv
Chapter 1. Introduction ____________________________________________ 1
1.1 Digital forensics and digital evidence______________________________ 1
1.2 Contributions ________________________________________________ 6
1.3 Dissertation roadmap __________________________________________ 7
Chapter 2. Background: Digital forensics_____________________________ 11
2.1 A brief history of digital forensics _______________________________ 11
2.2 Digital evidence & digital forensics defined _______________________ 13
2.2.1 The nature of digital evidence ______________________________ 16
2.2.2 Perspectives on the digital investigation process ________________ 18
2.3 Digital forensics tools _________________________________________ 23
2.3.1 Acquisition tools ________________________________________ 23
2.3.2 Examination & analysis tools_______________________________ 24
2.3.3 Integrated digital investigation environments __________________ 27
2.3.4 Models of tools__________________________________________ 27
2.3.5 Current approaches to tool integration ________________________ 28
2.4 Key challenges ______________________________________________ 30
2.4.1 Volume & Complexity____________________________________ 32
vii
2.4.2 Effective forensics tools and techniques ______________________ 33
2.4.3 Meeting the standard for scientific evidence ___________________ 34
2.5 Conclusions_________________________________________________ 36
Chapter 3. Related work ___________________________________________ 37
3.1 Event correlation for forensics __________________________________ 38
3.1.1 Approaches to modeling events _____________________________ 39
3.1.2 Event patterns and event pattern languages ____________________ 40
3.1.3 Observations____________________________________________ 42
3.2 Current approaches to evidence representation and format ____________ 42
3.2.1 Digital evidence container formats___________________________ 42
3.2.2 Representation of digital investigation documentation ___________ 46
3.2.3 Observations____________________________________________ 48
3.3 Reliable interpretation of time __________________________________ 48
3.3.1 An introduction to computer timekeeping _____________________ 48
3.3.2 Reliable time synchronization ______________________________ 49
3.3.3 Factors affecting timekeeping accuracy _______________________ 49
3.3.4 Usage of timestamps in forensics ____________________________ 50
3.3.5 Observations____________________________________________ 51
3.4 Conclusion _________________________________________________ 51
Chapter 4. Digital evidence representation: addressing the complexity and
volume problems of digital forensics ____________________________________ 53
4.1 Introduction_________________________________________________ 54
4.2 Background on knowledge representation _________________________ 56
4.2.1 Historical foundations ____________________________________ 56
4.2.2 Defining knowledge representation __________________________ 58
4.2.3 Hybrid approaches _______________________________________ 61
4.3 Semantic markup languages ____________________________________ 62
4.3.1 A basic Introduction to the RDF data model___________________ 64
4.3.2 RDF serialisation ________________________________________ 67
4.3.3 Adding semantics to published RDF data _____________________ 69
4.4 KR in digital forensics and IT security ____________________________ 72
4.5 A formal KR approach to investigation documentation and digital evidence
___________________________________________________________ 74
4.6 Conclusion _________________________________________________ 76
viii
Chapter 5. Event representation in forensic event correlation ____________ 79
5.1 Introduction: Event correlation in digital forensics __________________ 80
5.2 Ontologies, KR and a new approach______________________________ 81
5.2.1 Knowledge representation framework ________________________ 82
5.2.2 Application architecture ___________________________________ 82
5.3 Implementation ______________________________________________ 83
5.3.1 The design of the event representation________________________ 83
5.3.2 Log parsers_____________________________________________ 85
5.3.3 A heuristic correlation language – FR3 _______________________ 86
5.4 Case study 1: Intrusion forensics ________________________________ 89
5.4.1 Investigation using FORE _________________________________ 90
5.4.2 Experimental results______________________________________ 95
5.5 Case study 2: Extending the approach to new domains _______________ 96
5.5.1 Integration of standard ontologies ___________________________ 97
5.5.2 Integrating new domains __________________________________ 98
5.5.3 Experimental results_____________________________________ 100
5.6 Conclusion ________________________________________________ 102
Chapter 6. Sealed digital evidence bags _____________________________ 107
6.1 Introduction________________________________________________ 108
6.2 Definitions ________________________________________________ 109
6.3 An extensible information architecture for digital evidence bags ______ 110
6.3.1 Storage container architecture _____________________________ 111
6.3.2 Information architecture__________________________________ 113
6.3.3 Integrity ______________________________________________ 117
6.3.4 Evidence assurance _____________________________________ 118
6.3.5 Clarifications __________________________________________ 118
6.4 Usage scenario: imaging and annotation _________________________ 118
6.5 Experimental results _________________________________________ 120
6.6 Conclusion and future work ___________________________________ 121
Chapter 7. Temporal provenance & uncertainty ______________________ 125
7.1 Introduction________________________________________________ 126
7.2 Characterising the behaviour of drifting clocks ____________________ 127
7.2.1 Experimental setup______________________________________ 127
7.2.2 Analysis and discussion of results __________________________ 128
ix
7.3 Identifying computer timescales by correlation with corroborating sources _
__________________________________________________________ 133
7.3.1 Experimental setup ______________________________________ 134
7.3.2 Challenges in correlating browser and squid logs ______________ 135
7.3.3 Analysis methodology ___________________________________ 136
7.3.4 Clickstream correlation algorithm __________________________ 137
7.3.5 Results _______________________________________________ 139
7.3.6 Non-cached records correlation algorithm ____________________ 140
7.3.7 Results _______________________________________________ 141
7.4 Discussion _________________________________________________ 142
7.4.1 Relation to existing work _________________________________ 143
7.5 Conclusions________________________________________________ 144
Chapter 8. Conclusions and future work ____________________________ 147
8.1 Summary of contributions and achievements ______________________ 148
8.2 Discussion of main themes and conclusions _______________________ 149
8.2.1 Addressing complexity and volume of digital evidence _________ 149
8.2.2 Assurance of fundamental temporal information _______________ 150
8.3 Implications of Work ________________________________________ 150
8.4 Opportunities for further work _________________________________ 151
8.4.1 Document oriented evidence ______________________________ 151
8.4.2 Ontologies in digital forensics _____________________________ 152
8.4.3 Temporal assumptions underlying event correlation ____________ 153
8.4.4 Characterising temporal behavior of computers________________ 154
8.4.5 Event pattern languages __________________________________ 155
Chapter 9. Bibliography __________________________________________ 157
x
List of Tables
Table 1: Challenges in digital forensics - DFRWS 2006 keynote _______________ 31
Table 2: RDF/XML Serialisation of two triples_____________________________ 68
Table 3: RDF/XML serialisation using XML Namespace abbreviation __________ 68
Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition
_____________________________________________________________ 68
Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie 'Footloose'." ______ 69
Table 6: N3 serialisation of statement from Table 5 _________________________ 69
Table 7: A simple Movie related ontology_________________________________ 72
Table 8: Web Session / Causality Correlation Rule __________________________ 89
Table 9: OSExploit Heuristic Rule_______________________________________ 91
Table 10: SAP Related Events __________________________________________ 99
Table 11: Identity Masquerade Rule ____________________________________ 100
Table 12: Door Entry- Login rule ______________________________________ 100
Table 13: The file content of a browser log SDEB _________________________ 113
Table 14: XML/RDF content of Investigation Documentation File named
jbloggs.cache.index.dat.rdf ______________________________________ 115
Table 15: Digital Evidence Bag instance data stored in the Tag File ___________ 117
Table 16: Evidence Content message digest property _______________________ 117
Table 17: Investigation Documentation Container Metadata stored in the Tag File. 118
Table 18: Annotated information from composing SDEB____________________ 121
xi
xii
List of Figures
Figure 1: Corresponding phases of linear process models of digital forensic
investigation___________________________________________________ 21
Figure 2: Event based digital investigation framework _______________________ 22
Figure 3: Digital crime scene specific investigation phases____________________ 23
Figure 4: Carrier's digital forensics tool abstraction layer model _______________ 28
Figure 5: Turner's digital evidence bag ___________________________________ 44
Figure 6: Trivial set of physical, digital and document evidence________________ 55
Figure 7: Current Semantic Web standards ________________________________ 64
Figure 8: Basic RDF node-arc-node triple _________________________________ 65
Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named
'Footloose'"____________________________________________________ 65
Figure 10: Unambiguous meaning is given to concepts and instances through naming
with URI’s ____________________________________________________ 66
Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.” _____ 67
Figure 12: The FORE Architecture ______________________________________ 83
Figure 13: Instance and Class/Subclass relationships between events____________ 85
Figure 14: Causal ancestry graph of exploit________________________________ 92
Figure 15: Related events remain unconnected because of surrogate proliferation __ 93
Figure 16: Correlated event graphs after proliferate surrogates merged __________ 94
Figure 17: Causal ancestry graph of identity masquerading scenario ___________ 101
Figure 18: Referencing nested and external digital evidence bags _____________ 111
Figure 19: Proposed sealed digital evidence bag structure ___________________ 112
Figure 20: RDF Graph relating original data object and image ________________ 116
Figure 21: RDF graph resulting from addition of new documentation to embedded DEB
____________________________________________________________ 120
Figure 22: Experimental setup for logging temporal behaviour of windows PC's in
small business network _________________________________________ 128
xiii
Figure 23: Clock skew of Domain Controller "Rome" offset from civil time. ____ 129
Figure 24: Clock skew of workstation "Florence" offset from civil time. ________ 130
Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed). __ 131
Figure 26: Clock skew of workstation “Trieste” offset from civil time. _________ 132
Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed). __ 132
Figure 28: Experimental setup for correlation _____________________________ 134
Figure 29: Matching is complicated by only the most recent record present in the
history. ______________________________________________________ 136
Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host
“Milan” do not correlate because of presence of false positives.__________ 137
Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when
false positives are removed. ______________________________________ 138
Figure 32: Pompeii" cache correlation. __________________________________ 139
Figure 33: History Correlation vs. Timescale. _____________________________ 141
Figure 34: Incomplete information______________________________________ 143
xiv
List of Abbreviations
AAFS: American Association of Forensic Sciences
ACPO: (UK) Association of Chief Police Officers
AFF: Advanced Forensics Format
API: Application Programming Interface
APIC: Advanced Programmable Interrupt Controller
BIOS: Basic Input Output System
CART: Computer Analysis and Response Team
CDESF: Common Digital Evidence Storage Format
CEP: Complex Event Processing
CERN: The European Particle Physics Laboratory (Consiel Europeen pour la
Recherche Nucleaire)
CERT: Computer Emergency Response Team
CFSAP: Computer Forensics Secure Analyse Present
CFTT: Computer Forensics Tool Testing
DAML: DARPA Agent Markup Language
DARPA: Defence Advanced Research Projects Agency
DCO: Drive Configuration Overlay
DC: (MS Windows) Domain Controller
DCS: Digital Crime Scene
DE: Digital Evidence
DEB: Digital Evidence Bag
DEID: Digital Evidence Identifier
DF: Digital forensics
DFRWS: Digital Forensics Research Workshop
DL: Description Logic
xv
DMCA: Digital Millennium Copyright Act
DTD: Document Type Definition
DLG: Directed Labelled Graph
DO: Data Object
ERP: Enterprise Resource Planning
FBI: (US) Federal Bureau of Investigation
FOL: First Order (Predicate) Logic
FSM: Finite State Machine
FTK: Forensics Toolkit
HDD: Hard Disk Drive
HTML: Hypertext Markup Language
HPA: Post Protected Area
IDS: Intrusion Detection System
IE: Internet Explorer
IOCE: International Organisation on Computer Evidence
KIF: Knowledge Interchange Format
KR: Knowledge Representation
LSID: Life Sciences Identifier
MAC: Media Access Control
MD5: Message Digest 5
MRU: Most Recently Used
N3: Notation 3
NIJ: (US) National Institute of Justice
NIST: (US) National Institute of Standards and Technology
NL: Natural Language
NSRL: (US) National Software Reference Library
NTP: Network Time Protocol
OIL: Ontology Integration Language
OWL: Web Ontology Language
P2P: Peer To Peer
PDA: Personal Digital Assistant
RAID: Redundant Array of Inexpensive Disks
RDF: Resource Description Framework
RDFS: RDF Schema
xvi
RTC: Real Time Clock
SDEB: Sealed Digital Evidence Bag
SGML: Standard Generalized Markup Lanugage
SIM: Subscriber Identity Module
SNTP: Simple Network Time Protocol
SUO: Standard Upper Ontology
SUMO: Suggested Upper Merged Ontology
SWGDE: Scientific Working Group on Digital Evidence
SPARQL: Simple Protocol And RDF Query Language
TSK: The Sleuth Kit
UMM: Unified Modelling Methodology
URI: Uniform Resource Identifier
URL: Uniform Resource Locator
URN: Uniform Resource Name
UTC: Coordinated Universal Time
WWW: World Wide Web
W3C: World Wide Web Consortium
XML: Extensible Markup Language
XML-NS: XML Namespace
XSD: XML Schema Definition
xvii
xviii
Declaration
The work contained in this dissertation has not been previously submitted to
meet requirements for an award at this or any other higher education institution. To the
best of my knowledge and belief, this dissertation contains no material previously
published or written by any other person except where due reference is made.
Signed: ……………………………… Date:……………………………
xix
xx
Previously Published Material
The following papers have been published or presented, and contain material
based on the content of this dissertation.
Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer
Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and
Management Systems (APIEMS 2004), Brisbane, Australia.
Schatz, B., Mohay, G. and Clark, A., (2004) ‘Generalising Event Forensics Across
Multiple Domains’ Proceedings of the 2004 Australian Computer Network and
Information Forensics Conference (ACNIFC 2004), Perth, Australia.
(revised version published as)
Schatz, B., Mohay, G. and Clark, A., (2005) ‘Generalising Event Correlation Across
Multiple Domains’, Journal of Information Warfare, vol 4, iss 1, pp. 69-79.
Schatz, B., Clark, A., (2006) ‘An information architecture for digital evidence
integration’ Proceedings of the 2006 Australian Security Response Team Annual
Conference (AUSCERT 2006), Gold Coast, Australia.
Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of
computer event log evidence’ Digital Investigation, 3 (Supplement 1), pp. 89-107.
(also published as)
Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of
computer event log evidence’ Proceedings of the 2006 Digital Forensics Workshop
(DFRWS 2006), West Lafayette, USA.
xxi
xxii
In loving memory of my father, Gregory Schatz.
xxiii
xxiv
Acknowledgements
This dissertation, like most, is the product of one author, yet has been shaped
by a cast of supporters, colleagues, friends and family. I would like to express my
sincere appreciation in the following paragraphs.
I would like to thank Adjunct Professor George Mohay, Dr. Andrew Clark and
Associate Professor Peter Best for their supervision. George and Andrew’s guidance
and inspiration have been instrumental in directing the course of this research. Thank
you both for giving me the opportunity to spend this time researching and for freely
sharing of your insight, time, energy and experience.
George deserves special mention for the methodical and focused attention
which he applied to my writing; it has been a true pleasure writing papers together.
Andrew’s patience and willingness to offer an alternative perspective has often helped
clarify otherwise murky waters.
The Information Security Institute (ISI) provided the resources and
environment for me to perform this research, contributions which I highly appreciate.
Without the help of Ed Dawson, Colin Boyd, and Mark Looi I would not have found
the opportunity to research at the ISI. Many thanks to the ISI staff who have helped
along the way. Additional thanks go to SAP Research, who supported some research
related to Chapter 5.
I would like express my appreciation to Peter Best for providing important
direction related to the event correlation work presented in Chapter 5. Peter Kingsley
(Qld. Police Forensics Unit) provided valuable assistance in provenance related issues
which contributed towards the results in Chapter 6.
xxv
Many thanks go to my colleagues who have helped along the way by reading
drafts of papers and discussing ideas. I would especially like to thank Jason Smith
(ISI), Mark Branagan (ISI) and Julienne Vayssiere (SAP Research).
Finally, to my family, thank you all for your encouragement and understanding
during the period of this research. In particular I would like to give heartfelt thanks to
my wife Kelly, who has patiently supported me throughout this period, enduring far too
much seriousness and absence on my part.
xxvi
Chapter 1. Introduction
All our lauded technological progress – our very civilization – is like
the axe in the hand of the pathological criminal.
(Albert Einstein)
The aim of the work described in this thesis is to examine, at a fundamental

level, the nature of digital (computer based) evidence, so that improved digital
investigation techniques addressing investigation efficiency and evidence assurance
might be identified. In our opinion, the current tool-centric focus of digital forensics
impedes access to the information represented within digital evidence as much as it
assists, by nature of the tight binding between the tool and the information which it
conveys. We hypothesise a shift in focus towards the data, by employing a common
representational approach, would benefit the field in areas such as tool interoperability,
assurance and validation.
1.1 Digital forensics and digital evidence

In times past, computer evidence meant “the regular print out from a computer”
[120]. Computer evidence today means data from storage media such as hard drives
and floppy disks, captures of data transmitted over communications links, emails, and
log files generated by operating systems. What was formerly called computer evidence
is now also called digital evidence, including new classes of evidence drawn from a
plethora of digital devices which do not fit the conventional concept of a computer.
PDAs, mobile phones, engine management systems in cars, and even washing
machines are all examples of this.
A body of widely accepted techniques for seizing computers and digital storage
media and copying (or imaging) the contents of media has been developed. Further
techniques address analysis of digital content for information relevant to a particular
investigation, and presentation of the evidence. Finally, these techniques and resultant
information must be both independently verifiable and understood in contexts such as
1
2 CHAPTER 1 - Introduction
courts of law. Protocols have been established which prevent contamination of

evidence, and establish continuity of evidence. These techniques and protocols, which
ultimately aim to find evidence and have it accepted as such by the courts, form a part
of the practice of digital forensics or computer forensics.
Since at least 1967 1 courts of law have been faced with challenges relating to
admitting computer related evidence. In these instances the courts have generally
followed rules of evidence, based on their legal tradition, for guidance whether to admit
evidence into proceedings. The “regular print outs from a computer” required the
testimony of the computer’s maintainer asserting the correct operation of the computer,
and were often admitted using the “business records” exemption to the hearsay rule of
evidence.
In the year 2007, the courts regularly accept as evidence a wide variety of
digital document based evidence which has similar non-digital equivalents. For
example, emails are similar to postal mail, word processor documents similar to
typewritten documents, and accounting software records similar to bookkeeping
ledgers. Digital evidence artefacts based on new ideas with less direct relation to real
world practise or artefacts has also become widely accepted. For example, processes
which recover deleted files or partial data from slack space 2 are widely understood and
accepted.
The principles by which digital evidence is evaluated, accepted into legal
proceedings, and ascribed weight vary widely from jurisdiction to jurisdiction.
Countries with a common law background, which includes the United Kingdom,
Australia, and the United States, share, however, a number of common principles. In
1
According to Parker [100] , the first successfully prosecuted case (in a federal
US jurisdiction) involving the criminal use of a computer concluded on January 10,
1967. The defendant, a computer programmer, worked on a reporting system for
overdrawn checking accounts for the National City Bank of Minneapolis. The
defendant, whose personal checking account was with the same bank, and subject to the
same processing system, patched the program to hide a growing personal debt. The
situation was discovered when a computer failure caused processing to revert back to
manual methods.
2
Slack space is an emergent artefact of filesystems related to their block oriented allocation
strategies; it refers to an area in the last block or cluster used to store the final part of a file.
Where the final chunk of the file only partly uses this block or cluster, the remainder of the
storage area remains unused. It is this unused area that is referred to as slack space.
CHAPTER 1 - Introduction 3
1998, Sommer described the following basic principles for evaluating the acceptability
of new types of evidence not previously considered by courts:
• authentic – the evidence should be “specifically linked to the circumstances

and persons alleged – and produced by someone who can answer questions
about such links”.
• accurate – the evidence should be “free from any reasonable doubt about the
quality of procedures used to collect the material, analyse the material if that is
appropriate and necessary and finally to introduce it into court - and produced
by someone who can explain what has been done. In the case of exhibits which
themselves contain statements - a letter of other document, for example –
‘accuracy’ must also encompass accuracy of content; and that normally
requires the document’s originator to make a Witness Statement and be
available for cross – examination”
• complete – “tells within its own terms a complete story of (a) particular set of
circumstances or events” [121]
In addition to these considerations, technically oriented evidence (forensic

evidence) must exhibit the following properties:
• chain of evidence – “there should be a clear chain of custody or continuity of

evidence”
• transparent – “a forensic method needs to be transparent, that is, freely

testable by a third party expert”
• explainable – “in the case of material derived from sources with which most
people are not familiar quite extensive explanations may be needed”
• accurate - when evidence is presented that contains statements which were

originally created by a computer, “accuracy must encompass the accuracy of
the process which produced the statement as well as accuracy of content”
[121]
Digital evidence is by its nature fundamentally different from existing types of

physical evidence. By itself it contains no informational value. This contrasts with
regular forms of evidence such as documents or testimony, both of which are readily
understood by the literate and conversant. The content of (or information latent within)
the digital evidence is dependent upon the process by which it is interpreted. While this
process of interpreting the data is based upon principles of computer science, and could
potentially be performed manually by a skilled expert with sufficient time and

motivation, imperatives such as efficiency and reliability have driven the adoption of
tools to mechanise these tasks. In the early days of printouts this discrimination was not
clear, but as the police began to seize storage media in general practise, the role of tools
in processing digital evidence came to the fore.
That digital forensics has been made the subject of a recent special issue of the
Communications of the Association of Computing Machinery (CACM) [5], is
indicative of the transition of the field towards the mainstream. The field is still,
however, far from mature, with widely acknowledged challenges remaining. The courts
are beginning to impose stricter standards as to what is admitted as scientific fact. The
United States legal system is increasingly less tolerant of “junk science”: data, research
and conclusions which are presented as scientific in nature, but absent of the rigour and
methodology underlying the scientific method. The challenge for digital forensics is to
firm its foundations in scientific discipline, so that it might meet arguments of the field
being a junk science.
In the United States, judges now act as gatekeeper for novel scientific
evidence. Under the 1993 Daubert v Merrel Dow case, expert evidence must satisfy the
following strict criteria, which are commonly referred to as the “Daubert factors” [103]:
• whether the technique “can be (and has been) tested”,
• whether the technique has been “subjected to peer review and publication”,
• “the known or potential rate of error… and the existence of and maintenance
of standards controlling the technique’s operation”, and
• “general acceptance.”
Sommer is doubtful whether some of the computer forensic evidence accepted

by courts, and in turn the processes being used to interpret such evidence, can meet
these tests, citing issues of disclosure, testing and repeatability as having been
neglected or not applied uniformly [121, 122]. These concerns are paralleled by recent
calls for the practice of digital forensics to establish itself more like a forensic science
[98].
Despite some preliminary work advancing the theory behind validation [12],
calls to validate the computer forensics toolset have largely resulted in lacklustre
results. For example, the Computer Forensics Tools Testing (CFTT) group of the US
National Institute of Standards and Technology (NIST) have so far only tested a
handful of hardware write blocking and imaging devices, the function of which is
already widely accepted.
The remainder of the digital forensics toolset generally employed remains

absent of rigorous validation. Commercial tools such as Guidance Software’s EnCase 3
(arguably the de-facto standard in commercial forensics software) and Access Data’s
FTK 4 , remain both absent of validation and conspicuous for their absence of identified
error rates. Surprisingly they have not yet faced a Daubert challenge. The tools are
monolithic in nature, their internal workings a closely guarded secret by nature of their
closed source and commercial origins. Furthermore, in-depth tests by the community
are limited by licensing contracts and legislation such as the Digital Millennium
Copyright Authority (DMCA) [78]. Open source development models have been
proposed as a means of increasing reliability of forensics tools [60], however the open
model of development does not necessarily assure peer review or a reduced error rate.
The rapid rate of change of technology and the constant environment of
change, has resulted in a situation where, in general, forensics tools are best
characterised as point solutions for addressing specific tasks.
In the longest standing area of the practice, media acquisition and analysis, a
number of commercial and open source tools exist which acquire, interpret and explore
digital evidence sourced from storage media such as hard disks and flash drives.
Leading commercial tools such as EnCase and FTK integrate acquisition, analysis,
documentation and reporting functions at various layers of abstraction. These tools do
not, however, take a general approach to processing digital evidence. Integration of
new analysis techniques and new sources of evidence, and, validation of analysis and
interpretation results, is hampered by the monolithic nature of the tools.
Outside the more established areas of the practice, an ecosystem of small task
specific tools exists, for example in the area of mobile device forensics. While
invaluable for specific tasks at hand, they lack an integrated approach. Employing data
from one tool in another tool may require reformatting, leaving the conversion process
a manual one and at risk of human error. Similarly, investigation related documentation
must be methodically and manually maintained outside of the context of the tool.
The constantly changing nature of computing and information technology
creates additional challenges beyond those of raising the practice to the level of rigour
of the forensic sciences. Firstly, the emergence of new or improved devices and
software serve to constantly introduce new sources of complexity which must be
addressed to acquire, interpret and analyse evidence, which often require the
development of new techniques. This is referred to as the complexity problem.
Secondly, the volume of data which may have relevance to an investigation is
3
http://www.guidancesoftware.com/
4
http://www.accessdata.com/
increasing markedly. The number of individual units of potential evidence is increasing

because of network effects, increasing the occurrence of evidence distributed over
multiple devices. Furthermore, the volume of data in each unit under consideration is
increasing markedly, with multiple terabyte acquisitions becoming common. This is
referred to as the volume problem.
With the prevalence of cases involving digital evidence approaching a
watershed, the challenge for digital forensics is to both increase in reliability and
rigour, while at the same time increasing the efficiency of investigation. Without these
efficiencies, mounting pressure on the courts to accept and employ digital evidence
based on “junk science” could lead to failures in the administration of justice, or limit
the employment of digital evidence to only the few who can afford it.
1.2 Contributions
New paradigms for interacting with, managing, processing and presenting
digital evidence are needed for achieving these efficiencies and reliability of findings.
Current approaches to digital investigation overly rely on human intervention as the
glue which binds together the operation of disparate tools over opaque data. All the
while investigation documentation must be sufficiently maintained and generated to
provide assurance of the authenticity and provenance of evidence and reproducibility of
findings.
The aim of the work described in this dissertation is to investigate at a
fundamental level the nature of information examined, inferred and reported in digital
investigations, to identify techniques which facilitate documenting of digital
investigations, analysis of digital evidence, and reporting of findings, while at the same
time assuring reliability and authenticity of digital evidence. Our research addresses the
complexity problem by supporting the expression of arbitrary information related to the
investigation, and the volume problem by enabling scalable approaches to digital
evidence.
This dissertation summarises a significant body of research performed
following three intertwined themes: representation, analysis techniques, and
information assurance. The original contributions contained within this dissertation are:
• Proposition that a formal knowledge representation approach to digital

evidence will yield benefits that will solve current digital forensics problems of
complexity and volume.
• Demonstration of the usefulness of a particular representational formalism,

RDF/OWL, in representing arbitrary and diverse information both implicit in
event log based evidence, investigation related documentation, and wider

domain knowledge. This is demonstrated in the context of building improved
forensic correlation tools, and in building interoperable forensics tools and
digital evidence storage formats.
• Demonstration of a novel analysis technique which supports automated

identification of high level forensically interesting situations by means of
heuristic event correlation rules which operate over event oriented information
described in the RDF/OWL formalism. Furthermore,
- A novel means of addressing the problem of surrogate proliferation 5 ,

improving automated correlation by interactive (human guided)
declaration of hypothetical equivalence relationships between
surrogates is demonstrated.
• Demonstration of a novel approach to the problem of digital evidence storage

containers, proposing an architecture for containers of digital evidence and
arbitrary investigation related information. Our proposal enables composition
of evidence units and arbitrary related information into a larger corpus of
evidence, while assuring the integrity of evidence. Furthermore;
- A unique naming scheme for identifying digital evidence which

enables separate and subsequent addition of arbitrary information
without violating the integrity of original evidence or the evidence
container is defined.
• An analysis of the temporal behaviour of PC clocks as generally implemented

in the Microsoft Windows 2000 and XP operating systems and empirical
results demonstrating the unreliability of timestamps sourced from computers
running these operating systems.
- A novel approach for characterising the temporal behaviour of a host,

based on correlating commonly available local timestamps and
timestamps from a reference source.
1.3 Dissertation roadmap

This dissertation contains this chapter, and another seven, which are described
in overview below:
Chapter 1: Introduction
5
Defined in Chapter 5.
This chapter provides a brief introduction to the field of digital forensics, and
the subject of that field; digital evidence. A brief summary of the challenges in digital
forensics are presented, followed by a summary of the contributions of this dissertation,
and finally, the dissertation roadmap.
Chapter 2: Background: Digital forensics
This chapter is a comprehensive review of digital forensics and digital evidence
from practice and research perspectives. The chapter begins by describing the historical
context and evolution of the field. The field is then characterised by presenting multiple
perspectives of it, including noted definitions, the nature of digital evidence and its
relation to digital forensics tools. Finally, current approaches to representing and
documenting digital evidence are described, and limitations in evidence representation
are identified.
Chapter 3: Related work
This chapter reviews background material and related work relevant to the
work described in this dissertation in Chapters 4 to 7. Section 3.1 describes the
literature related to event correlation, both specifically related to computer forensics
and in a more general context. Chapter 5 builds upon this background. Section 3.2
describes current approaches to maintaining investigation documentation and evidence
storage, which is a subject of Chapter 6. Both sections 3.1 and 3.2 make observations
related to representation upon which Chapter 4 follows from. Section 3.3 describes
work related to computer timekeeping, forming background information for Chapter 7.
Chapter 4: Digital evidence representation: addressing the complexity &
volume problems of digital forensics
Following from the observed limitations in evidence representation made in
Chapter 3, this chapter reviews literature in the fields of Knowledge Representation
(KR) and markup languages, with the goal of representing digital evidence. These are
currently the two primary approaches to representing and communicating knowledge
outside of natural language. The historical context of KR is described, followed by a
description of the major approaches to KR. The historical context of markup languages
is then relayed, leading to a description of the current state of KR, its influence on
markup languages, and the current research agenda towards building a “Semantic Web”
of knowledge. A brief introduction to the RDF/OWL representational formalism is then
presented. Finally, the chapter concludes by proposing that this formalism would be of
benefit towards solving the complexity and volume problems of computer forensics.
Chapter 5: Heuristic event correlation for forensics
This chapter addresses the themes of evidence representation and analysis
techniques. Following from the proposal of the RDF/OWL formalism as a
representation layer for documenting arbitrary information in a machine and human

readable manner, this chapter demonstrates the formalism is a useful generic
representation upon which digital forensics applications might be built. This is
performed in the context of correlation of computer and other event logs for the
purpose of forensics. An automated technique of identifying situations of interest in
computer forensic investigations is presented.
A novel means of resolving the problem of surrogate proliferation in event
based knowledge is proposed, and it is shown that this technique assists increasing the
quality of automatically inferred results, and reducing the volume of entities which
must be manually considered. Finally, it is demonstrated the approach is extensible and
generalisable to support integration of and reasoning with evidence from multiple
heterogeneous domains. This is demonstrated by applying the approach to a forensic
scenario involving Enterprise Resource Planning (ERP) software and door logs as well
as commonly available computer event logs.
Chapter 6: Sealed digital evidence bags
This chapter addresses themes of representation and assurance in considering
how forensics tools scale and interoperate in an automated manner, while assuring
evidence quality. A novel architecture is proposed for storing and representing digital
evidence, analysis results, and investigation documentation in a manner that supports
arbitrary composition of evidence units, and related information into a larger corpus of
evidence. Finally a proof of concept is demonstrated by describing a prototype
implementation of this architecture.
Chapter 7: Temporal provenance and uncertainty
This chapter addresses the theme of assurance of evidence, examining one of
the primary challenges in relating real world events to events found in computer event
sources: determining what the real world meaning of a particular timestamp is. While
timestamps are ubiquitous in computer records, the clocks which generate them are
often unreliable, fluctuating with changes in temperature and other factors. This chapter
presents empirical results identifying where the real world behaviour of Microsoft
Windows based computer clocks diverge from the ideal. A novel analysis technique for
assuring the interpretation of timestamps generated by a particular computer clock
based on commonly available event logs is proposed and evaluated.
Chapter 8: Conclusion & Future Work
The concluding chapter identifies areas for future work.
Chapter 2. Background: Digital forensics
“Normal science does and must continually strive to bring theory
and fact into closer agreement, and that activity can easily be seen as
testing or as a search for confirmation or falsification.”
(Thomas Kuhn)
This chapter describes in detail the field of digital forensics. Section 2.1 begins
by describing the historical context and evolution of the field. Following this, Section
2.2 relates key definitions of digital forensics and digital evidence. Section 2.2.1
describes in detail the nature of digital evidence, and the following section, Section
2.2.2, describes the digital investigation process by surveying a number of process
models which have been proposed. The subject of Section 2.3 is digital forensics tools.
Finally, key research challenges in the field of digital forensics are outlined.
2.1 A brief history of digital forensics

The field of digital forensics is in a transitional state. Its origins are in solving
pragmatic acquisition and chain of evidence problems related to investigations,
performed by and large, by law enforcement personnel with little formal background in
computing. The field has progressed to a point where, at national levels, best practice
standards and certification are being considered. Internationally, however, there is no
single accepted statement of standards or best practices, nor is there a generally
accepted governing body for the field [21]. The practice is not yet at a stage that
qualifies it to be called a forensic science.
The transitional nature of the field has impacts on any attempt to characterise
or critique it. The following sections examine the fundamental character of digital
forensics by reflecting on milestones in the history of the practice and definitions and
perspectives of the field made by actors who have shaped the field’s history.
The 1980’s era saw the beginnings of a need for dealing with computer based
evidence, which mainly involved mini systems or mainframe computers. In 1984 in the
UK, New Scotland Yard formed its Computer Crime Unit, and in the USA, the FBI
11
12 CHAPTER 2 – Background: Digital forensics
established a Magnetic Media Program, their first computer forensics initiative [9,
106]. The Magnetic Media Program later became the Computer Analysis and Response
Team (CART).
The late 80’s and early 90’s saw the proliferation of the PC platform, and in the
early 90’s, the widespread recognition that new techniques were required for preserving
digital evidence. The first specific forensic imaging tool, IMDUMP, was in the USA,
superceded in 1991 by a tool called Safeback [89]. In the UK in the same year, another
disk imaging application called the Data Image Back-Up System (DIBS) was produced
[9].
Computer forensics practitioners begin to organise and evaluate their
techniques and practices; in 1993 the First International Law Enforcement Conference
on Computer Evidence was hosted by the FBI. Subsequent conferences led to the 1995
formation of the International Organization on Computer Evidence (IOCE), and the
1997 meeting which resolved to develop best practice standards [21]. Around this time
audio and video technologies were moving from analogue to digital, which led
practitioners to consider whether the same principles of computer forensics applied to
all types of digital evidence [142].
Efforts to define the principles of computer forensics resulted in 1999 in the
adoption by the IOCE of proposals authored by member organisations, the Scientific
Working Group on Digital Evidence (SWGDE), from the USA, and the Association of
Chief Police Officers (ACPO), from the UK [21]. The ACPO proposal has evolved into
what is known as the “Good Practice Guide for Computer based Electronic Evidence”
[6]. In 2002, based on the IOCEs 2000 submission, the G8 issued the “G8 Proposed
principles for the procedures relating to digital evidence”. In Australia, the move
towards formal standardisation of the management and treatment of digital evidence
has begun with the 2003 definition of “Guidelines for the management of IT evidence”
[1].
The academic history of computer forensics goes back to the late 80’s and early
90’s with work by Collier and Spaul [33], Sommer [120] and Spafford [123]. By the
late 90’s very little had been published in the open literature on computer forensics
[88], however the new millennium has seen an upturn in both DF targeted publications
and conferences, including the first two specifically targeted journals. The first digital
forensics targeted conference, The Digital Forensics Research Workshop, was
established in 2001, followed by the International Journal of Digital Evidence in 2002
and the International Journal of Digital Investigation in 2004.
CHAPTER 2 – Background: Digital forensics 13
That digital forensics has been made the subject of a recent special issue of the
Communications of the Association of Computing Machinery (CACM) [5], is
indicative of the transition of the field towards the mainstream.
2.2 Digital evidence & digital forensics defined

There was a time when the primary area of technical innovation was analogue
electronics. In these pre-digital days, the courts adapted to new forms evidence,
including documents transmitted and received by telex, and visual and audio recordings
on magnetic tape. Such evidence was termed electronic evidence. Today, the courts are
coming to terms with a higher order of electronic evidence: digital evidence.
What exactly constitutes digital evidence is a moving target caused by the
continual emergence of new digital technologies, and additionally, because of the broad
definition of the word evidence. There are, however, a number of generally accepted
definitions which have been given by leading organisations and authors in the field
which serve to delineate the territory. These are presented below:
• information of probative value 6 that is stored or transmitted in binary form (SWGDE)

[131]
• information stored or transmitted in binary form that may be relied upon in court
(IOCE) [57]
• any data stored or transmitted using a computer that support or refute a theory of how
an offence occurred or that address critical elements of the offence such as intent or
alibi (Casey) [27]
The above definitions are sourced from entities who primarily reside in the
United States. The Association of Police Chief Officers of England, Wales and North
Ireland defines Computer Based Electronic Evidence as:
• information and data of an investigative value that is stored on or transmitted

by a computer (ACPO) [6]
The usage of the terms “digital evidence” and “computer based electronic
evidence” are synonymous. While the final definition parallels the earlier three, subtle
differences remain. By defining computer evidence in relation to the investigative
process, rather than in relation to the legal one, the ACPO definition addresses digital
data from the time it becomes a part of an investigation. The other definitions however
limit their subject to data which has been examined and found relevant towards
6
Probative value: “The extent to which evidence could rationally affect the assessment of the
probability of the existence of a fact in issue” [68]
establishing some theory. The IOCE definitions address this shortfall between the two
styles of definitions by defining the term Data Objects:
• Objects or information of potential probative value that are associated with physical
items. Data objects may occur in different formats without altering the original
information.
For this dissertation, the author chooses to adopt the IOCE definition of the
term digital evidence: “information of probative value 7 that is stored or transmitted in
binary form”.
The term computer forensics was in informal use in academic publications
from at least 1992 [121], however the term remained informally defined for many
years. A commonly cited definition of the field in Australian literature, is
McKemmish’s 1999 definition of forensic computing:
“The process of identifying, preserving, analysing and presenting digital evidence in a

manner that is legally acceptable” [76]
The American Academy of Forensic Sciences defines forensics as follows:
The word forensic comes from the Latin word forensis: public; to the forum or public
discussion; argumentative, rhetorical, belonging to debate or discussion. From there it
is a small step to the modern definition of forensic as belonging to, used in or suitable
to courts of judicature, or to public discussion or debate. Forensic science is science
used in public, in a court or in the justice system. Any science, used for the purposes of
the law, is a forensic science. [2]
This broad definition of forensics, and McKemmish’s earlier definition inform
the definition of computer forensics given by the Scientific Working Group on Digital
Evidence (SWGDE), whose definition is:
the scientific examination, analysis, and/or evaluation of digital evidence in legal

matters [131].
Researchers attending the first Digital Forensics Research Workshop, 2001,
defined Digital Forensic Science as:
The use of scientifically derived and proven methods toward the preservation,
collection, validation, identification, analysis, interpretation, documentation and
presentation of digital evidence derived from digital sources for the purpose of
facilitating or furthering the reconstruction of events found to be criminal, or helping
to anticipate unauthorized actions shown to be disruptive to planned operations. [98]
This broad definition reflects a change in forums in which the techniques of
computer forensics are increasingly being applied. While traditionally, computer
forensics was exclusively targeted in the legal forum, computer forensics is
increasingly practised in non-legal contexts such as corporate investigations,
intelligence and military.
7
Probative value: “The extent to which evidence could rationally affect the assessment of the
probability of the existence of a fact in issue” [68]
The terms digital forensics, forensic computing and computer forensics are
today arguably used interchangeably. Historically, computer forensics and forensic
computing 8 related to the interpretation of computer related evidence in courts of law.
Technology however does not stand still, nor does language, and the meaning of the
term has remained consistently under negotiation. Two factors have been at play
underlying this process: the changing state of uptake of digital technologies, and with it
moves within organisations to consider governing and regulating the use of information
technology.
The late 80’s and early 90’s period was characterised by a mainstream of stand
alone PC’s: the internet was in its infancy, and only in academic circles. In this
environment, the main subject of computer forensics was indeed the basic components
of computing: persistent storage, such as floppy disks and hard disks, and the software
itself. As computers became inter-networked and proliferated into small devices such
as PDAs and mobile phones, the field has broadened its scope beyond the computer to
include network forensics and small scale device forensics. Digital forensics more
accurately describes this new state of affairs.
New imperatives are further shaping the field; there exists a rise in demand for
computer forensics outside of the traditional legal context. Today, Digital forensics is
practiced by law enforcement, military, intelligence, and also within the corporate
sector. Each of these sectors bring with them divergent agendas, primarily related to the
rigour required of any conclusions made by an investigation.
In the law enforcement context, the primary objective is the prosecution of an
alleged perpetrator of crimes. This necessarily dictates application of strict judicial
standards to the practice of computer forensics, because of the impacts on the freedoms
and liberties of the accused. In the military context, a secondary objective may be
prosecution; however this objective is subordinate to continuity of operations. In this
context, the practitioner of computer forensics is prepared to sacrifice accuracy for
immediate answers. This being the case, and under a time imperative, the conclusions
made by the practice of computer forensics in the military context cannot necessarily be
expected to be rigorous. The term digital investigation arguably reflects these subtle
changes in focus.
Despite the changes in agenda signified by the usage of the term digital
investigation over computer forensics, it is the subject of this field which defines it,
unifying any common definitions. Digital evidence remains at the centre of the field.
The next section examines the nature of digital evidence.
8
For the reader interested in the evolution of definitions of the field, see “To Revisit: What is
Forensic Computing?”[53]
2.2.1 The nature of digital evidence
Digital evidence is an interpretation of data, either at rest (when found on a

hard drive) or in motion (for example network communications) or a combination of
the two. Where it is derived from non-volatile computer storage devices, such as hard
disks and flash drives, digital evidence could constitute portions of or the entirety of the
data contained in the drive. Network traffic dumps are recordings of data of a more
dynamic and mobile nature, recording data communications over a period of time.
Digital evidence is, in the primary instance, derived from physical properties,
whether it is a particular arrangement of magnetic fields on the platter of a hard disk,
the charge state of a transistor in memory, or the oscillation of electromagnetic waves
in a wireless network. Despite this physical basis, the nature of digital evidence is
ephemeral and independent of its storage or transmission medium – its true value is in
its interpretation as information. It is this nature that dictates the primary properties of
digital data and consequently digital evidence: its latency, fidelity, and volatility. These
properties impact the practice of digital forensics in fundamental ways.
By latency, we refer to the latent nature of digitally encoded data. Binary data
conveys no information in and of itself. For example, the binary number 01100001
could be interpreted as the decimal number 97, or the character “a” depending on its
context. For it to be interpreted as information, it must be first processed.
The fidelity property means that the data found on a digital device at a crime
scene may be freely copied and treated as if it was the original, provided that the
copying process can be demonstrated to be correct. This property is exploited in digital
forensics by inductively treating the copy as if it is an original, enabling evidence to be
easily shared between people and tools.
This kind of action upon digital data carries substantial risk of accidental
deletion or modification: we refer to this risk by the term volatility. Exploitation of the
fidelity property and the threat of volatility lead to the question arising as to whether
the copy is an authentic replica of the original. Evidence presented in legal proceedings
must be authentic: conversely, one must be sure that evidence is not fabricated or a
distortion.
Authenticating a piece of digital evidence is performed by testimony which
reliably identifies (or individuates) a particular piece of digital evidence, and then
establishes a chain of custody, the location of the evidence at all times since its
copying. The latent and voluminous nature of digital evidence makes discriminating
differences between two pieces of data in practice infeasible without tools. As such,
identification of digital evidence is achieved primarily through using hashing to

generate a unique, and more easily distinguishable, individuator.
The term “digital evidence” is a general term, only a little more specific than
the term “data”. Accordingly, it is used to refer to a plethora of things when talking
about digital forensics. In this section, we consider a number of categories which may
be used to discriminate between what is being referred to when speaking of digital
evidence.
Digital Crime Scene: After Carrier’s definition, a digital crime scene is the
data contained in a digital device, such as a hard drive or mp3 player, found at a
physical crime scene [26]. This is equivalent to the IOCE defined Data Object [130].
Well accepted protocols exist for preserving and collecting the digital crime scenes
associated with computer storage media, however other classes of digital crime scene,
such as volatile memory, and mobile phone memory are still under active scrutiny.
The use of the term “digital crime scene” acknowledges that the mere presence
of data at a physical crime scene (by way of being stored in a digital device) does not
make it evidence. Numerous sources draw the analogy that best practice in digital
forensic evidence collection perform the equivalent of seizing the entire physical crime
scene, including for example doors, rooms and even buildings.
Investigation Documentation: In outlining his “Big Computer Forensic
Challenges”, Spafford observes that practitioners and researchers in the field of digital
forensics do not use standard terminology[98]. Where they do, we observe it in areas
where the terminology is taken directly from the canon of computer science and relates
directly to elements of software, hardware and data abstraction layers.
While the field has achieved a strong consensus on terminology as it relates to
discussing artefacts found within the digital crime scene, common terminology for
describing investigation related artefacts is still elusive. In instances that investigation
documentation is discussed, it is referred to sometimes as “case documentation”, and
sometimes as “metadata” [30, 45]. Turner discusses elements of information related to
individuating features such as the name of the person capturing information, and
introduces a feature called a “tag continuity block” to help track chain of custody
related information [135].
Despite the lack of standards for what kind of documentation is required for
admission of evidence, a number of general classes of investigation documentation may
be identified:
• Continuity of Evidence: (also chain of evidence) involving tracking the

possession of the digital crime scene, and ultimately any digital evidence
identified.
• Provenance: identifying the genesis of the digital crime scene, for example
Case Number, Examiner, Evidence Number, Unique Description, Acquisition
Time
• Individuating characteristics: identifying the digital crime scene, or derived

digital data objects, or digital evidence
• Integrity: records which may enable identifying if the digital crime scene has
been modified
• Contemporaneous notes: notes made in the course of examination of the

digital crime scene which address reproducibility (ie Encase bookmarks,
Autopsy event sequencer, Notes, Audit)
• Error records: bad sectors, read failures, SMART errors
A subset of these classes of information, specifically continuity of evidence,

provenance, integrity, and individuating characteristics are the central pieces of
information which are used in assuring evidence is authentic. For this reason, we refer
to this set of information as evidence assurance documentation.
Protocols for individuating a piece of evidence and maintaining a chain of
custody have been successfully adapted from practice with physical evidence to the
handling of digital evidence. The protocols, however, remain implemented largely
through manual processes. Where tools support the maintenance or generation of these
types of documentation, their scope is limited to within the tool’s confines, preventing
third party tools from integrating in the documentation maintenance task.
Digital Evidence: While the presentation of a document printed from a
computer along with verbal evidence asserting its provenance may have been
acceptable in the past, today the authenticity of digital evidence is increasingly subject
to scrutiny, as is the accuracy of any conclusions drawn. Under this scrutiny we expect
the findings of digital investigations presented as evidence to a court are dependent on
the accompanying investigative documentation that asserts its pedigree.
2.2.2 Perspectives on the digital investigation process
The drive for establishing a recognised set of standards for performing digital
forensics has resulted in reflection on what tasks are performed in a digital
investigation, and to what end. Beyond the goals of standards setting, descriptions of
forensic processes are also useful for training and directing research.
Early descriptions of digital forensics processes, such as Mandia’s Intrusion
Response oriented methodology, and Farmer and Venema’s early guidelines have been
criticised as being too specific, focusing on the specifics of technology rather than on
generalised process [109]. Since these early attempts, a number of other processes and
frameworks have been proposed, which are described in the following sections.
Linear process models
A number of authors have proposed models which describe digital

investigation as a process consisting of number of phases, which are intended to be
performed one after the other (in a sequential fashion). Casey describes the phases of a
digital investigation as:
• Preliminary Considerations: Authority to conduct investigation
• Planning: Preparation and Methodology
• Recognition: Identification of potential sources of digital evidence
• Preservation, collection and documentation: Crime scene documentation

establishing provenance and chain of custody
• Classification, comparison, and individualization: Examination and search

for digital evidence
• Reconstruction: Deleted or damaged digital evidence recovery, slack space

search. Reconstruct relational and functional aspects of the crime [27]
The US National Institute of Justice (NIJ), in their “Electronic Crime Scene

Investigation: A Guide for First Responders” describe a four phase process, consisting
of the following four phases:
• Collection: “search for, recognition of, collection of, and documentation of

electronic evidence.”
• Examination: “make evidence visible and explain its origin and significance…
search for information… data reduction”
• Analysis: “looks at the product of the examination for its significance and
probative value to the case. Examination is a technical review that is the
province of the forensic practitioner, while analysis is performed by the
investigative team.”
• Reporting: “outlines the examination process and the pertinent data

recovered” [92]
A further preparatory phase was also implied, by posing the question whether
the first responder’s unit had the requisite capability to perform the other phases.
The results of a review of the terminology used for describing the phases of
linear process models are presented in Figure 1. Phases which may be considered as
either equivalent in nature, or simply more specific are arranged in columns.
A few points may be observed. Firstly, despite similarities in the activities or
goals identified for particular phases, terminology remains varied, and the subtleties
implied are not clearly defined. The differences in terminology may be explained by
the granularity of the models. For example, the 2001 NIJ model prescribes a Collection
phase, but in their 2004 model [93] describe Assessment and Acquisition phases
without reference to Collection. It would appear, however, that Assessment and
Acquisition would form sub-phases of a general Collection phase.
Regarding granularity, we observe the least granular and most abstract of
models presented is the Computer Forensics Secure Analyse Present (CFSAP) [89]
model of Mohay et al and the most granular is Reith’s Abstract Digital Forensics
Model, which describes 9 phases in all [109].
Beebe’s Hierarchical, Objectives-Based Framework addresses granularity
issues by proposing a hierarchical structure by which sub-phases may be related to less
granular, higher level phases [13]. Additionally, she relates a class of concerns, which
she calls Principles, which overarch many or all phases and sub-phases of the phases
and sub-phases of the investigative process. Two Principles she identifies are Evidence
Preservation and Documentation.
Classification, Comparison and
Considerations
Recognition
Preliminary
Individualization
Planning
Preservation
Collection &
Documentation
Reconstruction
Casey 2000
Identification
Preservation
Collection
Examination Analysis Presentation
DFRWS
2001
Preparation Collection Examination Analysis Reporting

NIJ 2001
Assessment
Documenting &
Preparation Acquisition Examination
Reporting
NIJ 2004
Figure 1: Corresponding phases of linear process models of digital forensic investigation
Overarching principles
Bebe’s inclusion of Principles in her forensic process model was novel in

explicitly relating principles to process. It was, however, making explicit a relationship
which had been previously implicit. The early standards efforts of the IOCE presented
a set of principles for the standardised recovery of computer based evidence:
• Upon seizing digital evidence, actions taken should not change that evidence.
• When it is necessary for a person to access original digital evidence, that person must
be forensically competent.
• All activity relating to the seizure, access, storage, or transfer of digital evidence must
be fully documented, preserved, and available for review.
• An individual is responsible for all actions taken with respect to digital evidence while
the digital evidence is in their possession.
• Any agency that is responsible for seizing, accessing, storing, or transferring digital
evidence is responsible for compliance with these principles. [130]
Event based digital investigation framework
More recently an investigation process model was proposed based on the

procedures conventionally used in investigating regular (or physical) crime scenes [26].
In this model, the data content of a digital device is conceptualised as a Digital Crime
Scene, with one digital crime scene per digital device. The process model, called the
Event Based Digital Investigation Framework, contains the following high level
phases: Readiness, Deployment, Physical Crime Scene Investigation, Digital Crime
Scene investigation, and Presentation. The goals related to each phase are described
below:
• Readiness: Operations readiness (training, methodology) infrastructure

readiness (prepping infrastructure for evidence gathering, forensic readiness)
• Deployment: Detection & notification (crime detected or incident detected),

confirmation and authorisation phase (search warrants)
• Physical Crime Scene: search for physical evidence
• Digital crime scene: preservation of system state, search for evidence,

reconstruction of digital events
• Presentation: Results presented to intended audience
The phases presented above are presented in Figure 2, with arrows indicating
flow through the process phases. We note here that the identification of a digital device
at a physical crime scene may instigate a digital crime scene investigation, and
transitively, that digital evidence found on a digital crime scene may instigate an
investigation at a newly identified physical location.
Figure 2: Event based digital investigation framework
The sub-phases of the digital crime scene investigation phase are depicted in
Figure 3. The first two phases are similar to early phases of the linear process models,
however the final phase, Event Reconstruction & Documentation, proposes a set of
sub-phases which attempt to prove and disprove hypotheses related to events that may
have caused digital evidence found in the crime scene.
Figure 3: Digital crime scene specific investigation phases
The inclusion of Documentation as a part of each sub-phase here points to the

implicit inclusion of one of Bebe’s principles of evidence.
2.3 Digital forensics tools

The concerns of officers of the court, and juries alike are highly abstracted
from the minutia of digital technology in general, and are highly reliant on the
testimony of expert witnesses in bridging the gap in understanding. Expert witnesses
and digital forensic investigators are in turn reliant on digital forensics tools in
interpreting digital evidence. While it is important that the first principles science
underlying the interpretation of digital evidence be understood by the technical expert,
on a day to day basis, tools are required to ensure efficiencies in the digital forensic
process. Without these efficiencies, mounting pressure on the courts to accept and
employ digital evidence could lead to a failure in the administration of justice, or limit
the employment of digital evidence to only the few who can afford it.
Early practitioners of computer forensics used generic operating systems tools
and “rolled their own”. While today a burgeoning commercial market segment exists,
producing tools for the digital forensics market, digital investigation still commonly
requires the use of multiple types of tools.
We describe three main classes of digital forensics tools below, based on their
relevance to the investigative process (acquisition, and examination and analysis tools)
as described in Section 2.2.2. The third class of tools, integrated tools, attempt to
address all phases of the investigation process from within one integrated environment.
2.3.1 Acquisition tools
Of all of areas in digital forensics, media acquisition is perhaps the most

mature of areas. This class of tools serve to make an exact copy of a digital crime
scene, which is commonly known as an image. The fundamental principle of
preserving the integrity of the crime scene, for example a hard drive, is routinely
satisfied by using hardware write blockers or operating systems modified to prevent
writing to the crime scene media. A cryptographic hash of the crime scene media, taken
at the time of acquisition, forms the foundation of the chain of custody and
maintenance of integrity by tying the physical evidence from which it is derived to the
digital crime scene. In some jurisdictions it is routine practice to print a copy of this
hash to use as a contemporaneous note establishing this link.
As raw data cannot exist without being contained or encoded in some way or
another, acquisition tools need a container in which to store the image. This container is
typically some other piece of raw media, such as a hard drive, or as the contents of a
file.
Despite the apparent simplicity of the process described above, numerous
problems emerge which must be considered. For example:
• Can we prove the write blocking technology actually ensures the integrity of
the digital crime scene?
• Is the digital crime scene copy an accurate copy of the original? In operation,
hard drives often have bad sectors which are unreadable. How do the presence
of these affect the maintenance of integrity? How do we record their presence?
• Is the digital crime scene copy a complete copy of the original? New drives
contain special areas protected from regular access, which may too be relevant
to an investigation at hand.
While the question of accuracy of write blocking technology is typically

addressed by the reputation of the particular tool used, or by independent validation, the
latter two questions relate to completeness. Acquisition of digital crime scene involves
far more than a simple copy operation, and potentially requires recording of a larger
amount of information outside of the data stored on the regular part of the drive.
The proliferation of mobile devices such as PDA’s and mobile phones, which
employ markedly different storage technologies such as embedded flash memory and
smart card SIM’s is today presenting new challenges to building acquisition tools.
Interest in acquiring and analysing the RAM of running computers is similarly on the
forefront of the DF agenda.
2.3.2 Examination & analysis tools
In the early days of forensics, the primary role of digital forensic examination
tools was to interpret raw data into information. As datasets containing potential
evidence have become larger, the information gleaned from interpretation tools has
become increasing unwieldy leading to “needle in haystack” kinds of problems. To
address this, integrated tools have emerged which provide various techniques for
searching, navigating, filtering and examining the information. This section describes
tools related to these concerns.
Four higher level strategies are typically employed for finding relevant
information within this raw data: structural interpretation, signature based searching,
file classification and event correlation.
Structural interpretation refers to the structured nature of data within most
forms of digital data. For example, the average hard drive is structured into partitions,
then file systems, then files and so on. Digital data, and the software that acts upon it,
is organised into layers of abstraction to reduce complexity. For example, as it applies
to storage, general purpose operating systems have long provided the familiar file and
directory abstractions for storing and organising data. Details of hard drive sector
addressing, and file indexing, are hidden from the average user at lower layers of
abstraction. Similar abstraction layers exist in software architectures and in network
data communications. Much of the job of forensic analysis tools is to exploit the
structure of raw binary data so data objects of an appropriate abstraction layer, and of
evidentiary value, may be found.
EnCase and FTK are storage media analysis tools which provide primarily
structural interpretation functions over a common set of evidence types. Similar
9
functionality is provided in the opensource tools The Coroners Toolkit (TCT) and The
Sleuth Kit (TSK) 10 which is based on the former.
Signature based interpretation refers to a class of interpretation techniques
best exemplified by the class of tools known as file carving utilities, which search raw
digital data for characteristics which are unique to particular species of files. File
carving utilities such as scalpel [111] or foremost 11 are able to identify potential
instances of image files such as GIF and JPEG, and documents in Microsoft word
format by identifying local structure, regardless of the underlying filesystem’s
presence. Local structure is used to identify data objects rather than searching global
structure. Such interpretation strategies are useful in instances where the global
structure has become corrupted.
Recent investigations in characterising large corpuses of hard drives have
employed signature methods for identifying credit card numbers, social security
numbers and email addresses [44].
9
http://www.porcupine.org/forensics/tct.html
10
http://www.sleuthkit.org/
11
http://foremost.sourceforge.net/
Antivirus and malware identification tools can also potentially characterise

unknown files where they may be of this class of malicious content. There are
presently, however, no widely available hash sets for the identification of malware.
File Classification is another widely used analysis technique. The
predominance of the file based storage paradigm 12 , causes much of digital forensics
practice today to involve examination of files or artefacts related to files. An aspect of
the volume problem of digital forensics is that the number of files in investigations is
rapidly growing, as is the size of the files. Means of reducing this data volume have
consequently become important. One commonly employed technique is to filter data
based on categorisations of files. For example, authentic component files of the
Windows operating system might typically be considered as not contributing to the
goals of an investigation. A number of databases of cryptographic hashes of known
files exist for the purpose of data reduction, individuating and classifying files by
equivalence of hashes of unknown files with known ones. The National Software
Reference Library (NSRL) is perhaps the best known one, containing over 31,000,000
files [112].
Unknown files may also be characterised by their content, by exploiting local
structure within the files specific to particular file formats, in much the same way that
file carving tools work. This class of tools is useful for identifying files irrespective of
the name of the file. A wily adversary may rename a file to indicate another class of
usage, for example renaming a picture with the name foo.gif to foo.dll. Naïve searching
for image files by looking for files ending in common image extension names would
miss this file.
Event Correlation refers to an array of techniques applied to comprehending
the dynamic behaviour of systems, based on events and patterns of events in their
history. Digital evidence is often rich in event oriented evidence: from the modified-
accessed-created MAC times stored for each file to the event logs generated by most
long running services as a function of their operation.
Efforts such as ECF, apply correlation techniques to infer higher level
situations, from low level event log data [3]. Garfinkel uses correlation techniques to
identify similar features across entire corpuses of drives, a technique which could prove
useful for identifying computers with similar usage patterns [44]. Finally, another
useful form of classification is similarity. Fuzzy hashing is a technique which identifies
files which are nearly identical [65].
12
Some architectures have used difference storage architectures. The Palm Pilot, and IBM’s
OS/390 do not use files, rather a record oriented storage paradigm.
2.3.3 Integrated digital investigation environments
In the more mature area of media analysis, where most practical activity is
occurring in the field, a number of commercial products have emerged that combine
acquisition, analysis, and reporting functionality in one integrated tool. EnCase and
FTK are prime examples of this class of tool.
A number task-specific features are found in integrated digital investigation
environments: these are Navigation, Search and Presentation.
Navigation features enable the investigator to visualise and explore the
structure of the digital crime scene. In practical terms, this feature is implemented in
Encase as a tree styled user interface element which represents the structure of the
digital crime scene using abstractions at various layers of the media analysis stack.
Search features enable the identification of data objects which conform to
various criteria, such as keyword or regular expression equivalence, date ranges, or data
object classifications. These criteria are evaluated against the content of data objects,
(such as in free text search of documents) the attributes of objects (as in finding all
pictoral image files with a .jpg or .gif file extension). Filtering, which we have
mentioned previously, can be seen as the opposite of search. Filtering limits the
perspective of search, presentation and navigation functionality to the data objects not
matching the filter criteria.
Finally, presentation functionality is related to presenting data objects, their
attributes (such as file metadata), and content in meaningful ways. As digital data may
have multiple interpretations, multiple viewer types may be appropriate for interpreting
data object content. For example, it may be instructive to read the textual content of a
HTML page in some instances, and the rendered page in others.
This class of tools typically provides support for recording of some
investigation documentation, such as case id’s and investigator.
2.3.4 Models of tools
Despite the pivotal nature of tools in digital forensics, little academic work has
focused on this subject.
Carrier has proposed a model of digital forensics examination and analysis
tools which relates the digital forensic tool as an interpreter of data from one layer of
abstraction to data at another, higher layer of abstraction. In this model (presented in
Figure 4) a forensic tool implements a rule set which translates input data from one
layer of abstraction into output data at another layer of abstraction. In performing this
transformation, a tool may introduce an error.
Figure 4: Carrier's digital forensics tool abstraction layer model
Tools which purely decode structure, such as media analysis tools, may
inadvertently introduce errors caused by implementation error; however these types of
error are difficult to measure and in the commercial sphere are not disclosed. File
carving tools introduce a different kind of error apart from implementation error:
abstraction error. An example of this is that signatures may inadvertently match data
that is not a valid file, leading to false positives 13 .
In practice, a tool may internally implement multiple abstraction layer
transformations. The open source forensics tools generally address only a few
abstraction layer transformations related to a particular class of structure. For example,
The Sleuth Kit (TSK), which takes as input data from the media management
abstraction layer (which is concerned with volumes and partitions) and outputs data at
the file system layer of abstraction, such as files (both deleted and regular) and
directories. Commercial tools tend to be more monolithic in nature and integrate
abstraction layers from separate domains. For example, while EnCase includes
abstraction layer translators equivalent to those found in TSK, it additionally includes a
translation layer which translates Redundant Array of Inexpensive Disks (RAID)
images from the physical media layer to the media management layer.
2.3.5 Current approaches to tool integration
Without doubt the integrated environments provided by commercial tools

provide tangible benefits to forensic investigations involving primarily media analysis.
These tools provide classification based filtering, search, navigation and case related
documentation maintenance services among others. The coverage and functionality of
these tools, however, often falls short, either when interpreting data objects outside of
the purview of the tool, or when searching for evidence in novel ways.
13
False positives refer to a result of a test that incorrectly indicates a positive result, despite the
finding being false. In this case a signature might match some data, indicating that a file of a
particular type has been found, while in fact the file is not of the type supposedly implied.
Interoperating with commercial forensics tools is cumbersome. For example,

while third party libraries have been developed to access proprietary evidence
containers such as that used by Encase, access to case related data is hampered both by
the absence of an API for accessing the internal abstractions, and the proprietary nature
of the format of the case file. Investigators are left to manually export or convert data
objects to files outside of the tool and continue investigation maintaining case related
concerns manually.
Of course, falling back to manual methods of case maintenance merely implies
performing the related activities using the protocols and methods practiced and
established before the advent of digital forensics specific tools. This does, however,
allow more opportunity for human error to creep in and results in more documentation
and case maintenance work.
The architecture underlying the implementation of a tool has a profound
impact on extensibility, robustness, scalability and integration with third party tools.
Commercial tools such as EnCase, ProDiscover, and FTK are all monolithic in nature.
The robustness problems associated with tightly coupled interpretation tool libraries is
well demonstrated by FTKs propensity to crash on particular files [4]. ProDiscover, by
means of an imbedded perl interpreter, and Encase, by means of an embedded
proprietary scripting language, provide some support for extensibility by defining an
API to access internal abstractions, however a brief investigation of the APIs reveals, in
the case of ProDiscover, to be minimally documented with an unclear and dynamic
runtime data model. With regard to scalability, many functions performed by digital
forensics tools, such as hash generation, data carving, thumbnailing, and keyword
searches are IO bound. Monolithic tools are not able to easily adapt to approaches such
as distributed computing to address these issues because of the lack of granularity, and
tight coupling of modules implementing their architectures.
The open-source model of development and software licensing has been
proposed as a potential solution to the problem of reliability of tools supporting digital
investigations [23, 60]. Regardless of whether open access to source code results in
more reliable tools, for the present time open source tools are the primary area within
computer science where digital forensics tools research is demonstrated and proved.
The majority of open source DF tools today are interpretation tools, operating on binary
devices or files and command line arguments, and producing binary files or text as
output. In the context of usage of these tools, case documentation is maintained outside
of the tool’s purview.
Notable open source digital forensics investigation tools are the Autopsy
forensic browser [25], PyFLAG 14 and the TULP2G small scale device forensics
framework [137]. The architecture employed by Autopsy is a component oriented one,
with the user interface components running in separate processes to the filesystem
interpretation layer tools. The latter are sourced from separate projects, including the
related Sleuth Kit project. Theoretically, this separation of functionality enhances
robustness by limiting the effects of software faults to the implementing module, rather
than affecting the whole application. Autopsy provides limited support for the
maintenance of case related documentation.
While not open source in nature, the XIRAF digital forensics prototype [7] uses
a similar architecture to TSK, utilising wrappers around existing open source
interpretation tools.
A number of groups have begun experimenting with clustered computing
architectures as foundations for forensics tools, towards the goal of addressing the IO
and CPU bound processing issues inherent in current monolithic architectures. The
prototype Distributed Environment for Large-scale inVestigations (DELV) investigated
the feasibility of speeding up processing by spreading an entire hard disk image across
the RAM of a cluster of commodity PCs, moving the processing to each node [114].
The Open Computer Forensics Architecture (OCFA) employs a distributed processing
model for recursively processing data objects found within a digital crime scene [62].
Similar to XIRAF and Autopsy, interpretation is realised by wrapping existing
interpretation tools.
2.4 Key challenges

At the 2006 DFRWS conference, the keynote speech, “Challenges in Digital
Forensics” was delivered by Ted Lindsey a computer scientist at the FBI [70]. In his
speech, a number of the challenges were identified. These are presented in Table 1.
14
http://pyflag.sourceforge.net/
Table 1: Challenges in digital forensics - DFRWS 2006 keynote
Device diversity Volume of evidence
Video and rich media Whole drive encryption
Wireless Anti-forensics
Virtualisation Live response
Distributed evidence Usability & visualisation
These challenges as enumerated by Lindsey at DFRWS 2006 are a mix of: new
technologies (e.g. wireless, whole drive encryption), situational technology trends (e.g.
device diversity, volume of evidence, distributed evidence), and techniques (e.g. Live
response, usability & visualisation).
In 2005, the following list of challenges was presented by Mohay [87]:
• Education & certification
• Embedded systems
• Corporate governance and forensic readiness
• Monitoring the internet
• Tools
• Data volumes
In 2005 and 2004, Casey summarised the key challenges as:
• Counter forensics
• Networked evidence
• Keeping pace with technology
• Tool testing
• Adapting to shifts in law
• Developing standards and certification [28, 29]
A subset of these challenges can be generalised to the following list of

challenges, which have been selected as relevant to this dissertation.
2.4.1 Volume & Complexity
The “volume of evidence” and “distributed evidence” challenges cited by

Lindsey in the previous section are both exemplars of what is referred to as the volume
problem in digital forensics. This refers to the following trends:
• the quantity of data which may be relevant to an investigation is increasing

markedly;
• the number of individual units of potential evidence is increasing caused by the

increasing occurrence of evidence distributed over multiple devices; and
• the quantity of data in each unit under consideration is also increasing

markedly, with multiple terabyte acquisitions becoming common.
Further complicating this situation is the rate at which storage capacity is

growing compared to access times. With hard drive capacities doubling on a yearly
basis 15 , and access times increasing by only 10%, software will soon have to treat disks
more as sequential devices than random access devices [49]. Such a situation is a fine
example of what Casey calls “keeping pace with technology”.
Lindsey’s challenges which are more related to specific technologies (eg.
Virtualisation) as well as the “device diversity” challenge are related to what is referred
to as the complexity problem in digital forensics. This refers to the introduction of new
sources of complexity, which must be addressed to investigate evidence. Addressing
this often requires the development of new techniques. Additionally, the latent nature
of digital evidence makes complexity inherent in analysis of digital evidence.
Addressing the complexity problem requires acknowledging the problem has
two dimensions: the rate at which new concepts are generated and the change in
meaning of existing concepts over time.
For example, 10 years ago the main subject of media analysis was the content
of individual pieces of magnetic storage media: specifically partition tables and
filesystems. The emergence of storage virtualisation added another layer of abstraction
in between: the volume. Whereas before a filesystem was tied to a specific contiguous
section of bytes on a particular piece of media, now a filesystem may be contained
within a volume which spans multiple pieces of storage media, such as when a RAID
array is used. This change has both added new concepts and redefined existing
relationships in the lexicon of digital forensics.
To date, this kind of conceptual evolution has been handled by two means: a)
modifying the conceptual model to include the new concepts and relationships, or b)
15
And have been doing so since around 1989
leaving this information outside of the scope of the tool. Considering the storage
volume example given above, the integrated forensics tool EnCase has had its internal
model changed to include abstractions for RAID volumes and corresponding
component media regions, and the tools interaction model has been tweaked to
represent these abstractions. The open source forensics tool addressing similar analysis
tasks, The Sleuth Kit, however does not address storage virtualisation. Rather, it leaves
management of this conceptual change in the hands of the human tool operator.
The complexity and volume problems are well illustrated in the Gorshkov case,
which involved credit card theft from at least 11 online entities, and subsequent
fraudulent use of those cards through PayPal 16 and ebay 17 [8]. Successful prosecution
of this case involved evidence drawn from multiple computers, under the control of
multiple company entities, in multiple jurisdictions (some of which was acquired over
the internet from Russia). The evidence was drawn from multiple sources, such as
backups, hard disk images, emails, archive copies, and hard copies. Interpreting the
evidence required numerous applications and multiple operating systems.
As we have said, digital investigation covers a very broad range of conceptual
entities, and any schema or model attempting to fully describe the domain quickly
becomes insufficient as technology inexorably marches on. In this light, a means of
representing evidence and related information expressive enough to represent all of the
information we wish, while not committing us to a particular data model, is desirable.
Furthermore, such a model should be extensible enough that new information may be
added by arbitrary means, as new tools and techniques emerge, without breaking
existing tools, nor violating the integrity of the existing information. Conversely, a
means to declaratively attach semantics to data, without resorting to modifying the
tools which operate over the information hold promise for integration of arbitrary and
heterogeneous data.
Addressing the volume and complexity challenges requires new approaches to
building tools which acknowledge the rate of change of technology, and enable
continued tool functioning despite new sources of complexity. We hypothesise that by
focusing on the relationship between tools and formal representation, a key theme of
this research, new approaches might be identified which address these challenges.
2.4.2 Effective forensics tools and techniques
The rapid rate of evolution of technology is a significant cause of volume and

complexity problems. Existing tools and techniques are unable to keep pace with these
16
http://www.paypal.com/
17
http://www.ebay.com/
changes. It is well acknowledged that new approaches to building tools are necessary:
this situation is reflected by a recent upturn in focus on research into new tools and
techniques.
The first DFRWS in 2001 focused primarily on frameworks and principles for
digital forensics, rather than on forensics tools and techniques [88]. Two speakers
however highlighted the need for tools and techniques to be evolved:
• The social aspects of our analytical endeavors are in need of focus, too. We
need tools that zero in on truly useful information and quickly deduce whether
it is material to the investigation or not. We need to identify a social “end-
game”. Are we prepared to take serious action to thwart wrongdoing in all its
forms? (Spafford, DFRWS 2001).
• The constant appearance of new and improved technology (e.g., cellular

phones, personal digital assistants [PDAs], the Global Positioning System) has
moved the target of media analysis tools way out of range for quick response.
(Baker, DFRWS 2001).
By the time of the DFRWS 2006, the research priorities of the field had indeed
begun to shift towards addressing these challenges. Papers were presented covering
tool validation, memory analysis, tool integration (twice) and evidence correlation (four
times). All up, 8 of the 17 papers presented at DFRWS2006 were related to
development of new techniques and tools.
Analysis of the prevalence of specific technologies cited as challenges by
Lindsey, Mohay and Turner reinforces the need for new and more effective forensics
tools and techniques. To quote Mohay:
These tools need to target the ever increasing volume and heterogeneity of
digital evidence and its sources, and they need to be inter-operable [88].
Tool interoperability (or tool integration) implicitly involves integration of

data, a goal related to the “Distributed evidence” challenge. This goal relates directly to
the representation theme of the research described in this dissertation. Additionally, the
wider challenge of “effective forensics tools and techniques” encompasses our theme of
analysis techniques
2.4.3 Meeting the standard for scientific evidence
In the early 90’s, much of the focus of the field was on building effective
forensics tools, and having them accepted in court. Frameworks for characterizing the
field of forensics, such as forensics process models, and protocols for ensuring integrity
and chain of evidence were primary concerns. Today, there appears to be consensus on
appropriate methodologies and protocols for dealing with digital evidence, a conclusion
which can be implied by the widespread adoption of digital evidence in proceedings.
Despite the apparent need to trade off expediency or other factors for rigour in
some contexts of digital investigations, the need for rigour in the conclusions is a
principle tenet. In particular, the traditional forensic sciences are based on the
application of reliable scientific methods – seeking to use techniques or tools only after
rigorous and thorough analysis. The field of digital forensics (at least in the United
States) is struggling to meet the court’s standards for scientific evidence [78].
At the first DFRWS, it was concluded that for digital forensic science to be
considered a discipline, it must have the following characteristics [98]:
• Theory: a body of statements and principles that attempts to explain how

things work
• Abstractions and models: considerations beyond the obvious, factual, or

observed
• Elements of practice: related technologies, tools, and methods
• Corpus of literature and professional practice
• Confidence and trust in results: usefulness, purpose
It is acknowledged the field only exhibits some of these characteristics: for

example, elements of practice are observable in the development of and trust in
forensics tools, however they are not tied to scientifically rigorous evaluation [98].
With the exception of the recent work of Carrier [24], which bridges between forensic
investigation process and computer science theory, little work has been contributed in
the area of theory.
The theme of building the field to become a discipline is echoed by Mohay’s
“education and certification” [87] and “standardised specification & testing of tools”
[88] challenges:, and Casey’s “Adapting to shifts in law” and “Developing standards
and certification” challenges [28].
The final sub-challenge above, “Confidence and trust in results”, directly
relates to the final theme of our research, which is assurance of digital evidence, and
analysis results.
2.5 Conclusions
The utility of the computer as a tool of production, communication, and
commerce has resulted in widespread adoption over the latter half of the twentieth
century and the start of the new millennium. Digital technology is now pervasive.
Network effects and the rapid pace of change in digital technology have led to a
situation where the employment of digital evidence is complicated by the burden of
large quantities of highly complex data. The challenge for digital forensics is to both
increase in reliability and rigour, while at the same time increasing the efficiency of
investigation. New techniques for interpreting and analysing evidence and new
approaches to building interoperable forensics tools are required.
Addressing these key challenges requires new approaches to building tools
which acknowledge the rate of change of technology, and enable continued tool
functioning despite new sources of complexity. We hypothesise that by focusing on the
relationship between tools and formal representation, new approaches might be
identified which address these challenges.
Chapter 3. Related work
“The search for truth is in one way hard, and in another easy – for it
is evident that no one of us can master it fully, nor miss it wholly.
Each one of us adds a little to our knowledge of nature, and from all
the facts assembled arises a certain grandeur.”
(Aristotle)
The preceding chapters have provided context and background to this

dissertation as a whole. This chapter provides background material and related work
specifically relevant to the work described in Chapters 4 to 7, and consists of four
sections.
Section 3.1 describes the literature related to event correlation, both
specifically related to DF and to the wider computer security context and Section 3.2
describes current approaches to maintaining investigation documentation and storage of
digital evidence. Both of these sections relate to the problems of complexity and
volume in digital forensics, and provide motivation for Chapter 4, which proposes
formal knowledge representation as a means of addressing these problems. Chapter 5
builds upon the event correlation background in section 3.1, and the approach proposed
in Chapter 4, proposing, implementing and evaluating a novel approach to representing
heterogeneous event oriented evidence, and a novel technique for automated
identification of forensically interesting situations. Chapter 6 builds on the background
material in section 3.2 and the approach proposed in Chapter 4, proposing a digital
evidence storage format which enables tool integration and inclusion of arbitrary
investigation related documentation.
Finally, Section 3.3 describes related work in computer timekeeping, which
forms background to Chapter 7, which focuses on assuring the correct interpretation of
digital timestamps, and Section 3.4 concludes the chapter.
37
38 CHAPTER 3 – Related work
3.1 Event correlation for forensics

This section surveys the literature in event correlation, particularly focusing on
approaches taken in representing events, event patterns, and scenarios, which is a
subject of Chapter 5.
Event correlation is a term which has emerged from a number of computer
security application domains, in particular in the areas of network management and
intrusion detection. It is used to describe an array of techniques applied to
comprehending the dynamic behaviour of systems, based on events and patterns of
events in their history. As in these domains, in the digital forensics domain we find the
need for event correlation.
Abbott et al, have, in their Event Correlation for Forensics (ECF) research,
translated textual log events into instances of a generalised data model (canonical form)
implemented using a relational database [3] performing either interactive or automated
scenario identification over these events.
Stallard and Levitt employed an anomaly based expert systems approach to
identifying semantic inconsistencies in investigation related data. Their approach
translated MAC times generated by TCT and the UNIX lastlog into an XML
representation, which was asserted into the JESS expert systems shell. Knowledge is
encoded as heuristic rules which specify invariant conditions related to logins and
potential file modifications.
Elsaesser and Tanner employ a AI based approach to automated diagnosis of
how an attacker might have compromised a system [39]. Using a model of the topology
of a network, the configuration of systems, and a set of “action templates”, a class of
artificial reasoner called a “planner” generates hypothetical attack sequences which
could have led to a particular situation. These hypothetical attack sequences are then
run in a simulated environment, and the generated logs compared with the logs of the
real world system. The action templates correspond to specifications of how a
particular action will transition the state of the world from one state to the next.
Approaches to event correlation in the IDS and network management domains
have focused on single domains of interest only, and have employed models of
correlation that are very specific in nature. Repurposing these specific existing
approaches to the more general task of event correlation in the CF domain is made
difficult for a number of reasons. Existing event pattern languages do not necessarily
generalise to application in wider domains. For example, while state machine based
event pattern languages may work well for events related to protocols, they do not work
well for patterns where time and duration are uncertain [37]. Most approaches focus
CHAPTER 3 – Related work 39
exclusively on events, and ignore context related information such as environmental

data and configuration information. Furthermore, few approaches have available
implementations in a form that is readily modifiable.
Where we have modifiable implementations of event correlation systems, we
find that extension is complicated by the software paradigm underlying its
implementation, and that the systems are weak on semantics. For example, extending
the STATL language [38] involves considerable burden. Adding new vocabulary to the
event language is slowed because of compilation and linkage overheads. Addition of
concepts outside of the event pattern language requires reengineering of the STAT
language compiler and supporting framework. Finally, no means of specifying the
semantics of the vocabulary of the language is available.
3.1.1 Approaches to modeling events
The representation used to model events has a significant impact on the

usability of correlation approaches, including conceptual expressiveness, extensibility,
ease of integration of new information and maintainability. We describe here a number
of existing representation approaches observed in the event correlation literature.
The MODEL language, a component of the DECS network management
system, used an object oriented (OO) style model of classes of events related together
in class/subclass relationships (which in this case was referred to as semantic
generalization) [143]. The event correlator translates from event patterns specified in
the MODEL language directly to C++, and presumably, is encumbered by the
maintainability characteristics of C++ software development and deployment.
Expert systems based approaches such as the EMERALD IDS [69] combine a
similar knowledge model, which support class/subclass models of events, with a rule
language. The model however is dynamically constructed at run time, eliminating the
C++ compile-link phase, resulting in simpler extensibility and more rapid evolution
compared to the DECS approach.
A number of challenges were identified with the ECF [3] approach. The
approach does not incorporate notions such as semantic generalization in its modelling
approach, and identification of a methodology for mapping the detail rich, domain
specific information contained in log files to the canonical representation appeared to
be elusive. The conceptual model of the canonical form implied that every event was
seen as a time-subject-object-action tuple (TSOA), a notion which proved to be an
impediment when attempting to represent arbitrary event log entries. This canonical
form was supplemented by the addition of shadow data an arbitrary set of name-value
pairs which could be associated with a canonical entry.
Employing a trivial example, the canonical form worked well for making
statements such as
“at 12:00 on the 1st January john hit the ball”
The data model of the TSOA canonical form alone, however, prevented
expressing even slightly more complex statements (which unfortunately are on the
lower end of conceptual complexity when considering event logs) such as the
following:
“at 12:00 on the 1st January john logged into the host www”
While at first glance the statement resembles the simple time-subject-verb-

object example previously, this statement includes an explicit adjective indicating the
class of the noun it modifies; that “www” is a (network) host. This extra information
was stored as a name-value pair (e.g. hostname-www) in the shadow table. No
mechanism exists, however, for interpreting the relationship of the name-value pair –
for example does hostname-www relate to the subject or the object?
In practice, the canonical event was used in so many ways that its meaning was
unclear, requiring extensive human inference and interpretation in usage as was the
relationship between the name-value pairs and the entities in the canonical record.
Further, it is unclear how the information model could be extended beyond the limits of
the relational model adopted.
3.1.2 Event patterns and event pattern languages
In the CF domain, the only work performed on automatically identifying event

patterns is the ECF work. This work used a rule based approach, which is characterized
by statements that have the form of “IF condition THEN conclusion”, which they
referred to as “Logical Event Patterns” (LEPs). LEPS are specified in an custom XML
variant and are evaluated against the SQL database oriented repository of events. LEPs
do not support semantic generalisation of events, nor does the underlying canonical
data model.
Chapter 4 similarly investigates automated event pattern identification,
focusing on formal models of representing evidence.
It is in the area of misuse intrusion detection systems (IDS) that there has been
most work investigating the matching of event patterns. These IDS use either signature
or rule based approaches or a mixture of the two. Signature based approaches typically
operate at a higher level of abstraction than rule based approaches by using declarative
languages that model different aspects of situation. Both signature and rule based
techniques typically entail specifying event signatures or rules using some kind of
event language.
A number of signature based alert correlation languages aim to correlate events
based on abstract models of intrusion goals. The LAMBDA correlation language is a
signature based approach matches signatures of event consequences with event
prerequisites, generating Prolog based correlation rules [35]. This language uses an ad
hoc combination of XML and Prolog syntax to model both Attacks and Alerts.
JIGSAW uses a similar technique for correlating pre and post conditions, focusing
more on language syntax [132]. They model pre and post conditions as “requires” and
“provides” relationships of events. Ning et al criticize JIGSAW as overly restrictive,
and weaken the requires/provides relation in Hyper-Alerts to allow correlation in
absence of certain prerequisites [95]. Similarly to LAMBDA, both JIGSAW and
Hyper-Alerts are translatable to rules.
CEP [101], employs a rule language called RAPIDE for event pattern
recognition and correlation. This language contained features to match over parameters
such as causal ancestry, repetition, as well as simple property based comparison.
STATL uses finite state machine (FSM) models to specify signatures. Doyle et al
critique using FSMs: “Representing events as transitions through a single chain of
states precludes recognizing the achievement of a set of attack preconditions that have
no innate required time order.” (p. 21) [37] Techniques for translating FSMs into rules
are well established.
The line where rule and signature based approaches become expert systems is
blurred. Two differentiators would be the use of dynamic, object based knowledge
models, and a translation stage between signature specifications and underlying rule
representations.
A number of approaches have applied expert systems and logic based
reasoning to event correlation. The EMERALD IDS [69] uses an expert system
implemented using the P-BEST rule language to specify intrusion rules. Doyle et al
criticise P-BEST for lacking any concepts specific to event recognition [37]. The rule
language employed by Stallard and Levitt, called JESS, has a similar heritage to P-
BEST [125].
Of the correlation languages reviewed, only STATL was available with source
code. RAPIDE is available only as executables, and has not been updated since 1998.
JIGSAW has no implementation. Hyper-Alerts, while implemented, have no available
implementation, nor has LAMBDA. P-BEST is only available embedded in the
EMERALD product [69], and is not easily modified, nor is it straightforward how to
access the P-BEST functionality. It has very few features differentiating it from the
class of languages based on the CLIPS expert system [143].
3.1.3 Observations
Event correlation in the forensics domain is complicated by the high conceptual

complexity of, and volume of, events which might be drawn upon in an investigation.
Existing approaches focused primarily on correlation techniques. We hypothesise that a
formal and general approach to representational of events would benefit the field by
addressing complexity issues.
3.2 Current approaches to evidence representation and format

This section surveys the literature as it relates to digital evidence container
formats and representations of investigations. This forms the background and related
work for Chapter 6, the focus of which is evidence and tool integration through formal
representation.
Despite the apparent maturity of the media acquisition area of digital forensics,
changes in the technology landscape present new challenges. Until recently, digital
crime scenes have typically been bitwise images of the entire content of data storage
media such as hard drives or floppy drives, or images of partitions contained within
these storage media. Today, the notion of creating an image of a particular device has
become complicated by the presence of multiple data streams within devices. For
example, today’s hard drives may contain Drive Configuration Overlays (DCO) or
Host Protected Areas (HPA). These are certain areas addressed separately to the regular
data area.
3.2.1 Digital evidence container formats
A number of containers for images are in common usage, such as simple binary
data files produced by the venerable UNIX copy tool, dd, and the EnCase Expert
Witness file format, produced by EnCase’s imaging tools. The latter, besides serving as
a container for images and Palm Pilot memory [31], additionally contains checksums, a
hash for verifying the integrity of the contained image, error information describing bad
sectors on the source media, and metadata related to provenance. The Advanced
Forensics Format (AFF) is a disk image container which supports storing arbitrary
metadata as name, value pairs [45].
There is, however, little standardisation of storage containers or consideration
of how to record aspects such as those described above. The current state of the art has
given rise to a variety of ad hoc and proprietary formats for storing evidence content,
and related evidence metadata. Conversion between the evidence formats utilized and
produced by the current generation of forensic tools is complicated. The process is time
consuming and manual in nature, and there exists the potential that it may produce
incorrect evidence data, or lose metadata [30]. Validation of the results produced is
hindered by this lack of format standardisation.
It is with these concerns in mind that calls have been made for a universal
container for the capture of digital evidence. Recently, the term “digital evidence bags”
was proposed to refer to a container for digital crime scene artefacts, metadata, integrity
information, and access and usage audit records [135]. Subsequently, the Digital
Forensics Research Workshop (DFRWS) recently formed a working group with a goal
of defining a standardised Common Digital Evidence Storage Format (CDESF) for
storing digital evidence and associated metadata [30].
The Advanced Forensics Format (AFF), recently proposed as a disk image
storage format, includes storing of acquisition related metadata in the same container as
the disk image. Garfinkel et al describe the AFF and summarise the key characteristics
of nine different forensic file formats. The also outline the desirable characteristics for
an image storage container [45]. They conclude that the AFF is the only publicly
disclosed forensic format which supports storage of arbitrary metadata. The metadata
storage mechanism in the AFF is, however, limited to name/value pairs and makes no
provision for attaching semantics to the name.
Encase 18 , uses a monolithic case file for storing case related metadata and
stores filesystem images in separate and potentially segmented files. The format of the
case file is proprietary.
Turner’s Digital Evidence Bag (DEB) attempts to replicate the key features of
physical evidence bags, which are used for traditional evidence capture. The key
structural components of a physical evidence bag are the bag itself, a means of bag
identification (potentially a serial number), an area for recording evidence related
information (which Turner refer to as a tag), and optionally, a tamper evident security
seal.
The key features of physical evidence bags are categorised as follows:
Evidence Metadata Records: Standard evidence metadata includes a
description of the evidence, the location, date and time of the acquisition of the
evidence.
18
http://www.guidancesoftware.com/
Provenance Records: Includes chain of custody information, as well as

information pertaining to the collector of the evidence.
Identification Records: Identification information includes a unique serial
number (or seal number) which uniquely identifies the bag, and other case related
information such as the case number, item number, collecting organisation, suspect and
victim.
Integrity Device: Pieces of evidence collected at an investigation scene are
placed in evidence bags and sealed on the spot, potentially with a tamper evident tape
closure seal. This seal, and the construction characteristics of the bag itself, help to
ensure Integrity of the evidence by indicating tampering.
Evidence Content: The physical object found at the crime scene which is
preserved inside the bag.
It is worth noting here that the use of the features listed above varies dependent
on jurisdiction.
Turner’s proposal translates a number of aspects of the above features of the
physical evidence bag into the digital realm. A file archive structure is proposed which
defines a specific naming scheme for files containing digital evidence, separate files
containing evidence metadata, and a singular file which contains evidence integrity,
provenance and identification information. Figure 5 depicts the structure of Turner’s
digital evidence bag.
KEY
.tag Digital
Evidence
Evidence
Metadata
Tag
.index01 .index02 .indexNN Digital

Evidence Bag
.bag01 .bag02 .bagNN
Figure 5: Turner's digital evidence bag
A DEB is a collection of the Digital Evidence files, Index Files and a single
Tag File. Turner does not detail the implementation of the container grouping these
evidence files, however we expect that in practice, the container layer of the DEB
would be an archive similar to a tar, zip or other.
Individual elements of digital evidence collected (such as filesystem images,
network traces, or the contents of image files) are stored in digital evidence files, which
are identified by a file extension .bagNN. The NN refers to a unique number.
Correspondingly, evidence metadata, such as file last access time is stored in similarly
named files with an extension .indexNN. The pairing of a single digital evidence file
with its corresponding evidence metadata file is refered to by Turner as an evidence
unit. Turner does not describe the naming of the files other than the extensions defined.
It is unclear as to whether or not multiple pieces of content are stored in a .bagNN file.
Integrity, provenance and identification information are stored as unstructured
text within the tag file, which is identified by the file extension .tag. The tag file also
enumerates the names of all of the Evidence Units.
The architecture of Turner’s digital evidence bags is oriented towards a single
monolithic digital evidence bag being used in a case, as a container for all digital
evidence acquired. Secondary evidence (evidence derived from the analysis of earlier
acquired evidence, such as files extracted from a filesystem image) would appear in
this scheme to be added to the same digital evidence bag as the original image. This
involves modification to the tag file and the addition of new files to the evidence bag.
Integrity is assured by the onion like use of hashing of the contents of the tag file.
A potentially confusing aspect of Turner’s DEB proposal is that modification
of the tag file, and the addition of new files to the DEB may lead the layman to the
conclusion that the monolithic bag is never sealed, thus raising doubts as to the
integrity of the evidence. While this may be seen more as an impedance mismatch in
translating the evidence bag metaphor, we suggest an alternate architecture for digital
evidence bags, which is presented in Chapter 5. The architecture we present favours
treating evidence bags as immutable objects. Addition of information is achieved
outside the bag, in much the same way that information is added to the tag of a physical
evidence bag without breaking the tamper evident tape.
Turner’s structure does not define a scheme for referencing of evidence and
metadata between digital evidence bags. Therefore the ability to compose multiple
evidence bags into a corpus is not addressed.
The format and vocabulary of the investigation documentation maintained in
the DEB has no formally defined syntax, data model or semantics. The syntax appears
ad hoc and the vocabulary overly abbreviated. In this context, little attention has been
paid to the nature of the metadata that is being stored, with no consideration being
given to the relationship between the metadata and wider case related information, nor
information found within the digital crime scene.
3.2.2 Representation of digital investigation documentation
Treatment of the wider issues of investigation related documentation has been

covered in an abstract sense by Bogen and Dampier [17] who attempt to model the
knowledge discovered during the identification and analysis phase of the investigation
process using the Universal Modelling Language (UML). Further development of this
work described a Unified Modelling Methodology (UMM), the purpose of which was
largely as a framework with which to describe and think about planning, performing,
and documenting forensics tasks. This methodology described three unique
perspectives from which to view computer forensics as a system:
Investigative Process View: abstract models of the forensic process and
concrete models of specifics tasks at hand. The “Sequence of activities performed by
investigators and examiners.” In this view, abstract models refer to the various digital
investigation process models described in Section 2.2.2. The concrete models refer
more to non-abstract models, more like plans of action as have been described in
Mandia’s Incident Response Methodology or, even more concrete, the lower edges of
Bebe’s Hierarchical Process Model.
Case Domain View: “A model of the information domain of the case; the
relevant information items that the investigators know, and the relevant information
items that the examiner seeks.” The view refers to models of the entities involved in the
case, both real world and virtual, and the relationships between them. For example, an
abstract case domain model might be constructed that documents concepts relevant to
the investigation, such as physical and tangible objects (ie computers and mobile
phones), transactions (ie payments or sales), and places.
Bogen and Dampier propose that such a model may be used as a tool for
planning an investigation, by using the concept diagram as a model for identifying
classes of concepts which might be related to the case at hand. Concrete models are
populated by individual entities, which are instances of the classes defined in the
abstract model. For example, a simple abstract model might contain the concepts
computer and user. Considering what instances of these concepts might exist in a
particular case might lead to numerous instances of both, each instance representing a
particular computer or user of a computer.
Evidence View: “Represents information about the incident and evidence of the
incident.” [18] The evidence view relates to the product of the investigation: evidence
that relates to the goals of the investigation. The evidence view would thus include
models of timelines, hypotheses regarding the incident, and supporting evidence. The
bookmark feature of Encase could be seen as an example of this class of model
concept.
Bogen’s work is significant in that it addresses conceptualising digital
investigations from a number of crosscutting perspectives. While the work proposes the
use of a graphical language for modelling these domains, it stops short of presenting
any actual models, or exploring means for tools to use these models.
A number of authors have proposed the definition of domain specific languages
for the purpose of representing and describing digital investigation related information.
Prior to proposing the Unified Modelling Methodology described above, Bogen and
Dampier proposed that a Computer Forensics Experience Modelling Language
(CFEML) would be of use in modelling of experiences, lessons learned, and knowledge
discovered in the course of an investigation [17]. While the purpose and need for a
CFEML was only proposed in an abstract sense, in the same year Stephenson proposed
the Digital Investigation Process Language (DIPL) [126] This language, whose syntax
is based on the Common Intrusion Specification Language (CISL) of the IDS field, and
ultimately on LISP S-expressions [113], focuses on modelling the forensic
investigation process and on the entities involved in it with a heavy emphasis on
intrusion response. Using the perspectives defined by Bogen’s UMM, it focuses on the
investigative process view, and domain case view, and aims to be suitable for
describing in a narrative the actions performed in an investigation, and the entities
being acted upon. It is, however, strongly influenced by a network incident response
viewpoint, and lacks a means for ascribing semantics to vocabulary in an extensible
and machine readable way.
Both XIRAF [7] and TULP2G [137] process digital crime scenes into single
document based XML tree representations, operating primarily in the case domain,
whereas DIPL appears to operate more in both the case domain and investigative
process domain. An important contribution of the XIRAF work is in conceptualising
the XML representation of information extracted from the digital crime scene (DCS) as
annotations of particular byte ranges within the DCS, which also implies defining a
composable addressing scheme. This approach maintains a direct linkage between
information and its source, addressing the provenance of the information. Both
approaches avoid describing the semantics of their XML based representations, and
address information integration by tree manipulation.
The alternate approach to persistent, machine readable representations of case
related information is the dynamic generation model. EnCase falls into this category of
approach. We suspect that similar to TSK, which uses C-structure based data models
internally, EnCase employs internal structural models in interpreting digital data. The
only way to interact with or otherwise view these models is, however, through the GUI
representation of this structure, or, to a limited degree, through the use of a custom
scripting language and limited object model.
3.2.3 Observations
The current efforts to create a standard digital evidence storage container do

not address storage and integration of arbitrary information related to the investigation.
In a wider context, efforts such as Bogen and Dampier’s modeling efforts attempt to
describe the process of digital investigation. We hypothesise that by adopting a formal
approach to representation we might bridge between these two related areas, enabling
digital forensics tools to both interpret evidence, and also maintain documentation
related to the surrounding investigation.
3.3 Reliable interpretation of time

This section provides background on computer timekeeping, the unreliability
of computer clocks, and methods of computer clock synchronisation. This forms the
background and related work for Chapter 7, the subject of which is assuring the reliable
interpretation of timestamps found in digital evidence.
One of the primary challenges in the use of digital evidence is assuring that
digital timestamps might be reliably interpreted as times in the real world. While
timestamps are ubiquitous in computer records, the clocks which generate them are
often unreliable, the reliability of which ranges from seconds possibly to years,
fluctuating with changes in temperature and other factors.
3.3.1 An introduction to computer timekeeping
A battery powered real time clock (RTC) (also called BIOS or CMOS clock) is
used to keep time while a computer is switched off. While the RTC is used as the basis
for determining time when the computer boots, the interpretation of this time is
operating system specific. For example, the family of Windows operating systems
interpret the RTC as civil time 19 , whereas the Linux operating system may interpret the
RTC as either civil time or UTC by configuration.
Commonly, UNIX operating systems implement a software clock (called the
system clock) by setting a counter from the RTC at boot, and employ a hardware timer
(such as RTC timer interrupts, an advanced programmable interrupt controller (APIC)
19
Civil time refers to the government mandated time in a particular jurisdiction, incorporating
regions specific offsets such as daylight savings time.
or other means) as an oscillator. Stevens suggests that all instances of the Windows OS
base their timescale on the RTC throughout operation [128]. There is, however,
evidence to suggest that, similar to UNIX implementations, Windows 2000 and above
similarly employ a software clock rather than use the RTC directly [79, 81, 115].
3.3.2 Reliable time synchronization
PC clocks are commonly known to be inaccurate, because of the inherent

instability of the crystal oscillators with which clocks are implemented. These vary
widely with temperature, voltage and noise fluctuations [83]. Since the late 1980’s,
significant effort has been invested in the development of techniques for obtaining
reliable sources of time via directly connected atomic clocks, radio clocks, GPS and
the network time protocol (NTP). As far back as 1994, researchers were demonstrating
synchronisation of computer clocks to an accuracy of 10 ms across the Pacific Ocean
using NTP [82]. Today, system clocks on the UNIX platform are able to be
synchronised to reliable time sources on a nanosecond scale.
NTP is the most prevalent method of time synchronisation on UNIX hosts.
From Windows 2000 onwards, Microsoft has included in their operating systems a
restricted version of NTP, called Simple Network Time Protocol (SNTP). While it is
protocol compatible with NTP, SNTP does not implement the clock discipline
algorithms present in the former, and is not capable of delivering the same degree of
precision.
The degree to which reliable network time may be utilized by a particular PC
running Windows varies. By default, standalone Windows 2000 workstations have the
SNTP service switched off, while Windows XP workstations by default will
synchronize with an SNTP service hosted at time.windows.com once every week. Both
2000 and XP workstations in a domain network will by default synchronize via SNTP
from the Domain Controller (DC).
In theory then, stand alone Windows XP workstations will become
synchronized with civil time by use of SNTP once a week. Stand alone Windows 2000
PC’s will likely drift away from civil time.
3.3.3 Factors affecting timekeeping accuracy
A number of interrelated factors influence the accuracy of both timekeeping on

computers, and the interpretation of timestamps sourced from them. We summarise
these below:
System clock implementation As discussed previously the quartz crystals
used as oscillators in computers are notoriously unstable and known to be inaccurate
over time. In addition, implementing the correct local time offsets for civil time is
complicated by changes in region-specific time zones. Recent evidence for the
importance of this may be illustrated by the flurry of patches related to the Melbourne
Commonwealth Games daylight savings time extension in early 2006 [80].
Clock configuration It is common to see Windows workstations with the time
zone set to the default installation time zone. Another clock configuration error is the
commonly occurring example of systems where the BIOS time has not been correctly
set.
Tampering The practice of setting computer clocks back or forward for
reasons such as evading digital rights management or misdirection of investigation is
often referred to as tampering. Timestamps, like any data, are subject to the possibility
of deliberate modification.
Synchronisation Protocol The Windows time synchronisation protocol is
based on SNTP and is only designed to keep computers synchronised to within 2
seconds in a particular site and 20 seconds within a distributed enterprise. Furthermore,
computers using NTP and SNTP without cryptographic authentication are subject to
protocol based attacks.
Misinterpretation Timestamps are related to a particular frame of reference,
and their correct interpretation requires knowledge of that context. For example, to
interpret the time to which an Internet Explorer timestamp corresponds, in the civil
time where it was generated, one needs to know the time zone offset. Other sources of
uncertainty are the ambiguity as to what point in time the timestamp refers to for a
particular event – is it the start time or the end time of the event - and was the
timestamp generated at the time of the event or the time of writing to the event log.
Bugs Software errors in the implementation of software clocks or the
algorithms which convert the in memory clock to a timestamp have the potential for
adversely affecting timekeeping accuracy.
3.3.4 Usage of timestamps in forensics
Aside from the comprehensive study of computer time behaviour on UNIX

systems performed in the context of developing the NTP infrastructure [82, 83], to the
best of our knowledge there has been little research in either characterising the
behaviour of the timescale of unsynchronised Windows computers, or on automated
means of identifying that behaviour.
In the computer forensics literature, timelining is often referenced as a
fundamental tool in determining ordering and likely cause and effect. Stephens has
proposed a model and algorithm for relating timestamps taken from multiple timelines
[128]. In this model, a base clock is set to UTC, and subordinate clocks are defined by
skews from parent clocks with additional skews further generated from time drift rates.
Gladyshev and Patel propose using corroborating sources of time to find the
time bounds of events with an unknown time of occurrence by examining ordering
relationships with events with known times. They define both a formalism and an
algorithm for determining these temporal bounds [47].
Weil argues for dynamic analysis of the temporal behaviour of suspect
systems, proposing correlation of timestamps embedded within locally cached web
pages with the modified and accessed times (MAC times) of the cached files [141].
3.3.5 Observations
Computer clocks are inherently unreliable, which casts doubt on the usage of
timestamps in forensic investigations. Methods of post-hoc characterisation of the
behaviour of a particular computer’s clock are of interest in assuring the correct
interpretation of timestamps.
3.4 Conclusion
This chapter has summarised the literature related to event correlation in
forensics, which is a focus of Chapter 5; current approaches to evidence representation
and storage, and representation of digital investigation related documentation, which
are related to Chapter 6, and reliable interpretation of time, which is related Chapter 7.
Analysis of the literature as it relates to event correlation and digital evidence
has led to the hypothesis that digital forensics tools would benefit from a formal
approach to representation. The next chapter describes and contextualises the field of
knowledge representation (KR), where we look for inspiration and formalisms with
which to address the representational challenges in digital forensics.
Chapter 4. Digital evidence representation:
addressing the complexity and volume
problems of digital forensics
“If scientific reasoning were limited to the logical processes of
arithmetic, we should not get very far in our understanding of the
physical world. One might as well attempt to grasp the game of poker
entirely by the use of the mathematics of probability.”
(Vannevar Bush)
Analysis of the field of digital forensics has indicated that examining the nature
of the information which it operates on may help address the complexity and volume
problems described in Section 2.4.1. This chapter looks to the field of knowledge
representation (KR) for inspiration, and proposes that a KR based approach to digital
evidence representation will yield benefits in solving these problems. In particular,
Semantic markup languages, which are described in Section 4.3, are employed towards
solving these problems in Chapters 5 and 6.
The chapter is structured as follows. Section 4.1 introduces the representational
challenges involved in digital evidence, describing why the current natural language
based approach to documenting investigations hinders tool interoperability and
potentially introduces errors. Section 4.2 provides background on the field of
knowledge representation, Section 4.2.1 describes its historical foundations, Section
4.2.2 describes key definitions, and Section 4.2.3 describes hybrid approaches to KR.
Section 4.3 describes the synthesis of markup languages and KR which has led to the
current generation of semantic markup languages, the Resource Description
Framework (RDF) and the Web Ontology Language (OWL). Section 4.3.1 introduces
RDF; Section 4.3.2 describes the XML serialisation of RDF, which is intended for
publishing and machine interpretation; and Section 4.3.3 introduces ontology
languages, and OWL. Section 4.4 reviews the literature of digital forensics and
computer security for knowledge representation related themes, and finally, Section 5.5
53
54 CHAPTER 4 – Digital evidence representation
puts forward the proposition that the field of forensics would benefit from a formal
approach to representing evidence and related investigative information.
4.1 Introduction
The simplest of digital forensics investigations will involve numerous
documentary artefacts as evidence. Examples of these are printouts of data objects
identified as evidence, evidence manifests, and investigation reports. In the course of
investigation, other documents may be kept or produced, including chain of custody
documentation, file notes recording analysis activities and results, and provenance
documentation.
The current state of affairs is that the much of the information related to digital
forensics investigations is recorded in documents such as these in natural language. The
vocabulary employed in these documents is drawn from multiple domains: law
enforcement, legal, computing, and general spoken English. This situation is similar
within the digital crime scene (defined previously in Section 2.2.2), where voluminous
amounts of information are stored in free text form, and semi-structure text form.
Despite much research into Natural Language Processing (NLP), such textual
information is still unsuitable for machines to reason with. For example, consider the
following two trivial sentences:
“The box is in the pen. The pen is in the box.”
While the two statements are both syntactically and grammatically valid
English, their meaning is at first glance an oxymoronic state of affairs. Only by treating
the word “pen” as having two meanings in this context, as a fenced area and then as a
writing instrument, can one resolve the spatial contradiction first observed.
Machine interpretation of natural language is complicated not only by the free
form nature of English grammar and syntax, but also by the context dependence of
interpreting semantics of language terms. This dependence on context, and the
additional real world knowledge and reasoning which are required to resolve
ambiguities, are some of the lower level problems in NL understanding. Machine
understanding of natural language remains, today, one of the grand challenges in
computing.
The preoccupation of computer forensics has, until recently, been on the
immediate goals of interpreting binary data. While this is of fundamental importance, it
cannot be forgotten that the function of computer forensics is to not only glean
CHAPTER 4 – Digital evidence representation 55
knowledge from digital evidence, but to communicate and analyse such knowledge in
a rigorous and verifiable manner.
This communication problem is best demonstrated with the following example.
Consider the simple set of evidence depicted in Figure 6. Such a set of evidence may be
presented in a case where inappropriate content of some description is found on a
computing resource. The figure depicts two pieces of physical evidence, which are the
containers of the two sets of digital evidence, the digital crime scene, and a set of
extracted files. The digital crime scene is a bitwise image taken of a hard drive, and the
extracted files are files found within the filesystem of the digital crime scene. The
analysis report, imaging records, and chain of custody, are all regular textual
documents, and the evidence printouts/visual aids are visual printouts of the extracted
files. The blue lines connecting pieces of evidence in the figure indicate where
references must be made from one piece of evidence to another. Red lines indicate a
“part of” relationship: the extracted files are contained on the CD, and are also
contained within the digital crime scene. Finally, the digital crime scene is contained
within the hard drive.
Figure 6: Trivial set of physical, digital and document evidence
Evaluating this evidence involves verifying that the digital crime scene exactly
matches the crime scene referred to in the imaging records, and verifying that the files
found on the CD are found in the digital crime scene, in the locations described.
Performing these verifications requires human interaction, which is necessary
because of the use of natural language in the Analysis Report and Imaging Records.
The references to the particular hard drive, the names and paths of the files on the CD,
and the hash of the digital crime scene all need to be located by the analyst and
interpreted to refer to the correct artefacts, tools selected, and then employed to perform
the verification.
Such a corpus of evidence falls squarely within the purview of media forensics,
the most established area of the field. While such verification actions might seem trivial
to perform at first glance, in practice, it is complicated by numerous factors. For
example, if the files were found by file carving, how does one document the location of
the files? How does one validate that those locations form a valid file? What if the
digital crime scene was striped across multiple disks, as in a RAID array? How does
one document the raid array configuration? What if the investigation is dealing with
thousands of files, and hundreds of digital crime scenes? The basic task of verifying
simple claims becomes under these circumstances a laborious manual exercise, due to
communication problems related to natural language.
This natural language problem may additionally be seen within the digital
crime scene. Event logs have in the past been employed in computing for a number of
purposes, including auditing system activity, recording performance information, and
recording system state for post mortem debugging among other things. As such, they
record information about the computing environment, referring to entities such as hosts,
users, software agents, and activities with an ad-hoc vocabulary, irregular syntax, and
varying naming schemes. While such event log records are more structured than natural
language, their machine readability is arguably as difficult as natural language by
nature of the considerable amounts of domain knowledge required to infer their
semantics.
Beyond the problems preventing practical machine interpretation of natural
language, further problems confound the use of natural language as a common
language for documenting all aspects of a digital investigation. Producing suitably
complete and precise documentation over the course of the investigation requires
repetitive and methodical attention to detail. As such, it carries with it the threat of
unintentional introduction of errors and the omission of important details.
4.2 Background on knowledge representation

Having identified natural language, and to a lesser extent visual and audible
mediums, are presently the primary mediums for describing aspects of investigations,
we ask the question, Are there other languages (or representations) that are more
amenable to automated processing, and perhaps even machine reasoning, which may
still carry meaning to humans? We look to the field of knowledge representation for
potential guidance.
4.2.1 Historical foundations
The concept of knowledge representation has been a persistent one at the centre
of the field of artificial intelligence (AI) since its founding conference in the mid 50’s.
In the early years it was, however, not explicitly recognised as an important issue in its
own right [75]. Early approaches in this period to representing knowledge in “thinking
machines” and automated problem solving are best characterised as ad hoc, with formal
semantics remaining absent. Consider, for example, the language LISP, which was the
mainstay of the AI field at the time. LISP’s basic tree-like list data structure, with the
addition of cross links forming a malleable basis for organising data into hierarchical
and graph based structures. It, however, lacks any foundations of intelligent reasoning;
rather its foundations are computational. Any intelligent reasoning that may be
embodied in such programs must be implicit in the procedural code of the application.
Knowledge representation emerged as a field in its own right in the mid 60’s
with the next two decades seeing a number of approaches to knowledge representation
begin to emerge, with frames, production systems, and logic based approaches being
the predominant varieties.
The logic based approach takes the view that machine reasoning may be
realized by implementing programs which use the language of mathematical logic.
These approaches share the common approach of representing a domain of interest as a
set of propositions which embody specific information. Knowledge is encoded by
axioms which define logical implications which may be made about the information.
The earliest attempts used first order predicate logic (FOL) as their basis [50],
which was seen as appealing due both to the general expressive power, and well
defined semantics[41]. The use of FOL has been persistent since. FOL is however
computationally intractable, which led to experiment with smaller subsets with better
tractability. This led to the PROLOG language which was first introduced in 1972.
PROLOG supports declaratively specifying information as symbol value pairs, and
enabling axiom definition using a restricted form of FOL. This implementation of logic
based inferencing, based on declarative specification of logical rules has been used to
implement numerous expert systems.
Logical approaches are criticised for being unable to deal with exceptions to
rules, or to exploit approximate or heuristic models of knowledge. The expression of
meta-knowledge (description of what the knowledge can be used for) is also a
limitation [85]. Nor does it allow for incomplete or contradictory knowledge, or
subjective or time dependent knowledge [10].
A number of approaches commonly based on unstructured graph based
representations emerged in the early 1960’s and came to be known as semantic net
based representation schemes. The common points to these schemes were a graph
structure representing concepts and instances (or objects) and a set of inference
procedures which operate over these nodes [75]. Three types of edges are defined:
property edges, which assign properties (such as age) to the source concepts, IS-A
edges, which define class/subclass relationships between concepts, and instance
relationships between objects and classes. Semantic nets are criticized for being absent
of formal semantics, leaving the meaning of the network to the intuition of the users
and programmers who use these network based representations [10].
At direct odds with the viewpoint of the logic based approach, the Frame based
approach [84] attempts to imitate how the human mind works, drawing its inspiration
from psychology and linguistics:
Whenever one encounters a new situation (or makes a substantial change in one’s
viewpoint) he selects from memory a structure called a frame, a remembered
framework to be adapted to fit reality by changing details as necessary.
A frame is a data structure for representing a stereotyped situation, like being in a
room, or going to a child’s birthday party. Attached to each frame are several kinds of
information. Some of this information is about how to use the frame. Some is about
what one can expect to happen next. Some is about what to do if these expectations are
not confirmed. [84]
Under the frames-based approach, knowledge is represented in structured

networks, frames are related in class/subclass taxonomies and the relationships between
classes are attached to each frame at a placeholder called a slot. Inference is viewed as
a process of matching a particular situation at hand with a stereotyped situation which
the viewer has experienced in the past. The stereotyped situation gives guidance as to
infer consequence. The object-oriented programming paradigm, embodied by
languages such as Smalltalk and Java, has taken the structure of frame based approach
and applied it to data structures.
Production systems (or rule systems) share a similar psychological philosophy
to the frame based approach: that human problem solving is an empirical phenomenon
which may be viewed in terms of goals, plans and other complex mental structures
[36]. A set of production rules, which represent “ifÆthen” or “patternÆaction”
inference rules are applied to a set of knowledge (or knowledge base) by repeatedly
matching the pattern part of a rule, and where a match is found, certain actions are
taken on the set of knowledge. This style of system formed the foundations of research
into expert systems and knowledge based systems, which attempt to capture the guesses
of the sort that an expert human would make.
4.2.2 Defining knowledge representation
Despite the apparent fundamental nature of knowledge representation in AI, it

for many years remained free of direct definition. A knowledge representation is
described by Davis et al by the five distinct roles a representation plays:
• a surrogate, a substitute for the thing itself;
• a set of ontological commitments;
• a fragmentary theory of intelligent reasoning;
• a medium for pragmatically efficient computation; and

• a medium of human expression [36].
We consider these perspectives in more detail below, using digital forensics as

an example domain for representing knowledge:
KR as a Surrogate: A knowledge representation of an investigation serves as a
surrogate for things related to that investigation: things that exist in the physical and
virtual world, as well as actions, beliefs, suppositions and conclusion. Each surrogate
corresponds to its referent in the real and virtual worlds of the crime scene and the
surrounding investigation. The correspondence between the representational surrogate
and actual referent is the semantics of the representation.
Using these definitions, it becomes readily apparent that in media acquisition,
an “image” is a surrogate for the content of a particular piece of digital storage media
found at a crime scene.
KR as a set of Ontological Commitments: When one builds representations of
physical things, it is difficult if not impossible to build representations with the same
fidelity as the original. A representation contains simplifications and assumptions
which depend on purpose of the representation or the perspective of the developer. For
example a particular representation may distinguish people based on their names and
mobile phone numbers. A more involved representation related to medical diagnosis
might represent component body parts such as organs and systems of the body.
These two different perspectives on representing a human are examples of
different conceptualisations of the things which we choose to value within the context
of our representation. Both conceptualisations model the entities, categories of entities,
and relationships in a simplified view of the world. They differ, however. One is
concerned purely with identity, the other with anatomy.
To be able to discuss a conceptualisation in detail, we need a means of fully
describing the conceptualisation as a whole and its constituent parts. The term
“ontology” refers to such a description using a formal vocabulary. The most commonly
given definition for the term is “an explicit specification of a conceptualisation” [51].
The act of defining a particular representation carries with it a set of implicit
agreements upon how to view and talk about the world. We refer to these implicit
agreements as ontological commitments. According to Davis et al, “these commitments
are in effect a strong pair of glasses that determine what we can see, bringing some part
of the world into sharp focus, at the expense of blurring other parts” [36]. Referring
again to our two example conceptualisations of a human, we see that one representation
focuses on viewing humans as simple possessors of identity and endpoints of voice

communications, whereas the other focuses on anatomical structure.
The impact of committing to a particular ontology has effects not only at the
domain level, but also at the primitive language level. Production systems (rules) view
the world as facts (symbol-value pairs) and knowledge (axiomatic rules of plausible
inference between them), whereas frame based systems are based on conceptualising
the world by correspondence with prototypical objects.
One ontological commitment pervasive in the digital forensics literature is the
view of data as either evidence or metadata. This commitment constrains the rich
variety of information which relates to an evidence unit as merely a property of the
evidence unit.
KR as a Fragmentary Theory of Intelligent Reasoning: Knowledge
representation has its roots in the field of AI, the goal of which is intelligent machine
reasoning. Just what constitutes intelligent reasoning varies from one formalism to
another, and depends ultimately upon the intellectual origins of the formalism. Logic
based formalisms such as FOL rely on mathematics and propositional logic as the basis
of intelligent reasoning, whereas rule and frame based systems have at their roots
behaviourist views of intelligent reason completely devoid of logic.
As these theories propose a model of inference, it is important to then consider
what inferences are allowed by the model. For the FOL approaches this is simply any
set of logical inferences provided by traditional formal logic. A frame-based
representations however “encourages jumping to possibly incorrect conclusions based
on good matches, expectations, or defaults” [36].
A medium for pragmatically efficient computation: A knowledge
representation, and the theory of intelligent reason on which it is based is only useful in
so far as it is actually able to be usefully employed in computation. Early research into
connectionist models of inference (best exemplified by neural networks) were quickly
outclassed in the 60’s by rule and frame based approaches due in part to the
computational resource needs of the approach. The relative abundance of computing
resources in recent years has now made connectionist models pragmatically efficient,
allowing a resurgence in research in connectionist models and an appreciation of which
tasks they are more suitable for [85].
Medium of human expression Whether it is for the purpose of communicating
meaning to a computer or a human, the final role of a knowledge representation is of a
medium of expression and communication by humans. In the context of digital
forensics, where tracking the provenance of any interpretation or analysis result is a
necessity, the extent of a representation’s interpretability by human readers has direct

implications on managing complexity.
As a medium of human expression, it bears considering what the expressive
limitations of the representation are, such as what cannot be said, what can be said, and
to what precision.
4.2.3 Hybrid approaches
Despite early criticism between communities surrounding the frame and logic
based approaches KR, it became apparent effective machine reasoning would benefit
from hybrid approaches involving the application of multiple theories of intelligent
reasoning within the same representation. Within the schools of KR approaches
research was directed towards addressing deficiencies observed. The limited
expressiveness of FOL based approaches has been addressed by proposing hybrid
logics, for example Nonmonotic Logics and Modal Logics [10]and by limiting the
context in which FOL applies.
An awareness of the syntactic problems and undecidable nature of using FOL
as a representation language led to the development of so called Description Logics
(DL). This branch of logic began by attempting to address these problems by adopting
a semantic nets inspired model (which has been shown to be directly translatable to
FOL) and restricting the expressive power of the language to a decidable subset of
FOL. Description Logics model the world as atomic concepts (unary predicates) and
roles (binary predicates), using a small number of epistemologically adequate
constructors to build complex concepts and roles [10].
Focusing on the wider goal of building practical knowledge based systems, the
CYC project (named after the stressed syllable of the word encyclopaedia), embarked
upon in 1984, attempts to implement “the commonsense knowledge of a human being”
[67]. The knowledge representation language employed by CYC, called CYC-L
provided a language which addressed the vocabulary and syntax issues of representing
instance based knowledge, and the semantic linkages needed for defining ontologies.
CYC-L is based on FOL, and intelligent reasoning is implemented in the large by first
order logic theorem provers, and in the small, by domain specific micro-reasoners. In
this system, the axioms (rules) representing general theories about the world are
assumed true by default, and where exception based knowledge applies it is limited in
application by context.
The DARPA Knowledge Sharing Effort, initiated circa 1990 [91] researched
means of enabling knowledge sharing between computer systems. Its central theme was
that knowledge sharing required communication between systems, and that this, in turn,
required a common language. The research centred on defining such a common

language, and the surrounding ecosystem of tools and methodologies which were
required to interoperate with it. The common language proposed was a machine
readable syntax of FOL, called the Knowledge Interchange Format (KIF) [46], and has
since evolved into the current ISO draft Common Logic Standard [77].
4.3 Semantic markup languages

While the knowledge representation field has pursued questions of how best to
represent and reason with knowledge, the world has experienced a revolutionary
change in the way information is shared and communicated. First, by the adoption of
text based email, then by structured text formats, and finally the World Wide Web
(WWW). Today, we have an ubiquitous and globally interconnected repository of
interconnected information based on the simple linking and embedding of documents.
These two streams of research and development have not gone entirely without
interplay, and are in some areas converging towards the goal of building a globally
interconnected repository of knowledge, a so called “Semantic Web”. This section
describes in brief the history of digital information sharing and publishing which have
led to the approaches which are today being employed for storing and sharing data, and
finally how the lessons learned in building the World Wide Web are influencing current
knowledge representation research.
General purpose markup languages have their roots in the 60’s in work
performed by the both the Graphic Communications Association (GCA) and IBM [34].
IBM developed a language called the Generalized Markup Language (GML), which
was used internally as a source document from which numerous different types of
documents could be generated. GML used tags (<> and </> ) very much like we see
today in HTML and XML. Recognition of the need for standards in markup and
Document Type Definition (DTD) led to the establishment of the American National
Standards Institute (ANSI) committee on Computer Languages for the Processing of
Text. The Standard Generalized Markup Language (SGML) followed, becoming an
ISO standard in 1986 [58]. In 1990, Tim Berners-Lee took tags from a sample SGML
DTD used by CERN and added the concept of hypertext links to form the basis for the
markup language which was to become the Hypertext Markup Language (HTML), and
one of the foundations of the emerging WWW.
As the WWW became ubiquitous, a general awareness formed as to HTML’s
structural limitations, in particular HTML failed to enable declaration of the semantics
of newly added tags, nor was the syntax easy to programmatically interpret. This
situation led to the HTML language beginning to diverge along vendor lines, with
interoperability of documents beginning to become a problem. The eXtensible Markup

Language (XML) [138] began as a language intended to address these limitations, by
defining a simple to use and extensible subset of SGML. XML defined a syntax with a
data model based on tree like structures similar to LISPs S-Expressions, the balanced
parameter list. XML exhibits two properties which are useful towards achieving the
goal of extensible interchange of information:
• Well-formedness: a syntactic constraint which enables interchange of
information despite a party only being capable of understanding
portions of a document. “Well-formedness is a fundamental tool for
allowing documents to include extended information while remaining
processable by older "down-level" applications.” [14]
• Vocabulary mix-in: As it is practically infeasible to predefine a
vocabulary that spans all application domains, XML takes an approach
that all tags are potentially scoped by arbitrary and separate
namespaces, via the XML Namespace facility. This enables ad hoc use
of vocabularies from arbitrary application domains.
As the focus on the use of the WWW began to shift from information
dissemination to information exchange, numerous parties began to find a need for
publishing machine usable descriptions of collections of distributed information. For
example, Microsoft proposed the Channel Definition Format (CDF) for describing push
based web content, the Platform for Internet Content Selection (PICS) for rating web
content [66], and Netscape proposed the Metadata Content Framework (MCF) for
generally describing metadata content [52]. The XML alone proved insufficient for
addressing these needs, especially in the areas of schematic expressiveness and
evolution, and integration of data from heterogeneous sources. The Resource
Description Framework (RDF) [63] arose out of these efforts.
In the very late 90’s, Berners-Lee, now a leading figure within the standards
body which produced RDF, the World Wide Web Consortium (W3C), began to
enunciate a vision to create a universal medium for sharing information and exchanging
data, which he referred to as the semantic web. A semantic web activity was initiated
within the W3C to pursue this goal, drawing on many lessons learned in building the
WWW and applying them to the task of knowledge representation.
Berners-Lee opines that the centralised, “all knowledge about my thing is
contained here” approach taken by most existing knowledge representation systems is
stifling and unmanageable, and proposes that these shortcomings might be addressed
by adopting a decentralised approach in much the same way as was employed with
hypertext. [16]. A key architectural principle of the web which enabled it to scale
where hypertext failed to scale was the notion that all information did not have to be
published in the same place. The definition of the Uniform Resource Locator (URL), a
hypertext link spanning arbitrary information servers was the key to enabling
distributed and interconnected documents. The URL forms the foundations of the RDF
approach.
The W3C has standardised a number of technologies towards the goal of the
semantic web. These technologies are layered in a stack like manner, similar to that
observed in networking. The current semantic web stack comprises three logical layers:
a data layer, an ontology layer, and a query layer (presented below in Figure 7).
Query Layer SPARQL
OWL Full
OWL DL
Ontology Layer OWL LiteLite
OWL
RDFS
Data Layer RDF
Foundations: URI, XML, XML Namespaces
Figure 7: Current Semantic Web standards
Further logical layers are under development or envisaged, including

standardised rule based inference, trust, and explanation enabling services.
The following sections describe in detail the lower layers of the semantic web
stack (RDF and OWL) which are employed in Chapters 5 and 6.
4.3.1 A basic Introduction to the RDF data model
The RDF is a framework defining a data model based on the directed, labelled
graph (DLG), and can be seen to be influenced by both the semantic nets and frames
KR approaches described earlier. This section presents a basic introduction to the RDF
data model, as it is used as a representational format in Chapters 4 and 5. For a more
comprehensive introduction see [110].
In the DLG model of RDF, graph nodes are either resources (such as things or
entities, ie. people, places, events, or other) or values (such as numeric values, times, or
other resources). The directed nature of the graph corresponds to a constraint that the
subject, which is a node that a graph edge originates from may only be a resource,
while target nodes (nodes that graph edges terminates at) can either be a resource, or a
value. Graph edges correspond to properties (or attributes) of the subject. An example
of a simple graphical depiction of a RDF graph is presented in Figure 8 20 .
Figure 8: Basic RDF node-arc-node triple
In the Figure 8, we see a simple graph which attaches a property (represented

as an arc) called “starredIn” to the resource (represented as a rectangle) called Kevin
Bacon. The value (represented as an ellipse) of the “starredIn” relationship is
“Footloose”. Through the knowledge that “starred in” is terminology relating to
theatrical works, and the hypothesis that “Kevin Bacon” is a proper name, and a little
inference beside, one might infer the meaning of this graph is “Kevin Bacon starred in
‘Footloose’”.
A reader who experienced childhood in the 1980’s might further add the
knowledge that “Kevin Bacon” corresponds to a person, and that “Footloose” is a
movie. A graph presenting this knowledge in the RDF formalism is presented in Figure
9.
Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named
'Footloose'"
In inferring there is a class of things called a Movie in the world, we also

suppose there is a particular one with the name “Footloose”. Accordingly, we create a
corresponding surrogate for the movie in the graph. We also create a surrogate for the
concept of Movie, and relate it to the particular movie named Footloose through the
introduction of a Class/Instance relationship. The relationship labelled “rdf:type”
denotes this relationship using the RDF defined vocabulary for a Class/Instance
20
We note that this is not a legal RDF graph due to the identifier of the nodes and arcs not being
a legal URI, but present it for simplicity of discussion.
relationship 21 . Similarly, we do the same for the person with the name Kevin Bacon
and the class Person.
A fundamental premise of the RDF data model is that everything is named with
a Universal Resource Identifier (URI) [15]. URIs are a generalised addressing scheme,
of which a subset is the Uniform Resource Locator (URL) which is used to link
together web documents. When modelling data with RDF, concepts, instances,
properties, and even data types are all named using URIs 22 .
The use of URIs enables reuse of common concepts and instances. Returning to
our example, we wish to create an unambiguous identifier for the node whose “name”
property connects to “Kevin Bacon”. In this case, for convenience we turn to a
canonical source of information related to movies, The Internet Movie Database
(IMDB) 23 , and use the URL for this actor’s details page:
http://www.imdb.com/name/nm0000102/. Similarly we do the same for the movie
“Footloose”. The modified RDF graph is presented in Figure 10.
Figure 10: Unambiguous meaning is given to concepts and instances through naming with
URI’s
By reusing an identifier which is universally scoped, we provide an

unambiguous meaning for the instance that we are modelling. In this case, the
semantics are determined by the IMDB organisation, and can be determined by
fetching and perusing the (human readable) content of the URL in a web browser. It is
just as easy for one to create their own URI to represent a concept, individual or
property. This means any entity, be it an individual, a professional group, or a business
may create their own identifiers. In this case, we have minted our own URLs for
24
identifying the concepts Person and Movie. By this means, vocabularies may
21
This is an abbreviation, so that “rdf:type” is to be interpreted as “the URI for the type
predicate drawn from the ‘rdf’ vocabulary”. In practise this means taking the “rdf” namespace
URI defined in the top of the document and concatenating it with the predicate, to form an
actual URI. The URI for this predicate is thus the URL http://www.w3.org/1999/02/22-rdf-
syntax-ns#type.
22
This is not strictly true. Nodes may also not have a name, in which case they are known as
blank nodes.
23
http://www.imdb.com/
24
We note here that this namespace does not necessarily need to resolve to a web page. Rather,
it is a scope in which to define terminological names used in the particular conceptualisation
that we are defining.
separately evolve within an area of expertise, yet be used outside of their original
purview.
RDF supports integration of information published in separate documents by
merging together uniquely named nodes and arcs. Returning again to our example,
suppose we publish the graph representing “A person named Kevin Bacon starred in
the Movie Footloose” in one document, and in a similar document a graph representing
the statement “A person named Sarah Jessica Parker starred in the movie named
‘Footloose’”. How can we combine the information from the two RDF graphs?
In combining our “Kevin Bacon” graph with our “Sarah Jessica Parker” graph,
an RDF implementation will preserve uniqueness of nodes based on their identifiers,
and merge nodes with the same identifiers, leading to the graph in Figure 11:
http://isi.qut.edu.au/Movie/Person http://isi.qut.edu.au/Movie/Movie
rdf:type rdf:type
starredIn
Name
http://www.imdb.com/
“Kevin Bacon” name/nm0000102/ http://www.imdb.com/title/tt0087277/
starredIn name
“Sarah Jessica http://www.imdb.com/
Name “Footloose”
Parker” name/nm0000572/
Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.”
A naïve merge of the two graphs would lead to a duplication of nodes

representing the concepts Person and Movie and the instance Footloose. A correct
implementation of the RDF semantics would however merge the duplicate nodes into
one and yield a graph conforming to the knowledge we are trying to represent: “A
Person named Kevin Bacon and a Person named Sarah Jessica Parker starred in the
Movie Footloose.” 25
4.3.2 RDF serialisation
The graph model defined by the RDF, while useful both for description in
terms of model theory, and for visualisation, is not suited for publishing as it is an
abstract graph. For this reason, a serialization of RDF to XML was defined at the time
of the definition of RDF. The serialisation (and subsequent ones) are based on
converting the graphical representation into a set of 3-tuples (or triples), where each
25
Note that for brevity this graph is still not a valid RDF graph, as we have not yet given URI
identifiers to the properties used in the graph.
triple is a unique (resource, property, value) tuple corresponding to two nodes joined by
an edge. The triples for the fragment of the graph beginning with the movie would be:
(http:/www.imdb.com/title/tt0087277/, http://www.isi.qut.edu.au/movie/name, “Footloose”)
(http:/www.imdb.com/title/tt0087277/, rdf:type, http://www.isi.qut.edu.au/Movie/Movie)
These two triples would map to the RDF/XML serialization presented below in
Table 2.
Table 2: RDF/XML Serialisation of two triples
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ >
<rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/”
http://www.isi.qut.edu.au/Movie/name=”Footloose">
<rdf:type rdf:resource="http://www.isi.qut.edu.au/Movie/Movie"/>
</rdf:Description>
</rdf:RDF>
All vocabulary used in RDF is scoped by a particular namespace, and the

vocabulary used by the RDF syntax is no exception to this rule. For this reason the
RDF in Table 2 contains a namespace declaration for the RDF vocabulary used. By
defining an abbreviation for our ad hoc namespace, and setting it as the base of all
unscoped names, we obtain the text presented in Table 3.
Table 3: RDF/XML serialisation using XML Namespace abbreviation
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“
Xmlns:isimv=http://www.isi.qut.edu.au/Movie/
Xml:base=“http://www.isi.qut.edu.au/Movie/” >
<rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/”
isimv:name=”Footloose">
<rdf:type rdf:resource="Movie"/>
</rdf:Description>
</rdf:RDF>
An alternate and equivalent syntax for the above, which is tailored to declaring
instances based on their Class is presented in Table 4.:
Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition
<isimv:Movie rdf:about="http://www.imdb.com/title/tt0087277/” >

<isimv:name>Footloose</isimv:name>
</isimv:Movie>
</rdf:RDF>
Finally, the entire merged document representing the statement “A Person

named Kevin Bacon and a Person named Sarah Jessica Parker starred in the Movie
Footloose.” is presented in Table 5:
Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a Person
named Sarah Jessica Parker starred in the Movie 'Footloose'."
<isimv:Person rdf:about=”http://www.imdb.com/name/nm0000102/” >

<isimv:starredIn>
<isimv:Movie rdf:about="http://www.imdb.com/title/tt0087277/” >
<isimv:name>Footloose</isimv:name>
</isimv:Movie>
</isimv:starredIn>
<isimv:name>Kevin Bacon</isimv:name>
</isimv:Person>
<isimv:Person rdf:about=”http://www.imdb.com/name/nm0000572/”>
<isimv:name>Sarah Jessica Parker</isimv:name>
<isimv:starredIn rdf:resource="http://www.imdb.com/title/tt0087277/” />
</isimv:Person>
</rdf:RDF>
The RDF/XML serialization of RDF has received criticism for being too
unwieldy and difficult to read, and for good reason. This has led to numerous efforts to
produce more human usable serializations. One of the earliest of these is the N3
serialization. Table 6 presents the same RDF graph above, serialized to N3 triples. In
this serialization, multiple property-value pairs may be associated with a single
definition of a resource. A number of shorthand terms are defined. In this example, we
see the “a” shorthand for “rdf:type” used.
Table 6: N3 serialisation of statement from Table 5
@prefix isimv: <http://www.isi.qut.edu.au/Movie/> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix imdb: <http://www.imdb.com/> .
imdb:title/tt0087277/ isimv:name “Footloose” ;

a isimv:Movie .
imdb:name/nm0000102/isimv:name=”Kevin Bacon” ;
isimv:starredIn imdb:title/tt0087277/ ;
a isimv:Person .
imdb:name/nm0000572/ isimv:name=”Sarah Jessica Parker” ;
isimv:starredIn imdb:title/tt0087277/ ;
a isimv:Person .
4.3.3 Adding semantics to published RDF data
While RDF provides a machine readable and extensible language for

representing arbitrary information on the web, it fails to provide a means for declaring
the semantics of the vocabulary used. The XML foundations of the RDF provide a
simple means of declaring schema by use of the DTD, this really only provides for
describing constraints on document structure, and the use of which is actually
incompatible with the RDFs open world view of data [139]. RDF provides no means
for describing RDF properties, nor does it enable describing the relationships between
properties and other resources [140].
The vocabulary description language RDF Schema (RDFS) [140] begins to

address these descriptive goals, by providing meta-level classes and properties for
describing classes, properties and other resources. Additionally, RDFS is a semantic
extension of the RDF language. The RDF Schema language is similar to frame based
representations and object oriented languages in how it describes classes and
properties, however it differs in that it does not describes the world in terms of
properties belonging to a particular class. Rather, properties are first class entities in
their own right, with descriptions in terms of what classes are appropriate as the domain
and range of the property. By decoupling the notion of property from class, the RDF
vocabulary description language permits vocabulary descriptions to be further extended
by later descriptions, facilitating an extensible schema declaration.
The RDF Schema language is a lightweight vocabulary description language,
intended to address these fundamental descriptions, and is sufficient for describing
class/subclass, and property/sub-property structural relationships between vocabulary
terms, and the domain and range of properties. It was, however, defined as a minimal
solution to these goals, and falls short of providing the necessary fundamentals for
functioning as an ontology language.
The Web Ontology Language, which is conventionally abbreviated as OWL, is
a standard language defined by the W3C. It is based on earlier research efforts into
ontology languages such as DAML+OIL [54], and may be seen to trace its roots to
frame based representations and semantic networks.
Beyond what is provided by the RDF Schema language, OWL enables:
• naming of and linking together of ontologies in a web like manner;
• a simple form of describing instances of classes described by the
language and the interrelationships between them;
• the description of restrictions on properties, based either on cardinality
or data/object type;
• a means of inferring that instances with various properties are members
of a particular class; and
• relate concepts within separate ontologies (or conceptualizations) by
means of declaring which concepts mean the same thing.
So that we might demonstrate the difference in expressiveness of RDFS and
OWL, we demonstrate the kinds of statements which can be said using a number of
examples drawn from [56].
Using RDFS we can:
• Declare classes like Country, Person, Student, and Canadian
• State that Student is a subclass of Person;

• State that Canada and England are both instances of the class
Country;
• Declare Nationality as a property relating the classes Person (its
domain) and Country (its range);
• State that age is a property, with Person as its domain and integer as
its range; and
• State that Peter is an instance of the class Canadian, and that his age
has value 48.
With OWL we can additionally say:
• State that Country and Person are disjoint classes;
• State that Canada and England are distinct individuals;
• Declare HasCitizen as the inverse property of Nationality;
• State that the class Stateless is defined precisely as those members of
the class Person that have no values for the property Nationality;
• State that the class MultipleNationals is defined precisely as those
members of the class Person that have at least 2 values for the
property Nationality;
• State that the class Canadian is defined precisely as those members of
the class Person that have Canada as a value of the property
nationality… [56].
The development of OWL has been heavily influenced by research into the
decidability of Description Logics, resulting in three profiles of the language being
defined: Lite, Description Logic (DL), and Full. Each flavour of the OWL language
places different restrictions on the logical constructors available for category
description, based upon the impact the feature has on the decidability of the resulting
logic. OWL/Lite is the least computationally intensive of the three, and OWL/DL
restricts the resulting description logic to features which ensure decidability. OWL/Full
and RDFS are both formally undecidable, which means not all true statements can be
inferred [110].
A second influence Description Logics research has had on the OWL language
has been that OWL uses Description Logic style model theory to formalise the meaning
of the language [56]. The semantics of the OWL language are clearly defined, both in
terms of model-theory and FOL axioms.
The syntax of OWL is based on the RDF/XML serialisation described earlier.

An ontology which describes the simple Movie related example discussed above, is
presented using the OWL in Table 7.
Table 7: A simple Movie related ontology
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl=“http://www.w3.org/2002/07/owl#“
xml:base=“http://www.isi.qut.edu.au/Movie/”
xmlns:isimv=“http://www.isi.qut.edu.au/Movie/” >
<owl:Class rdf:ID="Person” />
<owl:Class rdf:ID="Movie” />
<owl:DatatypeProperty rdf:ID="name” >
<rdfs:domain>
<owl:Class rdf:resource=”Person” />
</rdfs:domain>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID="starredIn” >
<rdfs:domain>
<owl:Class rdf:resource=”Person” />
</rdfs:domain>
<rdfs:range>
<owl:Class rdf:resource=”Movie”/>
</rdfs:range>
</owl:ObjectProperty>
</rdf:RDF>
This ontology (in Table 7), when processed by an implementation of OWL,

would merge with the earlier RDF data and provide a machine readable description of
the semantics of the vocabulary used in the earlier information. By decoupling data
definition from data description (or schema, or ontology), the schema of data does not
need to be defined apriori, as in regular relational database models of information. Data
may be defined using the RDF data model, and then when it is deemed necessary,
OWL may be used to attach semantics to this data.
This example of the usage of OWL describes it as a means of defining the
vocabulary used in RDF documents, attaching semantics to data. In this sense
ontologies may be used to transform raw data into information. Ontologies are,
however, useful outside of this kind of usage, and are in some cases, such as in building
large conceptualisations, useful in their own right. Cataloguing efforts such as the Gene
Ontology [133] and the NCI Cancer Ontology [90] are primary examples of such
applications.
4.4 KR in digital forensics and IT security

In Section 3.1 we have described a number of approaches to event correlation
in the fields of forensics and intrusion detection that have used knowledge
representation related approaches implicitly. These approaches include AI planner
languages, expert systems shells and description logic environments. Apart from the
published research results arising from work described in this chapter, and chapters 5
and 6, there is little published research investigating the wider role of knowledge
representation in computer forensics. Below is a survey of such related work.
Stephenson’s PhD dissertation work [127] focused on describing and
representing actual and ideal digital investigations, and validating digital investigation
processes against formal models. The representational approach taken by Stepenson
was to define a process language called the Digital Investigation Process Language
(DIPL), which employ’s Rivest’s S-Expressions as a syntax, defines a range of
vocabulary related to digital investigations, and finally, represents the digital
investigation as a process. Stephenson’s primary contribution is the proposal and
demonstration of the use of formal mathematical methods to validate actual
investigation procedure described in the language with investigative process standards
represented as formal petri-net models. While Stephenson’s goal of describing
investigations is a shared motivation with the work described in this dissertation, we
have chosen to adopt a representational approach based on formal knowledge
representation roots. This is due to the insufficiency of S-Expressions in the areas of
schematic expressiveness and evolution, and integration of data from heterogeneous
sources.
Brinson et al. recently proposed a taxonomy scheme characterising the field of
cyber forensics 26 , which they called a Cyber forensics ontology [22]. This approach
takes a liberal interpretation of the meaning of the term ontology, avoiding a “formal
explicit description of concepts” [97] in the cyber forensics domain. The model
proposed organises concepts from the cyber forensics domain into a hierarchy; however
the relationships between concepts appear to be neither consistent nor specified.
Regardless of these omissions, this work could form a useful starting point for
developing formal ontologies related to cyber forensics.
Slay and Schulz [119]have, since the research described in this dissertation was
completed, employed ontologies as a means of describing a specific conceptualisation
of files and suspicious media content in a computer forensics application. Their
conceptualisation contains various categories of media files, and the idea of
suspiciousness, and is employed by a filesystem search application. While the work
concerns itself with categorising files as suspicious based on properties and
relationships of files (ie. file size and topological proximity to other files), their
ontology does not attempt to encompass these concepts, presumably leaving these
concerns to implementation in code.
26
The term “cyber forensics” is not defined in this paper. On examination, the author appears to
be referring to the digital forensics field.
We have also identified a number of applications of ontologies in the computer

and information security field, especially relating to intrusion detection.
Raskin et al. argue for the adoption of ontology as a powerful means for
organising and unifying the terminology and nomenclature of the information security
field [105]. They propose that the use of ontology in the information security field will
increase the systematics, allow for modularity and could make new phenomena
predictable within the security domain. Further work by this author investigates hybrid
ontology oriented/NLP approaches detecting deception in natural language text [104].
Schumacher focuses on systematic approaches to improving software security,
by using Security Patterns, the application of the design patterns approach to security
[116]. Ontologies are used as a means to model both the security concepts referred to
by the patterns, as well as the patterns themselves.
Undercoffer et al produced an ontology which can be used to describe a model
of a computer attack, which they call a “Target Centric Ontology for Intrusion
Detection” [136]. A DL classifier, in conjunction with a rule language is used to
classify event instances as belonging to particular classes of interest, which are in turn
described using the OWL precursor DAML+OIL.
Doyle et al [37], in reviewing the expressiveness of the state of the art in
intrusion detection correlation languages, suggest the Knowledge Representation (KR)
system CYC [67] may be of use positing that the CYC system provides powerful
constructs for reasoning with abstract and concrete concepts across multiple domains.
Goldman et al in their IDS alert fusion prototype, SCYLLARUS [48]
employed the KR system CLASSIC [19] to model a site’s security policy, static
network, software configuration, and intrusion events all within the same
representational formalism.
4.5 A formal KR approach to investigation documentation and

digital evidence
The problems of building an interconnected and distributed web of machine
and human interpretable information (a so-called “Semantic Web”) parallel, in the
large, the problems we observe in acquiring, assembling, and interpreting corpuses of
digital evidence and investigation related information. As we have said, digital
investigation covers a very broad range of conceptual entities, and any schema or
ontology attempting to fully describe the domain, quickly becomes insufficient as
technology inexorably marches on. In this light, a means of representing evidence and
related investigation information, expressive enough to represent all of the information
we wish to represent, but not committing us to a particular conceptual schema, is
desirable in order that usability is not hampered by debate over terminology and
conceptual granularity.
It is well known that formal specification of systems aids implementation and
correctness. For example, formal specification of software has led to significant
outcomes in producing provably correct software. We propose that the field of
computer forensics would similarly benefit from a formal approach, but in this case a
formal approach to representing knowledge about investigations, and information
within the digital crime scene. Such a formal approach would form a middle ground
between machine and human understanding, by adopting a common language with
extensible vocabulary, clearly defined semantics, and a regular syntax. We summarise
below the attributes of the proposed approach in the computer forensics context, by
relating the advantages of modelling information using RDF/OWL identified by
Reynolds et al in [110].
Integration of arbitrary information: The representation should employ a
simple and consistent model of data, in combination with a globally unique naming
scheme, in order that separately documented information may be easily combined into
a consistent and larger whole. Such a model of data and naming scheme will enable a
corpus of forensic evidence to be decomposed and composed. The benefits of this
relate to the volume problem, by enabling of sharing of evidence in small pieces, and in
enabling scalable approaches to processing evidence (due to the elimination of large
shared resources such as databases as a container of information). The naming scheme
should enable arbitrary information to be expressed and arbitrary vocabulary terms to
be created, and in addition, enable reuse of existing vocabulary terms. Related to the
complexity problem, such a representation would enable addition of new types of
information to a corpus of evidence without need to modify existing tools.
Support for semi-structured data: The representation should allow
information to be represented in the data model without need for considering or
deciding upon a particular conceptual model a priori, in order that information may be
rapidly integrated, without becoming bogged down in issues of semantics. At a later
point semantics may be attached through relating entities to elements of an ontology.
The complexity problem is addressed by rapidly enabling integration of new and
arbitrary information.
Extensibility and resilience to change: The complexity problem in computer
forensics indicates that forensics tools must address ever increasing complexity. A new
tool that exhibits backwards compatibility would in light of this complexity retain the
ability to interpret prior generations of information and models, despite changing
definitions of terminology over time. A representation that exhibits forwards
compatibility should be evolvable: existing tools should remain able to interpret newer
generations of information expressed in the representation. For example, if a new
storage technology is developed, an imaging application which operates on this kind of
storage may record further information related to the source of evidence. Existing tools
which operate over images must still be able to interpret the image despite the presence
of the new information.
Classification and Inference: Such a representation should enable describing
the world not only by names, but by relationships between entities, and inclusion in
classes of things. Such a representation should additionally be conducive to inferring
new knowledge based on existing knowledge regarding a concept’s relationships.
Provenance: A representation should enable the ability not only to express
information, but also to express information about where the information came from.
This, in particular, is important in the forensics context, where any facts identified must
be substantiated by evidence. This need for substantiation is a considerable burden in
computer forensics given the amount of natural language currently required to describe
these provenance issues.
The research described in chapters 5 and 6 apply this proposed approach to
reducing the volume and complexity of events sourced from computer and network
event logs, and in easing the construction of corpuses digital evidence.
4.6 Conclusion
In Section 4.1 the motivation for knowledge representations was presented in
the context of documenting digital investigations. Section 4.2 described in broad brush
strokes the history of the field of knowledge representation, discussing in turn the goals
of the field. In short, the field attempts first to answer the question “How can we
express what we know?” and “How can we reason with what we express?”
A key theme which comes through on analysing the field is the preoccupation
on reasoning. In fact today the field tends to refer to itself as Knowledge
Representation & Reasoning, rather than referring to itself as simply Knowledge
Representation. This reflects a realisation that, when addressing the goals of artificial
intelligence, both representational formalism and the model of intelligent reason impact
numerous factors such as expressiveness, computational tractability and pragmatic
usefulness. The field remains today an active research area.
Section 4.3 described recent standardisation efforts on semantic information
markup, then indicated areas where the knowledge representation field has influenced
these efforts. Such efforts have multiple stakeholders advancing their varied research
and development agendas. In turn these vary from addressing the more lofty AI related
ambitions of a globally published knowledge base, to the more pragmatic, “soft AI”
goals of publishing information in a manner that it may be unambiguously interpreted
and further intermixed.
Section 4.4 described KR related work in the IT security and forensics fields,
and concluded that representation has to date been an implicit subject in forensics.
Section 4.5 puts forward the proposition that the field of forensics would
benefit from a formal approach to representation, both by documenting investigations
and automating reasoning about evidence. The section summarizes the scientific
premise motivating much of the work described in this thesis.
The next chapter investigates whether semantic web KR formalisms are
suitable for using as the basis for developing DF analysis tools.
Chapter 5. Event representation in forensic
event correlation
“Nature herself cannot err, because she makes no statements. It is
men who may fall into error, when they formulate propositions”
(Bertrand Russell)
Chapter 4 put forward the proposition that a knowledge representation based

approach (in particular the semantic markup languages RDF/OWL) to digital evidence
representation will yield benefits that will solve current digital forensics problems of
complexity and volume. This chapter describes the design and implementation of such
a knowledge representation based approach, and then demonstrates the proof of
concept of the approach. Section 5.1 introduces the problem of event representation and
correlation in forensics. Section 5.2 describes the design of the knowledge
representation based approach, employing RDF/OWL as a representation with which
diverse event related information might be expressed in a human and machine readable
manner. Section 5.3 describes its implementation. The approach is evaluated in two
case studies in Sections 5.4 and 5.5. The former evaluates whether the approach is
feasible in the context of event correlation, and in the latter, evaluates whether the
approach can scale to integrate information sourced from heterogeneous logs across
multiple domains.
The research work described in this chapter has led to the publication of the
following papers:
Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer
Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and
Management Systems (APIEMS 2004), Gold Coast, Australia.
B Schatz, G Mohay, A Clark, (2004) ‘Generalising Event Forensics Across Multiple

Domains’ Proceedings of the 2004 Australian Computer Network and Information
Forensics Conference (ACNIFC), Perth, Australia.
B Schatz, G Mohay, A Clark, (2005) ‘Generalising Event Correlation Across Multiple

Domains’, Journal of Information Warfare, vol 4, iss 1, pp. 69-79. (revised version)
79
80 CHAPTER 5 – Event representation in forensic event correlation
5.1 Introduction: Event correlation in digital forensics

In cases involving computer related crime, event oriented evidence such as
computer event logs, and telephone call records are coming under increased scrutiny.
The volume problem described earlier refers to the current state of computer forensics,
where the number of sources of potential evidence in any particular computer forensic
investigation has grown considerably. Evidence of the occurrence can potentially be
drawn from multiple computers, networks, and electronic systems and from disparate
personal, organizational, and governmental contexts. Furthermore the complexity
problem is evident in the amount of technical knowledge required to manually interpret
event logs. The knowledge required for interpretation encompasses multiple domains of
expertise, ranging from computer networking to forensic accounting.
When comparing the number of security related events to the total number of
events logged by modern computer systems, we find that in practice, security related
events comprise only a small proportion of logged information. This means there is a
large amount of event log information that is not related to security, but that is available
to the computer forensics investigators for use in identifying activities and events of
potential forensic interest. In addition, forensic event correlation may consider event
logs from other, disparate sources, which are not computer event logs per se. These
would include traditional sources such as electronic door logs, telephone call records,
and bank transactions records, and newly emerging ones from the plethora of
embedded devices, both consumer and industrial.
In order for forensic investigators to effectively investigate this mass of event
oriented data, automated methods for extracting event records and then classifying
events and patterns of events into higher level terminology and vocabulary are
necessary. New techniques are needed to assist investigators with voluminous, low-
level event oriented evidence. Semantically rich representational models and automated
methods of correlating event information expressed in such models are becoming a
necessity. We need means to rapidly integrate knowledge from new types of
heterogeneous event records, in a manner that makes explicit the environmental or
implicit concepts associated with those logs. This is to facilitate human understanding,
and also machine processing. A general solution is needed.
Our approach enables this, and forms the basis for automated heuristic
correlation techniques, and provides extensibility for new models of event patterns and
correlation. We have defined an extensible and semantically grounded domain model
and an ad hoc forensic event ontology expressed using the Web Ontology Language
CHAPTER 5 – Heuristic event correlation for digital forensics 81
(OWL). This ontology describes a simple conceptualisation of event correlation related

concepts and relationships, enabling facts representing events and transactions, as well
as environment-based knowledge (for example, real world information such as people,
and places, as opposed to event based knowledge).
Because of the richness or abundance of detail in the events we consider, we
call our prototype system Forensics of Rich Events (FORE). The system is
demonstrated using a scenario consisting of event correlation of events sourced from
security related logs in the context of an intrusion forensics investigation, and then
demonstrated in a cross domain context, integrating accounting system type event
records.
While the validity of the results produced by forensic tools is of serious import
to the forensic and legal community, in this work we do not focus on how the outcomes
of this tool would be made acceptable to a court of law. There is, however, an extensive
body of work explaining the deductions of expert and rule systems that would provide
the foundations for addressing such concerns; for example see [129].
5.2 Ontologies, KR and a new approach

Section 3.1 described the literature related to event correlation in the forensics
domain, identifying three related works: the ECF work of Abbott et al, Stallard and
Levitt’s anomaly based intrusion forensics, and Elasser and Tanner’s work on abducing
explanations for intrusions. All three of these approaches used a different
representational approach for modelling event log related information. These are
relational modelling, expert systems rules, and the planning language PDDL [74]. The
focus of this work is primarily on information integration and the representation of
event related evidence, whereas these former works focus primarily on analysis
techniques, eschewing issues of semantics and integration of heterogeneous events
from multiple domains.
Our approach is to employ the RDF/OWL formalism for representing arbitrary
event log related information, and higher order concepts such as causal relationships,
and validate the approach in the context of correlation of heterogeneous event logs. Our
correlation approach relies on heuristic rules which abstract low level situations of
interest into higher level situations. These rules were in this research developed using
domain knowledge. In real world applications of this approach, we expect automated
means of identifying rules, such as data mining, would be necessary to make the rule
bases scale to a useful coverage of the domains of interest.
5.2.1 Knowledge representation framework
A number of factors influenced our choice to use RDF/OWL as the

representational formalism. Firstly, the current thrust of research in KR and the
Semantic Web is related to this formalism, and consequently, a wide variety of
RDF/OWL implementations are freely available. Secondly, this research has led to a
large body of knowledge upon which we might draw.
We initially investigated Description Logic (DL) reasoner implementations as
potential KR&R environments supporting OWL. The CLASSIC system appears to
have been at a standstill for some time, with neither little evidence of an active user
community, nor any support for OWL despite their shared heritage. Current
implementations of DL reasoners, such as FaCT appeared promising as they have
implemented translators from OWL to their native syntax [55]. Instance reasoning in
FaCT is however limited by failure to support datatypes defined by XML Schema
(XSD) [124]. For our research, support for time datatypes supporting the expression of
instants (or timestamps) as date and time values is a necessity that is not satisfied by
any of the current breed of DL reasoners.
JTP [40] initially appeared useful, as it has a built in time ontology and a
temporal reasoner, OWL support, and is open source. The FOL implementation of JTP,
however, proved to perform extremely slowly, the surrounding online community
dormant, and simple Knowledge Base operations such as removal of previously
asserted facts were not supported. OWLJessKB [64], an implementation of OWL
using the JESS [43] production system provides a reasoning method over OWL and has
a small, but active community, however OWLJessKB does not support reasoning over
time in a clean way. Further, JESS is closed source software.
The JENA semantic web toolkit, a Java based RDF/OWL implementation [73]
has recently added both a forward chaining reasoner, similar to JESS, and a backward
chaining reasoner similar to the tabled Prolog, XSB. The ontology API is clear and well
documented; the source is distributed under a liberal license, and is supported by an
active community. For these reasons we chose JENA as the knowledge representation
and reasoning (KR&R) framework to use as the foundations of our architecture.
5.2.2 Application architecture
As none of the existing event correlation systems or languages reviewed (see

Section 3.1) employed a formal representation at its foundations, a prototype system
was constructed. Our prototype system, called Forensics of Rich Events (FORE) is
composed of the following components: generic event log parser, event log ontology,
correlation rules, correlation rule parser, event browser and the JENA semantic web
framework (see Figure 12).
Generic Log JENA

Parser Framework
Apache
Spec
Knowledge FR3 Correlation

Win32 Spec Rule Base Rule Parser Rules
Base
Door Spec
SAP Spec Forensic Event

Ontology Browser
Figure 12: The FORE Architecture
Raw event logs are parsed into RDF event instances by the generic log parser
and inserted into the knowledge base. Correlation rules (expressed in a language called
FR3, which we describe later) are parsed into the native format of JENA, and applied to
event instances by the JENA inference engine. Investigators may interact with the
knowledge base containing the events and entity information using the event browser.
The components implementing the architecture are described in detail in the
following sections.
5.3 Implementation
5.3.1 The design of the event representation
While a number of ontologies related to security and intrusion detection were

identified, [32, 99, 136] in developing our prototype system we initially (for case study
1 described in Section 5.4) eschewed using an externally developed ontology as the
basis conceptualising the event correlation domain. We did this so we might focus on
validating the approach of employing RDF and OWL in building event correlation
tools. For the initial case study, we limited our use of the features of OWL to class and
property definition, avoiding object properties, property hierarchies, and constraints
over properties. Case study 2, described in Section 5.5, considers the use of component
ontologies in integrating event oriented evidence sourced from multiple domains.
Our ontology is rooted in two base classes, an Entity class that represents
tangible “objects” in our world, and an Event class that represents changes in state over
time.
At the time of this work, some guidance existed towards modelling time in the
form of the OWL-S time ontology [99]. This work, performed in the context of agent
oriented computing, was related to describing the time of occurrence of events both as
instants and durations, and the topological relationships related to this model of time.
This model of time however had no implementation of a temporal reasoner. For this
reason, we adopted a simple instant based model of time that assumes basic events to
happen at an instant of time. Our basic temporal ordering property is thus supported by
reasoning over the startTime owl:DatatypeProperty of the Event class. We assume a
simplified model of time, avoiding the implications of timing irregularities, such as

clock drift, cross time zone event sources, deliberate modification of time records, and
lack of time synchronization (Chapter 7 investigates the reliability of assuming correct
clock operation, and addresses the theme of assurance of timestamps ).
Causal linkage is modelled as an owl:ObjectProperty, whose domain and range
are both Event. In effect, this is a unidirectional relation where a parent event has a
collection of causal ancestors. We borrow Luckham’s definition of causality “If the
activity signified by event A had to happen in order for the activity signified by event B
to happen, then A caused B” (p.95) [71]. The causal linkage property is represented in
OWL as the following:
<owl:ObjectProperty rdf:ID="causality">
<rdfs:range rdf:resource="#Event"/>
<rdfs:domain rdf:resource="#Event"/>
</owl:ObjectProperty>
Composite events are implemented by the creation of new events of the new
abstract and more generalized concept as the result of the successful matching of event
patterns and firing of a correlation rule. For example, we define a
DownloadExecutionEvent. This class is a composite event, its semantics is that a user
has executed content that has previously been downloaded, for example by
downloading with a web browser, or as an email attachment. This event composes
lower level events: a FileReceiveEvent, and a subsequent ExecutionEvent. The inter-
instance and class/subclass relationships are depicted below in Figure 13.
Figure 13: Instance and Class/Subclass relationships between events
The expressiveness of RDF/OWL enables the translation of event log entries

into instances of information with fixed and specific semantics. The presence of
class/subclass relationships in the event forensics ontology enables the definition of
abstract classes of events sharing similar characteristics. These abstract event classes in
turn enable the expression of correlation rules matching over abstract notions, while
still operating over specific information. For example, a correlation rule composing a
FileReceiveEvent will, in the presence of an ontology describing a
WebFileDownloadEvent (an event sourced from web server logs) as a subclass of
FileReceiveEvent , just as happily match the latter more specific event.
5.3.2 Log parsers
A set of regular expression based parsers were developed to introduce

unstructured and heterogeneous event log data into the JENA knowledge base. These
parsers use a similar syntax to that used by the ECF [3] effort’s parsers for matching
event specifications, and create sets of OWL instances in the knowledge base. Under
this scheme, a single instance of a Windows 2000 login event sourced from the
windows security log would be converted into three instances: one representing the
event, one representing the user, and one representing the host. This is shown in the
RDF/XML syntax following:
<Win32ConsoleLoginEvent rdf:ID="loginInstance1">
<startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2002-03-04T20:30:00Z</startTime>
<user rdf:resource="#user1" />
<host rdf:resource="#host1" />
</Win32ConsoleLoginEvent>
<Win32DomainAccount rdf:ID="user1">
<userName>jbloggs</userName>
<domain>DSTO</domain>
</Win32DomainAccount>
<Win32Host rdf:ID="host1">
<hostName>s3</hostName>
</Win32Host>
Notable points here are the usage of XML Schema (XSD) for encoding the
time of the event in the representation, and the use of RDF resource references to link
the event with instances of entities representing the user and the host. The lack of
precision caused by the omission of a year in syslog based timestamps is addressed by
a declarative feature at the event parser layer to skew time data to the correct year. This
may be used to address timing irregularities where the irregularity is quantifiable and
regular.
A further simplification is that we assume the integrity of the event sources has
not been compromised.
5.3.3 A heuristic correlation language – FR3
The approach taken in this work is to perform correlation by successive

application of rules which have been defined by a domain expert. The problem of event
correlation and event pattern languages in particular lies in how to describe these
events, relationships and constraints. As discussed previously, we employ RDF/OWL
for describing event instances, classes of events and relationships between them,
however its expressiveness is insufficient for describing temporal constraints. The most
common approach for addressing expressiveness limitations in OWL is to employ
rules.
For this reason, and the observation that most of the correlation languages
reviewed are translatable to rules, we chose to use JENA’s built-in rule engine.
Investigations indicated the rule language understood by JENA’s rule engine was
overly verbose, so a rule language, which we dub FR3, was created to express readable
and manageable correlation rules.
Our language is based on the syntax of the language F-Logic [61] and the XML
specific features of another F-Logic inspired language, TRIPLE [118]. We have
however adopted much simpler semantics, avoiding path expressions and reified
statements (statements about statements).
Namespace support
FR3 has specific support for XML namespaces and resource identifiers.
Resource identifiers are the OWL standard of URIs [15]. Namespaces are declared as a
clause of namespace abbreviation := namespace. as follows:
fore :="http://www.isrc.qut.edu.au/fore#".
The usage of namespaces, (along with a number of OWL features we will not
discuss here) enables integration of concepts from separate ontologies. The fore
namespace declaration resolves the concepts used in FR3 rules to concepts specified in
the FORE forensic ontology. Similarly, a RDF namespace declaration enables FR3
rules to reason over type information.
Object shorthand
A shorthand form of object attribute access is expressed using a linguistic

grouping called molecules, and is supported via the following F-Logic inspired syntax:
object[property -> value; property2 -> value2]
This is a convenient form of syntax, which in the head of the rule, enables the
assignment of values to the properties of an object, without repeated use of the object.
The equivalent form of these clauses in JENAs rule language would be:
(object property value), (object property2 value3)
In an object oriented paradigm the previous could be expressed as
object.property = value;
object.property2 = value2;
In the tail of the rule this form is interpreted as an equality test whereas in the
head of the rule it is interpreted as variable assignment.
Heuristic rules
Rules are specified as follows:
antecedents -> consequences;
This can be read as IF antecedents THEN consequences, where antecedents and

consequents may contain any number of molecules or procedures. Molecules appearing
in the head of the rule (an alternative term for the consequences) are interpreted as new
facts of knowledge to be inserted into the knowledge base. Molecules appearing in the
antecedents (also known as the tail of the rule) must occur in the knowledge base for
the IF part of the rule to be satisfied.Variables are introduced by including a question
mark at the beginning of an identifier.
Reasoning
The JENA toolkit is employed as the knowledge base, RDF/OWL parser, and
reasoner. We use the RETE [42] based forward chaining reasoning engine to
implement our rule language. The RETE algorithm is a speed efficient and space
expensive pattern-matching algorithm with a long history of use in expert systems and
rule languages.
At the time of this work, JENAs OWL implementation did not support rules
matching on all subtypes of an abstract type. For example, a rule matching events of
class LoginEvent would not fire when a Win32TerminalLoginEvent was added to the
knowledge base. The machinery of JENA that implemented the semantics of OWL type
hierarchy inference relied on a hybrid implementation involving both forward (RETE)
and backward chaining (similar to Prolog) reasoners. The inferred types of the
Win32TerminalLoginEvent were not available as facts to the forward chaining rule
engine, as they were only computed backwards as a query. The OWL implementation
of JENA was modified to pre-compute the type hierarchy information using the RETE
engine so that these facts were available.
An example correlation rule
The following correlation rule in Table 8 is used to causally correlate Apache

web log entries with a particular user logon session. Standard web server access logs
only provide enough detail to determine the host that downloaded content, containing
no content which aids in discriminating the user on the host. As this is the case, we
correlate the download with all login sessions that exist on the host at the time of the
download.
Table 8: Web Session / Causality Correlation Rule
Rule fore :="http://www.isrc.qut.edu.au/fore#".
?e1[rdf:type -> fore:WebFileDownloadEvent; fore:clientHost ->

?sh ; fore:startTime -> ?t2],
?e3[rdf:type -> fore:LoginSessionEvent ; fore:host -> ?sh ;
fore:startTime -> ?t1; fore:finishTime -> ?t3],
lessThan(?t1, ?t2), lessThan(?t2, ?t3)
->
?e1[fore:causality -> ?e3];
Meaning Take a web file download event ?e1 that came from the Host ?sh and occurred at
time ?t2 and take a login session on the same Host. If the web file download event
occurred during the login session, then add the web file download event to the
causal ancestry (causality property) of login session event.
All of the event classes mentioned refer to concepts defined in the document
identified in the fore: namespace declaration.
Event browser
The GUI event browser provides a number of methods of interacting with the
events in the knowledge base. Two views form the basis of the user interface, the event
causality view, and the entity view.
The event causality view provides a display for all events matching a certain
context, and displaying the properties of each event in a drill down manner. It further
provides means to drill down, following the causal ancestry of a sequence of events.
We implement a simple query interface for finding sets of event instances based on
type and property values.
The entity view presents all entities identified in the event base, along with
their properties. Entities selected in this view may be used as the basis of a query of all
related events. The Entity View provides an operation enabling the investigator to
hypothesise an identity equivalence relationship between otherwise distinct instances.
This is discussed in Section 5.4.1.
5.4 Case study 1: Intrusion forensics

This section describes the results of the application of the FORE system to a
forensic scenario identified by the ECF research, comparing and contrasting the
investigation approaches enabled by each approach, from the perspective of the
forensic investigative process.
The scenario consists of the following trace of events, and provides support for
the following hypothesis: A particular person downloaded and executed an exploit
against a computer, and later gained elevated privileges on that computer. In this
example, we assume the exploit allows the user to reset the administrator password to a
known value. The various heterogeneous event logs from which each event is sourced
is identified in parentheses.
1. Person P enters room R (door log)

2. P logs on to Windows 2000 workstation W (Windows 2000 Security
log)
3. P downloads exploit file F from Apache web server A (Apache Web
Server log)
4. P executes the downloaded file F on a host W (Windows 2000 Security
log)
5. Workstation W is rebooted (Windows 2000 Security log)
6. Administrator logs on to the workstation W a short time later
(Windows 2000 Security log)
In this scenario, a user either noticing the server rebooting, or a user being
unable to log in as Administrator would most likely alert the investigator. While it is
likely the attacker would cover their tracks by deleting the event log, one still can
envisage finding the log entries either via forensic analysis of the disk where the event
log is located, or from a secured log host.
This scenario sourced events from the following security log types:
• Windows security logs: records of resource authentication in the windows OS,
• Apache web server logs: records of accesses of web resources,
• Door proximity logs: logs of proximity card readers controlling access to

rooms.
We are not focusing on cross-domain correlation, so will not address the door
log in this case.
5.4.1 Investigation using FORE
The FORE approach to forensic investigation supports three methods of

interacting with event log based data: search, hypothetical entity correlation, and
automated notification. We expect for most investigations, a mixture of all three
methods would be used. An example application of these methods of investigation are
described below.
Automated investigation: Notification
Most signature based approaches to Network Management and IDS enable

specification of signatures which, when they match, indicate an occurrence of interest.
Adopting this approach, we facilitate the specification of correlation rules that operate
over events in the KB. In this case, our investigation merely involves looking for a
certain set of events that are related to misuse oriented correlation rules.
We have developed a set of rules that causally correlate authentication events
and login sessions, and common actions performed on a computer, during a session of
activity. We have also defined a misuse rule that will detect the OS exploit scenario
described previously. This rule is presented in Table 9
Table 9: OSExploit Heuristic Rule
Rule fore :="http://www.isrc.qut.edu.au/fore#".
?e1[rdf:type -> fore: DownloadExecutionEvent;

fore:startTime -> ?t1 ; fore:host -> ?h ; fore:user -> ?u;
fore:causality -> ?e2 ],
?e2[rdf:type -> fore:Win32RebootEvent ; fore:host -> ?h;
fore:startTime -> ?t2],
?e3[rdf:type -> fore:LoginEvent; fore:startTime -> ?t3 ;
fore:hasUser-> fore:AdministratorUser ; fore:host -> ?h],
during(?t2, ?t1,
"http://www.w3.org/2001/XMLSchema#duration^^P10M"),
notEqual(?u, fore:AdministratorUser),
lessThan(?t1, ?t2), lessThan(?t2, ?t3),
makeTemp(?s)
->
?s[rdf:type -> fore:OSExploitEvent ; fore:causality -> ?e1];
Meaning Match an event instance of class Win32RebootEvent with an event instance of class
DownloadExecutionEvent that occurs before it.
If the user which caused these events is not the Administrator user, and the latter
two events occurred within 10 minutes of each other, create a new OSExploitEvent,
and link its causality property to the DownloadExecutionEvent.
By searching for instances of the class OSExploitEvent the investigator may

immediately and directly find an OSExploitEvent that has been automatically inferred.
The OSExploitEvent is a semantic generalization of the correlation of a
DownloadExecutionEvent followed by a Win32RebootEvent followed by a LoginEvent
with account Administrator. We require that the reboot be within a 10-minute duration
from the DownloadExecutionEvent. The sequence of events which are causally related
to an OSExploitEvent are displayed as a graph in Figure 14. It should be noted the
DownloadExecutionEvent event pattern would be matched by an instance of an
ApacheWebFileDownloadEvent, as the latter is a concrete subclass of the former (not
shown in the figure). A Win32LoginSessionEvent will similarly satisfy the
LoginSessionEvent.
Figure 14: Causal ancestry graph of exploit
This method of investigation will only result in success for cases where the
client computer in the Apache web server logs has the same as the name of the
computer in the Windows security log related events. This is often not the case, as
Windows uses a host name in security log events, whereas Apache, by default, uses IP
addresses. We discuss below a method of addressing this shortcoming.
Hypothetical entity correlation
Heterogeneous authentication environments make the notion of identity in the

security field difficult. Login names are often different from the real name of a user,
and a user may have several different login names associated with different computer
systems. Similarly, identifying computer hosts from log entries is complicated by the
use of hostnames in some cases, and IP addresses in others. Finally, the usage of
dynamic addressing further complicates this situation.
In the absence of information describing which names belong to a single entity,
when building a representation of a situation such as this, one has no alternative but to
create a separate and unique surrogate for each unique name. This leads to a
proliferation of surrogates for which there actually exists only a single referent in the
real (or virtual) world. We refer to this problem as surrogate proliferation.
The FORE approach provides a novel means of investigating under the
presence of surrogate proliferation. The entity view of the GUI provides an operation
enabling the investigator to incorporate hypotheses regarding equivalence relationships
between surrogates. For example, one might hypothesise separate individual entities
may represent the same individual.
Consider the following unrelated sets of correlated events in Figure 15. The
previous web server related rule in Table 8 would not correlate the
ApacheWebFileDownloadEvent with the Win32LoginSessionEvent, as the surrogate
hosts (the client in the web log and the host in the LoginSessionEvent) are not the same.
Similarly, in an unrelated scenario where we were interested in correlating remote shell
sessions with the activities on a client computer, the SSHPasswordAuthenticationEvent
would not correlate with the Win32ProcessCreationEvent that executed the SSH client,
putty.exe.
Figure 15: Related events remain unconnected because of surrogate proliferation
If the investigator looks in the entities view of the event browser he will see all
of the individual entities that have been identified from the event logs, including the
Host with IP address 131.181.6.167 and another host with name “DSTO”. Through
examining DNS logs, or by other means, the investigator may hypothesise that the Host
“DSTO” and the Host with IP address 131.181.6.167 are in fact the same host. The
investigator can select the two entities and invoke the sameAs operation on the two.
The individual entities representing the two Hosts are now treated by the OWL
implementation as one single entity, a single instance of Host that combines all
properties of the prior two.
The sameAs operation relies on the underlying semantics of the OWL
individual equivalence mechanism, owl:sameAs. This language feature may be used to
state that seemingly different individuals (or classes) are actually the same. This single
(now merged) individual will now suffice to fire the WebFileDownload-LoginSession
causality rule discussed previously, and causally correlate the
ApacheWebFileDownloadEvent to the Win32LoginSessionEvent, via a different rule as
shown in Figure 16. Additional rules may now correlate the connection initiated by the
“putty” SSH client to another UNIX host.
Figure 16: Correlated event graphs after proliferate surrogates merged
With these causal links correlated, the rules for the OSExploitEvent will now
be satisfied, driving the creation of an OSExploitEvent and its subsequent display in
the event view of the user interface.
Search oriented investigation
The prototype enables one follow an interactive search methodology to explore

arbitrary hypotheses. The investigator uses the query interface of the FORE event
browser to find all instances of a LoginEvent with the user property equal to the
Administrator user. With our current knowledge base, this will return a
Win32LoginEvent (which is a subclass of LoginEvent), corresponding to the login on
machine “DSTO”. The investigator also now knows that the host “DSTO” runs a
Windows operating system, by virtue of the specific nature of the Win32LoginEvent.
The investigator at this point may be interested in what other users were doing
on the computer in the time leading to this event. By querying for all instances of the
LoginSessionEvent class prior to the administrator login event on the machine “DSTO”,
the investigator will find that the user DSTO\bob was logged into the machine
previously. The LoginSessionEvent is an abstract event we use to represent a user’s
login session on a machine.
Examination of the event’s causal ancestry information will now accelerate the
process. In this case, we would find many events of the user DSTO\bob, causally
correlated together by rules relating web file downloads to login sessions, logins to
logouts, and so on. By navigating through the causal ancestry graph the investigator
may become suspicious of the DownloadExecutionEvent. This event generalises the
events of the user downloading a file, (in this case sourced from an Apache log) and the
execution of the file. Examining the processName property of the
DownloadExecutionEvent reveals the file ‘rawrite2.exe’. The investigator knows that
this file is a tool for copying bootable floppy images to a floppy disk. Exploring the
causal graph further reveals that the user has downloaded a file, ‘bd030426.bin’ which
is a bootable image file that, in this case, contains a utility to wipe the Windows
administrator password, resetting it to a known value.
5.4.2 Experimental results
We ran our prototype implementation against the previously presented scenario

with the knowledge base containing events sourced from the ECF dataset. A number of
instances of OSExploit events were immediately returned, demonstrating the correct
operation of the automated investigation component of the system.
Investigation of the entities found in the knowledge base revealed many hosts
refered to by name, and others referred to by IP address, which led the investigator to
formulate the hypothesis that a number of the hosts entries were in fact surrogates for
the same host. By looking at the networks current configuration it was hypothesised
that it was likely the host with name “DSTO” also had the IP address 131.181.6.167.
Upon expressing the hypothesis that the two surrogates in fact corresponded to the
same host by invoking the sameAs operation on the two surrogates (performed in the
UI by selecting two surrogates and right clicking to access the sameAs operation),
another instance of the OSExploit event was automatically generated. This
demonstrates that the hypothetical entity equivalence function of the prototype works
correctly, and furthermore, that this technique is effective in reducing the number of
false positives unavoidable without a means of hypothesis specification.
Comparison of the ECF and FORE approaches
Both the ECF and FORE approaches support querying event data in the
exploration of a hypothesis. FORE differs from ECF in that the underlying event base,
in the case of FORE, contains many linkages that have been inferred by event
correlation rules at the time events are loaded into the system. These causal linkages
enable an investigator to explore simple relationships without manual inference. ECF
however, contains none of these causal linkages. Following causal linkage in ECF
involves the operator inferring the linkages, and expressing the conditions required in
SQL queries. For example, for the scenario explored previously, the investigator would
have to write a series of SQL queries to successively narrow the set of events in
question. Investigation using ECF thus requires considerably more human inference
than using FORE.
FORE adds investigation features not found in ECF. In addition to providing a
general purpose KR framework and a complementary event correlation language, our
approach introduces the ability of hypothetically unifying entities of equivalent
identity, further enhancing the effectiveness of existing rule based correlation
approaches. We facilitate the representation of generalized events, enabling

investigators to reason with generalized concepts, at higher levels of abstraction. FORE
aims for high semantic consistency with little information loss, whereas ECF values
information normalization.
5.5 Case study 2: Extending the approach to new domains

In the previous section we presented our approach, which represents event or
transaction based knowledge as well as environment-based knowledge by defining an
extensible and semantically grounded domain model (a forensic ontology) expressed
using the Web Ontology Language (OWL). We created our own rule based correlation
language, FR3, based on the observation that most rule and signature based correlation
techniques are translatable to rules. We demonstrated the application of the approach to
the forensic investigation of a scenario in a single, homogeneous domain, using an ad
hoc ontology.
In the work described in this section, we demonstrate that the approach is
extensible and can be generalised to support forensic investigations involving multiple
heterogeneous domains. We demonstrate its applicability by applying it to two new
domains of event based evidence, along with the domain discussed previously. Where
we previously employed an ad hoc ontology, we now refine our approach by
integrating third party ontologies as our foundations, demonstrating that the approach
can scale by virtue of enabling separate development and subsequent integration of
information described by domain ontologies and knowledge encoded in inference rules.
This provides freedom to the expert in advancing forensic understanding within a
narrow domain, and also providing the necessary structure to relate and communicate
that understanding to less sophisticated practitioners and generic reasoning tools.
In this section we address the following potential scenario which illustrates the
motivation for our work and serves as a test of the success of our approach. We have
identified a scenario of potential misuse in an accounting environment where a
company is using the SAP 27 ERP system.
The scenario consists of the following actions, with the sources of the events in
question indicated in parenthesis:
1. Person P enters room R (Door log)
2. P logs on to Windows 2000 workstation (Win32 System Log)
3. P runs the SAP application client (Win32 System Log)
27
http://www.sap.com/
4. P logs into the SAP application as Q, which fails (SAP Security Audit
Log)
5. P logs off the windows workstation (Win32 System Log)
Detection of this scenario could indicate a user mistyping their username or
password. It could, however, also indicate a user attempting to (or succeeding to) login
as another user or to an account which they are not authorised to use. Persistent
recurrences of this event could potentially indicate the user methodically guessing the
password of another user.
5.5.1 Integration of standard ontologies
An upper ontology refers to a set of elementary, generalised and abstract

concepts that should form the basis of all other ontologies. The two primary efforts
towards defining upper ontologies are the Standard Upper Ontology (SUO) and CYC
(which stands for enCYClopaedia) [67]. Recently, the CYC upper ontology has been
made public as a part of the openCYC.org project. The SUO working group, under the
auspices of IEEE, is working at forming this ontology from a number of upper
ontologies, including the Suggested Upper Merged Ontology (SUMO) [94] and CYC.
Both efforts further define middle level ontologies which are more domain
specific than their upper counterparts. Reed and Lenat observe that in practice, most
work on ontology merging and reuse occurs in the middle and lower levels of ontology,
where the defining vocabulary for a domain is located [108].
SUMO [94] provides two middle level ontologies related to our work:
distributed computing, and geography. Chen and Finin (2004) have defined a set of
ontologies collectively referred to as SOUPA for context aware pervasive computing
environments, which addresses concerns such as location, places and time. It imports
subsets of the OWL-S web services ontologies, and defines a spatial ontology based on
a subset of the openCYC spatial ontology.
We chose to use the SOUPA ontology for representation of place and space
related concepts as SOUPA is more lightweight than the SUMO ontologies.
Lightweight ontologies perform better in automated inference, as there are a reduced
number of concepts and instances required to be considered by the inferencing engine.
Further, the SOUPA efforts have demonstrated this ontology working with the JENA
toolkit.
Of the ontologies related to security, the security ontology of Raskin et al [105]
appeared to be promising; however the ontology was unavailable at the URL published.
Of the available security ontologies, the closest fit to our needs was the NERD
ontology. To integrate it, we first had to translate it into OWL. This was
straightforward, as it is specified using the CLASSIC DL language which, like OWL, is

based on DL foundations.
The NERD ontology was far more granular in its modelling of the composition
of network and host structure. For example, in our original ontology, we modelled the
IP address of a host as a property with domain the Host class and range the simple
datatype string. In the NERD ontology there are another three layers of abstraction in
between: the class Interface, IPSetup, and IPAddress. We use a succession of
anonymous instances to represent this host. Rather than stating “the host with IP
address 131.181.6.3” in our original ad hoc ontology, we must make the statement “the
host whose interface has an ipsetup with IP address 131.181.6.3” using the NERD
ontology. This is expressed using this ontology as:
<nerd:Host>
<nerd:hasInterface>
<nerd:Interface>
<nerd:hasIPSetup>
<nerd:IPSetup>
<nerd:hasIPAddress>
<nerd:IPAddress>
<nerd:ipaddress
>131.181.6.3</nerd:ipaddress>
</nerd:IPAddress>
</nerd:hasIPAddress>
</nerd:IPSetup>
</nerd:hasIPSetup>
</nerd:Interface>
</nerd:hasInterface>
</nerd:Host>
This introduces many more entities into the system per log entry, which could
quickly overload the information conveyed in the entity view. In response to this, we
modified the presentation layer of our GUI to only present the outermost enclosing
instance, with the child properties connected to anonymous instances represented as
path elements. For example, in our entity viewer, we would represent the Host above
as:
[hasInterface.hasIPSetup.hasIPaddress.ipaddress=131.181.6.3]
5.5.2 Integrating new domains
The door log entries contain the date, time, card id, name of assigned owner,
the door name, and the zone. In our case, the door is named by both the room it controls
access to and the building containing the room. Integrating this knowledge into our
prototype first involves identifying the concepts implicit in the event log data, and then
determining an appropriate place for the concepts in our ontologies.
As we wish to represent Rooms and Buildings, we hook in our Room concept
by inheriting from the SOUPA class SpacedInAFixedStructure. Similarly, we inherit
Building from FixedStructure. We hooked a DoorEvent into our existing ad hoc event
ontology by inheriting it from our existing Event class. We next wrote an event parser
specification specific to the door logs, which matches the door log syntax, and declare
the OWL instances which are necessary to represent a door entry. Below we present an
example door log event, as created by the parser:
<fore:DoorEvent>
<fore:building>
<fore:Building rdf:ID=”building1”>
<spc:name>GP. S BLOCK</spc:name>
</fore:Building>
</fore:building>
<fore:room>
<fore:Room rdf:ID=”room0”>
<spc:name>GP. S BLOCK RM S826A</spc:name>
<spc:spatiallySubsumedBy>
<fore:Building rdf:about=”building1”/>
</spc:spatiallySubsumedBy>
</fore:Room>
</fore:room>
<fore:user>
<fore:DoorSwipeCard rdf:ID=”doorcard1”>
<fore:cardID>42281</fore:cardID>
<fore:name>RICCO LEE</fore:name>
</fore:DoorSwipeCard>
</fore:user>
<fore:startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2004-03-04T20:30:00Z</fore:startTime>
<fore:DoorEvent>
SAP Security Audit Logs record, among other things, the success or failure of
logins to SAP, along with the date and time of the event, and the host (or in SAP
terminology, terminal) that the user attempted to login from. Addition of SAP related
events specific to our scenario required the addition of the following new concepts to
our ontology (presented in Table 10):
Table 10: SAP Related Events
Class Meaning
ServiceAuthenticationEvent Authentication of a user to a resource,
specifically, a resource that is a service
SAPAuthenticationEvent Authentication of a user by SAP. Login success
or failure. Inherits ServiceAuthenticationEvent.
SAPClientLoginSuccessEvent Successful login to SAP. Inherits
SAPAuthenticationEvent.
SAPClientLoginFailureEvent Unsuccessful login to SAP. Inherits
SAPAuthenticationEvent.
SAPClientProcessCreationEvent The SAP client program has been run on a client
terminal.
IdentityMasqueradeEvent Multiple login names have been used to access a
service from the context of a single login
account.
The basis for identifying a case of identity masquerading is by recognising

when a user uses multiple identities to access resources. This was recognised by
looking for SAP authentication events, which occur from the context of a single user’s
OS login session, where the user identity is not consistent. The LoginSessionEvent is a
higher level abstraction which represents a user’s interactive login session on a host. In
Table 11 we present a correlation rule in our language FR3, which detects instances of
this scenario:
Table 11: Identity Masquerade Rule
Rule e1?[rdf:type -> fore:LoginSessionEvent ; fore:startTime ->

?t1 ; fore:finishTime ->
?t3 ; fore:host -> ?h ; fore:user -> ?u1],
e2?[rdf:type -> fore:SAPAuthenticationEvent; fore:startTime -
> ?t2; fore:terminal -> ?h; fore:user -> ?u2],
le(?t1, ?t2), le(?t2, ?t3), notEqual(?u1, ?u2),
makeTemp(?s)
->
?s[rdf:type -> fore:IdentityMasqueradeEvent; fore:causality -
> ?e1,?e2],
Meaning Match an event instance of class LoginSessionEvent with an event instance of class
SAPAuthenticationEvent where the LoginSessionEvent’s host is the same host as
the terminal in the SAPAuthenticationEvent. The SAPAuthenticationEvent must
occur within the time boundaries of the LoginSessionEvent, and the users in each
event are not the same user.
If this is the case, create an event of type IdentityMasqueradeEvent and link its
causality property to the matched events, and link the causality property of the
SAPAuthenticationEvent to the LoginSessionEvent
Correlating door entries with interactive logins to a workstation is achieved

using the rule presented in Table 12:
Table 12: Door Entry- Login rule
Rule ?e1[rdf:type -> fore:DoorEvent; fore:user -> ?u ;

?e3[rdf:type -> fore:TerminalEvent; fore:user -> ?u;
fail ( ?e2[rdf:type -> fore:DoorEvent; fore:user -> ?u;
fore:startTime -> ?t2], lessThan(?t1, ?t2), lessThan(?t2,
?t3) )
->
Meaning Match an event instance of class DoorEvent with an event instance of class
TerminalEvent that occurs before it.
If they refer to the same user and there is not another door event in between, then
link the TerminalEvent’s causality property to the DoorEvent.
5.5.3 Experimental results
We ran our extended software against the previously presented multi-domain

scenario with a knowledge base containing some hundreds of events sourced from the
three different domains. The event browser immediately identified the scenario, along
with a number of false positives. The scenario was identified by instances of the
MultipleIdentitiesUsedEvent event appearing in the event browser. We provide further
means for finding instances by querying for the specific event, or by using high level
views which limit the set of events displayed to higher level concepts closer to the
concerns and vocabulary of the investigator. The user interface enables the investigator
to “drill down” to the events which caused it. In this example, the
MultipleIdentitiesUsedEvent has causal links to the LoginSessionEvent and the
SAPAuthenticationEvent that triggered its creation.
In Figure 17, we present a graph of events that correspond to the scenario,
which can be explored by an investigator using the drill-down feature of the interface.
The causal relationships correlated by the rules above are presented in using bold.
Other links are correlated by rules not presented here.
IdentityMasquerade
Event
SAPClientProcess SAPClientLoginSuccess
CreationEvent EventEvent
user=P user=Q
host=F terminal=F
LoginSessionEvent
user=P
TerminalLogin TerminalLogout
Event Event
user=P user=P
DoorEvent
user=P
Figure 17: Causal ancestry graph of identity masquerading scenario
In our test environment, like many real world deployments of SAP, the SAP
username is not necessarily the same as the OS username for the same user. The
preceding rule presented in Table 11 resulted in many false positives, as the test for
inequality fires the rule for minor differences in username. For example, “jsmith” and
“j.smith” are treated as separate users.
In this case the surrogate proliferation problem creates false positives. To
resolve this, we explicitly select the users in question, and indicate that they should be
treated as representing the same thing, again using the sameAs functionality provided
by the OWL semantics. As a result, MultipleIdentitiesUsedEvent based on this kind of
identity failure are removed from the knowledge base and event viewer. This approach
to hypothetically resolving identity between a user identified from a door log, and a
user identified in a login, similarly allowed us to causally correlate door logs with
logins to computers.
5.6 Conclusion
The FORE prototype holds the promise of collaborative development of
correlation rules that correlate events across and within domains, reducing the amount
of manual inference and query tasks, and assisting in interactive investigation. At a
higher level, we have demonstrated that correlation rules can automatically correlate
whole forensic scenarios without interactive investigation by human operators.
The four contributions of this chapter are aligned with themes of representation
and analysis techniques.
Firstly, the work investigates whether the RDF/OWL formalism is a useful
general representation upon which a digital forensics application, requiring a wide
representational scope, might be built. An experimental result of case study 1 (Section
5.4) is that we find that RDF/OWL is a useful formalism for representing low level
computer security and systems related events, composite and abstract events related to
higher order suspicious situations, and entities referred to in those events. The instance
model of RDF enabled definition of a surrogate per event and entity, and the class
based model of abstraction enabled ascribing semantics these event and entity
instances. The experiment demonstrates that the representation is of use in addressing
the complexity problem by enabling integration of arbitrary information from various
computer security and systems event logs.
The RDF/OWL representation is not, however, sufficiently expressive enough
to describe and represent heuristic knowledge describing complex relationships
involving temporal constraints, instance matching, and declaration of new property
values or new instances. This necessitated shifting outside the knowledge
representation to employ a rule language (FR3) for these purposes.
In case study 2 (Section 5.5) we have demonstrated that the representation is
extensible and generalisable to support reasoning across multiple heterogeneous
domains. We do so by successfully applying the prototype to a forensic scenario that
involves both ERP security transaction logs, and door logs, in addition to computer
security logs such as those which we have considered in our previous efforts.
Furthermore, we demonstrate that our approach can scale, by supporting the separate
development and subsequent integration of domain models, event parsers, and
correlation rules, by experts in their respective domains.
In this case, however, we addressed integration of information with differing
ontological commitments, by integrating information modelled by an existing network
intrusion detection related ontology, in addition to events sourced from the enterprise
resource planning system, SAP. The extensibility of the representational approach was
demonstrated by the ease with which an existing domain model was integrated into our
prior prototype.
The RDF/OWL language alone was not, however, expressive enough to
provide the language tools to address the areas of impedance mismatch between our
prior (ad-hoc) ontology and the intrusion related ontology. In this case the mismatch
was resolved by modifying our existing ontology and heuristics to operate at the same
level of granularity and commitments of the new ontology. An alternate approach
would have been to adopt rules to bridge across these mismatches.
In practice, the approach and implementation described carried with it an
ontological commitment which focused on modelling of situations and entities,
simplifying the subtle relationships between events and their occurrent time. This
simplified model of time carries with it the assumption that all of the clocks on the
separate machines are synchronised. While network time infrastructure such as NTP
facilitates synchronisation of computer clocks down to the millisecond, we expect that
in practice all but the simplest of forensic investigation will involve multiple computer
time sources in various states of de-synchronisation. Further work is required in
adapting event correlation techniques to work with models of time which incorporate
notions of multiple independent timelines, hypothetical specifications of clock
timescale behaviour, and automated methods for identifying the temporal behaviour of
computer clocks from event logs. This last theme is investigated further in Chapter 7.
An additional time related simplification is the embedded assumption that
values of entities remain invariant over time, whereas in reality, attribute values vary
over time. For example, while it may be widely true that a particular person’s name
remains the same over the period of their life. This assumption fails to hold, however,
when one considers events such as marriage, and officially sanctioned name change via
deed poll. Models of entity attribute values which account for different values over
time require further investigation.
Another limitation of the prototype described here is that it eschews
maintenance of provenance information. The parser, in its current state does not record
the source of event instance that it generates. Secondly, the rule engine does not record
which successful rule firings lead to which new inferred composite events.
Documentation of both of these is important as any automated conclusions must be
verifiable and traceable back to the original evidence.
The second contribution of this chapter is the demonstration of a novel analysis
technique for automated detection of a computer forensic situation, based upon
information automatically derived from digital event logs. We present a heuristic rule
based approach that has the ability to manage the scalability and semantic issues arising
in such inter-domain forensics.
Such rule based approaches have a number of shortcomings. While abstraction
goes some way towards reducing the number of rules required for automated detection,
rules must still be authored by experts. Research into automated means of identifying
potential rules and associations is warranted; approaches such as data mining hold
promise. Furthermore, rules are by nature crisp in their definition, precluding
incorporation of fuzzy concepts. For example, in the OSExploit detection rules, we
implied a causal relationship by requiring that the Win32RebootEvent and LoginEvent
be within 10 minutes of each other, under the hypothesis that an attacker would operate
quickly and to avoid the complication of the rule matching every Login after a reboot.
Intuitively, the further the events which correlate to the OSExploitEvent are away from
each other, the more likely they are to be not causally correlated. Where one draws the
line on the relatedness of two events of these types is by nature subjective and could
benefit from techniques which acknowledge this.
The third contribution is the identification of a novel means of resolving the
problem of surrogate proliferation in interpreting names in event logs, which is
described in Section 5.4.1. Surrogate proliferation refers, in this case, to the problem
which arises from a single real (or virtual) world entity having multiple names by
which it is referred to, which leads to the necessary creation of one surrogate per name.
For example, while some event log entries (such as those taken from firewall logs) may
describe events related to a host by referring to its IP address, other event log entries
may refer to the same entity by its DNS name. This abundance of multiple names for
the same entity grows the quantity of entities which must be considered in interpreting
and correlating event logs.
This problem of surrogate proliferation can be observed throughout the digital
forensics domain. The event normalising task in Stephenson’s End to End Digital
Investigation (EEDI) methodology [127] refers to this problem in the context of
resolving records referring to the same network event being received from multiple
sensors. Similar problems may be observed in ascribing identity to particular versions
of files (i.e. operating system) found across multiple digital crime scenes.
The technique addressing this problem (described in Section 5.4.1) exploits a
general feature of the RDF/OWL formalism; the owl:sameAs language term, and
associated OWL defined semantics. That the general reasoning machinery of the
knowledge representation is employed to solve this problem demonstrates the
immediate benefits of employing a knowledge representation towards reducing

complexity and volume digital evidence.
This approach has a number of limitations related to expressiveness. Besides
the temporal simplifications described previously, we additionally simplified our model
of events by not modelling positional relationships between textual events within the
log file. Fully describing the information within an event log file requires a detailed
understanding of the meaning of the log file entries, and as such requires considerable
domain knowledge. Additionally, writing parsers which translate between textual event
log records and information expressed in the RDF/OWL representation carries with it
the additional burden of understanding the representation, and modelling methodology
employed. Description of event logs in this manner requires orders of magnitude more
storage, which exacerbates the volume problem.
The final, and significant limitation of the approach, is that the current
generation of RDF/OWL reasoners and data-stores were observed to be problematic in
scaling to large volumes of information. Wholesale import of event logs into current
generation data-stores and reasoners yields unsatisfactory results (measured by the
amount of time taken to import event logs and perform correlation), leading to the
conclusion that further work is required in identifying scalable methods of search and
reasoning over the OWL/RDF representation.
The next chapter proposes the same KR approach towards solving the
challenges of tool interoperability, and integration of arbitrary information.
Chapter 6. Sealed digital evidence bags
“There are more things in heaven and earth, Horatio, Than are
dreamt of in your philosophy.”
(William Shakespeare)
The previous chapter proposed and demonstrated the use of formal knowledge
representation in automating correlation of digital event oriented evidence, to facilitate
identifying situations of interest from heterogeneous and disparate domains. This
chapter addresses themes of representation and assurance in addressing how forensics
tools might scale and interoperate in an automated fashion, while assuring evidence
quality. The chapter considers the problem of sharing of digital evidence between tools
or even more widely, between organisations.
The chapter is structured as follows. Section 6.1 introduces the problem of
digital evidence storage formats, the related literature of which is described in Section
3.2. Section 6.2 enumerates a number of definitions of terms related to digital evidence
and related documentary artefacts. Section 6.3 proposes a novel integrated storage
container architecture and KR based information architecture for digital evidence bags,
which we call sealed digital evidence bags (SDEB). This approach supports arbitrary
composition of evidence units, and related information into a larger corpus of evidence
Section 6.4 describes the compositional nature of the architecture in the context of
usage scenario: building digital forensics tools and acquiring digital evidence from hard
disks. Section 6.5 describes experimental results validating the compositional nature of
the prototype approach, and Section 6.6 presents the conclusions of the chapter and
relates opportunities for future work.
The research work described in this chapter has led to the publication of the
following paper:
B Schatz, A Clark, (2006) ‘An information architecture for digital evidence

integration’ Proceedings of the 2006 Australian Security Response Team Annual
Conference (AUSCERT 2006), Gold Coast, Australia.
107
108 CHAPTER 6 – Sealed digital evidence bags
6.1 Introduction
The rapid pace of innovation in digital technologies presents substantial
challenges to digital forensics. New memory and storage devices and refinements in
existing ones provide constant challenges for the acquisition of digital evidence. The
proliferation of competing file formats and communications protocols challenges one’s
ability to extract meaning from the arrangement of ones and zeros within. Overarching
these challenges are the concerns of assuring the integrity of any evidence found, and
reliably explaining any conclusions drawn.
Researchers and practitioners in the field of digital forensics have responded to
these challenges by producing tools for acquisition and analysis of evidence. To date,
these efforts have resulted in a variety of ad hoc and proprietary formats for storing
evidence content, analysis results, and evidence metadata, such as integrity and
provenance information. Conversion between the evidence formats utilized and
produced by the current generation of forensic tools is complicated. The process is time
consuming and manual in nature, and there exists the potential that it may produce
incorrect evidence data, or lose metadata [30].
It is with these concerns in mind that calls have been made for a universal
container format for the capture and storage of digital evidence. Recently, the term
“Digital evidence bags” was proposed to refer to a container for digital evidence,
evidence metadata, integrity information, and access and usage audit records [135].
Subsequently, the DFRWS formed a working group with a goal of defining a
standardised Common Digital Evidence Storage Format (CDESF) for storing digital
evidence and associated metadata [30]. For further background on digital evidence
container formats, see Section 3.2.1.
Another source of complications related to the ad hoc nature of forensic tools is
the absence of a common representational format for Investigation Documentation.
This includes a number of generally related classes of information, such as Continuity
of Evidence, Provenance, Integrity, and Contemporaneous Notes (see Section 2.2). This
is not a trivial problem owing to the nature of the forensics domain, which deals with
massive conceptual complexity within multiple layers of abstraction. The challenge
here is to identify a means that decouples the evidence container formats and
investigation documentation used by forensics tools from the implementation logic of
these tools. Furthermore, this needs to be accomplished in a manner that facilitates the
assurance of provenance and maintains integrity.
This problem of evidence representation is not simply limited to the challenge
of tool interoperability. In outlining the “Big Computer Forensic Challenges”, Spafford
CHAPTER 6 – Sealed digital evidence bags 109
observes that practitioners and researchers in the field of digital forensics do not use
standard terminology [98], and indeed it is clear that there is limited attention paid to
the formal definition of taxonomies or ontologies describing this domain.
We propose the use of ontologies in addressing these terminological and
representational problems. We have produced a number of basic ontologies modelling
the domain of digital evidence acquisition, computer hardware, and networks, and
described these ontologies using the Web Ontology Language (OWL). In combination
with semantic markup languages such as RDF, ontologies encourage knowledge
sharing and reuse within a domain, which has the potential to lead towards a
convergence of vocabulary in the forensics domain.
In this chapter we propose an extensible architecture for integrating digital
evidence by applying an ontology based approach to Turner’s digital evidence bags
concept. We enumerate the representational requirements for the investigation
information component of an open common digital evidence storage format, and
formalise the domain by describing it with an ontology. An architecture for digital
evidence bags is demonstrated which facilitates modular composition of forensic tools
by way of an extensible information architecture. Further, a novel means of identifying
digital evidence, and digital evidence bags is proposed which supports arbitrary
referencing of information within and between digital evidence bags. The proposal
modifies Turner’s design to strengthen evidence assurance, proposing an sealed
(immutable) bag metaphor.
6.2 Definitions
Our concerns involve representation and terminology. To avoid confusion, the
following terms used throughout the chapter, and in our digital evidence ontology, are
defined below. As the subject is digital evidence, we omit the use of the word digital in
our definitions.
Continuity of Evidence Documentation: Information maintained to track
who has handled evidence since it was preserved.
Digital Evidence: A term which loosely refers to a related set of Evidence
Content or Secondary Evidence and Investigation Documentation.
Evidence Content: Stream of bytes of computer data: typically data which is
stored in a file, or a stream of a file, or in raw storage, such as the ordered sectors of a
disk.
Evidence Content File: A file containing evidence content.
Image : A contiguous sequence of bytes, which is a copy of a digital crime
scene.
Integrity Documentation: Information which is used to detect the

modification of evidence content or metadata.
Investigation Documentation: Contextual information which is related to
Evidence Content. For example, commonly gathered Investigation Documentation
related to a JPEG image might be the file name, the path which it was stored in, and the
last modification, last access and creation times of the file.
Investigation Documentation File: A specific file containing arbitrary
Investigation Documentation.
Provenance Documentation: Information which relates to the provenance of
the evidence. For example, information about who captured the evidence, where it was
stored, what tools were used fall into this category.
Secondary Evidence : Digital evidence produced by an analysis tool.
6.3 An extensible information architecture for digital evidence

bags
The primary aim of our work is to identify a general solution which meets the
representational needs for storing arbitrary information, including both investigation
documentation and secondary evidence, in digital evidence bags in a manner that is
both machine and human readable. We seek to do this in a manner that allows separate
evolution of, definition of, and interoperability between the abstractions which are used
in forensic tools, in a manner that is not dependant on the management of a single
entity or governing body. The secondary aim of the work is to produce an evidence
container that enables a compositional approach to evidence sharing and integration.
We look to the near future, where analysis cases may involve digital evidence
from sources orders of magnitude more numerous than the current norm. In fact we see
the beginnings of this challenge as investigations of P2P networks involve multiple
terabyte sized images, sourced from numerous locations and computers. We expect that
the monolithic approaches to digital evidence containers will not scale to this future, for
reasons such as evidence bag size, concurrent access, and IO efficiency.
For example, consider the case where two multi-terabyte images must be
acquired. The use of a single monolithic DEB for containing both images could imply
serialising access to the DEB, and prohibit acquiring the images in parallel. With
current IO speeds, this would add tens if not hundreds of hours to the acquisition time.
To address these scaling issues we propose a compositional rather than
monolithic approach to assembling of a corpus of digital evidence. This requires
amongst other things defining an identification scheme that is independent of location
and global in nature. This architecture facilitates the building of a corpus of evidence
by recursively embedding digital evidence bags within digital evidence bags, as well as
by intra-bag reference, which we depict in Figure 18. We call the architecture the
Sealed Digital Evidence Bags (SBEB) in reference to Turner’s proposal of the DEB.
Figure 18: Referencing nested and external digital evidence bags
For example, in the case of the multi-terabyte imaging scenario discussed

above, both imaging processes could happen in parallel, producing two digital evidence
bags. A further digital evidence bag, which references both these images could then be
used for adding provenance documentation such as the examiner’s name and case
number.
The architecture may be described in terms of two orthogonal components, the
storage container architecture and the information architecture.
6.3.1 Storage container architecture
The storage container architecture describes how data streams containing data
objects, investigation documentation, and evidence bag documentation are contained in
one archive.
Sealable digital evidence bags follow a similar structure to Turner’s bags. The
key difference is the use RDF/XML to represent the Tag and Investigation
Documentation related information, in order to facilitate an interoperable
representation. The Tag File of any digital evidence bag is called Tag.rdf. The naming
of the Investigation Documentation files is tool or user determined, however the
extension is .rdf to signify that the format of the file is RDF.
The XML/RDF format does not support recursive definition of RDF/XML
content within the content of another RDF/XML content block, and makes no provision
for arbitrary text outside the syntax of the XML syntax. This leads us to maintain
integrity information regarding the content of the Tag in a file external to the Tag,
unlike the DEB proposal. Turner’s DEB uses an onion like approach where a hash of
the previous contents of the Tag is recursively appended to the Tag. We instead define
a Tag Integrity File, called Tag.rdf.sig, which contains integrity information pertaining
to the Tag.
Sealable digital evidence bags are designed to be created and populated with
evidence and investigation documentation, then sealed exactly once. The Tag of an
SDEB is immutable after the Tag Integrity File has been added to the SDEB. Before
that the bags are unsealed and mutable.
The structure of the SDEB is presented in Figure 19.
Figure 19: Proposed sealed digital evidence bag structure
To demonstrate the SDEB architecture in context, we have developed a

prototype online acquisition tool for creating a digital evidence bag containing images
of the Internet Explorer cache and history index files (these are also referred to as web
browser logs). These files are typically located in a number of subfolders of the \Local
Settings\Temporary Internet Files\ path under the user’s profile directory on a
Windows host. The files in question are all named index.dat.
We present the file oriented contents of the digital evidence bag produced by
the prototype tool called acquireIELogs.py in Table 13. The tool creates images of the
browser log files, naming them according to a programmatic naming scheme based on
their original filename (in this case, index.dat), in combination with the user name, the
kind of file (cache or history), and the specific history file set.
Table 13: The file content of a browser log SDEB
jbloggs.history.MSHist012006010420060105.index.dat.rdf
jbloggs.history.MSHist012006010420060105.index.dat
jbloggs.cache.index.dat.rdf
jbloggs.cache.index.dat
jbloggs.history.index.dat.rdf
jbloggs.history.index.dat
Tag.rdf
Tag.rdf.sig
6.3.2 Information architecture
The information architecture is described by two ontologies, a representation

layer, and a unique naming scheme for referring to arbitrary information.
Unambiguous identification of evidence and arbitrary information
Recalling that in RDF (see Section 4.3.1), Subjects, Predicates and Objects are
named using a URI, we use a special category of URI called a Uniform Resource Name
(URN) [86] for identifying digital evidence bags, investigation documentation, and
arbitrary secondary evidence instances. URNs are intended to serve as persistent,
location-independent resource identifiers.
Following work performed in the life sciences area in uniquely identifying
proteins in distributed databases (which has resulted in the definition of the Life
Sciences Identifier (LSID) standard [117]), we propose a digital evidence specific URN
scheme. This scheme, which we call Digital Evidence IDentifier (DEID) is based on
the organisation of the tool user, and employs message digest algorithms as a globally
unique identifier. The format of a Digital Evidence Identifier is as follows:
urn:deid:organisation:digestalgorithm:digest:discriminator
For example, we identify a particular image taken of a file in our example

further below using the following URN:
urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea17d4:image
The string “deid” 28 is used to provide a unique namespace for digital evidence
identifiers. We provide scoping information in the organisation field which would
potentially enable one to resolve a URN back to a set of information or an evidence bag
as has been employed in the LSID work. The digestalgorithm field refers to the
28
We follow the LSID convention, which uses a lower case string “lsid” in the URN.
message digest algorithm used to generate text in the following field. The
descriminator field is provided for further addition of naming terms. It should be noted
that we rely on the collision free nature of message digest algorithms to assure globally
unique names. Given that flaws may be found in cryptographic hashes over time, our
proposal provides for the use of other digest algorithms.
Of course these identifiers are long and unwieldy and not suited for use as
names for the evidence we are concerned with. Evidence may be given more human
friendly, case specific names by asserting further RDF triples which have the identifier
as the subject. An example of this kind of usage is given in the case study in Section
6.4.
Where it is necessary to refer to the contents of a particular file, for example a
digital evidence container file in the same DEB, the DEB implementation interprets the
standard URI file protocol (i.e. file://./foo) to find the file.
DE and SDEB Ontologies
The representation approach underlying the SDEB information architecture is

that every real world or virtual world entity has a corresponding surrogate represented
using RDF/OWL. Prior approaches blur the distinction between entities. For example,
an AFF container holding a file image might define a number of name-value pairs to
describe the set of sectors from which it read the file, and another name value pair to
describe the serial number of the hard drive from which it read the sectors. The
ontological commitment of the name-value pair representation places the subject of any
statements in the background, leaving its identity and surrogate implicit. This
representation works well for making statements about a single entity such as a hard
drive image, however, when then number of discernable entities which must be
identified increases beyond a single instance, the representation becomes clumsy due to
its absence of surrogates. For example, say that the file was read from a filesystem on a
RAID5 array. How does one document; the volume, the various physical drives
composing the volume, the RAID5 configuration, and the relationships between them?
The SDEB approach creates a surrogate for every entity which it documents,
distinguishing between an image, its content, the source media from which it was
copied, and allows the representation of secondary evidence in the same
representational formalism.
Two ontologies are defined to describe a sufficient set of concepts and
properties required for describing both the storage related components of a SDEB, and
the digital investigation related concepts. The SDEB ontology defines concepts such as
DigitalEvidenecBag, TagIntegrityFile, EvidenceDocumentationFile, and
EvidenceMetadataFile, and the properties bagContents and contains. The Digital

Evidence ontology describes a wider range of concepts relating to imaging (FileImage),
data (ContiguousBytes), tools (AcquisitionTool) and disk structure (Partition).
The Investigation Documentation Files produced by the prototype tool all
contain information of a similar format to that presented in Table 14 (abridged). The
Investigation Documentation File is used for storing arbitrary information related to the
case. As such, it never contains information related to the SDEB ontology. We can see
in the example below that some documentation is stored related to Digital Evidence –
the FileImage defined in the Digital Evidence ontology. A further ontology is
referenced in this example – a web browser specific one that is required to represent the
concepts related to web browser cache and history files – the subject of the prototype
imaging application.
Table 14: XML/RDF content of Investigation Documentation File named

jbloggs.cache.index.dat.rdf
<de:FileImage
rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476
:image">
<de:imageContainer rdf:resource="file:///./jbloggs.cache.index.dat"/>
<de:imageOf
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786f…b29a2104c476:original"/>
<de:acquisitionTool>
<de:OnlineAcquisitionTool rdf:about =
”http://www.isi.qut.edu.au/2005/acquireIELogs.py”>
<de:name>acquireIELogs.py</de:name>
<de:version>0.1</de:version>
</de:OnlineAcquisitionTool>
</de:acquisitionTool>
</de:FileImage>
<wb:BrowserCacheFile rdf:about =
"urn:deid:isi.qut.edu.au:sha1:4056e4786f….b29a2104c476:original">
<fs:filePath>D:\Documents and...Files\Content.IE5\\index.dat</fs:filePath>
<de:messageDigest
rdf:datatype="http://www.w3.org/2000/09/xmldsig#sha1">4056e4786fc460d9adbe98a0bc
19b29a2104c476</de:messagedigest>
</wb:BrowserCacheFile>
This file (Table 14) contains RDF instance data which asserts two top level
instances; a FileImage and a WebBrowserCacheFile. The instances describe the
relationship between the Evidence Content (the content of an Evidence Content File in
the digital evidence bag) and the original data object, which is a Web Browser Cache
File, located on a particular host.
Our ontology here discriminates between the original data object, the web
browser cache file (which at one point in time resided on some piece(s) of physical
storage media) and the image of that file. As the contents of these two files are, from
the digital perspective, identical this results in a DEID URN with the same message
digest value. We discriminate between the two instances by using the labels “image”
and “original” in the discriminator field of the DEID URN. This distinguishes between
the FileImage and the BrowserCacheFile. The de:imageContainer property links the
FileImage instance in Table 13 with the contents of file jbloggs.cache.index.dat as seen

in Table 12.
The tool generates Provenance Documentation identifying itself by name,
location, and version, relating itself to the FileImage by use of the acquisitionTool
property. Provenance information identifying the examiner running the tool would be
added to a separate evidence bag, which refers to this sealed one. We do this to
simplify the acquisition tool, preferring that more complex data entry and annotation
tasks are performed using a task specific tool, such as an analogue of Turner’s Tag
editor application.
The property and class names used in the vocabulary above are defined in
ontologies specific to the domains of discourse that we are dealing with. The prefix de
is an alias for an ontology stored in the document located at
http://isi.qut.edu.au/2005/digitalevidence, which describes the digital evidence domain.
Hence, de:FileImage refers to a specific concept (a class) defined in this ontology.
Similarly we define an ontology for filesystem related concepts aliased as fs
(http://isi.qut.edu.au/2005/filesystem) and web browser related aliased as wb
(http://isi.qut.edu.au/2005/webbrowser).
Figure 20 depicts a portion of the RDF graph implied by the ontology
discriminating between the original data object and the image discussed above and
presented in Table 14.
Figure 20: RDF Graph relating original data object and image
The tag file contains the RDF data representing the SDEBs contents and related
integrity information. The DEID of the deb:DigitalEvidenceBag instance is based on
the hash of the content of the Investigation Documentation Files, in the order in which
they are defined in Table 15. The deb:bagContents property is an ordered list which
refer to instances of digital investigation documentation contained in the digital
investigation documentation files.
Table 15: Digital Evidence Bag instance data stored in the Tag File
<deb:DigitalEvidenceBag
rdf:about="urn:deid:isi.qut.edu.au:sha1:44bc23235f5e797aae992e5de09524e9071fd8c6
">
<deb:bagContents>
<rdf:Seq>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea1
7d4:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4a03ed30ebdf919004d4b40222b721c4771ad
ee9:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:c117652d98a4f612979c19f5701d278e02574
9fa:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:05de1243f67753150334968a2effcc4f8114e
f45:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:f3a9fd3fcc017d822f10bc4466b6d19ddbdd5
042:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c
476:image"/>
</rdf:Seq>
<deb:bagContents>
</deb:DigitalEvidenceBag>
6.3.3 Integrity
Current best practice for ensuring the integrity of digital evidence involves the
use of collision resistant message digest functions. Typically a message digest is taken
of the original evidence, and recorded in a manner that asserts the time of the digest
being taken (often via contemporaneous notes or printouts). The integrity of subsequent
images made, or copies of images made may then be ensured by taking the message
digest of the image or copy, and comparing with the original message digest.
In this proposal, integrity of evidence and investigation documentation is
ensured by the use of chained message digests. Besides using the message digest of
each piece of Evidence Content as a component of a unique identifier for both the
Evidence Content Documentation instance and the Digital Investigation Documentation
instance, we also define a property within the class de:EvidenceContext class called
de:messageDigest. This property is presented in context in Table 16.
Table 16: Evidence Content message digest property
<wb:IEBrowserCacheFile
rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476
:original">
<de:messageDigest
rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1
>4056e4786fc460d9adbe98a0bc19b29a2104c476</de:messagedigest>
</wb:IEBrowserCacheFile>
The value of the de:messageDigest property is the hash of the Digital Evidence
Content obtained from the file. Work in the xml signature area has already defined a
datatype representing a SHA-1 message digest, and defined a URI representing this
datatype, we use the URL http://www.w3.org/2000/09/xmldsig#sha1 to specify the datatype

of this property.
Integrity of the Investigation Documentation Files is maintained within the Tag
File, by definition of separate de:InvestigationDocumentationContainer instances per
Investigation Documentation File, as presented in Table 17. Integrity of the content is
assured by the inclusion of a message digest of the Investigation Documentation File,
using the de:messageDigest property.
Table 17: Investigation Documentation Container Metadata stored in the Tag File.
<deb:InvestigationDocumentationContainer>
<deb:contains
rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea1
7d4:image" />
<de:messageDigest
rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1
>731251ae7216b935cccf51a4018a00d8d89a89cd</de:messagedigest>
<fs:filePath>file:///./jbloggs.history.index.dat.rdf</fs:filePath>
</deb:InvestigationDocumentationContainer>
As the focus of this chapter is not the mechanics of integrity maintenance, we

do not specify the format or contents of the Tag Integrity File. We expect that the
contents of the file may be formatted according to the XML Signatures standard [11],
or some other standard. We do not consider here what kind of archive is used as the bag
medium.
6.3.4 Evidence assurance
We provide no construct that directly translates to the audit oriented functions

of the Tag Continuity Blocks of the DEB proposal, as we expect that further application
of tools to sealed bags will result in new digital evidence bags being produced. The
Provenance Documentation within these new bags would refer back to the original bag,
thus serving this role.
6.3.5 Clarifications
It appears that the DEB allows a number of pieces of evidence to be stored in a

single Evidence Content File. We restrict the definition of the Evidence Content File to
refer to a container with exactly one piece of evidence content.
6.4 Usage scenario: imaging and annotation

We demonstrate the modular manner in which forensic tools may interoperate
with evidence bags built using the sealed digital evidence bags approach by way of the
following hypothetical example.
In this case, the examiner uses a DEB enabled hard drive imaging application
for acquiring the evidence image. This tool is scripted together from a variant of the
UNIX dd 29 tool, and the Linux hdparm utility 30 . The examiner acquires the hard drive
using this utility, resulting in a digital evidence bag containing an Evidence Content
File, called hda.dd, an Investigation Documentation File, called hda.dd.rdf, as well as
Tag.rfd. The imaging application is designed to be as simple as possible, and produce a
sealed digital evidence bag. It automatically generates a message digest of the Tag.rdf
file and stores it in the Tag Integrity File, Tag.rdf.sig. At this point the evidence bag is
sealed, and considered immutable, depending on the underlying scheme of
implementation of the Tag Signature.
The examiner has further data associated with this digital evidence bag, namely
the Job ID, a case specific name, the examiner’s name and identifying details, and
perhaps the serial number printed on the drive. An evidence annotation program is used
by the examiner to create a new, unsealed digital evidence bag, and the original digital
evidence bag embedded within it. A new Tag File is created within this new bag by the
annotation application. The additional data is entered using the annotation user
interface, and added to the Tag File. In this case the annotation editor eschews creating
a new Investigation Documentation File, as no new evidence has been acquired.
There are two distinct activities involved in the above scenario: evidence
acquisition and evidence annotation. By the former, we refer to the process of making
an exact copy of a piece of digital evidence, for example a hard disk. The latter refers to
the act of recording details relevant to the acquisition process and the evidence source.
By modularizing these two tasks, individual tool complexity is reduced, which has the
potential to increase reliability and enable testing at a more granular level. Bugs in the
consuming forensic tool (the annotation tool), are more likely not to jeopardize the
integrity of the product of the evidence acquisition task.
The tool annotates the information in the original sealed digital evidence bag
by asserting new properties and their values, related to the DEID of the particular piece
of information from the subject bag, as new RDF triples. These triples are stored in the
Tag File of the new unsealed DEB. In reference to the above example, the new data is
related to the instance representing the hard disk by means of its unique identifier. A
depiction of a portion of the RDF graph formed from the new information as well as
the original investigation documentation is presented in Figure 21.
29
A low level block oriented copying tool found on most UNIX variants.
30
A utility which queries information such as serial numbers, size, and addressing information
from hard disks.
Figure 21: RDF graph resulting from addition of new documentation to embedded DEB
Modularity is not only facilitated in terms of interoperability between forensics

tools, but also by modular composition of ontologies. In this way an organisation could
create its own specific ontology (say for the purpose of adding an organisation specific
identifier) which would seamlessly integrate with the existing RDF graph and ontology.
We allude above to a further ontology (fooPolice) which defines the fooPolice:jobID
property.
6.5 Experimental results

To validate the effectiveness of the approach, two prototype SDEB aware
applications were created. The first, as mentioned previously, was an online acquisition
tool for acquiring Internet Explorer web browser cache and history logs. The second
prototype tool was a generic validation tool, the function of which was to validate the
integrity of an arbitrary SDEB.
The online imaging application was used to create an SDEB from a Windows
computer: many of the examples in Section 6.3 are taken from the produced SDEB.
The validation application was build in Java, using JENA as the RDF/OWL
implementation. The validation application implemented the storage architecture by the
following process:
1. validate the SDEB is a valid ZIP archive
2. load the Tag.rdf file into a JENA knowledge base
3. find an instance of DigitalEvidenceBag in the KB
4. find any associated EvidenceDocumentationContainer’s via the
bagContents property
5. load the contents of all EvidenceDocumentationContainer’s into the
KB, first ensuring their integrity via comparing with the associated
message digest
The semantics of the information architecture are implemented by nature of

JENA RDF/OWL implementation: on loading the RDF/XML, instances of the same
surrogate residing in separate files are merged together where their DEID is the same.
This leads to one integrated KB of digital investigation related information from the
separate RDF files contained within the SDEB. Both the Digital Evidence and SDEB
ontologies are also loaded into the KB.
The integrity of the Digital Evidence Content files are then validated by the
following process:
1. find any associated instances of DigitalEvidence by traversing the
contains property
2. validate the integrity of the contents of the related Digital Evidence
Content File via comparing with the associated message digest
The compositional nature of the architecture was validated by manually
generating a new digital evidence bag containing a single Investigation Documentation
File (in addition to Tag File) which contained the information excerpted in Table 18.
The new SDEB was loaded into the same KB as the previously described DEB, and the
contents of the KB dumped as RDF/XML, and manually inspected. The JENA RDF
implementation had successfully merged the information from both evidence bags into
the KB, associating the investigation documentation describing who acquired the Image
with the related FileImage in the original SDEB.
Table 18: Annotated information from composing SDEB
<de:Image
rdf:resource=
"urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:image">
<de:acquiredBy>
<foaf:Person>
<foaf:name>Bradley Schatz</foaf:name>
<foaf:mbox
rdf:resource=”mailto:bradley@blschatz.org”/>
</foaf:Person>
</de:acquiredBy>
</de:Image>
6.6 Conclusion and future work

The contributions of this chapter are aligned with themes of representation and
assurance. Similar to the research presented in the previous chapter, the research
described in this chapter investigates whether the RDF/OWL formalism is a useful
general representation upon which digital forensics application may be built. In this
case, however, we focus on representing information related to the investigation, rather
than the evidence.
The first contribution is the proposal of a formal knowledge representation as a

means of documenting digital investigations, in order that digital forensics tools may
interoperate and evidence documentation may be simply integrated without human
intervention. We demonstrate that semantic markup languages, in particular
RDF/OWL, are a suitable common information representation layer for digital evidence
related information, and digital investigation information. The proposed approach
addresses the complexity problem by demonstrating a general approach to documenting
investigation related information and storing evidence, which at the same time
increases automation.
In this context, the instance based data model of RDF/OWL enabled the
definition of surrogates for arbitrary investigation related information. Compared to the
attribute-value data model implicit in prior approaches to evidence containers, this
formal approach enables description of discreet entities, whereas prior approaches
preclude describing more than the single implicit entity (in this case the single disk
image in the container) due their omission of objects or instances from their data
model.
Proof of concept was demonstrated by way of describing the operation of a
prototype online acquisition application. We have focused on validating the approach
to representation in this work building a simple set of digital investigation and evidence
related ontologies, and a prototype acquisition tool, which are published at
http://www.isi.qut.edu.au/2005/sdeb/ 31 . This ontology is however ad hoc, and we believe
that the field of digital forensics would benefit from a standardised ontology describing
its domain.
The RDF/OWL representation was sufficiently expressive to represent and
document all aspects of the investigation and evidence considered, with the exception
of one aspect; integrity information related to the Tag file. This required stepping
outside the representation to define a Tag Integrity File for storing integrity related
statements about the information stored in the Tag File. This was necessitated by a
fundamental problem in the RDF/OWL formalism with regards to making statements
about statements (or more formally, reified statements). A simple example of this is
statements along the lines of “John thinks that Mary likes Bill”.
We observe that in the digital forensics domain, the omission of reification
from a common representation will be detrimental because of the provenance related
31
The prototype implementation and ontology use the term ‘Evidence Metadata’ where we now
use ‘Investigation Documentation’. This refinement in terminology is intended to signify the
arbitrary information which may be related to the evidence by multiple layers of abstraction.
concerns of digital forensics. Reification will assist in making possible statements such
as “The pasco tool interpreted the following statements from file X”.
The second contribution is a conceptual advance, proposing an improvement
on the Digital Evidence Bag (DEB) proposal of Turner. Our proposal, which we call
the Sealed Digital Evidence Bag (SDEB) enables arbitrary composition of evidence
bags and information within evidence bags, without modifying any data in original
evidence bags. This proposal improves upon the DEB proposal by simplifying aspects
of evidence authentication.
Central to the compositional approach is our proposal of a globally unique
identification scheme for identifying digital evidence and related information, which
we dub Digital Evidence IDentifiers (DEID). This unique naming scheme enables
automated integration of information from separate evidence bags by the
implementation of the underlying knowledge representation layer. This demonstrates
that employing a knowledge representation as a common language for documenting the
digital investigation provides immediate benefits towards solving the complexity
inherent in integrating this information.
The final benefit of the SDEB approach is that it enables granular composition
and decomposition of evidence into a corpus of inter-related evidence bags, which
addresses the volume problem by facilitating automated validation and scalable
processing of evidence .
The next chapter addresses the theme of analysis techniques and evidence
assurance.
Chapter 7. Temporal provenance &
uncertainty
“I used to be Snow White, but I drifted.”
(Mae West)
Chapters 4, 5, and 6 addressed themes of representation in computer forensics.

They did so at the documentation and evidence level, representing digital evidence and
the surrounding world, in a manner that is semantically crisp enough that both
machines and humans may unambiguously interpret such evidence. This chapter,
however, drops down a level to question the foundations of a particular part of
representation: time.
One of the key challenges in the field of digital forensics is “Meeting the
standard for scientific evidence”, which was the subject of Section 2.4.3. One of the
numerous aspects of this challenge is “Confidence and trust in results”[98]. This
chapter focuses on the trustworthiness of digital time-stamped data, and in particular,
assuring that digital timestamps might be reliably interpreted as times in the real world.
The chapter is structured as follows. Section 7.1 introduces the problem of
computer timestamps (related literature is described in Section 3.3). Section 7.2
presents the empirical results of a study identifying where the real world behaviour of
computer clocks diverges from the ideal, by studying the behaviour of computer clocks
in a real world windows network. Section 7.3 proposes a correlation approach for
characterising the behaviour of a remote clock from client and server side logs. Two
algorithms implementing this approach are described and experimental results
evaluated. Section 7.4 compares the two algorithms and experimental results. Finally
Section 7.5 summarises the conclusions of the chapter and describes future work.
The research work described in this chapter led to the publication of the
following paper:
B Schatz, G Mohay, A Clark, (2006) ‘Establishing temporal provenance of computer

event log evidence’ Proceedings of the 2006 Digital Forensics Workshop (DFRWS
125
126 CHAPTER 7 – Temporal provenance & uncertainty
2006), Lafayette, USA, and published as Digital Investigation, 3 (Supplement 1), pp. 89-
107.
7.1 Introduction
The use of timestamps in digital investigations is fundamental and pervasive.
Timestamps are used to relate events which happen in the digital realm to each other
and to events which happen in the physical realm, helping to establish event ordering
and cause and effect. A well known difficulty with timestamps, however, is how to
interpret and relate the timestamps generated by separate computer clocks when they
are not known to be synchronized [128]. Commonly observed differences in time occur
from computer to computer caused by location specific time variations (such as time
zones), the rate of drift of the hardware clocks in modern computers, and
misconfiguration and inadequate synchronisation.
Current approaches to inferring the real world interpretation of timestamps
assume idealised models of computer clock time, eschewing influences such as
synchronisation and deliberate clock tampering. For example, to determine the clock
skew of a computer being seized, it is commonly recommended that a record be made
“of the CMOS time on seized or examined system units in relation to actual time,
obtainable using radio signal clocks or via the Internet using reliable time servers.”
[20]. CERT recommend that, “As you collect a suspicious system’s current date, time
and command history … determine if there is any discrepancy between the collected
time and date and the actual time and date within your time zone” [96].
While this approach will approximately identify the skew between the local
time and the observed computer time at the time of the check, it says nothing about the
passage of time on the computer’s clock prior to that point [141]. Uncertainty remains
as to the behaviour of the clock of the suspect computer prior to seizure. This further
leads to uncertainty as to what real world time to ascribe to any timestamp based on
this clock.
In this work we explore two themes related to this uncertainty. Firstly, we
investigate whether it is reasonable to assume uniform behaviour of computer clocks
over time, and test this assumption by attempting to characterise how computer clocks
behave in practice. Secondly, we investigate the feasibility of automatically identifying
the local time on a computer by correlating timestamps embedded in digital evidence
with corroborative time sources.
CHAPTER 7 – Temporal provenance & uncertainty 127
7.2 Characterising the behaviour of drifting clocks

Having identified that computer clocks are unreliable, we attempt here to
experimentally validate whether one can make informed assumptions about their
behaviour, as seems to be the current practice in forensic investigations. We do this by
empirically studying the temporal behaviour of a network of computers in a commonly
deployed small business environment.
7.2.1 Experimental setup
The subject of our case study is a network of machines in active use by a small
business. The network consists of a Windows 2000 domain, containing one Windows
2000 server, a Domain Controller (DC), and a variety of Windows XP and 2000
workstations. Access to the internet is provided by a Linux based firewall. In this case,
the Windows 2000 DC (the server) has not been configured to synchronize with any
reliable time source, and as such has been drifting away from the civil timescale for
some time. The Linux firewall also provides both a squid 32 web proxy server, and an
NTP server, which is synchronised with a stratum 2 NTP server 33 . All workstations are
configured to use the squid proxy cache for web access.
Our goal here is to observe both the temporal behaviour of the Windows 2000
DC, and the effects of synchronization on the subordinate workstation computers. We
would expect that the timescales of the workstation computers would approximate that
of the DC, because of the use of SNTP in this network arrangement (see Chapter 3).
To observe this behaviour, we have constructed a simple service that logs both
the system time of the host computer and the civil time for the location, which we
obtain via SNTP from the local NTP server. The program samples both sources of time
and logs the results to a file. Figure 22 depicts the network topology and time related
infrastructure for this experiment.
32
http://www.squid-cache.org/
33
Stratum refers to the distance from a reference clock.
Figure 22: Experimental setup for logging temporal behaviour of windows PC's in small
business network
The logging program was deployed on all workstations and the server on the 1st
February 2006, and the results checked mid March. Unfortunately, the program was
rendered short lived, as a particular bug in the Windows service implementation of
Python (the implementation language) saw the log service crash after writing 4k of
debug messages to the standard output steam. On fixing the bug, a new version was
redeployed on the 21st March, 2006 for 20 days (until the 10th April), and then results
collected.
7.2.2 Analysis and discussion of results
The graphs presented below are based on the sampled timescales taken from
machines in the subject network. The x-axis is the time and date of the sample, taken
from the civil timescale, as served by the NTP server. The y-axis is the difference in
time between the system time and civil time at that moment, in seconds.
Figure 23 is the graph of results taken from the domain controller of the
Windows 2000 server based network. The solid line of samples shows a uniform drift
of the system time away from civil time for the time period 21st March through 10th
April. The other two sets of samples from the 1st February through the 21st March
2006 are samples taken by the initial version of the program in the time after a boot
(before the program crashed). In Figure 23 two clusters are visible outside the
aforementioned line, one about the 1st February, and one about 13th March. These two
clusters indicate reboots at that time. Extrapolating the solid line shows the drift of the
server to be at a near uniform rate.
Figure 23: Clock skew of Domain Controller "Rome" offset from civil time.
Figure 24 shows results taken from a Windows 2000 workstation called

Florence over the same period. It displays a general time drift trend which matches that
of the domain controller. The faulty logging service has generated far more samples
than were generated in the case of the server. This is caused by the habitual shutting
down of the computer by the user at the end of each work day, resulting in a number of
samples generated every time the machine reboots (before the initial version of the
program crashes).
Figure 24: Clock skew of workstation "Florence" offset from civil time.
The scale of the graph is misleading as to the number of outlier values present
from 8:19:34AM through 8:25:56AM on the 20th February. The cross at 0 skew
actually represents 38 outlier values, which do not fit a model of time where the clock
is synchronised to the DC. It seems highly irregular that during this period the machine
became synchronized to within one second of Civil Time (a time stream which the
network in question has no configured reference to).
The default auditing configuration of the Windows network failed to include
the necessary privilege to identify whether this was user instigated. The accuracy to
which the clock became synchronised with civil time leads us to suspect that this was
not the result of user interaction; rather that it was the action of some program which
had access to an external, reliable time source. The Windows update service was active
during this period; we speculate that the cause of synchronisation with the Civil
Timescale during this period was the Windows Update service.
The graph presented in Figure 25 shows the skew data taken from a Windows
XP workstation named Milan. 34 Again the drift rate generally remains constant, and
correlated with that of the server; however there are two sets of anomalies which
deviate from this general trend. Immediately noticeable are the almost vertical lines
which indicate a resynchronisation with the DC timescale from wide time skews. We
speculate that these features indicate a computer reboot immediately before. The
second anomaly is the two peaks on the graph around the 6th and 7th April.
34
The scale of this graph differs from the previous graphs to present a more clear view of the
features in discussion. We note that the overall form of the graph when taken at the previous
scale follows the same gradient and offset.
On closer investigation, the vertical line on the 4th of April reveals that over a
period of 22 minutes and 0 seconds of real time, the system clock only advanced 20
minutes 51 seconds. In total the system clock loses 1 minute 9 seconds over this period.
This behaviour occurred in small, and incremental changes and is consistent with the
disciplining of a skewed clock back into synchronisation with a trusted source.
Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed).
A check of this workstation’s RTC via the BIOS configuration interface a

small number of days after the results were collected revealed that RTC was minutes
ahead of the system time measured just moments before. It would appear from this that
either Windows XP does not update the RTC, or that update of this particular RTC
failed. Interestingly, we see similar behaviour for the PC named “Trieste” (shown
below in Figure 26) which is the only other Windows XP host on the network. All four
other workstations (which are running Windows 2000) do not exhibit this behaviour.
The near linear relationship of the lower ends of the vertical reboot lines may indicate
the rough drift of the RTC. 35
35
As the focus of this section is on describing observed deviations of Windows based clocks
from the ideal, we leave experiments which conclusively determine the behaviour of the
Windows XP clock at boot to others.
Figure 26: Clock skew of workstation “Trieste” offset from civil time.
The graph in Figure 27 combines data from the DC and Milan (figures 23 and
21) about the period where peaks are seen in the skew graph 36 . We can see here that the
DC was maintaining a stable timescale (part of its data, the points forming a thin line, is
showing through under the peak) for the period with Milan drifting away sharply at the
peaks. At the start of the peak we can see that Milan began drifting away from the DC
at a rate of around 1 second every 14 minutes, before re-synchronising with the DC.
Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed).
Investigation of the event logs of the computer “Milan” revealed an inordinate

number of Print subsystem warning events in the system logs (which appeared to
indicate repeated retries of installation of a print driver) before this time. No other
36
Note that colour would help with this graph. Rome data is plotted as points and Milan as
crosses.
events of interest were found. This drift is unlikely to have been based on a single
operator action, as the corresponding change in skew would have been immediately
visible, with a discontinuity between the two points.
The remaining three workstations stayed synchronised with the DC, with no
temporal anomalies observed. The skew timelines for these were similar to Figure 23.
For reasons of brevity they are not reproduced here.
From these results we make a number of conclusions. In general, we find that
Windows hosts (2K and XP) integrated with a Windows based time synchronisation
network will stay synchronised. The anomalies observed above, however, indicate that
making reliable statements about the timescale of a particular workstation computer
within a Windows Domain network (and as such the interpretation of timestamps from
these workstations) is problematic.
Windows computers not in a Domain network, either untethered from reliable
sources of time (such as windows 2000), or loosely tethered (such as computers
running the XP OS) may suffer from the same problem. Indeed, as XP hosts are
tethered to synchronise with time.windows.com on a far less frequent basis (weekly),
there will be larger periods of de-synchronisation. The observation that the host
“Milan” became synchronised with civil time for a period, and the further observation
of it drifting away from the DC timescale and civil time (for no observable reasons)
indicate that other factors are influencing the behaviour of the clock.
7.3 Identifying computer timescales by correlation with

corroborating sources
Given our uncertainty with respect to the timescale of a particular computer (as
identified in the previous section), we seek automated methods for identifying the
temporal behaviour of a computer. In this section we describe an automated approach
which correlates timestamped events found on a suspect computer with timestamped
events from a more reliable, corroborating source.
Web browser records are increasingly employed as evidence in investigations,
and are a rich source of timestamped data. The ISP side which correspond to these are
proxy logs. We expect that the common practice of deploying transparent proxies by
ISPs will see a greater availability of this kind of event log, and corresponding interest
in their use as evidence by law enforcement.
Because of the increasing ubiquity of web browsers, we have chosen to use the
web browser and proxy records as data sources for use in characterising temporal
behaviour. We expect here that in the process of an investigation, proxy logs which
relate to a suspect computer may be obtained from the ISP which has served as the
computer’s gateway to the Internet.
We assume that these records on the proxy would be produced by a computer
which is synchronized with an accurate time source. While this might not at present be
a generalisable assumption, we look towards a near future where the provenance of
audit records receives closer attention by ISPs and business in general as forensic
preparedness finds its way onto the agenda for compliance reasons among others.
7.3.1 Experimental setup
Our experimental setup uses the same infrastructure which was used in the first
study. Relevant to this experiment is the deployment of the Microsoft Internet Explorer
web browser on all Windows based machines, and the presence of a Squid HTTP proxy
on the firewall, which the computers are configured to use to access the web. The
experimental setup is depicted in Figure 28.
Figure 28: Experimental setup for correlation
This experiment takes the browser records from the machines in the network
and correlates them with the proxy logs from the squid log, to determine the temporal
behaviour of the Windows machines on the network. The correctness of the correlation
techniques are evaluated using the data collected from the previous experiment.
7.3.2 Challenges in correlating browser and squid logs
IE stores records of browsing access in two subsystems: the cache and history.
These records are all stored in separate files, called index.dat, but located in different
directories.
The IE cache subsystem stores locally cached copies of web content, such as
pages and images in files such as those with a jpg extension amongst others. An index
is kept mapping web addresses to these locally stored copies in a file called index.dat.
The cache index files contain entries for all cacheable resources visited, including
component files of a particular viewable page (for example, images, sounds, and flash
animations,).
The history subsystem creates a historical record of URLs visited over time in a
set of index.dat files. Three separate types of history file are kept: the root history, daily
sort history and weekly sort history. Within these files are records of visits to top level
viewable pages:
• Pages visited by typing a URL
• Pages visited by clicking on a hypertext link
• Documents opened within Windows Explorer by double clicking (i.e.
.xls, .doc…)
The cache and history index.dat files are all of a similar undocumented binary
file format. Despite the lack of documentation, there exist a number of documented
analyses of reverse engineering the file format, and a number of tools are available
which will interpret the content of this file. For a good description of the file format,
especially notable in distinguishing some subtle semantic differences in interpreting the
timestamps in these records, see [20].
We initially used the Pasco [59] tool for extracting the data contained in these
files. We chose this tool as it had freely available source code. In practice we suspected
that it was generating spurious results. This prompted us to perform our own reverse
engineering effort. Our new tool identified a bug in the Pasco tool where a spurious
record was generated from an unchecked file read for an offset outside the bounds of
the file 37 .
The squid proxy cache logs a record of all web transactions which it processes
in a file called access.log. This is a textual log file. The fields of interest to us are the
resource access time (which, similar to the IE index files is the end of the transaction),
and the URL visited.
37
Our new parsing tool, imaginatively named pasco2, is available at
http://www.bschatz.org/2006/pasco2/
Our experiment involves translating the web browser records and squid logs
into a common representation and matching entries from the two sources based on the
URL visited. We assume that the last accessed time from the squid record is relative to
civil time (kept tightly synchronised using, for example, NTP), and compare that time
with the last accessed time from the corresponding history or cache record.
The primary challenge related to correlation is in determining which entry in
the Squid cache log corresponds to a particular entry in the cache or history records. As
IE records are most recently used (MRU) records, there will not be a one to one
mapping between history entries and Squid events. We illustrate this with the following
example.
Figure 29: Matching is complicated by only the most recent record present in the history.
Figure 29 depicts the relationship between records of visits to a particular page

over two days. On the first day, the user has visited the site once, and on the second day
has visited the site a further three times. As the history is a MRU record, the visits at
7:00 and 7:36 are absent from the IE history, with the visit at 8:21 being the only record
left for that day. Simple matching based on the URL field of each record will result in 4
potential matches for each history file record. The addition of further history records
and cache records related to visits to this URL complicates the matter even further. Our
correlation approach must in this case determine which potential match is the correct
match.
7.3.3 Analysis methodology
For the two algorithms explored, the sampled timescales from the previous
experiment in Section 7.2 are used as a baseline for determining which matches are true
or false. True positives are data points output by the correlation algorithm which
correlate with the timescale identified in Section 7.2. False positives are matches
generated by the algorithm which do not correlate with the timescale. True negatives
are prospective matches that are rightly discarded by the algorithm. False negatives are
data points which would correlate with the timescale, but the algorithm classifies or
misidentifies as not correlated.
7.3.4 Clickstream correlation algorithm
Our initial approach to correlation is based on the concept of a clickstream. We

borrow this term from the web content industry, where it refers to the path taken by an
individual visitor navigating a website, in a particular session. Our hypothesis was that
as a user navigates through a website, the time taken to read each page, and select a link
and follow it, and so on, would lead to a set of unique timing characteristics between
page visits for a particular clickstream.
We define a clickstream as a time ordered sequence of page hits within a
website. The intra-hit time is the time period between two successive object access
events in a clickstream. We constrain the definition of clickstream such that the intra-
hit time for successive hits is within max seconds of each other and further than min
seconds apart. We define a maximum limit so that we may disambiguate sessions. The
function of the minimum is described further below. Finally, the dimensions of a
clickstream are the ordered set of intra-hit times, the unique timing characteristics
between page visits.
The algorithm attempts to fit a clickstream identified in the web browser
records to corresponding events in the squid logs. The heuristic here is that the longer
the clickstream, the more unique the dimensions of it will be, thus giving a single
unique match when fitting to the other event stream. This is, however, complicated by
the presence of sub-page resources such as images and flash content which is
immediately loaded after the page: the timing characteristics of these are less unique.
Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host
“Milan” do not correlate because of presence of false positives.
Figure 30 and Figure 31 are of a clickstream correlation run, graphed with the
timescale log of the workstation “Milan”. The clickstream correlation dataset in Figure
30 is graphed as crosses. It contains 75 results, of which we can see 4 clusters of
clickstream results 38 . Clearly there is conflicting data. The two clusters visible, but not
on the timeline, actually contain 5 false positive values which are causing the problem.
These 5 values are false positives as we know from our earlier experiment what the
actual time was on the computer clock at that particular point in time, and is plotted on
the graph as dots. Removing these false positives from the result set results in the graph
labelled Figure 31, where we can see tight correlation with the workstation’s timescale.
Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when false
positives are removed.
The results of running the same correlation algorithm on the host “Pompeii”
which has generated far less web traffic over the period is presented in Figure 32. In
this case the clickstream correlation algorithm produces no false positives.
38
We note here that a colour graph would be more illuminating, as the timeline values on the
graph dominate. The apparent line on y-axis 0 is actually the individual timeline samples (which
are graphed as dots) merging to form a solid line. Three clusters are visible about the 11/04 x-
axis coordinate, and a further cluster is visible just after 07/04 on the 0 y-axis.
Figure 32: Pompeii" cache correlation.
7.3.5 Results
In practice the rate of false positives increased when comparing intra-hit times
at magnitudes below the magnitude of one second. We expect that this is caused by the
measurement error being more pronounced the smaller the intra-hit time becomes.
Values of around 20 minutes for the maximum intra-hit time and values of over 1
second for the minimum value, produced clickstreams with the best uniqueness
properties (as measured by a reduction in rate of false positives).
Modifying the algorithm to filter clickstream acceptance based on clickstream
length produced a similar effect on the false positive rate, and consequently a high rate
of false negatives for clickstreams of larger size. With larger sized clickstreams the rate
of true positives falls off quickly however, and the rate of false negatives becomes
high.
The algorithm performed far better on cache records than on history records.
We expect that this is caused by the difference in granularity of record keeping in the
sources. As the cache stores cache records both for top level web pages and component
content such as images, style sheets and the like, clickstreams are more likely to be
formed. The IE history subsystem only records the top level page views, so is less
likely to produce long clickstreams in situations where users do not heavily explore
websites.
Designing an algorithm which eliminates these false positives is complicated
by the fact that the last access timestamp of any particular cache record is unreliable, as
it may have been accessed more recently by the user (before the cached content
expired). In this case, no corresponding Squid event would be logged even though the
cache record timestamp is updated, thus introducing a skew to the expected offset of
the matching Squid event.
For this reason, we set about identifying a means of identifying IE records
which must have been requested via the Squid proxy and not from the local cache.
7.3.6 Non-cached records correlation algorithm
After some further investigation, documentation of another effort at reverse

engineering the index.dat format came to light [134]. This work identified another field
in the IE History record which recorded the total number of accesses to a particular web
resource. For records where this field has a value of one, we can be sure that there has
only been one access and that the record has come directly via the squid proxy.
Our new algorithm reduces uncertainty by choosing only history records which
must have come directly via squid, bypassing the local cache. Furthermore it places a
high value on matching entries for which there exists only one corresponding match in
the squid log.
The algorithm is defined as follows:
• All matching history and squid records with common IP and URL are
found, each of these matches is called a history-squid tuple.
• A subset of these tuples, called the base set is identified, where for
each matched URL, only one history-squid tuple exists and the history
record is “non-cached”.
• We call the set of remaining history-squid tuples the remainder set.
• The base-set mean is the mean of the skews of each history-squid tuple
in the base set.
• We further cluster the remainder set based on URL of the history-squid
tuples (we note here that for a particular URL visit, we might have
multiple history records and multiple squid records). We call each
cluster of history-squid tuples with a common URL within this
remainder set a remainder set cluster.
• For each of remainder set clusters, we find the history-squid tuple with
a skew closest to the base set mean, and add it to the initially empty
inferred set, discarding the rest of the tuples from that cluster.
• The results of this algorithm are the union of the base set and the
inferred set.
7.3.7 Results
In practice this algorithm produces a set of data which correlates well with the
timescales produced by our previous experiment. For example, Figure 33 is a graph of
the output of the algorithm described above overlaid over the timescale for host Milan
obtained from the previous experiment.
Of 1188 unique history records, 821 history-squid tuples were identified. One
would expect that the number of history-squid tuples would be higher, however URLs
with encoded GET requests are not matched because of squid’s anonymised logging of
this kind of URL.
In practice there are a significantly high proportion of non-cached hits in the
history for our algorithm to work effectively. The algorithm identifies 304 potential
non-cached matches, and a base set of 110 matches from this. In total the algorithm
generates 134 data points (see Figure 33).
Figure 33: History Correlation vs. Timescale.
Comparison with the sampled timescale reveals numerous false positives

(around 15-20) for which we have no explanation.
We are confident that the algorithm generates many false negatives caused by
its simplicity in selecting non-cached hits for the base set. A more comprehensive
algorithm would in addition to finding history records with an accessed count of 1, use
the temporal ordering relationship between the history record sets. For example, say the
oldest weekly sort file contains a record for a particular URL with accessed count equal
to one. If a newer sort file contains a record for the same URL visit, with an accessed
count of two then one can be sure that this record corresponds to a non-cached access.
7.4 Discussion
In this section the two algorithms are compared, and the general problems
related to correlating these types of event logs are outlined.
Of the two algorithms, the history correlation algorithm performed the best.
Results are generated which cover a far wider period of time than the cache oriented
algorithm, giving greater insights into the temporal behaviour of the computer.
Furthermore the ratio of true to false positives is far higher.
The history algorithm was originally the worst performing of the two
approaches. At that point in time, determining whether the high rate of false positives
was caused by a tool implementation error or an error in the correlation algorithm was
problematic. Boyd’s paper [20] was at that point in time essential in identifying that
our interpretation of the weekly history timestamps was mistaken.
Despite having re-implemented a new set of index.dat file parsers, (and
discovered a third timestamp in the history records 39 ), we still used the semantics
defined by the pasco tool. Our model was corrected to treat the first timestamp in the
weekly sort history record as the accessed time, offset by the local time zone offset in
operation. This resulted in the high rate of true positives and a low rate of false
positives previously seen in Figure 33.
Both approaches to developing a correlation algorithm outlined above make a
closed world assumption – that the algorithm has access to all of the information that it
needs. In practice, development of the algorithm was complicated by this not being the
case. Consider for example Figure 34, which was generated using the same history
correlation algorithm as that seen in Figure 33. The input to the algorithm was however
a dataset which omitted a particular squid access log.
Strong correlation with the computer’s timescale is evident; however, there are
in this case false positives in the extremes of the graph. Examination of the false
positives indicated that they were related to records from the particular squid access log
which had been omitted, which had not been included in the correlation run for
processing speed reasons. The omission of the records resulted in the algorithm picking
a match from another squid log file, resulting in a far greater offset. Adding the
excluded log produces the results seen previously in Figure 33.
39
A 32bit MSDOS timestamp was identified at offset 0x50h within the history record. Within
the root history file, interpretation of this timestamp is the last accessed time, as is apparent by
comparing the last access time in the Internet Explorer history viewer. In practise, the value is
always a small amount after the 64bit FileTime based last accessed time.
Figure 34: Incomplete information
There is another problem related to the closed world assumption. By assuming

that all data is present, we assume a perfectly functioning logging system in Internet
Explorer and Squid.
The corollary of this assumption is that we assume a perfect implementation
for our index.dat parser and that we have interpreted the semantics of the records
correctly, and also avoided bugs in our implementation. Clearly, this is not valid given
the challenges in reverse engineering the file format and allowing for inevitable bugs.
We expect that the false positives present in the history correlation algorithm are
attributable to these.
Despite the challenges outlined above in correlating IE History and Cache
records with Squid access logs, we find that we are able to correlate the records to a
dataset which correlates reasonably with the timestreams sampled from the first
experiment.
We expect that far higher rates of true positives are possible; both algorithms
ignore large parts of the dataset, as our heuristics and rules only apply to a small
proportion of the dataset where we can infer certainty in matches. Algorithms which
model uncertainty in matching records and incorporate probabilistic methods hold
promise towards this goal; Monte Carlo Markov Chains have been identified as a
potential approach. We expect that the principles underlying Gladyshev and Patel’s
[47] event bounding approach could have relevance.
7.4.1 Relation to existing work
We compare our approach here to the two closest approaches identified in the
literature, which are summarised in Section 3.3.4.
The approach of Gladyshev and Patel [47] differs from ours in that we deal
predominately with events which indeed have a timestamp, but there is uncertainty as
to that real world time this corresponds to. The approach taken by Gladyshev and Patel
instead tries to find the temporal bounds of an event which may or may not have a
timestamp associated with it.
Our work has similar objectives but differs significantly from Weil [141] in
two respects. Firstly, we investigate to what degree timescales are unstable. Secondly,
Weil’s approach relies on manual classification of cached web pages as dynamically or
statically generated. This is because the technique relies specifically on dynamic
content in order for the embedded timestamps to be interpreted. In addition, we also
present two algorithms which enable the automatic determination of the behaviour of a
suspect computer’s clock by comparison with a commonly logged corroborative
source.
7.5 Conclusions
This chapter has investigated a key problem which lies at the foundations of
evidence representation: how to assure the reliability of timestamps found in digital
evidence. The contributions of this chapter are aligned with the theme of assurance, and
tangentially, representation.
The first contribution is an analysis of the temporal behaviour of PC clocks as
generally implemented in the windows operating system and empirical results
demonstrating the unreliability of timestamps sourced from windows based computers.
This was presented in Section 7.2.
The second contribution, presented in Section 7.3, demonstrated the feasibility
of automatically characterising the temporal behaviour of a computer by correlating
timestamps embedded in digital evidence with corroborative time sources. Two
algorithms were proposed and evaluated, and experimental results were presented
which demonstrate that the latter algorithm produces outputs which correlate
reasonably with the timescales of the subject computers. We have additionally
described how the history correlation algorithm could be modified to produce a higher
rate of true positives.
There are a number of areas where future work is warranted. First, in order that
results based on this kind of correlation may be more clearly interpreted and explained
in forums such as courts of law, a means of qualifying and quantifying the error
involved would be of use. Second, in order that the resolution of the characterised
timescales may increase, improved algorithms which incorporate uncertainty in record
matching should be investigated. Finally, the Internet Explorer index.dat file format is
still not fully understood. We expect that a clearer understanding of the file format
would lead to a reduction in errors.
Chapter 8. Conclusions and future work
“(I am) acutely aware of the difficulties created by saying that when
Aristotle and Galileo looked at swinging stones, the first saw
constrained fall, the second a pendulum. Nevertheless, I am
convinced that we must learn to make sense of sentences that at least
resemble these.”
(The Structure of Scientific Revolutions, Thomas Kuhn)
A widespread migration of communication and publishing from analogue to

digital formats is occurring. It has been reported in 2001 that at least 93 percent of
information created was in digital form, and in 2000, that 70 percent of corporate
records were kept in digital format [107]. The effect of this migration is that digital
information is increasingly presented as evidence in legal and other proceedings. A
revolution in the way that courts of law, and law enforcement treat, and view evidence
is underway.
The nature of digital technologies and information in digital formats is
markedly different from traditional evidence forms, because of the latent nature of the
information in digital data, the capacity for possibly undetectable modification and,
conversely, the capacity for perfect copying. Existing approaches to evidence are being
reinterpreted in this new technical context, and new techniques for interpreting relevant
information derived from digital evidence are the focus of a new field currently called
digital forensics.
The work described in this dissertation examines at a fundamental level the
role of representation in interpreting and analysing digital evidence, identifying where a
formal approach to documenting digital investigations and digital evidence reduces the
complexity and volume problems in the field. Additionally, the work identifies flaws in
fundamental assumptions in the interpretation of temporal evidence, and proposes a
novel method of characterising the temporal behaviour of hosts.
147
148 CHAPTER 8 – Conclusions & future work
8.1 Summary of contributions and achievements

As previously summarised in Section 1.2, the principal achievements and
contributions of the dissertation include the following:
• Proposition of formal knowledge representation as an approach to solving

current digital forensics problems of complexity and volume;
• Demonstration of the usefulness of a particular representational formalism,

RDF/OWL, in representing arbitrary and diverse information implicit in event
log based evidence, investigation related documentation and wider domain
knowledge. This is demonstrated in the context of building improved forensic
correlation tools, and in building interoperable forensics tools and digital
evidence storage formats;
• Identification of particular areas of the digital forensics domain where the

RDF/OWL formalism is insufficiently expressive
• Demonstration of a novel analysis technique which supports automated

identification of high level forensically interesting situations by means of
heuristic event correlation rules which operate over general information;
- A novel means of addressing the problem of surrogate proliferation,

improving automated correlation by interactive (human guided)
declaration of hypothetical equivalence relationships between
surrogates;
• Proposal of a novel architecture for containers of digital evidence and arbitrary

investigation related information, in a manner that enables composition of
evidence units and related information into a larger corpus of evidence, while
assuring the integrity of evidence;
- Definition of a unique naming scheme for identifying digital evidence

which enables separate and subsequent addition of arbitrary
information without violating the integrity of original evidence;
• An analysis of the temporal behaviour of PC clocks as generally implemented

in the Windows OS and empirical results demonstrating the unreliability of
timestamps sourced from windows based computers; and
- A novel approach for characterising the temporal behaviour of a host

based on correlating commonly available local timestamps and
timestamps from a reference source.
CHAPTER 8 – Conclusions & future work 149
8.2 Discussion of main themes and conclusions

The work described in dissertation has examined at a fundamental level the
nature of digital evidence and its use in digital investigations, following three
interwoven themes: representation, analysis techniques, and information assurance.
8.2.1 Addressing complexity and volume of digital evidence
Chapter 3 concluded that the field of digital forensics might benefit from the
application of formal knowledge representation to digital evidence and digital
investigations. Chapter 4 investigated the history of formal representation, in the
context of Knowledge Representation and Semantic markup languages, introduced the
RDF/OWL formalism, and proposed that this formalism would be of benefit to
addressing the complexity and volumes in forensic event correlation.
Chapter 5 investigated using this formalism in the context of event correlation
for forensic purposes. The primary outcome of this chapter was to show that the
RDF/OWL formalism is useful as a general representation and is expressive enough to
represent and integrate digital evidence sourced from disparate arbitrary event oriented
sources, composite and abstract events corresponding to higher level situations, and
entities referred to in those events. This was demonstrated by building tools which
translated heterogeneous event logs into the formalism.
The second outcome was to show that the formalism is useful for building tools
which analyse such information. This was demonstrated by building automated
correlation tools which automatically identified forensically interesting scenarios from
event log based evidence based on heuristic rules. This was additionally demonstrated
buy the ease with which investigator hypotheses regarding entity identity could be used
to solve the problem of surrogate proliferation, reducing the volume of entities under
consideration.
A final outcome of this chapter is the identification of areas where the
formalism in insufficiently expressive
Chapter 6 showed that formal representation is useful in documenting digital
investigations and sharing digital evidence. This was demonstrated by the proposal of
improved approach to digital evidence containers which enables more scalable
processing of evidence, extensible integration of arbitrary information, and simplified
evidence authentication.
8.2.2 Assurance of fundamental temporal information
In Chapter 7 focus moved away from wider representational issues, focusing

on the low level interpretation of digital timestamps. Empirical results were presented
which draw into question commonly made assumptions about the passage of time on
computers, showing that such assumptions are not generally applicable because of the
unreliability of computer clocks, on Windows systems in particular. More generally,
other operating systems and embedded devices are likely to suffer from similar
problems. Following from empirical results identifying ways in which real-world
computer clock operation deviates from the ideal, an analysis technique for
characterising the behaviour of a computer clock in the past based on commonly
available event log data was presented.
8.3 Implications of Work

The representational approach underlying Chapters 5, 6 and 7 has clear
implications for the development of forensics tools. Having identified that tool
interoperability is complicated by machine’s inability to read natural language, and
conversely human’s inability to understand binary data, we have demonstrated that
employing a formal knowledge representation as a middle language for documenting
investigation and evidence related information is of benefit towards building digital
forensics tools. In particular, we have shown here the first practical use of the
RDF/OWL formalism in digital forensics, demonstrating that ontologies, in concert
with semantic markup languages, are a practical middle language for recording
evidence assurance documentation.
The SDEB approach has the potential to form a lingua franca for forensics
tools to use in assembling a corpus of evidence bags, maintaining assurance
documentation, automated validation of the integrity of evidence, and scalable evidence
processing. Similarly it was shown that the same representational approach may be
used for documenting, in a machine and human readable way, the information
interpreted from event logs.
These results point towards a document-oriented approach to digital evidence,
which relies on a common representational formalism for documenting all information
produced, interpreted or inferred by forensics tools. Documents produced by such tools,
would, through a consistent syntax and easily interpreted and extensible semantics, be
able to be integrated with otherwise unrelated information, and be read and
manipulated by generic libraries and programs. While concrete results which validate
that the representation is expressive and extensible and that the SDEB information
architecture is interoperable were presented, this work has only covered a small portion
of possible tool integration scenarios.
One implication of the results is the impact that an ontology and document
oriented approach to digital evidence might have upon firming the terminology used in
the field. We are not the first to argue for terminological precision in forensics; a
number of parties have observed that the terminology in the field is used in differing
ways. Some have proposed the use of ontologies as a useful tool for discussing and
defining the field, from a theoretical standpoint, from the top down. The practical
employment of ontologies in approaches such as have been described in this
dissertation has the potential to shape the terminology of the field from the bottom up,
with human readable results expressed using semantically grounded vocabulary,
passively shaping the investigator’s conception of digital evidence and the information
interpreted and derived from it.
The implications of the results showing the unreliability of a Windows based
time synchronisation infrastructure are clear. In cases where establishing the precise
time at which a computer event occurred is important, one cannot assume that
computers running MS Windows 2000 or XP have behaved in uniform ways with
respect to keeping time. Where precision is not so necessary, it would be expected that
corroborating sources of timestamped evidence might be useful in characterising the
behaviour of computer clocks, and thus enabling one to challenge the acceptability of
blanket assumptions made about clock behaviour. Where one expects to depend on the
correctness of timestamps, other, more reliable, measures must be taken towards
assuring synchronised computer clocks. Areas such as real time stock trading and
banking would be potential areas where this kind of forensic preparedness could be
warranted.
8.4 Opportunities for further work

This section of the chapter addresses areas of future work that have been
identified over the course of this research.
8.4.1 Document oriented evidence
While the work described in Chapters 5 and 6 validate that the representation is
expressive and extensible and that the SDEB information architecture is interoperable
were presented, this work has only covered a small portion of possible tool integration
scenarios.
Future work is required to ascertain how to best integrate evidence and
information with conflicting ontological commitments, the impact of part/whole
relationships on identifying entities, and how to practically integrate investigation

domain concepts such as hypotheses, investigator actions, assumptions, suspicions, and
likelihood into the approach. Furthermore, representing multiple, possibly conflicting
interpretations of evidence presents future challenges.
Additional work is required to investigate the linkages between both Bogen and
Dampier’s work on conceptualizing the digital forensics investigation process [17], and
Stephensen’s formal verification of investigation work [127], and this document
oriented evidence approach. We suspect that such a document oriented approach would
help bridge between the abstract goals of their work, and practical automation of the
digital investigation process.
In certain areas of documenting the investigation and representing event log
evidence, expressiveness problems were identified in the RDF/OWL language.
Reification was identified as an area where the RDF/OWL formalism was deficient in
expressiveness, and information provenance related statements were identified as areas
where the representation failed to effectively express information. The representation
was additionally identified to be insufficiently expressive enough describe and
represent heuristic knowledge involving temporal constraints, instance matching, and
declaration of new property values and instances. Finally, the formalism is not suited
toward efficiently modelling and reasoning with models of time involving uncertainty,
multiple timelines, and property values which vary over time. Future work is needed to
establish where the current and future generations of knowledge representations
address these problems, and to direct future research in the field of knowledge
representation.
A limitation to the RDF/OWL approach to representation is its high resource
consumption. While current generation RDF stores routinely scale to hundreds of
millions of statements, the implementation of OWL semantics remains a problem
because of the large amount of computation and querying/search involved. In the
context of the SDEB, performance would not be problematic because of the limited
amount of data required for SDEB composition. In the context of representing the
information content of digital evidence, as has been described in Chapter 5 in the
context of event correlation, the approach will likely fail to scale with current
approaches to RDF/OWL implementations. Future work is needed to address these
scaling difficulties.
8.4.2 Ontologies in digital forensics
This work principally employed ontologies as a means of ascribing a fixed

semantics to digital evidence and related information, for the purpose of documenting
knowledge related to a case, and as an information format compatible with rule based
reasoning. Another form of reasoning which is possible with description logics is
categorization, which is performed by description logic reasoners (in this work we did
not employ these as detailed in section 5.2.1). From data asserted in RDF, and an OWL
ontology, a description logic reasoner can classify instances of information as
belonging to a class or category, based on the relationships between the individual and
other classes or instances. Future work is necessary to determine the extent to which
this kind of reasoning could automatically identify situations or information of interest.
The ontologies used in the course of this research have been developed in an ad
hoc manner and built only for the purpose at hand. There has been no attempt to create
a comprehensive digital forensics ontology. Such an ontology would be of worth both
in building consensus on the meaning of the digital forensics related vocabulary,
highlighting areas where language is used in inexact or confusing ways, and as machine
readable semantics for tool interoperability. Building such an ontology is, however,
complicated by established linguistic conventions (ie. UK usage of the term “computer
based electronic evidence” vs. US usage of “digital evidence”), the context dependent
nature of terminology, and the difficulty of limiting the scope of the ontology 40 .
Future work applying automated ontology construction methods (ie. “ontology
learning” [72]) could potentially produce a digital forensics ontology with low human
time, energy and consensus costs, and at the same time identify areas of the digital
forensics vocabulary which are used in divergent ways.
8.4.3 Temporal assumptions underlying event correlation
Forensic event correlation is fundamentally different from event correlation in

the IDS or network management fields so it is important to be cognisant of assumptions
which are normally made in those fields. Typical assumptions are made related to
temporal nearness in those fields. Many IDS operate on small time windows to enable
the systems to scale. A problem with this approach is that situations exceeding the
windows of temporal focus are missed. For example, a commonly observed problem
with IDS is that the wily adversary need only wait 24 hours between steps in a multi
step attack in order for the IDS to miss the ongoing attack.
In event correlation in the forensics context, determining what window of time
to apply to an event pattern is problematic for the same reason that IDS use limited
time windows: scalability. Simple assumptions such as forgetting state after 24 hours
may help limit the state space of correlation algorithms, but may produce false
40
For a good survey of approaches to building ontologies, see [102].
negatives, which may be acceptable in the IDS context however not in the forensics
one.
Temporal correlation methods such as those we have proposed in Chapter 7
imply models of time more complex than can be described using terms such as offset
and drift; our research visually depicted the relationship between our reference timeline
(UTC) and the subject computer. The initial results characterising the temporal
behaviour of a particular clock showed that this relationship might be described by
successive time offsets and drift rates. The problem is that in practice there are no
events stored on the computer which allow one to see changes in rate of the passage of
time at this micro level granularity.
The presence of false positives in the results generated by the correlation
method precludes its use as a directly usable means of interpreting unreliable
timestamps to corresponding times on a reference timescale. Upstream tools seeking to
work with events with unreliable timestamps may use results generated by this
correlation method, however the raw results would need to be manually interpreted into
a set of assumptions about the passage of time relative to the reference timescale.
Assuming that the false positives problem might be solved by a more thorough
reverse engineering of the IE cache and history file format, automated upstream tool
use of the correlation results is still complicated by the granularity of the correlation
results, and the likely limited period which the results would cover. For this reason,
extending the concrete results of correlation requires production of a set of assumptions
about the passage of time in between the samples. These assumptions, or temporal
theories, about the passage of time could be used by a correlation tool to ascribe a
theoretical real world time to an event.
The results regarding temporal provenance indicate that event correlation
processes would benefit from richer models of temporal progress including timescale
deviation, event time uncertainty, and orthogonal to this, assumptions about these.
What effect such notions might have upon event pattern languages is an open question.
It would appear likely that their affect on the algorithmic complexity of correlation
approaches would be adverse to a high degree.
8.4.4 Characterising temporal behavior of computers
The study in Chapter 7 focused on the Windows platform because of its

dominance in deployment. While a number of studies have observed widespread
temporal skews in computer networks, the extent to which the results relate to temporal
behaviour of other operating systems is still an open question. The computers in this
experiment were tethered to a time source where synchronisation occurred often.
Future work is needed in characterising the behaviour of Windows PCs that are either
untethered from or loosely tethered to reliable time sources, and also on the behaviour
of UNIX and RTOS variants.
8.4.5 Event pattern languages
This work addresses means for analyzing event log based evidence, utilizing
RDF/OWL for representing entities and events, and rules for expressing correlation
relationships. In this context correlation refers to an abstraction relationship between a
set of events and a higher level event or situation. This correlation relationship is in
turn dependent upon a variety of relationships between the lower level events,
including temporal constraints and constraints over property relationships with entities
involved in the events and the wider environment.
The problem of event correlation and event pattern languages in particular lies
in how to describe these events, relationships and constraints. This work relied on
OWL for modeling events and relationships; however its expressiveness is insufficient
for describing temporal constraints. This led to the employment of a rule language for
declaring these. Some work has been performed on extending description logics to
incorporate temporal descriptions; however the work is preliminary.
Future investigations of event pattern languages would benefit from working
with abstract notions of time such as before, after, coincident, and during, rather than
reasoning with time as a single discreet numerical value. How a language incorporating
these notions would interact with temporal models such as those mentioned in Section
7.3 similarly requires further investigation.
Chapter 9. Bibliography
[1] HB171-2003 Guidelines for the management of IT evidence. 2003,

Standards Australia International: Sydney, Australia.
[2] AAFS. So you want to be a forensic scientist? 2006 [Viewed Nov

2006]; Available from:
http://www.aafs.org/default.asp?section_id=resources&page_id=choosin
g_a_career.
[3] Abbott, J., J. Bell, A. Clark, O.D. Vel, and G. Mohay. Computer
forensics (CF): Automated recognition of event scenarios for digital
forensics. in 2006 ACM Symposium on Applied Computing. 2006. Dijon,
France: ACM Press.
[4] AccessData. FTK Crashes or Hangs on Certain Files. 2006 [Viewed

29 Nov 2006]; Available from:
http://www.accessdata.com/media/en_us/print/techdocs/techdoc.FTK_cr
ashes_or_hangs_on_certain_files.en_us.pdf.
[5] ACM, Next-generation cyber forensics. Communications of the ACM,

2006. 49(2).
[6] ACPO. Good Practise Guide for Computer based Electronic Evidence.
2006 [Viewed 19 Oct 2006]; Available from:
http://www.acpo.police.uk/asp/policies/Data/gpg_computer_based_evid
ence_v3.pdf.
157
158 CHAPTER 9 – Bibliography
[7] Alinka, W., R.A.F. Bhoedjanga, P.A. Bonczb, and A.P.d. Vriesb, XIRAF
– XML-based indexing and querying for digital forensics. Digital
Investigation (6th Digital Forensics Research Workshop), 2006.
3(Supplement 1): p. 89-107.
[8] Attfield, P., United States v Gorshkov detailed forensics and case study:
expert witness perspective, in 1st International Workshop on Systematic
Approaches to Digital Forensic Engineering. 2005: Taipei, Taiwan. p.
3-24.
[9] Austen, J., Some stepping stones in computer forensics. Information

Security Technical Report, 2003. 8(2): p. 37-41.
[10] Baader, F., Logic-based Knowledge Representation, in Artificial

Intelligence Today: Recent Trends and Developments. 1999, Springer.
[11] Bartel, M., J. Boyer, B. Fox, B. LaMaccia, and E. Simon. XML-

Signature Syntax and Processing. 2002 [Viewed 9 Jan2006]; Available
from: http://www.w3.org/TR/xmldsig-core/.
[12] Beckett, J., Digital Forensics: Validation and Verification in a Dynamic

Work Environment, in 40th Annual Hawaii International Conference on
Systems Science. 2007: Hawaii.
[13] Beebe, N.L. and J.G. Clark, A Hierarchical, Objectives-Based

Framework for the Digital Investigations Process, in 4th Digital
Forensics Research Workshop. 2004: Baltimore, MD.
[14] Berners-Lee, T., D. Connolly, and R.R. Swick. Web Architecture:

Describing and Exchanging Data. 1999 [Viewed 4 Dec 2006];
Available from: http://www.w3.org/1999/06/07-WebData.
[15] Berners-Lee, T., R. Fielding, and L. Masinter. Uniform Resource

Identifiers (URI): Generic Syntax. 1998 [Viewed 9 January 2006];
Available from: http://www.ietf.org/rfc/rfc2396.txt.
CHAPTER 9 – Bibliography 159
[16] Berners-Lee, T., J. Hendler, and O. Lassila, The Semantic Web.

Scientific American, 2001. 284(5): p. 28-37.
[17] Bogen, A. and D. Dampier. Knowledge discovery and experience

modeling in computer forensics media analysis. in International
Symposium on Information and Communication Technologies. 2004:
Trinity College Dublin.
[18] Bogen, A.C. and D.A. Dampier. Unifying computer forensics modeling
approaches: a software engineering perspective. in 1st International
Workshop on Systematic Approaches to Digital Forensic Engineering.
2005.
[19] Borgida, A., R.J. Brachman, D.L. McGuinness, and L.A. Resnick,
CLASSIC: A Structural Data Model for Objects, in ACM SIGMOD
International Conference on Management of Data. 1989: Portland,
Oregon.
[20] Boyd, C. and P. Forster, Time and date issues in forensic computing – a
case study. Digital Investigation, 2004: p. 18-23.
[21] Brill, A.E., M. Pollitt, and C.M. Whitcomb, The Evolution of Computer
Forensic Best Practices: An Update on Programs and Publications.
Journal of Digital Forensic Practice, 2006. 1(1): p. 2-11.
[22] Brinson, A., A. Robinson, and M. Rogers, A cyber forensics ontology:

Creating a new approach to studying cyber forensics, in 6th Digital
Forensics Research Workshop. 2006: Lafayette, IN.
[23] Carrier, B. Open Source Digital Forensics Tools: The Legal Argument.
@stake Research Report 2002 [Viewed Dec 2006]; Available from:
http://www.digital-evidence.org/papers/opensrc_legal.pdf.
[24] Carrier, B. A Hypothesis-Based Approach to Digital Forensic

Investigations (Ph.D. Thesis). 2006. West Lafayette: Purdue University.
[25] Carrier, B. The sleuth kit & autopsy: Forensics tools for linux and other
unixes. 2006 [Viewed 29 Nov 2006]; Available from:
http://www.sleuthkit.org/.
[26] Carrier, B. and E. Spafford, An Event-based Digital Forensic

Investigation Framework, in 4th Digital Forensic Research Workshop.
2004: Baltimore, MD.
[27] Casey, E., Digital evidence and computer crime. 2000, San Diego, Calif:
Academic Press.
[28] Casey, E., State of the field: growth, growth, growth. Digital
Investigation, 2004. 1(4): p. 241-309.
[29] Casey, E., Digital arms race e The need for speed. Digital Investigation,
2005. 2(4): p. 229-280.
[30] CDESF. Common Digital Evidence Storage Format. 2004 [Viewed 21

December 2005]; Available from:
http://www.dfrws.org/CDESF/index.html.
[31] CDESF. Survey of Disk Image Storage Formats. 2006 [Viewed Dec
2006]; Available from: http://www.dfrws.org/CDESF/survey-dfrws-
cdesf-diskimg-01.pdf.
[32] Chen, H., T. Finin, and A. Joshi, An Ontology for Context-Aware

Pervasive Computing Environments, in Adjunct Proceedings of the 6th
International Conference on Ubiquitous Computing. 2003: Seattle,
Washington.
[33] Collier, P.A. and B.J. Spaul, A Forensic Methodology for Countering
Computer Crime. Journal of Forensic Science, 1992. 32(1).
[34] Connolly, D., R. Khare, and A. Rifkin. The Evolution of Web

Documents: The Ascent of XML. 1997 [Viewed 4 Dec 2006]; Available
from: http://www.cs.caltech.edu/~adam/papers/xml/ascent-of-xml.html.
[35] Cuppens, F. and A. Miege, Alert Correlation in a Cooperative Intrusion

Detection Framework, in IEEE Symposium on Security and Privacy.
2002: Berkeley, California.
[36] Davis, R., H. Shrobe, and P. Szolovits, What Is a Knowledge

Representation? AI Magazine, 1993. 14(1): p. 17-33.
[37] Doyle, J., I. Kohane, W. Long, H. Shrobe, and P. Szolovits, Event

Recognition Beyond Signature and Anomaly, in IEEE Workshop on
Information Assurance and Security. 2001: United States Military
Academy, West Point, New York.
[38] Eckmann, S. and G. Vigna, STATL: An Attack Language for State-based

Intrusion Detection. 2000, Dept. of Computer Science, University of
California: Santa Barbara.
[39] Elsaesser, C. and M. Tanner. Automated diagnosis for computer

forensics. 2001 [Viewed 2007 Feb]; Available from:
http://www.mitrecorp.org/work/tech_papers/tech_papers_01/elsaesser_f
orensics/esaesser_forensics.pdf.
[40] Fikes, R., J. Jenkins, and G. Frank, JTP: A System Architecture and
Component Library for Hybrid Reasoning, in Proceedings of the
Seventh World Multiconference on Systemics, Cybernetics, and
Informatics. 2003: Orlando, Florida.
[41] Fikes, R. and T. Kehler, The role of frame-based representation in

reasoning. Communications of the ACM, 1985. 28(9): p. 904-920.
[42] Forgy, C., Rete: A Fast Algorithm for the Many Patterns/Many Objects
Match Problem. Artificial Intelligence, 1982. 19(1): p. 17-37.
[43] Friedman-Hill, E. JESS: The Rule Engine for the JavaTM Platform.
2003 [Viewed Nov 2003]; Available from:
http://herzberg.ca.sandia.gov/jess/.
[44] Garfinkel, S., Forensic feature extraction and cross-drive analysis.

Digital Investigation (6th Digital Forensics Research Workshop), 2006.
3(Supplement 1): p. 71-81.
[45] Garfinkel, S.L., D.J. Malan, K.-A. Dubec, C.C. Stevens, and C. Pham,
Disk Imaging with the Advanced Forensics Format, Library and Tools.
Advances in Digital Forensics (2nd Annual IFIP WG 11.9 International
Conference on Digital Forensics), 2006.
[46] Genesereth, M.R. and R.E. Fikes, Knowledge Interchange Format,

Version 3.0 Reference Manual. 1992, Technical Report Logic-92-1,
Computer Science Department, Stanford University, 1992.
[47] Gladyshev, P. and A. Patel, Formalising Event Time Bounding in Digital

Investigations. International Journal of Digital Evidence, 2005. 4(2).
[48] Goldman, R., W. Heimerdinger, S. Harp, C. Geib, V. Thomas, and R.

Carter, Information Modeling for Intrusion Report Aggregation, in
DARPA Information Survivability Conference and Exposition II. 2001:
Anaheim, CA.
[49] Gray, J. and D. Patterson, A conversation with Jim Gray. ACM Queue,
2003. 1(4).
[50] Green, C. Application of Theorem Proving to Problem Solving. in 1st

International Joint Conference on Artificial Intelligence. 1969: Stanford
Research Institute, Artificial Intelligence Group.
[51] Gruber, T.R., Toward principles for the design of ontologies used for
knowledge sharing? International Journal of Human Computer Studies,
1995. 43(5-6): p. 907-928.
[52] Guha, R.V. and T. Bray. Meta Content Framework Using XML. 1997
[Viewed 2006 20 Dec 2006]; Available from:
http://www.w3.org/TR/NOTE-MCF-XML/.
[53] Hannan, M., To Revisit: What is Forensic Computing?, in 2nd

Australian Computer, Network & Information Forensics Conference.
2004: Perth, Australia.
[54] Harmelen, F., P.F. Patel-Schneider, and I. Horrocks. Reference

description of the DAML+OIL (March 2001) ontology markup
language. 2001 [Viewed 20 July, 2004]; Available from:
http://www.daml.org/2001/03/reference.html.
[55] Horrocks, I. The FaCT system. 1999 [Viewed Nov 2003]; Available
from: http://www.cs.man.ac.uk/~horrocks/FaCT/.
[56] Horrocks, I., P.F. Patel-Schneider, and F. van Harmelen, From SHIQ
and RDF to OWL: The making of a web ontology language. Journal of
Web Semantics, 2003. 1(1): p. 7-26.
[57] IOCE. G8 Proposed principles for the procedures relating to digital

evidence. 2002 [Viewed 16 Jan 2007]; Available from:
http://ncfs.org/documents/ioce2002/reports/g8ProposedPrinciples.pdf.
[58] ISO, ISO 8879:1986 Information processing — Text and office systems
— Standard Generalized Markup Language (SGML). 1986.
[59] Jones, K.J. Pasco – An Internet Explorer Activity Forensics Analysis

Tool. 2004 [Viewed April 2006]; Available from:
http://sourceforge.net/project/shownotes.php?group_id=78332&release_
id=237810.
[60] Kenneally, E.E., Gatekeeping Out Of The Box: Open Source Software As
A Mechanism To Assess Reliability For Digital Evidence. Virginia
Journal of Law and Technology, 2001. 6(3).
[61] Kifer, M., G. Lausen, and J. Wu, Logical Foundations for Object-
Oriented and Frame-Based Languages. Journal of the Association of
Computing Machinery, 1995. 42(3): p. 741-843.
[62] KLPD. The Open Computer Forensics Architecture (OCFA). 2006

[Viewed 30 Nov 2006]; Available from: http://ocfa.sourceforge.net/.
[63] Klyne, G. and J. Carrol. Resource Description Framework (RDF) :

Concepts and Abstract Syntax. 2004 [Viewed 21 December 2005];
Available from: http://www.w3.org/TR/rdf-concepts/.
[64] Kopena, J. OWLJessKB: A Semantic Web Reasoning Tool. 2003

[Viewed Feb 2003]; Available from:
http://edge.cs.drexel.edu/assemblies/software/owljesskb/.
[65] Kornblum, J., Identifying almost identical files using context triggered
piecewise hashing. Digital Investigation (6th Digital Forensics Research
Workshop), 2006. 3(Supplement 1): p. 91-97.
[66] Lassila, O., Web metadata: a matter of semantics. Internet Computing,

IEEE, 1998. 2(4): p. 30-37.
[67] Lenat, D.B., CYC: A Large-Scale Investment in Knowledge

Infrastructure. Communications of the ACM, 1995. 38(11): p. 33-38.
[68] LexisNexis, Butterworths Encyclopaedic Australian Legal Dictionary.

2006.
[69] Lindqvist, U. and P.A. Porras. Detecting computer and network misuse
through the production-based expert system toolset (P-BEST). in IEEE
Symposium on Security and Privacy. 1999. Berkeley, California.
[70] Lindsey, T. Challenges in Digital Forensics. 2006 [Viewed 7 Mar

http://www.dfrws.org/2006/proceedings/Lindsey-pres.pdf.
[71] Luckham, D., The Power of Events. 2002, Indianapolis, Indiana: Pearson
Education.
[72] Maedche, A. and S. Staab, Ontology learning for the Semantic Web.
IEEE Intelligent Systems, 2001. 16(2): p. 72 - 79
[73] McBride, B., Jena: a semantic web toolkit. IEEE Internet Computing,
2002. 6(6): p. 55-59.
[74] McDermott, D., The 1989 AI Planning Systems Competition. AI

Magazine, 2000. 21(2).
[75] McGalla, G. and N. Cercone, Guest Editor's Introduction: Approaches

to Knowledge Representation. IEEE Computer, 1983: p. 12-18.
[76] McKemmish, R., What is Forensic Computing? Trends and Issues in

Crime and Criminal Justice, 1999(118).
[77] Menzel, C., Common Logic Standard, in Metadata Forum Symposium

on Ontologies. 2003 Santa Fe.
[78] Meyers, M. and M. Rogers, Computer Forensics: Meeting the

challenges of Scientific Evidence. Advances in Digital Forensics (1st
Annual IFIP WG 11.9 International Conference on Digital Forensics),
2005. 1(1).
[79] Microsoft. How Windows Keeps Track of the Date and Time. 2006
[Viewed April 2006]; Available from:
http://support.microsoft.com/?kbid=232488.
[80] Microsoft. Microsoft products do not reflect Australian daylight saving

time changes for the year 2006. 2006 [Viewed April 2006]; Available
from: http://support.microsoft.com/kb/909915.
[81] Microsoft. The system clock may run fast when you use the ACPI power
management timer as a high-resolution counter on Windows 2000-
based, Windows XP-based, and Windows Server 2003-based computers.
2006 [Viewed April 2006]; Available from:
http://support.microsoft.com/?kbid=821893.
[82] Mills, D.L., Precision synchronization of computer network clocks.

ACM Computer Communications Review, 1994. 24(2): p. 28-43.
[83] Mills, D.L., A brief history of NTP time: confessions of an Internet

timekeeper. ACM Computer Communications Review, 2003. 33(2): p.
9-22.
[84] Minsky, M., A Framework for Representing Knowledge, in The

Psychology of Computer Vision, P.H. Winston, Editor. 1974, McGraw-
Hill: New York.
[85] Minsky, M., Logical vs.Analogical or Symbolic vs. Connectionist or

Neat vs. Scruffy. Artificial Intelligence at MIT, Expanding Frontiers,
1991. 1.
[86] Moates, R. URN Syntax. 1997 [Viewed 6 Jan 2006]; Available from:
http://www.ietf.org/rfc/rfc2141.txt.
[87] Mohay, G. Technical Challenges and Directions for Digital Forensics.

in 1st International Workshop on Systematic Approaches to Digital
Forensic Engineering,. 2005.
[88] Mohay, G., From Computer Forensics to Digital Forensics, in 1st

International Conference on Information Security and Computer
Forensics. 2006: Chennai, India.
[89] Mohay, G., A. Anderson, B. Collie, R. McKemmish, and O. de Vel,

Computer and Intrusion Forensics. 2003: Artech House, Inc. Norwood,
MA, USA.
[90] NCI. The National Cancer Institute Thesaurus in OWL. 2003 [Viewed
2007 Jan]; Available from:
http://www.mindswap.org/2003/CancerOntology/.
[91] Neches, R., R. Fikes, T.W. Finin, T.R. Gruber, R. Patil, T.E. Senator,
and W.R. Swartout, Enabling Technology for Knowledge Sharing. AI
Magazine, 1991. 12(3): p. 36-56.
[92] NIJ, Electronic Crime Scene Investigation: A Guide for First

Responders. 2001, National Institute of Justice: Washington, DC.
[93] NIJ, Forensic Examination of Digital Evidence: A Guide for Law

Enforcement. 2004, National Institute of Justice: Washington, DC.
[94] Niles, I. and A. Pease, Towards a Standard Upper Ontology, in 2nd

International Conference on Formal Ontology in Information Systems
(FOIS-2001), C. Welty and B. Smith, Editors. 2001: Ogunquit, Maine.
[95] Ning, P., Y. Cui, and D. Reeves, Constructing attack scenarios through
correlation of intrusion alerts, in 9th ACM conference on Computer and
Communications Security. 2002: Washington, DC.
[96] Nolan, R., C. O'Sullivan, J. Branson, and C. Waits, First Responders

Guide to Computer Forensics. 2005, Software Engineering Institute,
Carnegie Mellon University: Pittsburgh, PA.
[97] Noy, N.F. and D.L. McGuinness. Ontology Development 101: A Guide
to Creating Your First Ontology. 2001 [Viewed 2004]; Available from:
http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology10
1-noy-mcguinness.html.
[98] Palmer, G. (ed), A Road Map for Digital Forensic Research, in First
Digital Forensic Research Workshop, G. Palmer, Editor. 2001: Ucita,
New York.
[99] Pan, F. and J.R. Hobbs, Time in OWL-S, in 2004 AAAI Spring
Symposium Series - Semantic Web Services. 2004: Stanford University.
[100] Parker, D.B., Rules of ethics in information processing. Communications

of the ACM, 1968. 11(3): p. 198-201.
[101] Perrochon, L., E. Jang, S. Kasriel, and D.C. Luckham, Enlisting Event
Patterns for Cyber Battlefield Awareness, in DARPA Information
Survivability Conference & Exposition. 2000: Hilton Head, South
Carolina.
[102] Pinto, H.S. and J.P. Martins, Ontologies: How can They be Built?
Knowledge and Information Systems, 2004. 6(4).
[103] Pollack, J. U.S. v Plaza, Acosta (Cr. No. 98-362-10, 11,12). 2002 7 Jan
2002 [Viewed; Available from:
http://www.paed.uscourts.gov/documents/opinions/02d0046p.pdf.
[104] Raskin, V., C.F. Hempelmann, and K.E. Triezenberg. Semantic

Forensics: An Application of Ontological Semantics to Information
Assurance. in Second Workshop on Text Meaning and Interpretation.
2004.
[105] Raskin, V., C.F. Hempelmann, K.E. Triezenberg, and S. Nirenburg,

Ontology in information security: a useful theoretical foundation and
methodological tool, in Workshop on New Security Paradigms. 2001:
Cloudcroft, New Mexico.
[106] RCFL. REGIONAL COMPUTER FORENSIC LABORATORY

PROGRAM: Fiscal Year 2003 Annual Report. 2003 [Viewed Dec
http://www.rcfl.gov/downloads/documents/RCFL_Nat_Annual.pdf.
[107] Redgrave, L.M., A.S. Prasad, J.B. Fliegel, T.S. Hiser, and J.H. Jessen,
The Sedona Principles: Best Practices Recommendations & Principles
for Addressing Electronic Document Production in The Sedona
Conference Working Group Series. 2004, The Sedona Conference.
[108] Reed, S.L. and D.B. Lenat, Mapping Ontologies into Cyc, in AAAI
workshop on Ontologies and the Semantic Web. 2002: Edmonton,
Canada.
[109] Reith, M., C. Carr, and G. Gunsch, An Examination of Digital Forensic

Models. International Journal of Digital Evidence, Fall, 2002. 1(2).
[110] Reynolds, D., C. Thompson, J. Mukerji, and D. Coleman. An assessment

of RDF/OWL modelling. 2005 [Viewed Aug 2006]; Available from:
http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf.
[111] Richard III, G.G. and V. Roussev, Scalpel: A Frugal, High Performance
File Carver, in Digital Forensics Research Workshop. 2005: New
Orleans, LA.
[112] Richard III, G.G. and V. Roussev, Next-generation digital forensics.

Communications of the ACM, 2006. 49(2): p. 76-80
[113] Rivest, R. SEXP---(S-expressions). 1997 [Viewed 4 Dec 2006];

Available from: http://theory.lcs.mit.edu/%7Erivest/sexp.html.
[114] Roussev, V. and G.G.R. III, Breaking the Performance Wall: The Case
for Distributed Digital Forensics, in 5th Digital Forensics Workshop.
2005: New Orleans, LA.
[115] SandhillConsulting. How Microsoft Windows NT 4.0 Handles Time.

1998 [Viewed April 2006]; Available from:
http://folkworm.ceri.memphis.edu/ew-
doc/PROGRAMMER/NTandTime.html.
[116] Schumacher, M., Security Engineering with Patterns. Lecture Notes in

Computer Science, 2003. 2754.
[117] Seneger, M. Life Sciences Identifiers LSID Response. 2004 [Viewed 6

Jan 2006]; Available from: http://www.omg.org/cgi-
bin/doc?lifesci/2003-12-02.
[118] Sintek, M. and S. Decker, TRIPLE---A Query, Inference, and

Transformation Language for the Semantic Web, in International
Semantic Web Conference (ISWC). 2002: Sardinia.
[119] Slay, J. and F. Schulz, Development of an Ontology Based Forensic

Search Mechanism: Proof of Concept. Journal of Digital Evidence,
Security and Law, 2006. 1(1): p. 19-34.
[120] Sommer, P. Computer Forensics: an introduction. 1997 [Viewed Dec

2006]; Available from: http://www.virtualcity.co.uk/vcaforens.htm.
[121] Sommer, P., Digital Footprints: Assessing Computer Evidence. Criminal

Law Review Special Edition, 1998: p. 61-78.
[122] Sommer, P. Digital Evidence: Emerging Problems in Forensic

Computing. 2002 [Viewed Jan 16 2007]; Available from:
http://www.cl.cam.ac.uk/research/security/seminars/2002/2002-05-
21.pdf.
[123] Spafford, E.H. and S.A. Weeber, Software forensics: can we track code
to its authors? Computers and Security, 1993. 12(6): p. 585-595.
[124] Sperberg-McQueen, C.M. and H. Thompson. XML Schema. 2001

[Viewed November 2003]; Available from:
http://www.w3.org/XML/Schema.
[125] Stallard, T. and K. Levitt. Automated analysis for digital forensic

science: semantic integrity checking. in Computer Security Applications
Conference. 2003. Las Vegas, Nevada.
[126] Stephenson, P., A Comprehensive Approach to Digital Incident

Investigation. 2003, Elsevier Information Security Technical Report.
[127] Stephenson, P. Structured Investigation of Digital Incidents in Complex

Computing Environments (Ph.D. Thesis). 2004.
[128] Stevens, M.W., Unification of relative time frames for digital forensics.
Digital Investigation, 2004. 1: p. 225-239.
[129] Swartout, W., C. Paris, and J. Moore, Explanations in knowledge

systems: design for explainable expert systems. IEEE Expert, 1991. 6: p.
58 - 64.
[130] SWGDE, Digital Evidence: Standards and Principles. Forensic Science

Communications, 2000. 2(2).
[131] SWGDE. SWGDE and SWGIT Glossary of Terms. 2005 25 Aug 2007
[Viewed; Available from:
http://68.156.151.124/documents/swgde2005/SWGDE%20and%20SWG
IT%20Combined%20Master%20Glossary%20of%20Terms%20-
July%2020..pdf.
[132] Templeton, S.J. and K. Levitt, A Requires/Provides Model for Computer

Attacks, in New Security Paradigms Workshop. 2000: Ballycotton,
County Cork, Ireland.
[133] TGO. The Gene Ontology. 2006 [Viewed Jan 2007]; Available from:
http://www.geneontology.org/.
[134] Thomas, L.K. Reverse engineering index.dat. 2003 [Viewed April

http://www.latenighthacking.com/projects/2003/reIndexDat/.
[135] Turner, P. Unification of Digital Evidence from Disparate Sources

(Digital Evidence Bags). in 5th Digital Forensics Research Workshop.
2005. New Orleans.
[136] Undercoffer, J., A. Joshi, T. Finin, and J. Pinkston, A Target-Centric

Ontology for Intrusion Detection, in 18th International Joint Conference
on Artificial Intelligence. 2004: Acapulco, Mexico.
[137] van den Bos, J. and R. van der Knijff, TULP2G–An Open Source
Forensic Software Framework for Acquiring and Decoding Data Stored
in Electronic Devices. International Journal of Digital Evidence, 2005.
4(2).
[138] W3C. Extensible Markup Language (XML). 1998 [Viewed 7 Mar

2007]; Available from: http://www.w3.org/TR/REC-xml.
[139] W3C. XML Schema Requirements. 1999 [Viewed Dec 21 2006;

Available from: http://www.w3.org/TR/NOTE-xml-schema-req.
[140] W3C. RDF Vocabulary Description Language 1.0: RDF Schema. 2004
[Viewed 21 Dec 2006]; Available from: http://www.w3.org/TR/rdf-
schema/.
[141] Weil, C., Dynamic Time & Date Stamp Analysis. International Journal of
Digital Evidence, 2002. 1(2).
[142] Whitcomb, C., An historical perspective of digital evidence: A forensic

scientist's view. International Journal of Digital Evidence, 2002. 1(1).
[143] Yemini, S.A. and S. Kliger, High Speed and Robust Event Correlation.
IEEE Communications, 1996: p. 433-450.

Bradley Schatz Thesis

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bradley Schatz Thesis

Enviado por

Direitos autorais:

Formatos disponíveis

Digital Evidence:

Representation and Assurance

Bachelor of Science (Computer Science), UQ, Australia 1995

Thesis submitted in accordance with the regulations for the

Information Security Institute

Keywords __________________________________________________________ iii

Signed: ……………………………… Date:……………………………

The aim of the work described in this thesis is to examine, at a fundamental

1.1 Digital forensics and digital evidence

courts of law. Protocols have been established which prevent contamination of

• authentic – the evidence should be “specifically linked to the circumstances

In addition to these considerations, technically oriented evidence (forensic

• chain of evidence – “there should be a clear chain of custody or continuity of

• transparent – “a forensic method needs to be transparent, that is, freely

• accurate - when evidence is presented that contains statements which were

Digital evidence is by its nature fundamentally different from existing types of

potentially be performed manually by a skilled expert with sufficient time and

• whether the technique “can be (and has been) tested”,

Sommer is doubtful whether some of the computer forensic evidence accepted

The remainder of the digital forensics toolset generally employed remains

increasing markedly. The number of individual units of potential evidence is increasing

• Proposition that a formal knowledge representation approach to digital

• Demonstration of the usefulness of a particular representational formalism,

event log based evidence, investigation related documentation, and wider

• Demonstration of a novel analysis technique which supports automated

- A novel means of addressing the problem of surrogate proliferation 5 ,

• Demonstration of a novel approach to the problem of digital evidence storage

- A unique naming scheme for identifying digital evidence which

• An analysis of the temporal behaviour of PC clocks as generally implemented

- A novel approach for characterising the temporal behaviour of a host,

1.3 Dissertation roadmap

representation layer for documenting arbitrary information in a machine and human

2.1 A brief history of digital forensics

2.2 Digital evidence & digital forensics defined

• information of probative value 6 that is stored or transmitted in binary form (SWGDE)

• information and data of an investigative value that is stored on or transmitted

“The process of identifying, preserving, analysing and presenting digital evidence in a

the scientific examination, analysis, and/or evaluation of digital evidence in legal

2.2.1 The nature of digital evidence

Digital evidence is an interpretation of data, either at rest (when found on a

identification of digital evidence is achieved primarily through using hashing to

• Continuity of Evidence: (also chain of evidence) involving tracking the

• Individuating characteristics: identifying the digital crime scene, or derived

• Contemporaneous notes: notes made in the course of examination of the

• Error records: bad sectors, read failures, SMART errors

A subset of these classes of information, specifically continuity of evidence,

2.2.2 Perspectives on the digital investigation process

Linear process models

A number of authors have proposed models which describe digital

• Preliminary Considerations: Authority to conduct investigation

• Planning: Preparation and Methodology

• Recognition: Identification of potential sources of digital evidence

• Preservation, collection and documentation: Crime scene documentation

• Classification, comparison, and individualization: Examination and search

• Reconstruction: Deleted or damaged digital evidence recovery, slack space

The US National Institute of Justice (NIJ), in their “Electronic Crime Scene

• Collection: “search for, recognition of, collection of, and documentation of

• Reporting: “outlines the examination process and the pertinent data

Classification, Comparison and

Preparation Collection Examination Analysis Reporting

Figure 1: Corresponding phases of linear process models of digital forensic investigation

Bebe’s inclusion of Principles in her forensic process model was novel in

Event based digital investigation framework

More recently an investigation process model was proposed based on the