Escolar Documentos
Profissional Documentos
Cultura Documentos
by
Bradley Schatz
October, 2007
ii
Keywords
Digital evidence, computer based electronic evidence, digital forensics,
computer forensics, forensic computing, evidence provenance, evidence representation,
knowledge representation.
iii
iv
Abstract
The field of digital forensics is concerned with finding and presenting evidence
sourced from digital devices, such as computers and mobile phones. The complexity of
such digital evidence is constantly increasing, as is the volume of data which might
contain evidence. Current approaches to interpreting and assuring digital evidence rely
implicitly on the use of tools and representations made by experts in addressing the
concerns of juries and courts. Current forensics tools are best characterised as not easily
verifiable, lacking in ease of interoperability, and burdensome on human process.
The tool-centric focus of current digital forensics practise impedes access to
and transparency of the information represented within digital evidence as much as it
assists, by nature of the tight binding between a particular tool and the information that
it conveys. We hypothesise that a general and formal representational approach will
benefit digital forensics by enabling higher degrees of machine interpretation,
facilitating improvements in tool interoperability and validation. Additionally, such an
approach will increase human readability.
This dissertation summarises research which examines at a fundamental level
the nature of digital evidence and digital investigation, in order that improved
techniques which address investigation efficiency and assurance of evidence might be
identified. The work follows three themes related to this: representation, analysis
techniques, and information assurance.
The first set of results describes the application of a general purpose
representational formalism towards representing diverse information implicit in event
based evidence, as well as domain knowledge, and investigator hypotheses. This
representational approach is used as the foundation of a novel analysis technique which
uses a knowledge based approach to correlate related events into higher level events,
which correspond to situations of forensic interest.
v
The second set of results explores how digital forensic acquisition tools scale
and interoperate, while assuring evidence quality. An improved architecture is
proposed for storing digital evidence, analysis results and investigation documentation
in a manner that supports arbitrary composition into a larger corpus of evidence.
The final set of results focus on assuring the reliability of evidence. In
particular, these results focus on assuring that timestamps, which are pervasive in
digital evidence, can be reliably interpreted to a real world time. Empirical results are
presented which demonstrate how simple assumptions cannot be made about computer
clock behaviour. A novel analysis technique for inferring the temporal behaviour of a
computer clock is proposed and evaluated.
vi
Table of Contents
vii
2.4.2 Effective forensics tools and techniques ______________________ 33
2.4.3 Meeting the standard for scientific evidence ___________________ 34
2.5 Conclusions_________________________________________________ 36
Chapter 3. Related work ___________________________________________ 37
3.1 Event correlation for forensics __________________________________ 38
3.1.1 Approaches to modeling events _____________________________ 39
3.1.2 Event patterns and event pattern languages ____________________ 40
3.1.3 Observations____________________________________________ 42
3.2 Current approaches to evidence representation and format ____________ 42
3.2.1 Digital evidence container formats___________________________ 42
3.2.2 Representation of digital investigation documentation ___________ 46
3.2.3 Observations____________________________________________ 48
3.3 Reliable interpretation of time __________________________________ 48
3.3.1 An introduction to computer timekeeping _____________________ 48
3.3.2 Reliable time synchronization ______________________________ 49
3.3.3 Factors affecting timekeeping accuracy _______________________ 49
3.3.4 Usage of timestamps in forensics ____________________________ 50
3.3.5 Observations____________________________________________ 51
3.4 Conclusion _________________________________________________ 51
Chapter 4. Digital evidence representation: addressing the complexity and
volume problems of digital forensics ____________________________________ 53
4.1 Introduction_________________________________________________ 54
4.2 Background on knowledge representation _________________________ 56
4.2.1 Historical foundations ____________________________________ 56
4.2.2 Defining knowledge representation __________________________ 58
4.2.3 Hybrid approaches _______________________________________ 61
4.3 Semantic markup languages ____________________________________ 62
4.3.1 A basic Introduction to the RDF data model___________________ 64
4.3.2 RDF serialisation ________________________________________ 67
4.3.3 Adding semantics to published RDF data _____________________ 69
4.4 KR in digital forensics and IT security ____________________________ 72
4.5 A formal KR approach to investigation documentation and digital evidence
___________________________________________________________ 74
4.6 Conclusion _________________________________________________ 76
viii
Chapter 5. Event representation in forensic event correlation ____________ 79
5.1 Introduction: Event correlation in digital forensics __________________ 80
5.2 Ontologies, KR and a new approach______________________________ 81
5.2.1 Knowledge representation framework ________________________ 82
5.2.2 Application architecture ___________________________________ 82
5.3 Implementation ______________________________________________ 83
5.3.1 The design of the event representation________________________ 83
5.3.2 Log parsers_____________________________________________ 85
5.3.3 A heuristic correlation language – FR3 _______________________ 86
5.4 Case study 1: Intrusion forensics ________________________________ 89
5.4.1 Investigation using FORE _________________________________ 90
5.4.2 Experimental results______________________________________ 95
5.5 Case study 2: Extending the approach to new domains _______________ 96
5.5.1 Integration of standard ontologies ___________________________ 97
5.5.2 Integrating new domains __________________________________ 98
5.5.3 Experimental results_____________________________________ 100
5.6 Conclusion ________________________________________________ 102
Chapter 6. Sealed digital evidence bags _____________________________ 107
6.1 Introduction________________________________________________ 108
6.2 Definitions ________________________________________________ 109
6.3 An extensible information architecture for digital evidence bags ______ 110
6.3.1 Storage container architecture _____________________________ 111
6.3.2 Information architecture__________________________________ 113
6.3.3 Integrity ______________________________________________ 117
6.3.4 Evidence assurance _____________________________________ 118
6.3.5 Clarifications __________________________________________ 118
6.4 Usage scenario: imaging and annotation _________________________ 118
6.5 Experimental results _________________________________________ 120
6.6 Conclusion and future work ___________________________________ 121
Chapter 7. Temporal provenance & uncertainty ______________________ 125
7.1 Introduction________________________________________________ 126
7.2 Characterising the behaviour of drifting clocks ____________________ 127
7.2.1 Experimental setup______________________________________ 127
7.2.2 Analysis and discussion of results __________________________ 128
ix
7.3 Identifying computer timescales by correlation with corroborating sources _
__________________________________________________________ 133
7.3.1 Experimental setup ______________________________________ 134
7.3.2 Challenges in correlating browser and squid logs ______________ 135
7.3.3 Analysis methodology ___________________________________ 136
7.3.4 Clickstream correlation algorithm __________________________ 137
7.3.5 Results _______________________________________________ 139
7.3.6 Non-cached records correlation algorithm ____________________ 140
7.3.7 Results _______________________________________________ 141
7.4 Discussion _________________________________________________ 142
7.4.1 Relation to existing work _________________________________ 143
7.5 Conclusions________________________________________________ 144
Chapter 8. Conclusions and future work ____________________________ 147
8.1 Summary of contributions and achievements ______________________ 148
8.2 Discussion of main themes and conclusions _______________________ 149
8.2.1 Addressing complexity and volume of digital evidence _________ 149
8.2.2 Assurance of fundamental temporal information _______________ 150
8.3 Implications of Work ________________________________________ 150
8.4 Opportunities for further work _________________________________ 151
8.4.1 Document oriented evidence ______________________________ 151
8.4.2 Ontologies in digital forensics _____________________________ 152
8.4.3 Temporal assumptions underlying event correlation ____________ 153
8.4.4 Characterising temporal behavior of computers________________ 154
8.4.5 Event pattern languages __________________________________ 155
Chapter 9. Bibliography __________________________________________ 157
x
List of Tables
Table 1: Challenges in digital forensics - DFRWS 2006 keynote _______________ 31
Table 2: RDF/XML Serialisation of two triples_____________________________ 68
Table 3: RDF/XML serialisation using XML Namespace abbreviation __________ 68
Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition
_____________________________________________________________ 68
Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie 'Footloose'." ______ 69
Table 6: N3 serialisation of statement from Table 5 _________________________ 69
Table 7: A simple Movie related ontology_________________________________ 72
Table 8: Web Session / Causality Correlation Rule __________________________ 89
Table 9: OSExploit Heuristic Rule_______________________________________ 91
Table 10: SAP Related Events __________________________________________ 99
Table 11: Identity Masquerade Rule ____________________________________ 100
Table 12: Door Entry- Login rule ______________________________________ 100
Table 13: The file content of a browser log SDEB _________________________ 113
Table 14: XML/RDF content of Investigation Documentation File named
jbloggs.cache.index.dat.rdf ______________________________________ 115
Table 15: Digital Evidence Bag instance data stored in the Tag File ___________ 117
Table 16: Evidence Content message digest property _______________________ 117
Table 17: Investigation Documentation Container Metadata stored in the Tag File. 118
Table 18: Annotated information from composing SDEB____________________ 121
xi
xii
List of Figures
Figure 1: Corresponding phases of linear process models of digital forensic
investigation___________________________________________________ 21
Figure 2: Event based digital investigation framework _______________________ 22
Figure 3: Digital crime scene specific investigation phases____________________ 23
Figure 4: Carrier's digital forensics tool abstraction layer model _______________ 28
Figure 5: Turner's digital evidence bag ___________________________________ 44
Figure 6: Trivial set of physical, digital and document evidence________________ 55
Figure 7: Current Semantic Web standards ________________________________ 64
Figure 8: Basic RDF node-arc-node triple _________________________________ 65
Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named
'Footloose'"____________________________________________________ 65
Figure 10: Unambiguous meaning is given to concepts and instances through naming
with URI’s ____________________________________________________ 66
Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.” _____ 67
Figure 12: The FORE Architecture ______________________________________ 83
Figure 13: Instance and Class/Subclass relationships between events____________ 85
Figure 14: Causal ancestry graph of exploit________________________________ 92
Figure 15: Related events remain unconnected because of surrogate proliferation __ 93
Figure 16: Correlated event graphs after proliferate surrogates merged __________ 94
Figure 17: Causal ancestry graph of identity masquerading scenario ___________ 101
Figure 18: Referencing nested and external digital evidence bags _____________ 111
Figure 19: Proposed sealed digital evidence bag structure ___________________ 112
Figure 20: RDF Graph relating original data object and image ________________ 116
Figure 21: RDF graph resulting from addition of new documentation to embedded DEB
____________________________________________________________ 120
Figure 22: Experimental setup for logging temporal behaviour of windows PC's in
small business network _________________________________________ 128
xiii
Figure 23: Clock skew of Domain Controller "Rome" offset from civil time. ____ 129
Figure 24: Clock skew of workstation "Florence" offset from civil time. ________ 130
Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed). __ 131
Figure 26: Clock skew of workstation “Trieste” offset from civil time. _________ 132
Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed). __ 132
Figure 28: Experimental setup for correlation _____________________________ 134
Figure 29: Matching is complicated by only the most recent record present in the
history. ______________________________________________________ 136
Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host
“Milan” do not correlate because of presence of false positives.__________ 137
Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when
false positives are removed. ______________________________________ 138
Figure 32: Pompeii" cache correlation. __________________________________ 139
Figure 33: History Correlation vs. Timescale. _____________________________ 141
Figure 34: Incomplete information______________________________________ 143
xiv
List of Abbreviations
AAFS: American Association of Forensic Sciences
ACPO: (UK) Association of Chief Police Officers
AFF: Advanced Forensics Format
API: Application Programming Interface
APIC: Advanced Programmable Interrupt Controller
BIOS: Basic Input Output System
CART: Computer Analysis and Response Team
CDESF: Common Digital Evidence Storage Format
CEP: Complex Event Processing
CERN: The European Particle Physics Laboratory (Consiel Europeen pour la
Recherche Nucleaire)
CERT: Computer Emergency Response Team
CFSAP: Computer Forensics Secure Analyse Present
CFTT: Computer Forensics Tool Testing
DAML: DARPA Agent Markup Language
DARPA: Defence Advanced Research Projects Agency
DCO: Drive Configuration Overlay
DC: (MS Windows) Domain Controller
DCS: Digital Crime Scene
DE: Digital Evidence
DEB: Digital Evidence Bag
DEID: Digital Evidence Identifier
DF: Digital forensics
DFRWS: Digital Forensics Research Workshop
DL: Description Logic
xv
DMCA: Digital Millennium Copyright Act
DTD: Document Type Definition
DLG: Directed Labelled Graph
DO: Data Object
ERP: Enterprise Resource Planning
FBI: (US) Federal Bureau of Investigation
FOL: First Order (Predicate) Logic
FSM: Finite State Machine
FTK: Forensics Toolkit
HDD: Hard Disk Drive
HTML: Hypertext Markup Language
HPA: Post Protected Area
IDS: Intrusion Detection System
IE: Internet Explorer
IOCE: International Organisation on Computer Evidence
KIF: Knowledge Interchange Format
KR: Knowledge Representation
LSID: Life Sciences Identifier
MAC: Media Access Control
MD5: Message Digest 5
MRU: Most Recently Used
N3: Notation 3
NIJ: (US) National Institute of Justice
NIST: (US) National Institute of Standards and Technology
NL: Natural Language
NSRL: (US) National Software Reference Library
NTP: Network Time Protocol
OIL: Ontology Integration Language
OWL: Web Ontology Language
P2P: Peer To Peer
PDA: Personal Digital Assistant
RAID: Redundant Array of Inexpensive Disks
RDF: Resource Description Framework
RDFS: RDF Schema
xvi
RTC: Real Time Clock
SDEB: Sealed Digital Evidence Bag
SGML: Standard Generalized Markup Lanugage
SIM: Subscriber Identity Module
SNTP: Simple Network Time Protocol
SUO: Standard Upper Ontology
SUMO: Suggested Upper Merged Ontology
SWGDE: Scientific Working Group on Digital Evidence
SPARQL: Simple Protocol And RDF Query Language
TSK: The Sleuth Kit
UMM: Unified Modelling Methodology
URI: Uniform Resource Identifier
URL: Uniform Resource Locator
URN: Uniform Resource Name
UTC: Coordinated Universal Time
WWW: World Wide Web
W3C: World Wide Web Consortium
XML: Extensible Markup Language
XML-NS: XML Namespace
XSD: XML Schema Definition
xvii
xviii
Declaration
The work contained in this dissertation has not been previously submitted to
meet requirements for an award at this or any other higher education institution. To the
best of my knowledge and belief, this dissertation contains no material previously
published or written by any other person except where due reference is made.
xix
xx
Previously Published Material
The following papers have been published or presented, and contain material
based on the content of this dissertation.
Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer
Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and
Management Systems (APIEMS 2004), Brisbane, Australia.
Schatz, B., Mohay, G. and Clark, A., (2004) ‘Generalising Event Forensics Across
Multiple Domains’ Proceedings of the 2004 Australian Computer Network and
Information Forensics Conference (ACNIFC 2004), Perth, Australia.
(revised version published as)
Schatz, B., Mohay, G. and Clark, A., (2005) ‘Generalising Event Correlation Across
Multiple Domains’, Journal of Information Warfare, vol 4, iss 1, pp. 69-79.
Schatz, B., Clark, A., (2006) ‘An information architecture for digital evidence
integration’ Proceedings of the 2006 Australian Security Response Team Annual
Conference (AUSCERT 2006), Gold Coast, Australia.
Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of
computer event log evidence’ Digital Investigation, 3 (Supplement 1), pp. 89-107.
(also published as)
Schatz, B., Mohay, G. and Clark, A., (2006) ‘Establishing temporal provenance of
computer event log evidence’ Proceedings of the 2006 Digital Forensics Workshop
(DFRWS 2006), West Lafayette, USA.
xxi
xxii
In loving memory of my father, Gregory Schatz.
xxiii
xxiv
Acknowledgements
This dissertation, like most, is the product of one author, yet has been shaped
by a cast of supporters, colleagues, friends and family. I would like to express my
sincere appreciation in the following paragraphs.
I would like to thank Adjunct Professor George Mohay, Dr. Andrew Clark and
Associate Professor Peter Best for their supervision. George and Andrew’s guidance
and inspiration have been instrumental in directing the course of this research. Thank
you both for giving me the opportunity to spend this time researching and for freely
sharing of your insight, time, energy and experience.
George deserves special mention for the methodical and focused attention
which he applied to my writing; it has been a true pleasure writing papers together.
Andrew’s patience and willingness to offer an alternative perspective has often helped
clarify otherwise murky waters.
The Information Security Institute (ISI) provided the resources and
environment for me to perform this research, contributions which I highly appreciate.
Without the help of Ed Dawson, Colin Boyd, and Mark Looi I would not have found
the opportunity to research at the ISI. Many thanks to the ISI staff who have helped
along the way. Additional thanks go to SAP Research, who supported some research
related to Chapter 5.
I would like express my appreciation to Peter Best for providing important
direction related to the event correlation work presented in Chapter 5. Peter Kingsley
(Qld. Police Forensics Unit) provided valuable assistance in provenance related issues
which contributed towards the results in Chapter 6.
xxv
Many thanks go to my colleagues who have helped along the way by reading
drafts of papers and discussing ideas. I would especially like to thank Jason Smith
(ISI), Mark Branagan (ISI) and Julienne Vayssiere (SAP Research).
Finally, to my family, thank you all for your encouragement and understanding
during the period of this research. In particular I would like to give heartfelt thanks to
my wife Kelly, who has patiently supported me throughout this period, enduring far too
much seriousness and absence on my part.
xxvi
Chapter 1. Introduction
All our lauded technological progress – our very civilization – is like
the axe in the hand of the pathological criminal.
(Albert Einstein)
1
2 CHAPTER 1 - Introduction
1
According to Parker [100] , the first successfully prosecuted case (in a federal
US jurisdiction) involving the criminal use of a computer concluded on January 10,
1967. The defendant, a computer programmer, worked on a reporting system for
overdrawn checking accounts for the National City Bank of Minneapolis. The
defendant, whose personal checking account was with the same bank, and subject to the
same processing system, patched the program to hide a growing personal debt. The
situation was discovered when a computer failure caused processing to revert back to
manual methods.
2
Slack space is an emergent artefact of filesystems related to their block oriented allocation
strategies; it refers to an area in the last block or cluster used to store the final part of a file.
Where the final chunk of the file only partly uses this block or cluster, the remainder of the
storage area remains unused. It is this unused area that is referred to as slack space.
CHAPTER 1 - Introduction 3
1998, Sommer described the following basic principles for evaluating the acceptability
of new types of evidence not previously considered by courts:
• accurate – the evidence should be “free from any reasonable doubt about the
quality of procedures used to collect the material, analyse the material if that is
appropriate and necessary and finally to introduce it into court - and produced
by someone who can explain what has been done. In the case of exhibits which
themselves contain statements - a letter of other document, for example –
‘accuracy’ must also encompass accuracy of content; and that normally
requires the document’s originator to make a Witness Statement and be
available for cross – examination”
• complete – “tells within its own terms a complete story of (a) particular set of
circumstances or events” [121]
• explainable – “in the case of material derived from sources with which most
people are not familiar quite extensive explanations may be needed”
• whether the technique has been “subjected to peer review and publication”,
• “the known or potential rate of error… and the existence of and maintenance
of standards controlling the technique’s operation”, and
• “general acceptance.”
3
http://www.guidancesoftware.com/
4
http://www.accessdata.com/
6 CHAPTER 1 - Introduction
1.2 Contributions
New paradigms for interacting with, managing, processing and presenting
digital evidence are needed for achieving these efficiencies and reliability of findings.
Current approaches to digital investigation overly rely on human intervention as the
glue which binds together the operation of disparate tools over opaque data. All the
while investigation documentation must be sufficiently maintained and generated to
provide assurance of the authenticity and provenance of evidence and reproducibility of
findings.
The aim of the work described in this dissertation is to investigate at a
fundamental level the nature of information examined, inferred and reported in digital
investigations, to identify techniques which facilitate documenting of digital
investigations, analysis of digital evidence, and reporting of findings, while at the same
time assuring reliability and authenticity of digital evidence. Our research addresses the
complexity problem by supporting the expression of arbitrary information related to the
investigation, and the volume problem by enabling scalable approaches to digital
evidence.
This dissertation summarises a significant body of research performed
following three intertwined themes: representation, analysis techniques, and
information assurance. The original contributions contained within this dissertation are:
5
Defined in Chapter 5.
8 CHAPTER 1 - Introduction
This chapter provides a brief introduction to the field of digital forensics, and
the subject of that field; digital evidence. A brief summary of the challenges in digital
forensics are presented, followed by a summary of the contributions of this dissertation,
and finally, the dissertation roadmap.
Chapter 2: Background: Digital forensics
This chapter is a comprehensive review of digital forensics and digital evidence
from practice and research perspectives. The chapter begins by describing the historical
context and evolution of the field. The field is then characterised by presenting multiple
perspectives of it, including noted definitions, the nature of digital evidence and its
relation to digital forensics tools. Finally, current approaches to representing and
documenting digital evidence are described, and limitations in evidence representation
are identified.
Chapter 3: Related work
This chapter reviews background material and related work relevant to the
work described in this dissertation in Chapters 4 to 7. Section 3.1 describes the
literature related to event correlation, both specifically related to computer forensics
and in a more general context. Chapter 5 builds upon this background. Section 3.2
describes current approaches to maintaining investigation documentation and evidence
storage, which is a subject of Chapter 6. Both sections 3.1 and 3.2 make observations
related to representation upon which Chapter 4 follows from. Section 3.3 describes
work related to computer timekeeping, forming background information for Chapter 7.
Chapter 4: Digital evidence representation: addressing the complexity &
volume problems of digital forensics
Following from the observed limitations in evidence representation made in
Chapter 3, this chapter reviews literature in the fields of Knowledge Representation
(KR) and markup languages, with the goal of representing digital evidence. These are
currently the two primary approaches to representing and communicating knowledge
outside of natural language. The historical context of KR is described, followed by a
description of the major approaches to KR. The historical context of markup languages
is then relayed, leading to a description of the current state of KR, its influence on
markup languages, and the current research agenda towards building a “Semantic Web”
of knowledge. A brief introduction to the RDF/OWL representational formalism is then
presented. Finally, the chapter concludes by proposing that this formalism would be of
benefit towards solving the complexity and volume problems of computer forensics.
Chapter 5: Heuristic event correlation for forensics
This chapter addresses the themes of evidence representation and analysis
techniques. Following from the proposal of the RDF/OWL formalism as a
CHAPTER 1 - Introduction 9
This chapter describes in detail the field of digital forensics. Section 2.1 begins
by describing the historical context and evolution of the field. Following this, Section
2.2 relates key definitions of digital forensics and digital evidence. Section 2.2.1
describes in detail the nature of digital evidence, and the following section, Section
2.2.2, describes the digital investigation process by surveying a number of process
models which have been proposed. The subject of Section 2.3 is digital forensics tools.
Finally, key research challenges in the field of digital forensics are outlined.
11
12 CHAPTER 2 – Background: Digital forensics
established a Magnetic Media Program, their first computer forensics initiative [9,
106]. The Magnetic Media Program later became the Computer Analysis and Response
Team (CART).
The late 80’s and early 90’s saw the proliferation of the PC platform, and in the
early 90’s, the widespread recognition that new techniques were required for preserving
digital evidence. The first specific forensic imaging tool, IMDUMP, was in the USA,
superceded in 1991 by a tool called Safeback [89]. In the UK in the same year, another
disk imaging application called the Data Image Back-Up System (DIBS) was produced
[9].
Computer forensics practitioners begin to organise and evaluate their
techniques and practices; in 1993 the First International Law Enforcement Conference
on Computer Evidence was hosted by the FBI. Subsequent conferences led to the 1995
formation of the International Organization on Computer Evidence (IOCE), and the
1997 meeting which resolved to develop best practice standards [21]. Around this time
audio and video technologies were moving from analogue to digital, which led
practitioners to consider whether the same principles of computer forensics applied to
all types of digital evidence [142].
Efforts to define the principles of computer forensics resulted in 1999 in the
adoption by the IOCE of proposals authored by member organisations, the Scientific
Working Group on Digital Evidence (SWGDE), from the USA, and the Association of
Chief Police Officers (ACPO), from the UK [21]. The ACPO proposal has evolved into
what is known as the “Good Practice Guide for Computer based Electronic Evidence”
[6]. In 2002, based on the IOCEs 2000 submission, the G8 issued the “G8 Proposed
principles for the procedures relating to digital evidence”. In Australia, the move
towards formal standardisation of the management and treatment of digital evidence
has begun with the 2003 definition of “Guidelines for the management of IT evidence”
[1].
The academic history of computer forensics goes back to the late 80’s and early
90’s with work by Collier and Spaul [33], Sommer [120] and Spafford [123]. By the
late 90’s very little had been published in the open literature on computer forensics
[88], however the new millennium has seen an upturn in both DF targeted publications
and conferences, including the first two specifically targeted journals. The first digital
forensics targeted conference, The Digital Forensics Research Workshop, was
established in 2001, followed by the International Journal of Digital Evidence in 2002
and the International Journal of Digital Investigation in 2004.
CHAPTER 2 – Background: Digital forensics 13
That digital forensics has been made the subject of a recent special issue of the
Communications of the Association of Computing Machinery (CACM) [5], is
indicative of the transition of the field towards the mainstream.
• information stored or transmitted in binary form that may be relied upon in court
(IOCE) [57]
• any data stored or transmitted using a computer that support or refute a theory of how
an offence occurred or that address critical elements of the offence such as intent or
alibi (Casey) [27]
The above definitions are sourced from entities who primarily reside in the
United States. The Association of Police Chief Officers of England, Wales and North
Ireland defines Computer Based Electronic Evidence as:
The usage of the terms “digital evidence” and “computer based electronic
evidence” are synonymous. While the final definition parallels the earlier three, subtle
differences remain. By defining computer evidence in relation to the investigative
process, rather than in relation to the legal one, the ACPO definition addresses digital
data from the time it becomes a part of an investigation. The other definitions however
limit their subject to data which has been examined and found relevant towards
6
Probative value: “The extent to which evidence could rationally affect the assessment of the
probability of the existence of a fact in issue” [68]
14 CHAPTER 2 – Background: Digital forensics
establishing some theory. The IOCE definitions address this shortfall between the two
styles of definitions by defining the term Data Objects:
• Objects or information of potential probative value that are associated with physical
items. Data objects may occur in different formats without altering the original
information.
For this dissertation, the author chooses to adopt the IOCE definition of the
term digital evidence: “information of probative value 7 that is stored or transmitted in
binary form”.
The term computer forensics was in informal use in academic publications
from at least 1992 [121], however the term remained informally defined for many
years. A commonly cited definition of the field in Australian literature, is
McKemmish’s 1999 definition of forensic computing:
The word forensic comes from the Latin word forensis: public; to the forum or public
discussion; argumentative, rhetorical, belonging to debate or discussion. From there it
is a small step to the modern definition of forensic as belonging to, used in or suitable
to courts of judicature, or to public discussion or debate. Forensic science is science
used in public, in a court or in the justice system. Any science, used for the purposes of
the law, is a forensic science. [2]
This broad definition of forensics, and McKemmish’s earlier definition inform
the definition of computer forensics given by the Scientific Working Group on Digital
Evidence (SWGDE), whose definition is:
The use of scientifically derived and proven methods toward the preservation,
collection, validation, identification, analysis, interpretation, documentation and
presentation of digital evidence derived from digital sources for the purpose of
facilitating or furthering the reconstruction of events found to be criminal, or helping
to anticipate unauthorized actions shown to be disruptive to planned operations. [98]
This broad definition reflects a change in forums in which the techniques of
computer forensics are increasingly being applied. While traditionally, computer
forensics was exclusively targeted in the legal forum, computer forensics is
increasingly practised in non-legal contexts such as corporate investigations,
intelligence and military.
7
Probative value: “The extent to which evidence could rationally affect the assessment of the
probability of the existence of a fact in issue” [68]
CHAPTER 2 – Background: Digital forensics 15
The terms digital forensics, forensic computing and computer forensics are
today arguably used interchangeably. Historically, computer forensics and forensic
computing 8 related to the interpretation of computer related evidence in courts of law.
Technology however does not stand still, nor does language, and the meaning of the
term has remained consistently under negotiation. Two factors have been at play
underlying this process: the changing state of uptake of digital technologies, and with it
moves within organisations to consider governing and regulating the use of information
technology.
The late 80’s and early 90’s period was characterised by a mainstream of stand
alone PC’s: the internet was in its infancy, and only in academic circles. In this
environment, the main subject of computer forensics was indeed the basic components
of computing: persistent storage, such as floppy disks and hard disks, and the software
itself. As computers became inter-networked and proliferated into small devices such
as PDAs and mobile phones, the field has broadened its scope beyond the computer to
include network forensics and small scale device forensics. Digital forensics more
accurately describes this new state of affairs.
New imperatives are further shaping the field; there exists a rise in demand for
computer forensics outside of the traditional legal context. Today, Digital forensics is
practiced by law enforcement, military, intelligence, and also within the corporate
sector. Each of these sectors bring with them divergent agendas, primarily related to the
rigour required of any conclusions made by an investigation.
In the law enforcement context, the primary objective is the prosecution of an
alleged perpetrator of crimes. This necessarily dictates application of strict judicial
standards to the practice of computer forensics, because of the impacts on the freedoms
and liberties of the accused. In the military context, a secondary objective may be
prosecution; however this objective is subordinate to continuity of operations. In this
context, the practitioner of computer forensics is prepared to sacrifice accuracy for
immediate answers. This being the case, and under a time imperative, the conclusions
made by the practice of computer forensics in the military context cannot necessarily be
expected to be rigorous. The term digital investigation arguably reflects these subtle
changes in focus.
Despite the changes in agenda signified by the usage of the term digital
investigation over computer forensics, it is the subject of this field which defines it,
unifying any common definitions. Digital evidence remains at the centre of the field.
The next section examines the nature of digital evidence.
8
For the reader interested in the evolution of definitions of the field, see “To Revisit: What is
Forensic Computing?”[53]
16 CHAPTER 2 – Background: Digital forensics
• Provenance: identifying the genesis of the digital crime scene, for example
Case Number, Examiner, Evidence Number, Unique Description, Acquisition
Time
• Integrity: records which may enable identifying if the digital crime scene has
been modified
The drive for establishing a recognised set of standards for performing digital
forensics has resulted in reflection on what tasks are performed in a digital
CHAPTER 2 – Background: Digital forensics 19
investigation, and to what end. Beyond the goals of standards setting, descriptions of
forensic processes are also useful for training and directing research.
Early descriptions of digital forensics processes, such as Mandia’s Intrusion
Response oriented methodology, and Farmer and Venema’s early guidelines have been
criticised as being too specific, focusing on the specifics of technology rather than on
generalised process [109]. Since these early attempts, a number of other processes and
frameworks have been proposed, which are described in the following sections.
• Examination: “make evidence visible and explain its origin and significance…
search for information… data reduction”
• Analysis: “looks at the product of the examination for its significance and
probative value to the case. Examination is a technical review that is the
province of the forensic practitioner, while analysis is performed by the
investigative team.”
20 CHAPTER 2 – Background: Digital forensics
A further preparatory phase was also implied, by posing the question whether
the first responder’s unit had the requisite capability to perform the other phases.
The results of a review of the terminology used for describing the phases of
linear process models are presented in Figure 1. Phases which may be considered as
either equivalent in nature, or simply more specific are arranged in columns.
A few points may be observed. Firstly, despite similarities in the activities or
goals identified for particular phases, terminology remains varied, and the subtleties
implied are not clearly defined. The differences in terminology may be explained by
the granularity of the models. For example, the 2001 NIJ model prescribes a Collection
phase, but in their 2004 model [93] describe Assessment and Acquisition phases
without reference to Collection. It would appear, however, that Assessment and
Acquisition would form sub-phases of a general Collection phase.
Regarding granularity, we observe the least granular and most abstract of
models presented is the Computer Forensics Secure Analyse Present (CFSAP) [89]
model of Mohay et al and the most granular is Reith’s Abstract Digital Forensics
Model, which describes 9 phases in all [109].
Beebe’s Hierarchical, Objectives-Based Framework addresses granularity
issues by proposing a hierarchical structure by which sub-phases may be related to less
granular, higher level phases [13]. Additionally, she relates a class of concerns, which
she calls Principles, which overarch many or all phases and sub-phases of the phases
and sub-phases of the investigative process. Two Principles she identifies are Evidence
Preservation and Documentation.
CHAPTER 2 – Background: Digital forensics 21
Considerations
Recognition
Preliminary
Individualization
Planning
Preservation
Collection &
Documentation
Reconstruction
Casey 2000
Identification
Preservation
Collection
Examination Analysis Presentation
DFRWS
2001
Assessment
Documenting &
Preparation Acquisition Examination
Reporting
NIJ 2004
Overarching principles
• Upon seizing digital evidence, actions taken should not change that evidence.
• When it is necessary for a person to access original digital evidence, that person must
be forensically competent.
• All activity relating to the seizure, access, storage, or transfer of digital evidence must
be fully documented, preserved, and available for review.
• An individual is responsible for all actions taken with respect to digital evidence while
the digital evidence is in their possession.
• Any agency that is responsible for seizing, accessing, storing, or transferring digital
evidence is responsible for compliance with these principles. [130]
Scene investigation, and Presentation. The goals related to each phase are described
below:
The phases presented above are presented in Figure 2, with arrows indicating
flow through the process phases. We note here that the identification of a digital device
at a physical crime scene may instigate a digital crime scene investigation, and
transitively, that digital evidence found on a digital crime scene may instigate an
investigation at a newly identified physical location.
The sub-phases of the digital crime scene investigation phase are depicted in
Figure 3. The first two phases are similar to early phases of the linear process models,
however the final phase, Event Reconstruction & Documentation, proposes a set of
sub-phases which attempt to prove and disprove hypotheses related to events that may
have caused digital evidence found in the crime scene.
CHAPTER 2 – Background: Digital forensics 23
maintenance of integrity by tying the physical evidence from which it is derived to the
digital crime scene. In some jurisdictions it is routine practice to print a copy of this
hash to use as a contemporaneous note establishing this link.
As raw data cannot exist without being contained or encoded in some way or
another, acquisition tools need a container in which to store the image. This container is
typically some other piece of raw media, such as a hard drive, or as the contents of a
file.
Despite the apparent simplicity of the process described above, numerous
problems emerge which must be considered. For example:
• Can we prove the write blocking technology actually ensures the integrity of
the digital crime scene?
• Is the digital crime scene copy an accurate copy of the original? In operation,
hard drives often have bad sectors which are unreadable. How do the presence
of these affect the maintenance of integrity? How do we record their presence?
• Is the digital crime scene copy a complete copy of the original? New drives
contain special areas protected from regular access, which may too be relevant
to an investigation at hand.
In the early days of forensics, the primary role of digital forensic examination
tools was to interpret raw data into information. As datasets containing potential
evidence have become larger, the information gleaned from interpretation tools has
become increasing unwieldy leading to “needle in haystack” kinds of problems. To
address this, integrated tools have emerged which provide various techniques for
CHAPTER 2 – Background: Digital forensics 25
searching, navigating, filtering and examining the information. This section describes
tools related to these concerns.
Four higher level strategies are typically employed for finding relevant
information within this raw data: structural interpretation, signature based searching,
file classification and event correlation.
Structural interpretation refers to the structured nature of data within most
forms of digital data. For example, the average hard drive is structured into partitions,
then file systems, then files and so on. Digital data, and the software that acts upon it,
is organised into layers of abstraction to reduce complexity. For example, as it applies
to storage, general purpose operating systems have long provided the familiar file and
directory abstractions for storing and organising data. Details of hard drive sector
addressing, and file indexing, are hidden from the average user at lower layers of
abstraction. Similar abstraction layers exist in software architectures and in network
data communications. Much of the job of forensic analysis tools is to exploit the
structure of raw binary data so data objects of an appropriate abstraction layer, and of
evidentiary value, may be found.
EnCase and FTK are storage media analysis tools which provide primarily
structural interpretation functions over a common set of evidence types. Similar
9
functionality is provided in the opensource tools The Coroners Toolkit (TCT) and The
Sleuth Kit (TSK) 10 which is based on the former.
Signature based interpretation refers to a class of interpretation techniques
best exemplified by the class of tools known as file carving utilities, which search raw
digital data for characteristics which are unique to particular species of files. File
carving utilities such as scalpel [111] or foremost 11 are able to identify potential
instances of image files such as GIF and JPEG, and documents in Microsoft word
format by identifying local structure, regardless of the underlying filesystem’s
presence. Local structure is used to identify data objects rather than searching global
structure. Such interpretation strategies are useful in instances where the global
structure has become corrupted.
Recent investigations in characterising large corpuses of hard drives have
employed signature methods for identifying credit card numbers, social security
numbers and email addresses [44].
9
http://www.porcupine.org/forensics/tct.html
10
http://www.sleuthkit.org/
11
http://foremost.sourceforge.net/
26 CHAPTER 2 – Background: Digital forensics
12
Some architectures have used difference storage architectures. The Palm Pilot, and IBM’s
OS/390 do not use files, rather a record oriented storage paradigm.
CHAPTER 2 – Background: Digital forensics 27
In the more mature area of media analysis, where most practical activity is
occurring in the field, a number of commercial products have emerged that combine
acquisition, analysis, and reporting functionality in one integrated tool. EnCase and
FTK are prime examples of this class of tool.
A number task-specific features are found in integrated digital investigation
environments: these are Navigation, Search and Presentation.
Navigation features enable the investigator to visualise and explore the
structure of the digital crime scene. In practical terms, this feature is implemented in
Encase as a tree styled user interface element which represents the structure of the
digital crime scene using abstractions at various layers of the media analysis stack.
Search features enable the identification of data objects which conform to
various criteria, such as keyword or regular expression equivalence, date ranges, or data
object classifications. These criteria are evaluated against the content of data objects,
(such as in free text search of documents) the attributes of objects (as in finding all
pictoral image files with a .jpg or .gif file extension). Filtering, which we have
mentioned previously, can be seen as the opposite of search. Filtering limits the
perspective of search, presentation and navigation functionality to the data objects not
matching the filter criteria.
Finally, presentation functionality is related to presenting data objects, their
attributes (such as file metadata), and content in meaningful ways. As digital data may
have multiple interpretations, multiple viewer types may be appropriate for interpreting
data object content. For example, it may be instructive to read the textual content of a
HTML page in some instances, and the rendered page in others.
This class of tools typically provides support for recording of some
investigation documentation, such as case id’s and investigator.
Despite the pivotal nature of tools in digital forensics, little academic work has
focused on this subject.
Carrier has proposed a model of digital forensics examination and analysis
tools which relates the digital forensic tool as an interpreter of data from one layer of
abstraction to data at another, higher layer of abstraction. In this model (presented in
Figure 4) a forensic tool implements a rule set which translates input data from one
layer of abstraction into output data at another layer of abstraction. In performing this
transformation, a tool may introduce an error.
28 CHAPTER 2 – Background: Digital forensics
Tools which purely decode structure, such as media analysis tools, may
inadvertently introduce errors caused by implementation error; however these types of
error are difficult to measure and in the commercial sphere are not disclosed. File
carving tools introduce a different kind of error apart from implementation error:
abstraction error. An example of this is that signatures may inadvertently match data
that is not a valid file, leading to false positives 13 .
In practice, a tool may internally implement multiple abstraction layer
transformations. The open source forensics tools generally address only a few
abstraction layer transformations related to a particular class of structure. For example,
The Sleuth Kit (TSK), which takes as input data from the media management
abstraction layer (which is concerned with volumes and partitions) and outputs data at
the file system layer of abstraction, such as files (both deleted and regular) and
directories. Commercial tools tend to be more monolithic in nature and integrate
abstraction layers from separate domains. For example, while EnCase includes
abstraction layer translators equivalent to those found in TSK, it additionally includes a
translation layer which translates Redundant Array of Inexpensive Disks (RAID)
images from the physical media layer to the media management layer.
13
False positives refer to a result of a test that incorrectly indicates a positive result, despite the
finding being false. In this case a signature might match some data, indicating that a file of a
particular type has been found, while in fact the file is not of the type supposedly implied.
CHAPTER 2 – Background: Digital forensics 29
Notable open source digital forensics investigation tools are the Autopsy
forensic browser [25], PyFLAG 14 and the TULP2G small scale device forensics
framework [137]. The architecture employed by Autopsy is a component oriented one,
with the user interface components running in separate processes to the filesystem
interpretation layer tools. The latter are sourced from separate projects, including the
related Sleuth Kit project. Theoretically, this separation of functionality enhances
robustness by limiting the effects of software faults to the implementing module, rather
than affecting the whole application. Autopsy provides limited support for the
maintenance of case related documentation.
While not open source in nature, the XIRAF digital forensics prototype [7] uses
a similar architecture to TSK, utilising wrappers around existing open source
interpretation tools.
A number of groups have begun experimenting with clustered computing
architectures as foundations for forensics tools, towards the goal of addressing the IO
and CPU bound processing issues inherent in current monolithic architectures. The
prototype Distributed Environment for Large-scale inVestigations (DELV) investigated
the feasibility of speeding up processing by spreading an entire hard disk image across
the RAM of a cluster of commodity PCs, moving the processing to each node [114].
The Open Computer Forensics Architecture (OCFA) employs a distributed processing
model for recursively processing data objects found within a digital crime scene [62].
Similar to XIRAF and Autopsy, interpretation is realised by wrapping existing
interpretation tools.
14
http://pyflag.sourceforge.net/
CHAPTER 2 – Background: Digital forensics 31
Wireless Anti-forensics
These challenges as enumerated by Lindsey at DFRWS 2006 are a mix of: new
technologies (e.g. wireless, whole drive encryption), situational technology trends (e.g.
device diversity, volume of evidence, distributed evidence), and techniques (e.g. Live
response, usability & visualisation).
In 2005, the following list of challenges was presented by Mohay [87]:
• Embedded systems
• Tools
• Data volumes
• Counter forensics
• Networked evidence
• Tool testing
15
And have been doing so since around 1989
CHAPTER 2 – Background: Digital forensics 33
leaving this information outside of the scope of the tool. Considering the storage
volume example given above, the integrated forensics tool EnCase has had its internal
model changed to include abstractions for RAID volumes and corresponding
component media regions, and the tools interaction model has been tweaked to
represent these abstractions. The open source forensics tool addressing similar analysis
tasks, The Sleuth Kit, however does not address storage virtualisation. Rather, it leaves
management of this conceptual change in the hands of the human tool operator.
The complexity and volume problems are well illustrated in the Gorshkov case,
which involved credit card theft from at least 11 online entities, and subsequent
fraudulent use of those cards through PayPal 16 and ebay 17 [8]. Successful prosecution
of this case involved evidence drawn from multiple computers, under the control of
multiple company entities, in multiple jurisdictions (some of which was acquired over
the internet from Russia). The evidence was drawn from multiple sources, such as
backups, hard disk images, emails, archive copies, and hard copies. Interpreting the
evidence required numerous applications and multiple operating systems.
As we have said, digital investigation covers a very broad range of conceptual
entities, and any schema or model attempting to fully describe the domain quickly
becomes insufficient as technology inexorably marches on. In this light, a means of
representing evidence and related information expressive enough to represent all of the
information we wish, while not committing us to a particular data model, is desirable.
Furthermore, such a model should be extensible enough that new information may be
added by arbitrary means, as new tools and techniques emerge, without breaking
existing tools, nor violating the integrity of the existing information. Conversely, a
means to declaratively attach semantics to data, without resorting to modifying the
tools which operate over the information hold promise for integration of arbitrary and
heterogeneous data.
Addressing the volume and complexity challenges requires new approaches to
building tools which acknowledge the rate of change of technology, and enable
continued tool functioning despite new sources of complexity. We hypothesise that by
focusing on the relationship between tools and formal representation, a key theme of
this research, new approaches might be identified which address these challenges.
16
http://www.paypal.com/
17
http://www.ebay.com/
34 CHAPTER 2 – Background: Digital forensics
changes. It is well acknowledged that new approaches to building tools are necessary:
this situation is reflected by a recent upturn in focus on research into new tools and
techniques.
The first DFRWS in 2001 focused primarily on frameworks and principles for
digital forensics, rather than on forensics tools and techniques [88]. Two speakers
however highlighted the need for tools and techniques to be evolved:
• The social aspects of our analytical endeavors are in need of focus, too. We
need tools that zero in on truly useful information and quickly deduce whether
it is material to the investigation or not. We need to identify a social “end-
game”. Are we prepared to take serious action to thwart wrongdoing in all its
forms? (Spafford, DFRWS 2001).
By the time of the DFRWS 2006, the research priorities of the field had indeed
begun to shift towards addressing these challenges. Papers were presented covering
tool validation, memory analysis, tool integration (twice) and evidence correlation (four
times). All up, 8 of the 17 papers presented at DFRWS2006 were related to
development of new techniques and tools.
Analysis of the prevalence of specific technologies cited as challenges by
Lindsey, Mohay and Turner reinforces the need for new and more effective forensics
tools and techniques. To quote Mohay:
These tools need to target the ever increasing volume and heterogeneity of
digital evidence and its sources, and they need to be inter-operable [88].
In the early 90’s, much of the focus of the field was on building effective
forensics tools, and having them accepted in court. Frameworks for characterizing the
field of forensics, such as forensics process models, and protocols for ensuring integrity
CHAPTER 2 – Background: Digital forensics 35
and chain of evidence were primary concerns. Today, there appears to be consensus on
appropriate methodologies and protocols for dealing with digital evidence, a conclusion
which can be implied by the widespread adoption of digital evidence in proceedings.
Despite the apparent need to trade off expediency or other factors for rigour in
some contexts of digital investigations, the need for rigour in the conclusions is a
principle tenet. In particular, the traditional forensic sciences are based on the
application of reliable scientific methods – seeking to use techniques or tools only after
rigorous and thorough analysis. The field of digital forensics (at least in the United
States) is struggling to meet the court’s standards for scientific evidence [78].
At the first DFRWS, it was concluded that for digital forensic science to be
considered a discipline, it must have the following characteristics [98]:
2.5 Conclusions
The utility of the computer as a tool of production, communication, and
commerce has resulted in widespread adoption over the latter half of the twentieth
century and the start of the new millennium. Digital technology is now pervasive.
Network effects and the rapid pace of change in digital technology have led to a
situation where the employment of digital evidence is complicated by the burden of
large quantities of highly complex data. The challenge for digital forensics is to both
increase in reliability and rigour, while at the same time increasing the efficiency of
investigation. New techniques for interpreting and analysing evidence and new
approaches to building interoperable forensics tools are required.
Addressing these key challenges requires new approaches to building tools
which acknowledge the rate of change of technology, and enable continued tool
functioning despite new sources of complexity. We hypothesise that by focusing on the
relationship between tools and formal representation, new approaches might be
identified which address these challenges.
Chapter 3. Related work
“The search for truth is in one way hard, and in another easy – for it
is evident that no one of us can master it fully, nor miss it wholly.
Each one of us adds a little to our knowledge of nature, and from all
the facts assembled arises a certain grandeur.”
(Aristotle)
37
38 CHAPTER 3 – Related work
Employing a trivial example, the canonical form worked well for making
statements such as
The data model of the TSOA canonical form alone, however, prevented
expressing even slightly more complex statements (which unfortunately are on the
lower end of conceptual complexity when considering event logs) such as the
following:
“at 12:00 on the 1st January john logged into the host www”
operate at a higher level of abstraction than rule based approaches by using declarative
languages that model different aspects of situation. Both signature and rule based
techniques typically entail specifying event signatures or rules using some kind of
event language.
A number of signature based alert correlation languages aim to correlate events
based on abstract models of intrusion goals. The LAMBDA correlation language is a
signature based approach matches signatures of event consequences with event
prerequisites, generating Prolog based correlation rules [35]. This language uses an ad
hoc combination of XML and Prolog syntax to model both Attacks and Alerts.
JIGSAW uses a similar technique for correlating pre and post conditions, focusing
more on language syntax [132]. They model pre and post conditions as “requires” and
“provides” relationships of events. Ning et al criticize JIGSAW as overly restrictive,
and weaken the requires/provides relation in Hyper-Alerts to allow correlation in
absence of certain prerequisites [95]. Similarly to LAMBDA, both JIGSAW and
Hyper-Alerts are translatable to rules.
CEP [101], employs a rule language called RAPIDE for event pattern
recognition and correlation. This language contained features to match over parameters
such as causal ancestry, repetition, as well as simple property based comparison.
STATL uses finite state machine (FSM) models to specify signatures. Doyle et al
critique using FSMs: “Representing events as transitions through a single chain of
states precludes recognizing the achievement of a set of attack preconditions that have
no innate required time order.” (p. 21) [37] Techniques for translating FSMs into rules
are well established.
The line where rule and signature based approaches become expert systems is
blurred. Two differentiators would be the use of dynamic, object based knowledge
models, and a translation stage between signature specifications and underlying rule
representations.
A number of approaches have applied expert systems and logic based
reasoning to event correlation. The EMERALD IDS [69] uses an expert system
implemented using the P-BEST rule language to specify intrusion rules. Doyle et al
criticise P-BEST for lacking any concepts specific to event recognition [37]. The rule
language employed by Stallard and Levitt, called JESS, has a similar heritage to P-
BEST [125].
Of the correlation languages reviewed, only STATL was available with source
code. RAPIDE is available only as executables, and has not been updated since 1998.
JIGSAW has no implementation. Hyper-Alerts, while implemented, have no available
implementation, nor has LAMBDA. P-BEST is only available embedded in the
42 CHAPTER 3 – Related work
EMERALD product [69], and is not easily modified, nor is it straightforward how to
access the P-BEST functionality. It has very few features differentiating it from the
class of languages based on the CLIPS expert system [143].
3.1.3 Observations
A number of containers for images are in common usage, such as simple binary
data files produced by the venerable UNIX copy tool, dd, and the EnCase Expert
Witness file format, produced by EnCase’s imaging tools. The latter, besides serving as
a container for images and Palm Pilot memory [31], additionally contains checksums, a
hash for verifying the integrity of the contained image, error information describing bad
sectors on the source media, and metadata related to provenance. The Advanced
Forensics Format (AFF) is a disk image container which supports storing arbitrary
metadata as name, value pairs [45].
There is, however, little standardisation of storage containers or consideration
of how to record aspects such as those described above. The current state of the art has
CHAPTER 3 – Related work 43
given rise to a variety of ad hoc and proprietary formats for storing evidence content,
and related evidence metadata. Conversion between the evidence formats utilized and
produced by the current generation of forensic tools is complicated. The process is time
consuming and manual in nature, and there exists the potential that it may produce
incorrect evidence data, or lose metadata [30]. Validation of the results produced is
hindered by this lack of format standardisation.
It is with these concerns in mind that calls have been made for a universal
container for the capture of digital evidence. Recently, the term “digital evidence bags”
was proposed to refer to a container for digital crime scene artefacts, metadata, integrity
information, and access and usage audit records [135]. Subsequently, the Digital
Forensics Research Workshop (DFRWS) recently formed a working group with a goal
of defining a standardised Common Digital Evidence Storage Format (CDESF) for
storing digital evidence and associated metadata [30].
The Advanced Forensics Format (AFF), recently proposed as a disk image
storage format, includes storing of acquisition related metadata in the same container as
the disk image. Garfinkel et al describe the AFF and summarise the key characteristics
of nine different forensic file formats. The also outline the desirable characteristics for
an image storage container [45]. They conclude that the AFF is the only publicly
disclosed forensic format which supports storage of arbitrary metadata. The metadata
storage mechanism in the AFF is, however, limited to name/value pairs and makes no
provision for attaching semantics to the name.
Encase 18 , uses a monolithic case file for storing case related metadata and
stores filesystem images in separate and potentially segmented files. The format of the
case file is proprietary.
Turner’s Digital Evidence Bag (DEB) attempts to replicate the key features of
physical evidence bags, which are used for traditional evidence capture. The key
structural components of a physical evidence bag are the bag itself, a means of bag
identification (potentially a serial number), an area for recording evidence related
information (which Turner refer to as a tag), and optionally, a tamper evident security
seal.
The key features of physical evidence bags are categorised as follows:
Evidence Metadata Records: Standard evidence metadata includes a
description of the evidence, the location, date and time of the acquisition of the
evidence.
18
http://www.guidancesoftware.com/
44 CHAPTER 3 – Related work
It is worth noting here that the use of the features listed above varies dependent
on jurisdiction.
Turner’s proposal translates a number of aspects of the above features of the
physical evidence bag into the digital realm. A file archive structure is proposed which
defines a specific naming scheme for files containing digital evidence, separate files
containing evidence metadata, and a singular file which contains evidence integrity,
provenance and identification information. Figure 5 depicts the structure of Turner’s
digital evidence bag.
KEY
.tag Digital
Evidence
Evidence
Metadata
Tag
A DEB is a collection of the Digital Evidence files, Index Files and a single
Tag File. Turner does not detail the implementation of the container grouping these
CHAPTER 3 – Related work 45
evidence files, however we expect that in practice, the container layer of the DEB
would be an archive similar to a tar, zip or other.
Individual elements of digital evidence collected (such as filesystem images,
network traces, or the contents of image files) are stored in digital evidence files, which
are identified by a file extension .bagNN. The NN refers to a unique number.
Correspondingly, evidence metadata, such as file last access time is stored in similarly
named files with an extension .indexNN. The pairing of a single digital evidence file
with its corresponding evidence metadata file is refered to by Turner as an evidence
unit. Turner does not describe the naming of the files other than the extensions defined.
It is unclear as to whether or not multiple pieces of content are stored in a .bagNN file.
Integrity, provenance and identification information are stored as unstructured
text within the tag file, which is identified by the file extension .tag. The tag file also
enumerates the names of all of the Evidence Units.
The architecture of Turner’s digital evidence bags is oriented towards a single
monolithic digital evidence bag being used in a case, as a container for all digital
evidence acquired. Secondary evidence (evidence derived from the analysis of earlier
acquired evidence, such as files extracted from a filesystem image) would appear in
this scheme to be added to the same digital evidence bag as the original image. This
involves modification to the tag file and the addition of new files to the evidence bag.
Integrity is assured by the onion like use of hashing of the contents of the tag file.
A potentially confusing aspect of Turner’s DEB proposal is that modification
of the tag file, and the addition of new files to the DEB may lead the layman to the
conclusion that the monolithic bag is never sealed, thus raising doubts as to the
integrity of the evidence. While this may be seen more as an impedance mismatch in
translating the evidence bag metaphor, we suggest an alternate architecture for digital
evidence bags, which is presented in Chapter 5. The architecture we present favours
treating evidence bags as immutable objects. Addition of information is achieved
outside the bag, in much the same way that information is added to the tag of a physical
evidence bag without breaking the tamper evident tape.
Turner’s structure does not define a scheme for referencing of evidence and
metadata between digital evidence bags. Therefore the ability to compose multiple
evidence bags into a corpus is not addressed.
The format and vocabulary of the investigation documentation maintained in
the DEB has no formally defined syntax, data model or semantics. The syntax appears
ad hoc and the vocabulary overly abbreviated. In this context, little attention has been
paid to the nature of the metadata that is being stored, with no consideration being
46 CHAPTER 3 – Related work
given to the relationship between the metadata and wider case related information, nor
information found within the digital crime scene.
models of timelines, hypotheses regarding the incident, and supporting evidence. The
bookmark feature of Encase could be seen as an example of this class of model
concept.
Bogen’s work is significant in that it addresses conceptualising digital
investigations from a number of crosscutting perspectives. While the work proposes the
use of a graphical language for modelling these domains, it stops short of presenting
any actual models, or exploring means for tools to use these models.
A number of authors have proposed the definition of domain specific languages
for the purpose of representing and describing digital investigation related information.
Prior to proposing the Unified Modelling Methodology described above, Bogen and
Dampier proposed that a Computer Forensics Experience Modelling Language
(CFEML) would be of use in modelling of experiences, lessons learned, and knowledge
discovered in the course of an investigation [17]. While the purpose and need for a
CFEML was only proposed in an abstract sense, in the same year Stephenson proposed
the Digital Investigation Process Language (DIPL) [126] This language, whose syntax
is based on the Common Intrusion Specification Language (CISL) of the IDS field, and
ultimately on LISP S-expressions [113], focuses on modelling the forensic
investigation process and on the entities involved in it with a heavy emphasis on
intrusion response. Using the perspectives defined by Bogen’s UMM, it focuses on the
investigative process view, and domain case view, and aims to be suitable for
describing in a narrative the actions performed in an investigation, and the entities
being acted upon. It is, however, strongly influenced by a network incident response
viewpoint, and lacks a means for ascribing semantics to vocabulary in an extensible
and machine readable way.
Both XIRAF [7] and TULP2G [137] process digital crime scenes into single
document based XML tree representations, operating primarily in the case domain,
whereas DIPL appears to operate more in both the case domain and investigative
process domain. An important contribution of the XIRAF work is in conceptualising
the XML representation of information extracted from the digital crime scene (DCS) as
annotations of particular byte ranges within the DCS, which also implies defining a
composable addressing scheme. This approach maintains a direct linkage between
information and its source, addressing the provenance of the information. Both
approaches avoid describing the semantics of their XML based representations, and
address information integration by tree manipulation.
The alternate approach to persistent, machine readable representations of case
related information is the dynamic generation model. EnCase falls into this category of
approach. We suspect that similar to TSK, which uses C-structure based data models
48 CHAPTER 3 – Related work
internally, EnCase employs internal structural models in interpreting digital data. The
only way to interact with or otherwise view these models is, however, through the GUI
representation of this structure, or, to a limited degree, through the use of a custom
scripting language and limited object model.
3.2.3 Observations
A battery powered real time clock (RTC) (also called BIOS or CMOS clock) is
used to keep time while a computer is switched off. While the RTC is used as the basis
for determining time when the computer boots, the interpretation of this time is
operating system specific. For example, the family of Windows operating systems
interpret the RTC as civil time 19 , whereas the Linux operating system may interpret the
RTC as either civil time or UTC by configuration.
Commonly, UNIX operating systems implement a software clock (called the
system clock) by setting a counter from the RTC at boot, and employ a hardware timer
(such as RTC timer interrupts, an advanced programmable interrupt controller (APIC)
19
Civil time refers to the government mandated time in a particular jurisdiction, incorporating
regions specific offsets such as daylight savings time.
CHAPTER 3 – Related work 49
or other means) as an oscillator. Stevens suggests that all instances of the Windows OS
base their timescale on the RTC throughout operation [128]. There is, however,
evidence to suggest that, similar to UNIX implementations, Windows 2000 and above
similarly employ a software clock rather than use the RTC directly [79, 81, 115].
over time. In addition, implementing the correct local time offsets for civil time is
complicated by changes in region-specific time zones. Recent evidence for the
importance of this may be illustrated by the flurry of patches related to the Melbourne
Commonwealth Games daylight savings time extension in early 2006 [80].
Clock configuration It is common to see Windows workstations with the time
zone set to the default installation time zone. Another clock configuration error is the
commonly occurring example of systems where the BIOS time has not been correctly
set.
Tampering The practice of setting computer clocks back or forward for
reasons such as evading digital rights management or misdirection of investigation is
often referred to as tampering. Timestamps, like any data, are subject to the possibility
of deliberate modification.
Synchronisation Protocol The Windows time synchronisation protocol is
based on SNTP and is only designed to keep computers synchronised to within 2
seconds in a particular site and 20 seconds within a distributed enterprise. Furthermore,
computers using NTP and SNTP without cryptographic authentication are subject to
protocol based attacks.
Misinterpretation Timestamps are related to a particular frame of reference,
and their correct interpretation requires knowledge of that context. For example, to
interpret the time to which an Internet Explorer timestamp corresponds, in the civil
time where it was generated, one needs to know the time zone offset. Other sources of
uncertainty are the ambiguity as to what point in time the timestamp refers to for a
particular event – is it the start time or the end time of the event - and was the
timestamp generated at the time of the event or the time of writing to the event log.
Bugs Software errors in the implementation of software clocks or the
algorithms which convert the in memory clock to a timestamp have the potential for
adversely affecting timekeeping accuracy.
[128]. In this model, a base clock is set to UTC, and subordinate clocks are defined by
skews from parent clocks with additional skews further generated from time drift rates.
Gladyshev and Patel propose using corroborating sources of time to find the
time bounds of events with an unknown time of occurrence by examining ordering
relationships with events with known times. They define both a formalism and an
algorithm for determining these temporal bounds [47].
Weil argues for dynamic analysis of the temporal behaviour of suspect
systems, proposing correlation of timestamps embedded within locally cached web
pages with the modified and accessed times (MAC times) of the cached files [141].
3.3.5 Observations
Computer clocks are inherently unreliable, which casts doubt on the usage of
timestamps in forensic investigations. Methods of post-hoc characterisation of the
behaviour of a particular computer’s clock are of interest in assuring the correct
interpretation of timestamps.
3.4 Conclusion
This chapter has summarised the literature related to event correlation in
forensics, which is a focus of Chapter 5; current approaches to evidence representation
and storage, and representation of digital investigation related documentation, which
are related to Chapter 6, and reliable interpretation of time, which is related Chapter 7.
Analysis of the literature as it relates to event correlation and digital evidence
has led to the hypothesis that digital forensics tools would benefit from a formal
approach to representation. The next chapter describes and contextualises the field of
knowledge representation (KR), where we look for inspiration and formalisms with
which to address the representational challenges in digital forensics.
Chapter 4. Digital evidence representation:
addressing the complexity and volume
problems of digital forensics
“If scientific reasoning were limited to the logical processes of
arithmetic, we should not get very far in our understanding of the
physical world. One might as well attempt to grasp the game of poker
entirely by the use of the mathematics of probability.”
(Vannevar Bush)
Analysis of the field of digital forensics has indicated that examining the nature
of the information which it operates on may help address the complexity and volume
problems described in Section 2.4.1. This chapter looks to the field of knowledge
representation (KR) for inspiration, and proposes that a KR based approach to digital
evidence representation will yield benefits in solving these problems. In particular,
Semantic markup languages, which are described in Section 4.3, are employed towards
solving these problems in Chapters 5 and 6.
The chapter is structured as follows. Section 4.1 introduces the representational
challenges involved in digital evidence, describing why the current natural language
based approach to documenting investigations hinders tool interoperability and
potentially introduces errors. Section 4.2 provides background on the field of
knowledge representation, Section 4.2.1 describes its historical foundations, Section
4.2.2 describes key definitions, and Section 4.2.3 describes hybrid approaches to KR.
Section 4.3 describes the synthesis of markup languages and KR which has led to the
current generation of semantic markup languages, the Resource Description
Framework (RDF) and the Web Ontology Language (OWL). Section 4.3.1 introduces
RDF; Section 4.3.2 describes the XML serialisation of RDF, which is intended for
publishing and machine interpretation; and Section 4.3.3 introduces ontology
languages, and OWL. Section 4.4 reviews the literature of digital forensics and
computer security for knowledge representation related themes, and finally, Section 5.5
53
54 CHAPTER 4 – Digital evidence representation
puts forward the proposition that the field of forensics would benefit from a formal
approach to representing evidence and related investigative information.
4.1 Introduction
The simplest of digital forensics investigations will involve numerous
documentary artefacts as evidence. Examples of these are printouts of data objects
identified as evidence, evidence manifests, and investigation reports. In the course of
investigation, other documents may be kept or produced, including chain of custody
documentation, file notes recording analysis activities and results, and provenance
documentation.
The current state of affairs is that the much of the information related to digital
forensics investigations is recorded in documents such as these in natural language. The
vocabulary employed in these documents is drawn from multiple domains: law
enforcement, legal, computing, and general spoken English. This situation is similar
within the digital crime scene (defined previously in Section 2.2.2), where voluminous
amounts of information are stored in free text form, and semi-structure text form.
Despite much research into Natural Language Processing (NLP), such textual
information is still unsuitable for machines to reason with. For example, consider the
following two trivial sentences:
While the two statements are both syntactically and grammatically valid
English, their meaning is at first glance an oxymoronic state of affairs. Only by treating
the word “pen” as having two meanings in this context, as a fenced area and then as a
writing instrument, can one resolve the spatial contradiction first observed.
Machine interpretation of natural language is complicated not only by the free
form nature of English grammar and syntax, but also by the context dependence of
interpreting semantics of language terms. This dependence on context, and the
additional real world knowledge and reasoning which are required to resolve
ambiguities, are some of the lower level problems in NL understanding. Machine
understanding of natural language remains, today, one of the grand challenges in
computing.
The preoccupation of computer forensics has, until recently, been on the
immediate goals of interpreting binary data. While this is of fundamental importance, it
cannot be forgotten that the function of computer forensics is to not only glean
CHAPTER 4 – Digital evidence representation 55
knowledge from digital evidence, but to communicate and analyse such knowledge in
a rigorous and verifiable manner.
This communication problem is best demonstrated with the following example.
Consider the simple set of evidence depicted in Figure 6. Such a set of evidence may be
presented in a case where inappropriate content of some description is found on a
computing resource. The figure depicts two pieces of physical evidence, which are the
containers of the two sets of digital evidence, the digital crime scene, and a set of
extracted files. The digital crime scene is a bitwise image taken of a hard drive, and the
extracted files are files found within the filesystem of the digital crime scene. The
analysis report, imaging records, and chain of custody, are all regular textual
documents, and the evidence printouts/visual aids are visual printouts of the extracted
files. The blue lines connecting pieces of evidence in the figure indicate where
references must be made from one piece of evidence to another. Red lines indicate a
“part of” relationship: the extracted files are contained on the CD, and are also
contained within the digital crime scene. Finally, the digital crime scene is contained
within the hard drive.
Evaluating this evidence involves verifying that the digital crime scene exactly
matches the crime scene referred to in the imaging records, and verifying that the files
found on the CD are found in the digital crime scene, in the locations described.
Performing these verifications requires human interaction, which is necessary
because of the use of natural language in the Analysis Report and Imaging Records.
The references to the particular hard drive, the names and paths of the files on the CD,
and the hash of the digital crime scene all need to be located by the analyst and
interpreted to refer to the correct artefacts, tools selected, and then employed to perform
the verification.
Such a corpus of evidence falls squarely within the purview of media forensics,
the most established area of the field. While such verification actions might seem trivial
to perform at first glance, in practice, it is complicated by numerous factors. For
example, if the files were found by file carving, how does one document the location of
56 CHAPTER 4 – Digital evidence representation
the files? How does one validate that those locations form a valid file? What if the
digital crime scene was striped across multiple disks, as in a RAID array? How does
one document the raid array configuration? What if the investigation is dealing with
thousands of files, and hundreds of digital crime scenes? The basic task of verifying
simple claims becomes under these circumstances a laborious manual exercise, due to
communication problems related to natural language.
This natural language problem may additionally be seen within the digital
crime scene. Event logs have in the past been employed in computing for a number of
purposes, including auditing system activity, recording performance information, and
recording system state for post mortem debugging among other things. As such, they
record information about the computing environment, referring to entities such as hosts,
users, software agents, and activities with an ad-hoc vocabulary, irregular syntax, and
varying naming schemes. While such event log records are more structured than natural
language, their machine readability is arguably as difficult as natural language by
nature of the considerable amounts of domain knowledge required to infer their
semantics.
Beyond the problems preventing practical machine interpretation of natural
language, further problems confound the use of natural language as a common
language for documenting all aspects of a digital investigation. Producing suitably
complete and precise documentation over the course of the investigation requires
repetitive and methodical attention to detail. As such, it carries with it the threat of
unintentional introduction of errors and the omission of important details.
The concept of knowledge representation has been a persistent one at the centre
of the field of artificial intelligence (AI) since its founding conference in the mid 50’s.
In the early years it was, however, not explicitly recognised as an important issue in its
own right [75]. Early approaches in this period to representing knowledge in “thinking
machines” and automated problem solving are best characterised as ad hoc, with formal
CHAPTER 4 – Digital evidence representation 57
semantics remaining absent. Consider, for example, the language LISP, which was the
mainstay of the AI field at the time. LISP’s basic tree-like list data structure, with the
addition of cross links forming a malleable basis for organising data into hierarchical
and graph based structures. It, however, lacks any foundations of intelligent reasoning;
rather its foundations are computational. Any intelligent reasoning that may be
embodied in such programs must be implicit in the procedural code of the application.
Knowledge representation emerged as a field in its own right in the mid 60’s
with the next two decades seeing a number of approaches to knowledge representation
begin to emerge, with frames, production systems, and logic based approaches being
the predominant varieties.
The logic based approach takes the view that machine reasoning may be
realized by implementing programs which use the language of mathematical logic.
These approaches share the common approach of representing a domain of interest as a
set of propositions which embody specific information. Knowledge is encoded by
axioms which define logical implications which may be made about the information.
The earliest attempts used first order predicate logic (FOL) as their basis [50],
which was seen as appealing due both to the general expressive power, and well
defined semantics[41]. The use of FOL has been persistent since. FOL is however
computationally intractable, which led to experiment with smaller subsets with better
tractability. This led to the PROLOG language which was first introduced in 1972.
PROLOG supports declaratively specifying information as symbol value pairs, and
enabling axiom definition using a restricted form of FOL. This implementation of logic
based inferencing, based on declarative specification of logical rules has been used to
implement numerous expert systems.
Logical approaches are criticised for being unable to deal with exceptions to
rules, or to exploit approximate or heuristic models of knowledge. The expression of
meta-knowledge (description of what the knowledge can be used for) is also a
limitation [85]. Nor does it allow for incomplete or contradictory knowledge, or
subjective or time dependent knowledge [10].
A number of approaches commonly based on unstructured graph based
representations emerged in the early 1960’s and came to be known as semantic net
based representation schemes. The common points to these schemes were a graph
structure representing concepts and instances (or objects) and a set of inference
procedures which operate over these nodes [75]. Three types of edges are defined:
property edges, which assign properties (such as age) to the source concepts, IS-A
edges, which define class/subclass relationships between concepts, and instance
relationships between objects and classes. Semantic nets are criticized for being absent
58 CHAPTER 4 – Digital evidence representation
of formal semantics, leaving the meaning of the network to the intuition of the users
and programmers who use these network based representations [10].
At direct odds with the viewpoint of the logic based approach, the Frame based
approach [84] attempts to imitate how the human mind works, drawing its inspiration
from psychology and linguistics:
Whenever one encounters a new situation (or makes a substantial change in one’s
viewpoint) he selects from memory a structure called a frame, a remembered
framework to be adapted to fit reality by changing details as necessary.
A frame is a data structure for representing a stereotyped situation, like being in a
room, or going to a child’s birthday party. Attached to each frame are several kinds of
information. Some of this information is about how to use the frame. Some is about
what one can expect to happen next. Some is about what to do if these expectations are
not confirmed. [84]
Despite early criticism between communities surrounding the frame and logic
based approaches KR, it became apparent effective machine reasoning would benefit
from hybrid approaches involving the application of multiple theories of intelligent
reasoning within the same representation. Within the schools of KR approaches
research was directed towards addressing deficiencies observed. The limited
expressiveness of FOL based approaches has been addressed by proposing hybrid
logics, for example Nonmonotic Logics and Modal Logics [10]and by limiting the
context in which FOL applies.
An awareness of the syntactic problems and undecidable nature of using FOL
as a representation language led to the development of so called Description Logics
(DL). This branch of logic began by attempting to address these problems by adopting
a semantic nets inspired model (which has been shown to be directly translatable to
FOL) and restricting the expressive power of the language to a decidable subset of
FOL. Description Logics model the world as atomic concepts (unary predicates) and
roles (binary predicates), using a small number of epistemologically adequate
constructors to build complex concepts and roles [10].
Focusing on the wider goal of building practical knowledge based systems, the
CYC project (named after the stressed syllable of the word encyclopaedia), embarked
upon in 1984, attempts to implement “the commonsense knowledge of a human being”
[67]. The knowledge representation language employed by CYC, called CYC-L
provided a language which addressed the vocabulary and syntax issues of representing
instance based knowledge, and the semantic linkages needed for defining ontologies.
CYC-L is based on FOL, and intelligent reasoning is implemented in the large by first
order logic theorem provers, and in the small, by domain specific micro-reasoners. In
this system, the axioms (rules) representing general theories about the world are
assumed true by default, and where exception based knowledge applies it is limited in
application by context.
The DARPA Knowledge Sharing Effort, initiated circa 1990 [91] researched
means of enabling knowledge sharing between computer systems. Its central theme was
that knowledge sharing required communication between systems, and that this, in turn,
62 CHAPTER 4 – Digital evidence representation
As the focus on the use of the WWW began to shift from information
dissemination to information exchange, numerous parties began to find a need for
publishing machine usable descriptions of collections of distributed information. For
example, Microsoft proposed the Channel Definition Format (CDF) for describing push
based web content, the Platform for Internet Content Selection (PICS) for rating web
content [66], and Netscape proposed the Metadata Content Framework (MCF) for
generally describing metadata content [52]. The XML alone proved insufficient for
addressing these needs, especially in the areas of schematic expressiveness and
evolution, and integration of data from heterogeneous sources. The Resource
Description Framework (RDF) [63] arose out of these efforts.
In the very late 90’s, Berners-Lee, now a leading figure within the standards
body which produced RDF, the World Wide Web Consortium (W3C), began to
enunciate a vision to create a universal medium for sharing information and exchanging
data, which he referred to as the semantic web. A semantic web activity was initiated
within the W3C to pursue this goal, drawing on many lessons learned in building the
WWW and applying them to the task of knowledge representation.
Berners-Lee opines that the centralised, “all knowledge about my thing is
contained here” approach taken by most existing knowledge representation systems is
stifling and unmanageable, and proposes that these shortcomings might be addressed
by adopting a decentralised approach in much the same way as was employed with
64 CHAPTER 4 – Digital evidence representation
hypertext. [16]. A key architectural principle of the web which enabled it to scale
where hypertext failed to scale was the notion that all information did not have to be
published in the same place. The definition of the Uniform Resource Locator (URL), a
hypertext link spanning arbitrary information servers was the key to enabling
distributed and interconnected documents. The URL forms the foundations of the RDF
approach.
The W3C has standardised a number of technologies towards the goal of the
semantic web. These technologies are layered in a stack like manner, similar to that
observed in networking. The current semantic web stack comprises three logical layers:
a data layer, an ontology layer, and a query layer (presented below in Figure 7).
OWL Full
OWL DL
Ontology Layer OWL LiteLite
OWL
RDFS
The RDF is a framework defining a data model based on the directed, labelled
graph (DLG), and can be seen to be influenced by both the semantic nets and frames
KR approaches described earlier. This section presents a basic introduction to the RDF
data model, as it is used as a representational format in Chapters 4 and 5. For a more
comprehensive introduction see [110].
In the DLG model of RDF, graph nodes are either resources (such as things or
entities, ie. people, places, events, or other) or values (such as numeric values, times, or
other resources). The directed nature of the graph corresponds to a constraint that the
subject, which is a node that a graph edge originates from may only be a resource,
while target nodes (nodes that graph edges terminates at) can either be a resource, or a
CHAPTER 4 – Digital evidence representation 65
value. Graph edges correspond to properties (or attributes) of the subject. An example
of a simple graphical depiction of a RDF graph is presented in Figure 8 20 .
Figure 9: RDF statement "A person named Kevin Bacon starred in a movie named
'Footloose'"
20
We note that this is not a legal RDF graph due to the identifier of the nodes and arcs not being
a legal URI, but present it for simplicity of discussion.
66 CHAPTER 4 – Digital evidence representation
relationship 21 . Similarly, we do the same for the person with the name Kevin Bacon
and the class Person.
A fundamental premise of the RDF data model is that everything is named with
a Universal Resource Identifier (URI) [15]. URIs are a generalised addressing scheme,
of which a subset is the Uniform Resource Locator (URL) which is used to link
together web documents. When modelling data with RDF, concepts, instances,
properties, and even data types are all named using URIs 22 .
The use of URIs enables reuse of common concepts and instances. Returning to
our example, we wish to create an unambiguous identifier for the node whose “name”
property connects to “Kevin Bacon”. In this case, for convenience we turn to a
canonical source of information related to movies, The Internet Movie Database
(IMDB) 23 , and use the URL for this actor’s details page:
http://www.imdb.com/name/nm0000102/. Similarly we do the same for the movie
“Footloose”. The modified RDF graph is presented in Figure 10.
Figure 10: Unambiguous meaning is given to concepts and instances through naming with
URI’s
21
This is an abbreviation, so that “rdf:type” is to be interpreted as “the URI for the type
predicate drawn from the ‘rdf’ vocabulary”. In practise this means taking the “rdf” namespace
URI defined in the top of the document and concatenating it with the predicate, to form an
actual URI. The URI for this predicate is thus the URL http://www.w3.org/1999/02/22-rdf-
syntax-ns#type.
22
This is not strictly true. Nodes may also not have a name, in which case they are known as
blank nodes.
23
http://www.imdb.com/
24
We note here that this namespace does not necessarily need to resolve to a web page. Rather,
it is a scope in which to define terminological names used in the particular conceptualisation
that we are defining.
CHAPTER 4 – Digital evidence representation 67
separately evolve within an area of expertise, yet be used outside of their original
purview.
RDF supports integration of information published in separate documents by
merging together uniquely named nodes and arcs. Returning again to our example,
suppose we publish the graph representing “A person named Kevin Bacon starred in
the Movie Footloose” in one document, and in a similar document a graph representing
the statement “A person named Sarah Jessica Parker starred in the movie named
‘Footloose’”. How can we combine the information from the two RDF graphs?
In combining our “Kevin Bacon” graph with our “Sarah Jessica Parker” graph,
an RDF implementation will preserve uniqueness of nodes based on their identifiers,
and merge nodes with the same identifiers, leading to the graph in Figure 11:
http://isi.qut.edu.au/Movie/Person http://isi.qut.edu.au/Movie/Movie
rdf:type rdf:type
starredIn
Name
http://www.imdb.com/
“Kevin Bacon” name/nm0000102/ http://www.imdb.com/title/tt0087277/
starredIn name
“Sarah Jessica http://www.imdb.com/
Name “Footloose”
Parker” name/nm0000572/
Figure 11: RDF Graph representing statement “A Person named Kevin Bacon and a
Person named Sarah Jessica Parker starred in the Movie ‘Footloose’.”
The graph model defined by the RDF, while useful both for description in
terms of model theory, and for visualisation, is not suited for publishing as it is an
abstract graph. For this reason, a serialization of RDF to XML was defined at the time
of the definition of RDF. The serialisation (and subsequent ones) are based on
converting the graphical representation into a set of 3-tuples (or triples), where each
25
Note that for brevity this graph is still not a valid RDF graph, as we have not yet given URI
identifiers to the properties used in the graph.
68 CHAPTER 4 – Digital evidence representation
triple is a unique (resource, property, value) tuple corresponding to two nodes joined by
an edge. The triples for the fragment of the graph beginning with the movie would be:
(http:/www.imdb.com/title/tt0087277/, http://www.isi.qut.edu.au/movie/name, “Footloose”)
(http:/www.imdb.com/title/tt0087277/, rdf:type, http://www.isi.qut.edu.au/Movie/Movie)
These two triples would map to the RDF/XML serialization presented below in
Table 2.
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“ >
<rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/”
http://www.isi.qut.edu.au/Movie/name=”Footloose">
<rdf:type rdf:resource="http://www.isi.qut.edu.au/Movie/Movie"/>
</rdf:Description>
</rdf:RDF>
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“
Xmlns:isimv=http://www.isi.qut.edu.au/Movie/
Xml:base=“http://www.isi.qut.edu.au/Movie/” >
<rdf:Description rdf:about="http://www.imdb.com/title/tt0087277/”
isimv:name=”Footloose">
<rdf:type rdf:resource="Movie"/>
</rdf:Description>
</rdf:RDF>
An alternate and equivalent syntax for the above, which is tailored to declaring
instances based on their Class is presented in Table 4.:
Table 4: Alternative but semantically equivalent RDF syntax tailored to type definition
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“
Xmlns:isimv=http://www.isi.qut.edu.au/Movie/
Xml:base=“http://www.isi.qut.edu.au/Movie/” >
Table 5: RDF/XML serialisation of statement “A Person named Kevin Bacon and a Person
named Sarah Jessica Parker starred in the Movie 'Footloose'."
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#“
Xmlns:isimv=http://www.isi.qut.edu.au/Movie/
Xml:base=“http://www.isi.qut.edu.au/Movie/” >
The RDF/XML serialization of RDF has received criticism for being too
unwieldy and difficult to read, and for good reason. This has led to numerous efforts to
produce more human usable serializations. One of the earliest of these is the N3
serialization. Table 6 presents the same RDF graph above, serialized to N3 triples. In
this serialization, multiple property-value pairs may be associated with a single
definition of a resource. A number of shorthand terms are defined. In this example, we
see the “a” shorthand for “rdf:type” used.
imdb:name/nm0000102/isimv:name=”Kevin Bacon” ;
isimv:starredIn imdb:title/tt0087277/ ;
a isimv:Person .
imdb:name/nm0000572/ isimv:name=”Sarah Jessica Parker” ;
isimv:starredIn imdb:title/tt0087277/ ;
a isimv:Person .
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl=“http://www.w3.org/2002/07/owl#“
xml:base=“http://www.isi.qut.edu.au/Movie/”
xmlns:isimv=“http://www.isi.qut.edu.au/Movie/” >
<owl:Class rdf:ID="Person” />
<owl:Class rdf:ID="Movie” />
<owl:DatatypeProperty rdf:ID="name” >
<rdfs:domain>
<owl:Class rdf:resource=”Person” />
</rdfs:domain>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID="starredIn” >
<rdfs:domain>
<owl:Class rdf:resource=”Person” />
</rdfs:domain>
<rdfs:range>
<owl:Class rdf:resource=”Movie”/>
</rdfs:range>
</owl:ObjectProperty>
</rdf:RDF>
and 6, there is little published research investigating the wider role of knowledge
representation in computer forensics. Below is a survey of such related work.
Stephenson’s PhD dissertation work [127] focused on describing and
representing actual and ideal digital investigations, and validating digital investigation
processes against formal models. The representational approach taken by Stepenson
was to define a process language called the Digital Investigation Process Language
(DIPL), which employ’s Rivest’s S-Expressions as a syntax, defines a range of
vocabulary related to digital investigations, and finally, represents the digital
investigation as a process. Stephenson’s primary contribution is the proposal and
demonstration of the use of formal mathematical methods to validate actual
investigation procedure described in the language with investigative process standards
represented as formal petri-net models. While Stephenson’s goal of describing
investigations is a shared motivation with the work described in this dissertation, we
have chosen to adopt a representational approach based on formal knowledge
representation roots. This is due to the insufficiency of S-Expressions in the areas of
schematic expressiveness and evolution, and integration of data from heterogeneous
sources.
Brinson et al. recently proposed a taxonomy scheme characterising the field of
cyber forensics 26 , which they called a Cyber forensics ontology [22]. This approach
takes a liberal interpretation of the meaning of the term ontology, avoiding a “formal
explicit description of concepts” [97] in the cyber forensics domain. The model
proposed organises concepts from the cyber forensics domain into a hierarchy; however
the relationships between concepts appear to be neither consistent nor specified.
Regardless of these omissions, this work could form a useful starting point for
developing formal ontologies related to cyber forensics.
Slay and Schulz [119]have, since the research described in this dissertation was
completed, employed ontologies as a means of describing a specific conceptualisation
of files and suspicious media content in a computer forensics application. Their
conceptualisation contains various categories of media files, and the idea of
suspiciousness, and is employed by a filesystem search application. While the work
concerns itself with categorising files as suspicious based on properties and
relationships of files (ie. file size and topological proximity to other files), their
ontology does not attempt to encompass these concepts, presumably leaving these
concerns to implementation in code.
26
The term “cyber forensics” is not defined in this paper. On examination, the author appears to
be referring to the digital forensics field.
74 CHAPTER 4 – Digital evidence representation
desirable in order that usability is not hampered by debate over terminology and
conceptual granularity.
It is well known that formal specification of systems aids implementation and
correctness. For example, formal specification of software has led to significant
outcomes in producing provably correct software. We propose that the field of
computer forensics would similarly benefit from a formal approach, but in this case a
formal approach to representing knowledge about investigations, and information
within the digital crime scene. Such a formal approach would form a middle ground
between machine and human understanding, by adopting a common language with
extensible vocabulary, clearly defined semantics, and a regular syntax. We summarise
below the attributes of the proposed approach in the computer forensics context, by
relating the advantages of modelling information using RDF/OWL identified by
Reynolds et al in [110].
Integration of arbitrary information: The representation should employ a
simple and consistent model of data, in combination with a globally unique naming
scheme, in order that separately documented information may be easily combined into
a consistent and larger whole. Such a model of data and naming scheme will enable a
corpus of forensic evidence to be decomposed and composed. The benefits of this
relate to the volume problem, by enabling of sharing of evidence in small pieces, and in
enabling scalable approaches to processing evidence (due to the elimination of large
shared resources such as databases as a container of information). The naming scheme
should enable arbitrary information to be expressed and arbitrary vocabulary terms to
be created, and in addition, enable reuse of existing vocabulary terms. Related to the
complexity problem, such a representation would enable addition of new types of
information to a corpus of evidence without need to modify existing tools.
Support for semi-structured data: The representation should allow
information to be represented in the data model without need for considering or
deciding upon a particular conceptual model a priori, in order that information may be
rapidly integrated, without becoming bogged down in issues of semantics. At a later
point semantics may be attached through relating entities to elements of an ontology.
The complexity problem is addressed by rapidly enabling integration of new and
arbitrary information.
Extensibility and resilience to change: The complexity problem in computer
forensics indicates that forensics tools must address ever increasing complexity. A new
tool that exhibits backwards compatibility would in light of this complexity retain the
ability to interpret prior generations of information and models, despite changing
definitions of terminology over time. A representation that exhibits forwards
76 CHAPTER 4 – Digital evidence representation
compatibility should be evolvable: existing tools should remain able to interpret newer
generations of information expressed in the representation. For example, if a new
storage technology is developed, an imaging application which operates on this kind of
storage may record further information related to the source of evidence. Existing tools
which operate over images must still be able to interpret the image despite the presence
of the new information.
Classification and Inference: Such a representation should enable describing
the world not only by names, but by relationships between entities, and inclusion in
classes of things. Such a representation should additionally be conducive to inferring
new knowledge based on existing knowledge regarding a concept’s relationships.
Provenance: A representation should enable the ability not only to express
information, but also to express information about where the information came from.
This, in particular, is important in the forensics context, where any facts identified must
be substantiated by evidence. This need for substantiation is a considerable burden in
computer forensics given the amount of natural language currently required to describe
these provenance issues.
The research described in chapters 5 and 6 apply this proposed approach to
reducing the volume and complexity of events sourced from computer and network
event logs, and in easing the construction of corpuses digital evidence.
4.6 Conclusion
In Section 4.1 the motivation for knowledge representations was presented in
the context of documenting digital investigations. Section 4.2 described in broad brush
strokes the history of the field of knowledge representation, discussing in turn the goals
of the field. In short, the field attempts first to answer the question “How can we
express what we know?” and “How can we reason with what we express?”
A key theme which comes through on analysing the field is the preoccupation
on reasoning. In fact today the field tends to refer to itself as Knowledge
Representation & Reasoning, rather than referring to itself as simply Knowledge
Representation. This reflects a realisation that, when addressing the goals of artificial
intelligence, both representational formalism and the model of intelligent reason impact
numerous factors such as expressiveness, computational tractability and pragmatic
usefulness. The field remains today an active research area.
Section 4.3 described recent standardisation efforts on semantic information
markup, then indicated areas where the knowledge representation field has influenced
these efforts. Such efforts have multiple stakeholders advancing their varied research
and development agendas. In turn these vary from addressing the more lofty AI related
CHAPTER 4 – Digital evidence representation 77
ambitions of a globally published knowledge base, to the more pragmatic, “soft AI”
goals of publishing information in a manner that it may be unambiguously interpreted
and further intermixed.
Section 4.4 described KR related work in the IT security and forensics fields,
and concluded that representation has to date been an implicit subject in forensics.
Section 4.5 puts forward the proposition that the field of forensics would
benefit from a formal approach to representation, both by documenting investigations
and automating reasoning about evidence. The section summarizes the scientific
premise motivating much of the work described in this thesis.
The next chapter investigates whether semantic web KR formalisms are
suitable for using as the basis for developing DF analysis tools.
Chapter 5. Event representation in forensic
event correlation
“Nature herself cannot err, because she makes no statements. It is
men who may fall into error, when they formulate propositions”
(Bertrand Russell)
Schatz, B., Mohay, G. and Clark, A. (2004) 'Rich Event Representation for Computer
Forensics', Proceedings of the 2004 Asia Pacific Industrial Engineering and
Management Systems (APIEMS 2004), Gold Coast, Australia.
79
80 CHAPTER 5 – Event representation in forensic event correlation
correlation rules, correlation rule parser, event browser and the JENA semantic web
framework (see Figure 12).
Door Spec
Raw event logs are parsed into RDF event instances by the generic log parser
and inserted into the knowledge base. Correlation rules (expressed in a language called
FR3, which we describe later) are parsed into the native format of JENA, and applied to
event instances by the JENA inference engine. Investigators may interact with the
knowledge base containing the events and entity information using the event browser.
The components implementing the architecture are described in detail in the
following sections.
5.3 Implementation
5.3.1 The design of the event representation
Our ontology is rooted in two base classes, an Entity class that represents
tangible “objects” in our world, and an Event class that represents changes in state over
time.
At the time of this work, some guidance existed towards modelling time in the
form of the OWL-S time ontology [99]. This work, performed in the context of agent
oriented computing, was related to describing the time of occurrence of events both as
instants and durations, and the topological relationships related to this model of time.
This model of time however had no implementation of a temporal reasoner. For this
reason, we adopted a simple instant based model of time that assumes basic events to
happen at an instant of time. Our basic temporal ordering property is thus supported by
reasoning over the startTime owl:DatatypeProperty of the Event class. We assume a
<owl:ObjectProperty rdf:ID="causality">
<rdfs:range rdf:resource="#Event"/>
<rdfs:domain rdf:resource="#Event"/>
</owl:ObjectProperty>
Composite events are implemented by the creation of new events of the new
abstract and more generalized concept as the result of the successful matching of event
patterns and firing of a correlation rule. For example, we define a
DownloadExecutionEvent. This class is a composite event, its semantics is that a user
has executed content that has previously been downloaded, for example by
downloading with a web browser, or as an email attachment. This event composes
lower level events: a FileReceiveEvent, and a subsequent ExecutionEvent. The inter-
instance and class/subclass relationships are depicted below in Figure 13.
CHAPTER 5 – Heuristic event correlation for digital forensics 85
<Win32ConsoleLoginEvent rdf:ID="loginInstance1">
<startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2002-03-04T20:30:00Z</startTime>
<user rdf:resource="#user1" />
<host rdf:resource="#host1" />
</Win32ConsoleLoginEvent>
<Win32DomainAccount rdf:ID="user1">
<userName>jbloggs</userName>
<domain>DSTO</domain>
</Win32DomainAccount>
<Win32Host rdf:ID="host1">
<hostName>s3</hostName>
</Win32Host>
Notable points here are the usage of XML Schema (XSD) for encoding the
time of the event in the representation, and the use of RDF resource references to link
the event with instances of entities representing the user and the host. The lack of
precision caused by the omission of a year in syslog based timestamps is addressed by
a declarative feature at the event parser layer to skew time data to the correct year. This
may be used to address timing irregularities where the irregularity is quantifiable and
regular.
A further simplification is that we assume the integrity of the event sources has
not been compromised.
Namespace support
FR3 has specific support for XML namespaces and resource identifiers.
Resource identifiers are the OWL standard of URIs [15]. Namespaces are declared as a
clause of namespace abbreviation := namespace. as follows:
fore :="http://www.isrc.qut.edu.au/fore#".
The usage of namespaces, (along with a number of OWL features we will not
discuss here) enables integration of concepts from separate ontologies. The fore
namespace declaration resolves the concepts used in FR3 rules to concepts specified in
the FORE forensic ontology. Similarly, a RDF namespace declaration enables FR3
rules to reason over type information.
Object shorthand
This is a convenient form of syntax, which in the head of the rule, enables the
assignment of values to the properties of an object, without repeated use of the object.
The equivalent form of these clauses in JENAs rule language would be:
object.property = value;
object.property2 = value2;
In the tail of the rule this form is interpreted as an equality test whereas in the
head of the rule it is interpreted as variable assignment.
Heuristic rules
antecedents (also known as the tail of the rule) must occur in the knowledge base for
the IF part of the rule to be satisfied.Variables are introduced by including a question
mark at the beginning of an identifier.
Reasoning
The JENA toolkit is employed as the knowledge base, RDF/OWL parser, and
reasoner. We use the RETE [42] based forward chaining reasoning engine to
implement our rule language. The RETE algorithm is a speed efficient and space
expensive pattern-matching algorithm with a long history of use in expert systems and
rule languages.
At the time of this work, JENAs OWL implementation did not support rules
matching on all subtypes of an abstract type. For example, a rule matching events of
class LoginEvent would not fire when a Win32TerminalLoginEvent was added to the
knowledge base. The machinery of JENA that implemented the semantics of OWL type
hierarchy inference relied on a hybrid implementation involving both forward (RETE)
and backward chaining (similar to Prolog) reasoners. The inferred types of the
Win32TerminalLoginEvent were not available as facts to the forward chaining rule
engine, as they were only computed backwards as a query. The OWL implementation
of JENA was modified to pre-compute the type hierarchy information using the RETE
engine so that these facts were available.
All of the event classes mentioned refer to concepts defined in the document
identified in the fore: namespace declaration.
Event browser
The GUI event browser provides a number of methods of interacting with the
events in the knowledge base. Two views form the basis of the user interface, the event
causality view, and the entity view.
The event causality view provides a display for all events matching a certain
context, and displaying the properties of each event in a drill down manner. It further
provides means to drill down, following the causal ancestry of a sequence of events.
We implement a simple query interface for finding sets of event instances based on
type and property values.
The entity view presents all entities identified in the event base, along with
their properties. Entities selected in this view may be used as the basis of a query of all
related events. The Entity View provides an operation enabling the investigator to
hypothesise an identity equivalence relationship between otherwise distinct instances.
This is discussed in Section 5.4.1.
known value. The various heterogeneous event logs from which each event is sourced
is identified in parentheses.
In this scenario, a user either noticing the server rebooting, or a user being
unable to log in as Administrator would most likely alert the investigator. While it is
likely the attacker would cover their tracks by deleting the event log, one still can
envisage finding the log entries either via forensic analysis of the disk where the event
log is located, or from a secured log host.
This scenario sourced events from the following security log types:
We are not focusing on cross-domain correlation, so will not address the door
log in this case.
Meaning Match an event instance of class Win32RebootEvent with an event instance of class
DownloadExecutionEvent that occurs before it.
If the user which caused these events is not the Administrator user, and the latter
two events occurred within 10 minutes of each other, create a new OSExploitEvent,
and link its causality property to the DownloadExecutionEvent.
This method of investigation will only result in success for cases where the
client computer in the Apache web server logs has the same as the name of the
computer in the Windows security log related events. This is often not the case, as
Windows uses a host name in security log events, whereas Apache, by default, uses IP
addresses. We discuss below a method of addressing this shortcoming.
between surrogates. For example, one might hypothesise separate individual entities
may represent the same individual.
Consider the following unrelated sets of correlated events in Figure 15. The
previous web server related rule in Table 8 would not correlate the
ApacheWebFileDownloadEvent with the Win32LoginSessionEvent, as the surrogate
hosts (the client in the web log and the host in the LoginSessionEvent) are not the same.
Similarly, in an unrelated scenario where we were interested in correlating remote shell
sessions with the activities on a client computer, the SSHPasswordAuthenticationEvent
would not correlate with the Win32ProcessCreationEvent that executed the SSH client,
putty.exe.
If the investigator looks in the entities view of the event browser he will see all
of the individual entities that have been identified from the event logs, including the
Host with IP address 131.181.6.167 and another host with name “DSTO”. Through
examining DNS logs, or by other means, the investigator may hypothesise that the Host
“DSTO” and the Host with IP address 131.181.6.167 are in fact the same host. The
investigator can select the two entities and invoke the sameAs operation on the two.
The individual entities representing the two Hosts are now treated by the OWL
implementation as one single entity, a single instance of Host that combines all
properties of the prior two.
The sameAs operation relies on the underlying semantics of the OWL
individual equivalence mechanism, owl:sameAs. This language feature may be used to
state that seemingly different individuals (or classes) are actually the same. This single
(now merged) individual will now suffice to fire the WebFileDownload-LoginSession
causality rule discussed previously, and causally correlate the
ApacheWebFileDownloadEvent to the Win32LoginSessionEvent, via a different rule as
shown in Figure 16. Additional rules may now correlate the connection initiated by the
“putty” SSH client to another UNIX host.
94 CHAPTER 5 – Event representation in forensic event correlation
With these causal links correlated, the rules for the OSExploitEvent will now
be satisfied, driving the creation of an OSExploitEvent and its subsequent display in
the event view of the user interface.
is a bootable image file that, in this case, contains a utility to wipe the Windows
administrator password, resetting it to a known value.
Both the ECF and FORE approaches support querying event data in the
exploration of a hypothesis. FORE differs from ECF in that the underlying event base,
in the case of FORE, contains many linkages that have been inferred by event
correlation rules at the time events are loaded into the system. These causal linkages
enable an investigator to explore simple relationships without manual inference. ECF
however, contains none of these causal linkages. Following causal linkage in ECF
involves the operator inferring the linkages, and expressing the conditions required in
SQL queries. For example, for the scenario explored previously, the investigator would
have to write a series of SQL queries to successively narrow the set of events in
question. Investigation using ECF thus requires considerably more human inference
than using FORE.
FORE adds investigation features not found in ECF. In addition to providing a
general purpose KR framework and a complementary event correlation language, our
approach introduces the ability of hypothetically unifying entities of equivalent
identity, further enhancing the effectiveness of existing rule based correlation
96 CHAPTER 5 – Event representation in forensic event correlation
27
http://www.sap.com/
CHAPTER 5 – Heuristic event correlation for digital forensics 97
4. P logs into the SAP application as Q, which fails (SAP Security Audit
Log)
5. P logs off the windows workstation (Win32 System Log)
Detection of this scenario could indicate a user mistyping their username or
password. It could, however, also indicate a user attempting to (or succeeding to) login
as another user or to an account which they are not authorised to use. Persistent
recurrences of this event could potentially indicate the user methodically guessing the
password of another user.
<nerd:Host>
<nerd:hasInterface>
<nerd:Interface>
<nerd:hasIPSetup>
<nerd:IPSetup>
<nerd:hasIPAddress>
<nerd:IPAddress>
<nerd:ipaddress
>131.181.6.3</nerd:ipaddress>
</nerd:IPAddress>
</nerd:hasIPAddress>
</nerd:IPSetup>
</nerd:hasIPSetup>
</nerd:Interface>
</nerd:hasInterface>
</nerd:Host>
This introduces many more entities into the system per log entry, which could
quickly overload the information conveyed in the entity view. In response to this, we
modified the presentation layer of our GUI to only present the outermost enclosing
instance, with the child properties connected to anonymous instances represented as
path elements. For example, in our entity viewer, we would represent the Host above
as:
[hasInterface.hasIPSetup.hasIPaddress.ipaddress=131.181.6.3]
The door log entries contain the date, time, card id, name of assigned owner,
the door name, and the zone. In our case, the door is named by both the room it controls
access to and the building containing the room. Integrating this knowledge into our
prototype first involves identifying the concepts implicit in the event log data, and then
determining an appropriate place for the concepts in our ontologies.
As we wish to represent Rooms and Buildings, we hook in our Room concept
by inheriting from the SOUPA class SpacedInAFixedStructure. Similarly, we inherit
Building from FixedStructure. We hooked a DoorEvent into our existing ad hoc event
CHAPTER 5 – Heuristic event correlation for digital forensics 99
ontology by inheriting it from our existing Event class. We next wrote an event parser
specification specific to the door logs, which matches the door log syntax, and declare
the OWL instances which are necessary to represent a door entry. Below we present an
example door log event, as created by the parser:
<fore:DoorEvent>
<fore:building>
<fore:Building rdf:ID=”building1”>
<spc:name>GP. S BLOCK</spc:name>
</fore:Building>
</fore:building>
<fore:room>
<fore:Room rdf:ID=”room0”>
<spc:name>GP. S BLOCK RM S826A</spc:name>
<spc:spatiallySubsumedBy>
<fore:Building rdf:about=”building1”/>
</spc:spatiallySubsumedBy>
</fore:Room>
</fore:room>
<fore:user>
<fore:DoorSwipeCard rdf:ID=”doorcard1”>
<fore:cardID>42281</fore:cardID>
<fore:name>RICCO LEE</fore:name>
</fore:DoorSwipeCard>
</fore:user>
<fore:startTime rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"
>2004-03-04T20:30:00Z</fore:startTime>
<fore:DoorEvent>
SAP Security Audit Logs record, among other things, the success or failure of
logins to SAP, along with the date and time of the event, and the host (or in SAP
terminology, terminal) that the user attempted to login from. Addition of SAP related
events specific to our scenario required the addition of the following new concepts to
our ontology (presented in Table 10):
Class Meaning
ServiceAuthenticationEvent Authentication of a user to a resource,
specifically, a resource that is a service
SAPAuthenticationEvent Authentication of a user by SAP. Login success
or failure. Inherits ServiceAuthenticationEvent.
SAPClientLoginSuccessEvent Successful login to SAP. Inherits
SAPAuthenticationEvent.
SAPClientLoginFailureEvent Unsuccessful login to SAP. Inherits
SAPAuthenticationEvent.
SAPClientProcessCreationEvent The SAP client program has been run on a client
terminal.
IdentityMasqueradeEvent Multiple login names have been used to access a
service from the context of a single login
account.
higher level abstraction which represents a user’s interactive login session on a host. In
Table 11 we present a correlation rule in our language FR3, which detects instances of
this scenario:
views which limit the set of events displayed to higher level concepts closer to the
concerns and vocabulary of the investigator. The user interface enables the investigator
to “drill down” to the events which caused it. In this example, the
MultipleIdentitiesUsedEvent has causal links to the LoginSessionEvent and the
SAPAuthenticationEvent that triggered its creation.
In Figure 17, we present a graph of events that correspond to the scenario,
which can be explored by an investigator using the drill-down feature of the interface.
The causal relationships correlated by the rules above are presented in using bold.
Other links are correlated by rules not presented here.
IdentityMasquerade
Event
SAPClientProcess SAPClientLoginSuccess
CreationEvent EventEvent
user=P user=Q
host=F terminal=F
LoginSessionEvent
user=P
TerminalLogin TerminalLogout
Event Event
user=P user=P
DoorEvent
user=P
In our test environment, like many real world deployments of SAP, the SAP
username is not necessarily the same as the OS username for the same user. The
preceding rule presented in Table 11 resulted in many false positives, as the test for
inequality fires the rule for minor differences in username. For example, “jsmith” and
“j.smith” are treated as separate users.
In this case the surrogate proliferation problem creates false positives. To
resolve this, we explicitly select the users in question, and indicate that they should be
treated as representing the same thing, again using the sameAs functionality provided
by the OWL semantics. As a result, MultipleIdentitiesUsedEvent based on this kind of
identity failure are removed from the knowledge base and event viewer. This approach
to hypothetically resolving identity between a user identified from a door log, and a
102 CHAPTER 5 – Event representation in forensic event correlation
user identified in a login, similarly allowed us to causally correlate door logs with
logins to computers.
5.6 Conclusion
The FORE prototype holds the promise of collaborative development of
correlation rules that correlate events across and within domains, reducing the amount
of manual inference and query tasks, and assisting in interactive investigation. At a
higher level, we have demonstrated that correlation rules can automatically correlate
whole forensic scenarios without interactive investigation by human operators.
The four contributions of this chapter are aligned with themes of representation
and analysis techniques.
Firstly, the work investigates whether the RDF/OWL formalism is a useful
general representation upon which a digital forensics application, requiring a wide
representational scope, might be built. An experimental result of case study 1 (Section
5.4) is that we find that RDF/OWL is a useful formalism for representing low level
computer security and systems related events, composite and abstract events related to
higher order suspicious situations, and entities referred to in those events. The instance
model of RDF enabled definition of a surrogate per event and entity, and the class
based model of abstraction enabled ascribing semantics these event and entity
instances. The experiment demonstrates that the representation is of use in addressing
the complexity problem by enabling integration of arbitrary information from various
computer security and systems event logs.
The RDF/OWL representation is not, however, sufficiently expressive enough
to describe and represent heuristic knowledge describing complex relationships
involving temporal constraints, instance matching, and declaration of new property
values or new instances. This necessitated shifting outside the knowledge
representation to employ a rule language (FR3) for these purposes.
In case study 2 (Section 5.5) we have demonstrated that the representation is
extensible and generalisable to support reasoning across multiple heterogeneous
domains. We do so by successfully applying the prototype to a forensic scenario that
involves both ERP security transaction logs, and door logs, in addition to computer
security logs such as those which we have considered in our previous efforts.
Furthermore, we demonstrate that our approach can scale, by supporting the separate
development and subsequent integration of domain models, event parsers, and
correlation rules, by experts in their respective domains.
In this case, however, we addressed integration of information with differing
ontological commitments, by integrating information modelled by an existing network
CHAPTER 5 – Heuristic event correlation for digital forensics 103
intrusion detection related ontology, in addition to events sourced from the enterprise
resource planning system, SAP. The extensibility of the representational approach was
demonstrated by the ease with which an existing domain model was integrated into our
prior prototype.
The RDF/OWL language alone was not, however, expressive enough to
provide the language tools to address the areas of impedance mismatch between our
prior (ad-hoc) ontology and the intrusion related ontology. In this case the mismatch
was resolved by modifying our existing ontology and heuristics to operate at the same
level of granularity and commitments of the new ontology. An alternate approach
would have been to adopt rules to bridge across these mismatches.
In practice, the approach and implementation described carried with it an
ontological commitment which focused on modelling of situations and entities,
simplifying the subtle relationships between events and their occurrent time. This
simplified model of time carries with it the assumption that all of the clocks on the
separate machines are synchronised. While network time infrastructure such as NTP
facilitates synchronisation of computer clocks down to the millisecond, we expect that
in practice all but the simplest of forensic investigation will involve multiple computer
time sources in various states of de-synchronisation. Further work is required in
adapting event correlation techniques to work with models of time which incorporate
notions of multiple independent timelines, hypothetical specifications of clock
timescale behaviour, and automated methods for identifying the temporal behaviour of
computer clocks from event logs. This last theme is investigated further in Chapter 7.
An additional time related simplification is the embedded assumption that
values of entities remain invariant over time, whereas in reality, attribute values vary
over time. For example, while it may be widely true that a particular person’s name
remains the same over the period of their life. This assumption fails to hold, however,
when one considers events such as marriage, and officially sanctioned name change via
deed poll. Models of entity attribute values which account for different values over
time require further investigation.
Another limitation of the prototype described here is that it eschews
maintenance of provenance information. The parser, in its current state does not record
the source of event instance that it generates. Secondly, the rule engine does not record
which successful rule firings lead to which new inferred composite events.
Documentation of both of these is important as any automated conclusions must be
verifiable and traceable back to the original evidence.
The second contribution of this chapter is the demonstration of a novel analysis
technique for automated detection of a computer forensic situation, based upon
104 CHAPTER 5 – Event representation in forensic event correlation
information automatically derived from digital event logs. We present a heuristic rule
based approach that has the ability to manage the scalability and semantic issues arising
in such inter-domain forensics.
Such rule based approaches have a number of shortcomings. While abstraction
goes some way towards reducing the number of rules required for automated detection,
rules must still be authored by experts. Research into automated means of identifying
potential rules and associations is warranted; approaches such as data mining hold
promise. Furthermore, rules are by nature crisp in their definition, precluding
incorporation of fuzzy concepts. For example, in the OSExploit detection rules, we
implied a causal relationship by requiring that the Win32RebootEvent and LoginEvent
be within 10 minutes of each other, under the hypothesis that an attacker would operate
quickly and to avoid the complication of the rule matching every Login after a reboot.
Intuitively, the further the events which correlate to the OSExploitEvent are away from
each other, the more likely they are to be not causally correlated. Where one draws the
line on the relatedness of two events of these types is by nature subjective and could
benefit from techniques which acknowledge this.
The third contribution is the identification of a novel means of resolving the
problem of surrogate proliferation in interpreting names in event logs, which is
described in Section 5.4.1. Surrogate proliferation refers, in this case, to the problem
which arises from a single real (or virtual) world entity having multiple names by
which it is referred to, which leads to the necessary creation of one surrogate per name.
For example, while some event log entries (such as those taken from firewall logs) may
describe events related to a host by referring to its IP address, other event log entries
may refer to the same entity by its DNS name. This abundance of multiple names for
the same entity grows the quantity of entities which must be considered in interpreting
and correlating event logs.
This problem of surrogate proliferation can be observed throughout the digital
forensics domain. The event normalising task in Stephenson’s End to End Digital
Investigation (EEDI) methodology [127] refers to this problem in the context of
resolving records referring to the same network event being received from multiple
sensors. Similar problems may be observed in ascribing identity to particular versions
of files (i.e. operating system) found across multiple digital crime scenes.
The technique addressing this problem (described in Section 5.4.1) exploits a
general feature of the RDF/OWL formalism; the owl:sameAs language term, and
associated OWL defined semantics. That the general reasoning machinery of the
knowledge representation is employed to solve this problem demonstrates the
CHAPTER 5 – Heuristic event correlation for digital forensics 105
The previous chapter proposed and demonstrated the use of formal knowledge
representation in automating correlation of digital event oriented evidence, to facilitate
identifying situations of interest from heterogeneous and disparate domains. This
chapter addresses themes of representation and assurance in addressing how forensics
tools might scale and interoperate in an automated fashion, while assuring evidence
quality. The chapter considers the problem of sharing of digital evidence between tools
or even more widely, between organisations.
The chapter is structured as follows. Section 6.1 introduces the problem of
digital evidence storage formats, the related literature of which is described in Section
3.2. Section 6.2 enumerates a number of definitions of terms related to digital evidence
and related documentary artefacts. Section 6.3 proposes a novel integrated storage
container architecture and KR based information architecture for digital evidence bags,
which we call sealed digital evidence bags (SDEB). This approach supports arbitrary
composition of evidence units, and related information into a larger corpus of evidence
Section 6.4 describes the compositional nature of the architecture in the context of
usage scenario: building digital forensics tools and acquiring digital evidence from hard
disks. Section 6.5 describes experimental results validating the compositional nature of
the prototype approach, and Section 6.6 presents the conclusions of the chapter and
relates opportunities for future work.
The research work described in this chapter has led to the publication of the
following paper:
107
108 CHAPTER 6 – Sealed digital evidence bags
6.1 Introduction
The rapid pace of innovation in digital technologies presents substantial
challenges to digital forensics. New memory and storage devices and refinements in
existing ones provide constant challenges for the acquisition of digital evidence. The
proliferation of competing file formats and communications protocols challenges one’s
ability to extract meaning from the arrangement of ones and zeros within. Overarching
these challenges are the concerns of assuring the integrity of any evidence found, and
reliably explaining any conclusions drawn.
Researchers and practitioners in the field of digital forensics have responded to
these challenges by producing tools for acquisition and analysis of evidence. To date,
these efforts have resulted in a variety of ad hoc and proprietary formats for storing
evidence content, analysis results, and evidence metadata, such as integrity and
provenance information. Conversion between the evidence formats utilized and
produced by the current generation of forensic tools is complicated. The process is time
consuming and manual in nature, and there exists the potential that it may produce
incorrect evidence data, or lose metadata [30].
It is with these concerns in mind that calls have been made for a universal
container format for the capture and storage of digital evidence. Recently, the term
“Digital evidence bags” was proposed to refer to a container for digital evidence,
evidence metadata, integrity information, and access and usage audit records [135].
Subsequently, the DFRWS formed a working group with a goal of defining a
standardised Common Digital Evidence Storage Format (CDESF) for storing digital
evidence and associated metadata [30]. For further background on digital evidence
container formats, see Section 3.2.1.
Another source of complications related to the ad hoc nature of forensic tools is
the absence of a common representational format for Investigation Documentation.
This includes a number of generally related classes of information, such as Continuity
of Evidence, Provenance, Integrity, and Contemporaneous Notes (see Section 2.2). This
is not a trivial problem owing to the nature of the forensics domain, which deals with
massive conceptual complexity within multiple layers of abstraction. The challenge
here is to identify a means that decouples the evidence container formats and
investigation documentation used by forensics tools from the implementation logic of
these tools. Furthermore, this needs to be accomplished in a manner that facilitates the
assurance of provenance and maintains integrity.
This problem of evidence representation is not simply limited to the challenge
of tool interoperability. In outlining the “Big Computer Forensic Challenges”, Spafford
CHAPTER 6 – Sealed digital evidence bags 109
observes that practitioners and researchers in the field of digital forensics do not use
standard terminology [98], and indeed it is clear that there is limited attention paid to
the formal definition of taxonomies or ontologies describing this domain.
We propose the use of ontologies in addressing these terminological and
representational problems. We have produced a number of basic ontologies modelling
the domain of digital evidence acquisition, computer hardware, and networks, and
described these ontologies using the Web Ontology Language (OWL). In combination
with semantic markup languages such as RDF, ontologies encourage knowledge
sharing and reuse within a domain, which has the potential to lead towards a
convergence of vocabulary in the forensics domain.
In this chapter we propose an extensible architecture for integrating digital
evidence by applying an ontology based approach to Turner’s digital evidence bags
concept. We enumerate the representational requirements for the investigation
information component of an open common digital evidence storage format, and
formalise the domain by describing it with an ontology. An architecture for digital
evidence bags is demonstrated which facilitates modular composition of forensic tools
by way of an extensible information architecture. Further, a novel means of identifying
digital evidence, and digital evidence bags is proposed which supports arbitrary
referencing of information within and between digital evidence bags. The proposal
modifies Turner’s design to strengthen evidence assurance, proposing an sealed
(immutable) bag metaphor.
6.2 Definitions
Our concerns involve representation and terminology. To avoid confusion, the
following terms used throughout the chapter, and in our digital evidence ontology, are
defined below. As the subject is digital evidence, we omit the use of the word digital in
our definitions.
Continuity of Evidence Documentation: Information maintained to track
who has handled evidence since it was preserved.
Digital Evidence: A term which loosely refers to a related set of Evidence
Content or Secondary Evidence and Investigation Documentation.
Evidence Content: Stream of bytes of computer data: typically data which is
stored in a file, or a stream of a file, or in raw storage, such as the ordered sectors of a
disk.
Evidence Content File: A file containing evidence content.
Image : A contiguous sequence of bytes, which is a copy of a digital crime
scene.
110 CHAPTER 6 – Sealed digital evidence bags
by recursively embedding digital evidence bags within digital evidence bags, as well as
by intra-bag reference, which we depict in Figure 18. We call the architecture the
Sealed Digital Evidence Bags (SBEB) in reference to Turner’s proposal of the DEB.
The storage container architecture describes how data streams containing data
objects, investigation documentation, and evidence bag documentation are contained in
one archive.
Sealable digital evidence bags follow a similar structure to Turner’s bags. The
key difference is the use RDF/XML to represent the Tag and Investigation
Documentation related information, in order to facilitate an interoperable
representation. The Tag File of any digital evidence bag is called Tag.rdf. The naming
of the Investigation Documentation files is tool or user determined, however the
extension is .rdf to signify that the format of the file is RDF.
The XML/RDF format does not support recursive definition of RDF/XML
content within the content of another RDF/XML content block, and makes no provision
for arbitrary text outside the syntax of the XML syntax. This leads us to maintain
integrity information regarding the content of the Tag in a file external to the Tag,
unlike the DEB proposal. Turner’s DEB uses an onion like approach where a hash of
the previous contents of the Tag is recursively appended to the Tag. We instead define
112 CHAPTER 6 – Sealed digital evidence bags
a Tag Integrity File, called Tag.rdf.sig, which contains integrity information pertaining
to the Tag.
Sealable digital evidence bags are designed to be created and populated with
evidence and investigation documentation, then sealed exactly once. The Tag of an
SDEB is immutable after the Tag Integrity File has been added to the SDEB. Before
that the bags are unsealed and mutable.
The structure of the SDEB is presented in Figure 19.
jbloggs.history.MSHist012006010420060105.index.dat.rdf
jbloggs.history.MSHist012006010420060105.index.dat
jbloggs.history.MSHist012006010320060104.index.dat.rdf
jbloggs.history.MSHist012006010320060104.index.dat
jbloggs.history.MSHist012005121220051219.index.dat.rdf
jbloggs.history.MSHist012005121220051219.index.dat
jbloggs.history.MSHist012005121920051226.index.dat.rdf
jbloggs.history.MSHist012005121920051226.index.dat
jbloggs.cache.index.dat.rdf
jbloggs.cache.index.dat
jbloggs.history.index.dat.rdf
jbloggs.history.index.dat
Tag.rdf
Tag.rdf.sig
Recalling that in RDF (see Section 4.3.1), Subjects, Predicates and Objects are
named using a URI, we use a special category of URI called a Uniform Resource Name
(URN) [86] for identifying digital evidence bags, investigation documentation, and
arbitrary secondary evidence instances. URNs are intended to serve as persistent,
location-independent resource identifiers.
Following work performed in the life sciences area in uniquely identifying
proteins in distributed databases (which has resulted in the definition of the Life
Sciences Identifier (LSID) standard [117]), we propose a digital evidence specific URN
scheme. This scheme, which we call Digital Evidence IDentifier (DEID) is based on
the organisation of the tool user, and employs message digest algorithms as a globally
unique identifier. The format of a Digital Evidence Identifier is as follows:
urn:deid:organisation:digestalgorithm:digest:discriminator
urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea17d4:image
The string “deid” 28 is used to provide a unique namespace for digital evidence
identifiers. We provide scoping information in the organisation field which would
potentially enable one to resolve a URN back to a set of information or an evidence bag
as has been employed in the LSID work. The digestalgorithm field refers to the
28
We follow the LSID convention, which uses a lower case string “lsid” in the URN.
114 CHAPTER 6 – Sealed digital evidence bags
message digest algorithm used to generate text in the following field. The
descriminator field is provided for further addition of naming terms. It should be noted
that we rely on the collision free nature of message digest algorithms to assure globally
unique names. Given that flaws may be found in cryptographic hashes over time, our
proposal provides for the use of other digest algorithms.
Of course these identifiers are long and unwieldy and not suited for use as
names for the evidence we are concerned with. Evidence may be given more human
friendly, case specific names by asserting further RDF triples which have the identifier
as the subject. An example of this kind of usage is given in the case study in Section
6.4.
Where it is necessary to refer to the contents of a particular file, for example a
digital evidence container file in the same DEB, the DEB implementation interprets the
standard URI file protocol (i.e. file://./foo) to find the file.
<de:FileImage
rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476
:image">
<de:imageContainer rdf:resource="file:///./jbloggs.cache.index.dat"/>
<de:imageOf
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786f…b29a2104c476:original"/>
<de:acquisitionTool>
<de:OnlineAcquisitionTool rdf:about =
”http://www.isi.qut.edu.au/2005/acquireIELogs.py”>
<de:name>acquireIELogs.py</de:name>
<de:version>0.1</de:version>
</de:OnlineAcquisitionTool>
</de:acquisitionTool>
</de:FileImage>
<wb:BrowserCacheFile rdf:about =
"urn:deid:isi.qut.edu.au:sha1:4056e4786f….b29a2104c476:original">
<fs:filePath>D:\Documents and...Files\Content.IE5\\index.dat</fs:filePath>
<de:messageDigest
rdf:datatype="http://www.w3.org/2000/09/xmldsig#sha1">4056e4786fc460d9adbe98a0bc
19b29a2104c476</de:messagedigest>
</wb:BrowserCacheFile>
This file (Table 14) contains RDF instance data which asserts two top level
instances; a FileImage and a WebBrowserCacheFile. The instances describe the
relationship between the Evidence Content (the content of an Evidence Content File in
the digital evidence bag) and the original data object, which is a Web Browser Cache
File, located on a particular host.
Our ontology here discriminates between the original data object, the web
browser cache file (which at one point in time resided on some piece(s) of physical
storage media) and the image of that file. As the contents of these two files are, from
the digital perspective, identical this results in a DEID URN with the same message
digest value. We discriminate between the two instances by using the labels “image”
and “original” in the discriminator field of the DEID URN. This distinguishes between
the FileImage and the BrowserCacheFile. The de:imageContainer property links the
116 CHAPTER 6 – Sealed digital evidence bags
Figure 20: RDF Graph relating original data object and image
The tag file contains the RDF data representing the SDEBs contents and related
integrity information. The DEID of the deb:DigitalEvidenceBag instance is based on
the hash of the content of the Investigation Documentation Files, in the order in which
they are defined in Table 15. The deb:bagContents property is an ordered list which
refer to instances of digital investigation documentation contained in the digital
investigation documentation files.
CHAPTER 6 – Sealed digital evidence bags 117
Table 15: Digital Evidence Bag instance data stored in the Tag File
<deb:DigitalEvidenceBag
rdf:about="urn:deid:isi.qut.edu.au:sha1:44bc23235f5e797aae992e5de09524e9071fd8c6
">
<deb:bagContents>
<rdf:Seq>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea1
7d4:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4a03ed30ebdf919004d4b40222b721c4771ad
ee9:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:c117652d98a4f612979c19f5701d278e02574
9fa:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:05de1243f67753150334968a2effcc4f8114e
f45:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:f3a9fd3fcc017d822f10bc4466b6d19ddbdd5
042:image"/>
<rdf:li
rdf:resource="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c
476:image"/>
</rdf:Seq>
<deb:bagContents>
</deb:DigitalEvidenceBag>
6.3.3 Integrity
Current best practice for ensuring the integrity of digital evidence involves the
use of collision resistant message digest functions. Typically a message digest is taken
of the original evidence, and recorded in a manner that asserts the time of the digest
being taken (often via contemporaneous notes or printouts). The integrity of subsequent
images made, or copies of images made may then be ensured by taking the message
digest of the image or copy, and comparing with the original message digest.
In this proposal, integrity of evidence and investigation documentation is
ensured by the use of chained message digests. Besides using the message digest of
each piece of Evidence Content as a component of a unique identifier for both the
Evidence Content Documentation instance and the Digital Investigation Documentation
instance, we also define a property within the class de:EvidenceContext class called
de:messageDigest. This property is presented in context in Table 16.
<wb:IEBrowserCacheFile
rdf:about="urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476
:original">
<de:messageDigest
rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1
>4056e4786fc460d9adbe98a0bc19b29a2104c476</de:messagedigest>
</wb:IEBrowserCacheFile>
The value of the de:messageDigest property is the hash of the Digital Evidence
Content obtained from the file. Work in the xml signature area has already defined a
datatype representing a SHA-1 message digest, and defined a URI representing this
118 CHAPTER 6 – Sealed digital evidence bags
Table 17: Investigation Documentation Container Metadata stored in the Tag File.
<deb:InvestigationDocumentationContainer>
<deb:contains
rdf:resource="urn:deid:isi.qut.edu.au:sha1:dc04e8f06b2a32e7d673c380c4d2c8a1d5ea1
7d4:image" />
<de:messageDigest
rdf:datatype=http://www.w3.org/2000/09/xmldsig#sha1
>731251ae7216b935cccf51a4018a00d8d89a89cd</de:messagedigest>
<fs:filePath>file:///./jbloggs.history.index.dat.rdf</fs:filePath>
</deb:InvestigationDocumentationContainer>
6.3.5 Clarifications
In this case, the examiner uses a DEB enabled hard drive imaging application
for acquiring the evidence image. This tool is scripted together from a variant of the
UNIX dd 29 tool, and the Linux hdparm utility 30 . The examiner acquires the hard drive
using this utility, resulting in a digital evidence bag containing an Evidence Content
File, called hda.dd, an Investigation Documentation File, called hda.dd.rdf, as well as
Tag.rfd. The imaging application is designed to be as simple as possible, and produce a
sealed digital evidence bag. It automatically generates a message digest of the Tag.rdf
file and stores it in the Tag Integrity File, Tag.rdf.sig. At this point the evidence bag is
sealed, and considered immutable, depending on the underlying scheme of
implementation of the Tag Signature.
The examiner has further data associated with this digital evidence bag, namely
the Job ID, a case specific name, the examiner’s name and identifying details, and
perhaps the serial number printed on the drive. An evidence annotation program is used
by the examiner to create a new, unsealed digital evidence bag, and the original digital
evidence bag embedded within it. A new Tag File is created within this new bag by the
annotation application. The additional data is entered using the annotation user
interface, and added to the Tag File. In this case the annotation editor eschews creating
a new Investigation Documentation File, as no new evidence has been acquired.
There are two distinct activities involved in the above scenario: evidence
acquisition and evidence annotation. By the former, we refer to the process of making
an exact copy of a piece of digital evidence, for example a hard disk. The latter refers to
the act of recording details relevant to the acquisition process and the evidence source.
By modularizing these two tasks, individual tool complexity is reduced, which has the
potential to increase reliability and enable testing at a more granular level. Bugs in the
consuming forensic tool (the annotation tool), are more likely not to jeopardize the
integrity of the product of the evidence acquisition task.
The tool annotates the information in the original sealed digital evidence bag
by asserting new properties and their values, related to the DEID of the particular piece
of information from the subject bag, as new RDF triples. These triples are stored in the
Tag File of the new unsealed DEB. In reference to the above example, the new data is
related to the instance representing the hard disk by means of its unique identifier. A
depiction of a portion of the RDF graph formed from the new information as well as
the original investigation documentation is presented in Figure 21.
29
A low level block oriented copying tool found on most UNIX variants.
30
A utility which queries information such as serial numbers, size, and addressing information
from hard disks.
120 CHAPTER 6 – Sealed digital evidence bags
Figure 21: RDF graph resulting from addition of new documentation to embedded DEB
<de:Image
rdf:resource=
"urn:deid:isi.qut.edu.au:sha1:4056e4786fc460d9adbe98a0bc19b29a2104c476:image">
<de:acquiredBy>
<foaf:Person>
<foaf:name>Bradley Schatz</foaf:name>
<foaf:mbox
rdf:resource=”mailto:bradley@blschatz.org”/>
</foaf:Person>
</de:acquiredBy>
</de:Image>
31
The prototype implementation and ontology use the term ‘Evidence Metadata’ where we now
use ‘Investigation Documentation’. This refinement in terminology is intended to signify the
arbitrary information which may be related to the evidence by multiple layers of abstraction.
CHAPTER 6 – Sealed digital evidence bags 123
concerns of digital forensics. Reification will assist in making possible statements such
as “The pasco tool interpreted the following statements from file X”.
The second contribution is a conceptual advance, proposing an improvement
on the Digital Evidence Bag (DEB) proposal of Turner. Our proposal, which we call
the Sealed Digital Evidence Bag (SDEB) enables arbitrary composition of evidence
bags and information within evidence bags, without modifying any data in original
evidence bags. This proposal improves upon the DEB proposal by simplifying aspects
of evidence authentication.
Central to the compositional approach is our proposal of a globally unique
identification scheme for identifying digital evidence and related information, which
we dub Digital Evidence IDentifiers (DEID). This unique naming scheme enables
automated integration of information from separate evidence bags by the
implementation of the underlying knowledge representation layer. This demonstrates
that employing a knowledge representation as a common language for documenting the
digital investigation provides immediate benefits towards solving the complexity
inherent in integrating this information.
The final benefit of the SDEB approach is that it enables granular composition
and decomposition of evidence into a corpus of inter-related evidence bags, which
addresses the volume problem by facilitating automated validation and scalable
processing of evidence .
The next chapter addresses the theme of analysis techniques and evidence
assurance.
Chapter 7. Temporal provenance &
uncertainty
“I used to be Snow White, but I drifted.”
(Mae West)
125
126 CHAPTER 7 – Temporal provenance & uncertainty
2006), Lafayette, USA, and published as Digital Investigation, 3 (Supplement 1), pp. 89-
107.
7.1 Introduction
The use of timestamps in digital investigations is fundamental and pervasive.
Timestamps are used to relate events which happen in the digital realm to each other
and to events which happen in the physical realm, helping to establish event ordering
and cause and effect. A well known difficulty with timestamps, however, is how to
interpret and relate the timestamps generated by separate computer clocks when they
are not known to be synchronized [128]. Commonly observed differences in time occur
from computer to computer caused by location specific time variations (such as time
zones), the rate of drift of the hardware clocks in modern computers, and
misconfiguration and inadequate synchronisation.
Current approaches to inferring the real world interpretation of timestamps
assume idealised models of computer clock time, eschewing influences such as
synchronisation and deliberate clock tampering. For example, to determine the clock
skew of a computer being seized, it is commonly recommended that a record be made
“of the CMOS time on seized or examined system units in relation to actual time,
obtainable using radio signal clocks or via the Internet using reliable time servers.”
[20]. CERT recommend that, “As you collect a suspicious system’s current date, time
and command history … determine if there is any discrepancy between the collected
time and date and the actual time and date within your time zone” [96].
While this approach will approximately identify the skew between the local
time and the observed computer time at the time of the check, it says nothing about the
passage of time on the computer’s clock prior to that point [141]. Uncertainty remains
as to the behaviour of the clock of the suspect computer prior to seizure. This further
leads to uncertainty as to what real world time to ascribe to any timestamp based on
this clock.
In this work we explore two themes related to this uncertainty. Firstly, we
investigate whether it is reasonable to assume uniform behaviour of computer clocks
over time, and test this assumption by attempting to characterise how computer clocks
behave in practice. Secondly, we investigate the feasibility of automatically identifying
the local time on a computer by correlating timestamps embedded in digital evidence
with corroborative time sources.
CHAPTER 7 – Temporal provenance & uncertainty 127
The subject of our case study is a network of machines in active use by a small
business. The network consists of a Windows 2000 domain, containing one Windows
2000 server, a Domain Controller (DC), and a variety of Windows XP and 2000
workstations. Access to the internet is provided by a Linux based firewall. In this case,
the Windows 2000 DC (the server) has not been configured to synchronize with any
reliable time source, and as such has been drifting away from the civil timescale for
some time. The Linux firewall also provides both a squid 32 web proxy server, and an
NTP server, which is synchronised with a stratum 2 NTP server 33 . All workstations are
configured to use the squid proxy cache for web access.
Our goal here is to observe both the temporal behaviour of the Windows 2000
DC, and the effects of synchronization on the subordinate workstation computers. We
would expect that the timescales of the workstation computers would approximate that
of the DC, because of the use of SNTP in this network arrangement (see Chapter 3).
To observe this behaviour, we have constructed a simple service that logs both
the system time of the host computer and the civil time for the location, which we
obtain via SNTP from the local NTP server. The program samples both sources of time
and logs the results to a file. Figure 22 depicts the network topology and time related
infrastructure for this experiment.
32
http://www.squid-cache.org/
33
Stratum refers to the distance from a reference clock.
128 CHAPTER 7 – Temporal provenance & uncertainty
Figure 22: Experimental setup for logging temporal behaviour of windows PC's in small
business network
The logging program was deployed on all workstations and the server on the 1st
February 2006, and the results checked mid March. Unfortunately, the program was
rendered short lived, as a particular bug in the Windows service implementation of
Python (the implementation language) saw the log service crash after writing 4k of
debug messages to the standard output steam. On fixing the bug, a new version was
redeployed on the 21st March, 2006 for 20 days (until the 10th April), and then results
collected.
The graphs presented below are based on the sampled timescales taken from
machines in the subject network. The x-axis is the time and date of the sample, taken
from the civil timescale, as served by the NTP server. The y-axis is the difference in
time between the system time and civil time at that moment, in seconds.
Figure 23 is the graph of results taken from the domain controller of the
Windows 2000 server based network. The solid line of samples shows a uniform drift
of the system time away from civil time for the time period 21st March through 10th
April. The other two sets of samples from the 1st February through the 21st March
2006 are samples taken by the initial version of the program in the time after a boot
(before the program crashed). In Figure 23 two clusters are visible outside the
aforementioned line, one about the 1st February, and one about 13th March. These two
CHAPTER 7 – Temporal provenance & uncertainty 129
clusters indicate reboots at that time. Extrapolating the solid line shows the drift of the
server to be at a near uniform rate.
Figure 23: Clock skew of Domain Controller "Rome" offset from civil time.
Figure 24: Clock skew of workstation "Florence" offset from civil time.
The scale of the graph is misleading as to the number of outlier values present
from 8:19:34AM through 8:25:56AM on the 20th February. The cross at 0 skew
actually represents 38 outlier values, which do not fit a model of time where the clock
is synchronised to the DC. It seems highly irregular that during this period the machine
became synchronized to within one second of Civil Time (a time stream which the
network in question has no configured reference to).
The default auditing configuration of the Windows network failed to include
the necessary privilege to identify whether this was user instigated. The accuracy to
which the clock became synchronised with civil time leads us to suspect that this was
not the result of user interaction; rather that it was the action of some program which
had access to an external, reliable time source. The Windows update service was active
during this period; we speculate that the cause of synchronisation with the Civil
Timescale during this period was the Windows Update service.
The graph presented in Figure 25 shows the skew data taken from a Windows
XP workstation named Milan. 34 Again the drift rate generally remains constant, and
correlated with that of the server; however there are two sets of anomalies which
deviate from this general trend. Immediately noticeable are the almost vertical lines
which indicate a resynchronisation with the DC timescale from wide time skews. We
speculate that these features indicate a computer reboot immediately before. The
second anomaly is the two peaks on the graph around the 6th and 7th April.
34
The scale of this graph differs from the previous graphs to present a more clear view of the
features in discussion. We note that the overall form of the graph when taken at the previous
scale follows the same gradient and offset.
CHAPTER 7 – Temporal provenance & uncertainty 131
On closer investigation, the vertical line on the 4th of April reveals that over a
period of 22 minutes and 0 seconds of real time, the system clock only advanced 20
minutes 51 seconds. In total the system clock loses 1 minute 9 seconds over this period.
This behaviour occurred in small, and incremental changes and is consistent with the
disciplining of a skewed clock back into synchronisation with a trusted source.
Figure 25: Clock skew of workstation "Milan" offset from civil time (zoomed).
35
As the focus of this section is on describing observed deviations of Windows based clocks
from the ideal, we leave experiments which conclusively determine the behaviour of the
Windows XP clock at boot to others.
132 CHAPTER 7 – Temporal provenance & uncertainty
Figure 26: Clock skew of workstation “Trieste” offset from civil time.
The graph in Figure 27 combines data from the DC and Milan (figures 23 and
21) about the period where peaks are seen in the skew graph 36 . We can see here that the
DC was maintaining a stable timescale (part of its data, the points forming a thin line, is
showing through under the peak) for the period with Milan drifting away sharply at the
peaks. At the start of the peak we can see that Milan began drifting away from the DC
at a rate of around 1 second every 14 minutes, before re-synchronising with the DC.
Figure 27: Clock skew of "Rome" vs. "Milan" offset from civil time (zoomed).
36
Note that colour would help with this graph. Rome data is plotted as points and Milan as
crosses.
CHAPTER 7 – Temporal provenance & uncertainty 133
events of interest were found. This drift is unlikely to have been based on a single
operator action, as the corresponding change in skew would have been immediately
visible, with a discontinuity between the two points.
The remaining three workstations stayed synchronised with the DC, with no
temporal anomalies observed. The skew timelines for these were similar to Figure 23.
For reasons of brevity they are not reproduced here.
From these results we make a number of conclusions. In general, we find that
Windows hosts (2K and XP) integrated with a Windows based time synchronisation
network will stay synchronised. The anomalies observed above, however, indicate that
making reliable statements about the timescale of a particular workstation computer
within a Windows Domain network (and as such the interpretation of timestamps from
these workstations) is problematic.
Windows computers not in a Domain network, either untethered from reliable
sources of time (such as windows 2000), or loosely tethered (such as computers
running the XP OS) may suffer from the same problem. Indeed, as XP hosts are
tethered to synchronise with time.windows.com on a far less frequent basis (weekly),
there will be larger periods of de-synchronisation. The observation that the host
“Milan” became synchronised with civil time for a period, and the further observation
of it drifting away from the DC timescale and civil time (for no observable reasons)
indicate that other factors are influencing the behaviour of the clock.
relate to a suspect computer may be obtained from the ISP which has served as the
computer’s gateway to the Internet.
We assume that these records on the proxy would be produced by a computer
which is synchronized with an accurate time source. While this might not at present be
a generalisable assumption, we look towards a near future where the provenance of
audit records receives closer attention by ISPs and business in general as forensic
preparedness finds its way onto the agenda for compliance reasons among others.
Our experimental setup uses the same infrastructure which was used in the first
study. Relevant to this experiment is the deployment of the Microsoft Internet Explorer
web browser on all Windows based machines, and the presence of a Squid HTTP proxy
on the firewall, which the computers are configured to use to access the web. The
experimental setup is depicted in Figure 28.
This experiment takes the browser records from the machines in the network
and correlates them with the proxy logs from the squid log, to determine the temporal
behaviour of the Windows machines on the network. The correctness of the correlation
techniques are evaluated using the data collected from the previous experiment.
CHAPTER 7 – Temporal provenance & uncertainty 135
IE stores records of browsing access in two subsystems: the cache and history.
These records are all stored in separate files, called index.dat, but located in different
directories.
The IE cache subsystem stores locally cached copies of web content, such as
pages and images in files such as those with a jpg extension amongst others. An index
is kept mapping web addresses to these locally stored copies in a file called index.dat.
The cache index files contain entries for all cacheable resources visited, including
component files of a particular viewable page (for example, images, sounds, and flash
animations,).
The history subsystem creates a historical record of URLs visited over time in a
set of index.dat files. Three separate types of history file are kept: the root history, daily
sort history and weekly sort history. Within these files are records of visits to top level
viewable pages:
• Pages visited by typing a URL
• Pages visited by clicking on a hypertext link
• Documents opened within Windows Explorer by double clicking (i.e.
.xls, .doc…)
The cache and history index.dat files are all of a similar undocumented binary
file format. Despite the lack of documentation, there exist a number of documented
analyses of reverse engineering the file format, and a number of tools are available
which will interpret the content of this file. For a good description of the file format,
especially notable in distinguishing some subtle semantic differences in interpreting the
timestamps in these records, see [20].
We initially used the Pasco [59] tool for extracting the data contained in these
files. We chose this tool as it had freely available source code. In practice we suspected
that it was generating spurious results. This prompted us to perform our own reverse
engineering effort. Our new tool identified a bug in the Pasco tool where a spurious
record was generated from an unchecked file read for an offset outside the bounds of
the file 37 .
The squid proxy cache logs a record of all web transactions which it processes
in a file called access.log. This is a textual log file. The fields of interest to us are the
resource access time (which, similar to the IE index files is the end of the transaction),
and the URL visited.
37
Our new parsing tool, imaginatively named pasco2, is available at
http://www.bschatz.org/2006/pasco2/
136 CHAPTER 7 – Temporal provenance & uncertainty
Our experiment involves translating the web browser records and squid logs
into a common representation and matching entries from the two sources based on the
URL visited. We assume that the last accessed time from the squid record is relative to
civil time (kept tightly synchronised using, for example, NTP), and compare that time
with the last accessed time from the corresponding history or cache record.
The primary challenge related to correlation is in determining which entry in
the Squid cache log corresponds to a particular entry in the cache or history records. As
IE records are most recently used (MRU) records, there will not be a one to one
mapping between history entries and Squid events. We illustrate this with the following
example.
Figure 29: Matching is complicated by only the most recent record present in the history.
For the two algorithms explored, the sampled timescales from the previous
experiment in Section 7.2 are used as a baseline for determining which matches are true
or false. True positives are data points output by the correlation algorithm which
correlate with the timescale identified in Section 7.2. False positives are matches
generated by the algorithm which do not correlate with the timescale. True negatives
are prospective matches that are rightly discarded by the algorithm. False negatives are
CHAPTER 7 – Temporal provenance & uncertainty 137
data points which would correlate with the timescale, but the algorithm classifies or
misidentifies as not correlated.
Figure 30: Correlated skew (clickstream) vs. experimental skew (timeline) for host
“Milan” do not correlate because of presence of false positives.
138 CHAPTER 7 – Temporal provenance & uncertainty
Figure 30 and Figure 31 are of a clickstream correlation run, graphed with the
timescale log of the workstation “Milan”. The clickstream correlation dataset in Figure
30 is graphed as crosses. It contains 75 results, of which we can see 4 clusters of
clickstream results 38 . Clearly there is conflicting data. The two clusters visible, but not
on the timeline, actually contain 5 false positive values which are causing the problem.
These 5 values are false positives as we know from our earlier experiment what the
actual time was on the computer clock at that particular point in time, and is plotted on
the graph as dots. Removing these false positives from the result set results in the graph
labelled Figure 31, where we can see tight correlation with the workstation’s timescale.
Figure 31: Correlated skew vs. experimental skew for host “Milan” correlates when false
positives are removed.
The results of running the same correlation algorithm on the host “Pompeii”
which has generated far less web traffic over the period is presented in Figure 32. In
this case the clickstream correlation algorithm produces no false positives.
38
We note here that a colour graph would be more illuminating, as the timeline values on the
graph dominate. The apparent line on y-axis 0 is actually the individual timeline samples (which
are graphed as dots) merging to form a solid line. Three clusters are visible about the 11/04 x-
axis coordinate, and a further cluster is visible just after 07/04 on the 0 y-axis.
CHAPTER 7 – Temporal provenance & uncertainty 139
7.3.5 Results
In practice the rate of false positives increased when comparing intra-hit times
at magnitudes below the magnitude of one second. We expect that this is caused by the
measurement error being more pronounced the smaller the intra-hit time becomes.
Values of around 20 minutes for the maximum intra-hit time and values of over 1
second for the minimum value, produced clickstreams with the best uniqueness
properties (as measured by a reduction in rate of false positives).
Modifying the algorithm to filter clickstream acceptance based on clickstream
length produced a similar effect on the false positive rate, and consequently a high rate
of false negatives for clickstreams of larger size. With larger sized clickstreams the rate
of true positives falls off quickly however, and the rate of false negatives becomes
high.
The algorithm performed far better on cache records than on history records.
We expect that this is caused by the difference in granularity of record keeping in the
sources. As the cache stores cache records both for top level web pages and component
content such as images, style sheets and the like, clickstreams are more likely to be
formed. The IE history subsystem only records the top level page views, so is less
likely to produce long clickstreams in situations where users do not heavily explore
websites.
Designing an algorithm which eliminates these false positives is complicated
by the fact that the last access timestamp of any particular cache record is unreliable, as
it may have been accessed more recently by the user (before the cached content
expired). In this case, no corresponding Squid event would be logged even though the
140 CHAPTER 7 – Temporal provenance & uncertainty
cache record timestamp is updated, thus introducing a skew to the expected offset of
the matching Squid event.
For this reason, we set about identifying a means of identifying IE records
which must have been requested via the Squid proxy and not from the local cache.
7.3.7 Results
In practice this algorithm produces a set of data which correlates well with the
timescales produced by our previous experiment. For example, Figure 33 is a graph of
the output of the algorithm described above overlaid over the timescale for host Milan
obtained from the previous experiment.
Of 1188 unique history records, 821 history-squid tuples were identified. One
would expect that the number of history-squid tuples would be higher, however URLs
with encoded GET requests are not matched because of squid’s anonymised logging of
this kind of URL.
In practice there are a significantly high proportion of non-cached hits in the
history for our algorithm to work effectively. The algorithm identifies 304 potential
non-cached matches, and a base set of 110 matches from this. In total the algorithm
generates 134 data points (see Figure 33).
7.4 Discussion
In this section the two algorithms are compared, and the general problems
related to correlating these types of event logs are outlined.
Of the two algorithms, the history correlation algorithm performed the best.
Results are generated which cover a far wider period of time than the cache oriented
algorithm, giving greater insights into the temporal behaviour of the computer.
Furthermore the ratio of true to false positives is far higher.
The history algorithm was originally the worst performing of the two
approaches. At that point in time, determining whether the high rate of false positives
was caused by a tool implementation error or an error in the correlation algorithm was
problematic. Boyd’s paper [20] was at that point in time essential in identifying that
our interpretation of the weekly history timestamps was mistaken.
Despite having re-implemented a new set of index.dat file parsers, (and
discovered a third timestamp in the history records 39 ), we still used the semantics
defined by the pasco tool. Our model was corrected to treat the first timestamp in the
weekly sort history record as the accessed time, offset by the local time zone offset in
operation. This resulted in the high rate of true positives and a low rate of false
positives previously seen in Figure 33.
Both approaches to developing a correlation algorithm outlined above make a
closed world assumption – that the algorithm has access to all of the information that it
needs. In practice, development of the algorithm was complicated by this not being the
case. Consider for example Figure 34, which was generated using the same history
correlation algorithm as that seen in Figure 33. The input to the algorithm was however
a dataset which omitted a particular squid access log.
Strong correlation with the computer’s timescale is evident; however, there are
in this case false positives in the extremes of the graph. Examination of the false
positives indicated that they were related to records from the particular squid access log
which had been omitted, which had not been included in the correlation run for
processing speed reasons. The omission of the records resulted in the algorithm picking
a match from another squid log file, resulting in a far greater offset. Adding the
excluded log produces the results seen previously in Figure 33.
39
A 32bit MSDOS timestamp was identified at offset 0x50h within the history record. Within
the root history file, interpretation of this timestamp is the last accessed time, as is apparent by
comparing the last access time in the Internet Explorer history viewer. In practise, the value is
always a small amount after the 64bit FileTime based last accessed time.
CHAPTER 7 – Temporal provenance & uncertainty 143
We compare our approach here to the two closest approaches identified in the
literature, which are summarised in Section 3.3.4.
144 CHAPTER 7 – Temporal provenance & uncertainty
The approach of Gladyshev and Patel [47] differs from ours in that we deal
predominately with events which indeed have a timestamp, but there is uncertainty as
to that real world time this corresponds to. The approach taken by Gladyshev and Patel
instead tries to find the temporal bounds of an event which may or may not have a
timestamp associated with it.
Our work has similar objectives but differs significantly from Weil [141] in
two respects. Firstly, we investigate to what degree timescales are unstable. Secondly,
Weil’s approach relies on manual classification of cached web pages as dynamically or
statically generated. This is because the technique relies specifically on dynamic
content in order for the embedded timestamps to be interpreted. In addition, we also
present two algorithms which enable the automatic determination of the behaviour of a
suspect computer’s clock by comparison with a commonly logged corroborative
source.
7.5 Conclusions
This chapter has investigated a key problem which lies at the foundations of
evidence representation: how to assure the reliability of timestamps found in digital
evidence. The contributions of this chapter are aligned with the theme of assurance, and
tangentially, representation.
The first contribution is an analysis of the temporal behaviour of PC clocks as
generally implemented in the windows operating system and empirical results
demonstrating the unreliability of timestamps sourced from windows based computers.
This was presented in Section 7.2.
The second contribution, presented in Section 7.3, demonstrated the feasibility
of automatically characterising the temporal behaviour of a computer by correlating
timestamps embedded in digital evidence with corroborative time sources. Two
algorithms were proposed and evaluated, and experimental results were presented
which demonstrate that the latter algorithm produces outputs which correlate
reasonably with the timescales of the subject computers. We have additionally
described how the history correlation algorithm could be modified to produce a higher
rate of true positives.
There are a number of areas where future work is warranted. First, in order that
results based on this kind of correlation may be more clearly interpreted and explained
in forums such as courts of law, a means of qualifying and quantifying the error
involved would be of use. Second, in order that the resolution of the characterised
timescales may increase, improved algorithms which incorporate uncertainty in record
matching should be investigated. Finally, the Internet Explorer index.dat file format is
CHAPTER 7 – Temporal provenance & uncertainty 145
still not fully understood. We expect that a clearer understanding of the file format
would lead to a reduction in errors.
Chapter 8. Conclusions and future work
“(I am) acutely aware of the difficulties created by saying that when
Aristotle and Galileo looked at swinging stones, the first saw
constrained fall, the second a pendulum. Nevertheless, I am
convinced that we must learn to make sense of sentences that at least
resemble these.”
(The Structure of Scientific Revolutions, Thomas Kuhn)
147
148 CHAPTER 8 – Conclusions & future work
Chapter 3 concluded that the field of digital forensics might benefit from the
application of formal knowledge representation to digital evidence and digital
investigations. Chapter 4 investigated the history of formal representation, in the
context of Knowledge Representation and Semantic markup languages, introduced the
RDF/OWL formalism, and proposed that this formalism would be of benefit to
addressing the complexity and volumes in forensic event correlation.
Chapter 5 investigated using this formalism in the context of event correlation
for forensic purposes. The primary outcome of this chapter was to show that the
RDF/OWL formalism is useful as a general representation and is expressive enough to
represent and integrate digital evidence sourced from disparate arbitrary event oriented
sources, composite and abstract events corresponding to higher level situations, and
entities referred to in those events. This was demonstrated by building tools which
translated heterogeneous event logs into the formalism.
The second outcome was to show that the formalism is useful for building tools
which analyse such information. This was demonstrated by building automated
correlation tools which automatically identified forensically interesting scenarios from
event log based evidence based on heuristic rules. This was additionally demonstrated
buy the ease with which investigator hypotheses regarding entity identity could be used
to solve the problem of surrogate proliferation, reducing the volume of entities under
consideration.
A final outcome of this chapter is the identification of areas where the
formalism in insufficiently expressive
Chapter 6 showed that formal representation is useful in documenting digital
investigations and sharing digital evidence. This was demonstrated by the proposal of
improved approach to digital evidence containers which enables more scalable
processing of evidence, extensible integration of arbitrary information, and simplified
evidence authentication.
150 CHAPTER 8 – Conclusions & future work
architecture is interoperable were presented, this work has only covered a small portion
of possible tool integration scenarios.
One implication of the results is the impact that an ontology and document
oriented approach to digital evidence might have upon firming the terminology used in
the field. We are not the first to argue for terminological precision in forensics; a
number of parties have observed that the terminology in the field is used in differing
ways. Some have proposed the use of ontologies as a useful tool for discussing and
defining the field, from a theoretical standpoint, from the top down. The practical
employment of ontologies in approaches such as have been described in this
dissertation has the potential to shape the terminology of the field from the bottom up,
with human readable results expressed using semantically grounded vocabulary,
passively shaping the investigator’s conception of digital evidence and the information
interpreted and derived from it.
The implications of the results showing the unreliability of a Windows based
time synchronisation infrastructure are clear. In cases where establishing the precise
time at which a computer event occurred is important, one cannot assume that
computers running MS Windows 2000 or XP have behaved in uniform ways with
respect to keeping time. Where precision is not so necessary, it would be expected that
corroborating sources of timestamped evidence might be useful in characterising the
behaviour of computer clocks, and thus enabling one to challenge the acceptability of
blanket assumptions made about clock behaviour. Where one expects to depend on the
correctness of timestamps, other, more reliable, measures must be taken towards
assuring synchronised computer clocks. Areas such as real time stock trading and
banking would be potential areas where this kind of forensic preparedness could be
warranted.
While the work described in Chapters 5 and 6 validate that the representation is
expressive and extensible and that the SDEB information architecture is interoperable
were presented, this work has only covered a small portion of possible tool integration
scenarios.
Future work is required to ascertain how to best integrate evidence and
information with conflicting ontological commitments, the impact of part/whole
152 CHAPTER 8 – Conclusions & future work
knowledge related to a case, and as an information format compatible with rule based
reasoning. Another form of reasoning which is possible with description logics is
categorization, which is performed by description logic reasoners (in this work we did
not employ these as detailed in section 5.2.1). From data asserted in RDF, and an OWL
ontology, a description logic reasoner can classify instances of information as
belonging to a class or category, based on the relationships between the individual and
other classes or instances. Future work is necessary to determine the extent to which
this kind of reasoning could automatically identify situations or information of interest.
The ontologies used in the course of this research have been developed in an ad
hoc manner and built only for the purpose at hand. There has been no attempt to create
a comprehensive digital forensics ontology. Such an ontology would be of worth both
in building consensus on the meaning of the digital forensics related vocabulary,
highlighting areas where language is used in inexact or confusing ways, and as machine
readable semantics for tool interoperability. Building such an ontology is, however,
complicated by established linguistic conventions (ie. UK usage of the term “computer
based electronic evidence” vs. US usage of “digital evidence”), the context dependent
nature of terminology, and the difficulty of limiting the scope of the ontology 40 .
Future work applying automated ontology construction methods (ie. “ontology
learning” [72]) could potentially produce a digital forensics ontology with low human
time, energy and consensus costs, and at the same time identify areas of the digital
forensics vocabulary which are used in divergent ways.
40
For a good survey of approaches to building ontologies, see [102].
154 CHAPTER 8 – Conclusions & future work
negatives, which may be acceptable in the IDS context however not in the forensics
one.
Temporal correlation methods such as those we have proposed in Chapter 7
imply models of time more complex than can be described using terms such as offset
and drift; our research visually depicted the relationship between our reference timeline
(UTC) and the subject computer. The initial results characterising the temporal
behaviour of a particular clock showed that this relationship might be described by
successive time offsets and drift rates. The problem is that in practice there are no
events stored on the computer which allow one to see changes in rate of the passage of
time at this micro level granularity.
The presence of false positives in the results generated by the correlation
method precludes its use as a directly usable means of interpreting unreliable
timestamps to corresponding times on a reference timescale. Upstream tools seeking to
work with events with unreliable timestamps may use results generated by this
correlation method, however the raw results would need to be manually interpreted into
a set of assumptions about the passage of time relative to the reference timescale.
Assuming that the false positives problem might be solved by a more thorough
reverse engineering of the IE cache and history file format, automated upstream tool
use of the correlation results is still complicated by the granularity of the correlation
results, and the likely limited period which the results would cover. For this reason,
extending the concrete results of correlation requires production of a set of assumptions
about the passage of time in between the samples. These assumptions, or temporal
theories, about the passage of time could be used by a correlation tool to ascribe a
theoretical real world time to an event.
The results regarding temporal provenance indicate that event correlation
processes would benefit from richer models of temporal progress including timescale
deviation, event time uncertainty, and orthogonal to this, assumptions about these.
What effect such notions might have upon event pattern languages is an open question.
It would appear likely that their affect on the algorithmic complexity of correlation
approaches would be adverse to a high degree.
Future work is needed in characterising the behaviour of Windows PCs that are either
untethered from or loosely tethered to reliable time sources, and also on the behaviour
of UNIX and RTOS variants.
This work addresses means for analyzing event log based evidence, utilizing
RDF/OWL for representing entities and events, and rules for expressing correlation
relationships. In this context correlation refers to an abstraction relationship between a
set of events and a higher level event or situation. This correlation relationship is in
turn dependent upon a variety of relationships between the lower level events,
including temporal constraints and constraints over property relationships with entities
involved in the events and the wider environment.
The problem of event correlation and event pattern languages in particular lies
in how to describe these events, relationships and constraints. This work relied on
OWL for modeling events and relationships; however its expressiveness is insufficient
for describing temporal constraints. This led to the employment of a rule language for
declaring these. Some work has been performed on extending description logics to
incorporate temporal descriptions; however the work is preliminary.
Future investigations of event pattern languages would benefit from working
with abstract notions of time such as before, after, coincident, and during, rather than
reasoning with time as a single discreet numerical value. How a language incorporating
these notions would interact with temporal models such as those mentioned in Section
7.3 similarly requires further investigation.
Chapter 9. Bibliography
[3] Abbott, J., J. Bell, A. Clark, O.D. Vel, and G. Mohay. Computer
forensics (CF): Automated recognition of event scenarios for digital
forensics. in 2006 ACM Symposium on Applied Computing. 2006. Dijon,
France: ACM Press.
[6] ACPO. Good Practise Guide for Computer based Electronic Evidence.
2006 [Viewed 19 Oct 2006]; Available from:
http://www.acpo.police.uk/asp/policies/Data/gpg_computer_based_evid
ence_v3.pdf.
157
158 CHAPTER 9 – Bibliography
[7] Alinka, W., R.A.F. Bhoedjanga, P.A. Bonczb, and A.P.d. Vriesb, XIRAF
– XML-based indexing and querying for digital forensics. Digital
Investigation (6th Digital Forensics Research Workshop), 2006.
3(Supplement 1): p. 89-107.
[8] Attfield, P., United States v Gorshkov detailed forensics and case study:
expert witness perspective, in 1st International Workshop on Systematic
Approaches to Digital Forensic Engineering. 2005: Taipei, Taiwan. p.
3-24.
[18] Bogen, A.C. and D.A. Dampier. Unifying computer forensics modeling
approaches: a software engineering perspective. in 1st International
Workshop on Systematic Approaches to Digital Forensic Engineering.
2005.
[19] Borgida, A., R.J. Brachman, D.L. McGuinness, and L.A. Resnick,
CLASSIC: A Structural Data Model for Objects, in ACM SIGMOD
International Conference on Management of Data. 1989: Portland,
Oregon.
[20] Boyd, C. and P. Forster, Time and date issues in forensic computing – a
case study. Digital Investigation, 2004: p. 18-23.
[21] Brill, A.E., M. Pollitt, and C.M. Whitcomb, The Evolution of Computer
Forensic Best Practices: An Update on Programs and Publications.
Journal of Digital Forensic Practice, 2006. 1(1): p. 2-11.
[23] Carrier, B. Open Source Digital Forensics Tools: The Legal Argument.
@stake Research Report 2002 [Viewed Dec 2006]; Available from:
http://www.digital-evidence.org/papers/opensrc_legal.pdf.
[25] Carrier, B. The sleuth kit & autopsy: Forensics tools for linux and other
unixes. 2006 [Viewed 29 Nov 2006]; Available from:
http://www.sleuthkit.org/.
[27] Casey, E., Digital evidence and computer crime. 2000, San Diego, Calif:
Academic Press.
[28] Casey, E., State of the field: growth, growth, growth. Digital
Investigation, 2004. 1(4): p. 241-309.
[29] Casey, E., Digital arms race e The need for speed. Digital Investigation,
2005. 2(4): p. 229-280.
[31] CDESF. Survey of Disk Image Storage Formats. 2006 [Viewed Dec
2006]; Available from: http://www.dfrws.org/CDESF/survey-dfrws-
cdesf-diskimg-01.pdf.
[33] Collier, P.A. and B.J. Spaul, A Forensic Methodology for Countering
Computer Crime. Journal of Forensic Science, 1992. 32(1).
[40] Fikes, R., J. Jenkins, and G. Frank, JTP: A System Architecture and
Component Library for Hybrid Reasoning, in Proceedings of the
Seventh World Multiconference on Systemics, Cybernetics, and
Informatics. 2003: Orlando, Florida.
[42] Forgy, C., Rete: A Fast Algorithm for the Many Patterns/Many Objects
Match Problem. Artificial Intelligence, 1982. 19(1): p. 17-37.
[43] Friedman-Hill, E. JESS: The Rule Engine for the JavaTM Platform.
2003 [Viewed Nov 2003]; Available from:
http://herzberg.ca.sandia.gov/jess/.
162 CHAPTER 9 – Bibliography
[45] Garfinkel, S.L., D.J. Malan, K.-A. Dubec, C.C. Stevens, and C. Pham,
Disk Imaging with the Advanced Forensics Format, Library and Tools.
Advances in Digital Forensics (2nd Annual IFIP WG 11.9 International
Conference on Digital Forensics), 2006.
[49] Gray, J. and D. Patterson, A conversation with Jim Gray. ACM Queue,
2003. 1(4).
[51] Gruber, T.R., Toward principles for the design of ontologies used for
knowledge sharing? International Journal of Human Computer Studies,
1995. 43(5-6): p. 907-928.
[52] Guha, R.V. and T. Bray. Meta Content Framework Using XML. 1997
[Viewed 2006 20 Dec 2006]; Available from:
http://www.w3.org/TR/NOTE-MCF-XML/.
CHAPTER 9 – Bibliography 163
[55] Horrocks, I. The FaCT system. 1999 [Viewed Nov 2003]; Available
from: http://www.cs.man.ac.uk/~horrocks/FaCT/.
[56] Horrocks, I., P.F. Patel-Schneider, and F. van Harmelen, From SHIQ
and RDF to OWL: The making of a web ontology language. Journal of
Web Semantics, 2003. 1(1): p. 7-26.
[58] ISO, ISO 8879:1986 Information processing — Text and office systems
— Standard Generalized Markup Language (SGML). 1986.
[60] Kenneally, E.E., Gatekeeping Out Of The Box: Open Source Software As
A Mechanism To Assess Reliability For Digital Evidence. Virginia
Journal of Law and Technology, 2001. 6(3).
[61] Kifer, M., G. Lausen, and J. Wu, Logical Foundations for Object-
Oriented and Frame-Based Languages. Journal of the Association of
Computing Machinery, 1995. 42(3): p. 741-843.
164 CHAPTER 9 – Bibliography
[65] Kornblum, J., Identifying almost identical files using context triggered
piecewise hashing. Digital Investigation (6th Digital Forensics Research
Workshop), 2006. 3(Supplement 1): p. 91-97.
[69] Lindqvist, U. and P.A. Porras. Detecting computer and network misuse
through the production-based expert system toolset (P-BEST). in IEEE
Symposium on Security and Privacy. 1999. Berkeley, California.
[71] Luckham, D., The Power of Events. 2002, Indianapolis, Indiana: Pearson
Education.
[72] Maedche, A. and S. Staab, Ontology learning for the Semantic Web.
IEEE Intelligent Systems, 2001. 16(2): p. 72 - 79
CHAPTER 9 – Bibliography 165
[73] McBride, B., Jena: a semantic web toolkit. IEEE Internet Computing,
2002. 6(6): p. 55-59.
[79] Microsoft. How Windows Keeps Track of the Date and Time. 2006
[Viewed April 2006]; Available from:
http://support.microsoft.com/?kbid=232488.
[81] Microsoft. The system clock may run fast when you use the ACPI power
management timer as a high-resolution counter on Windows 2000-
based, Windows XP-based, and Windows Server 2003-based computers.
2006 [Viewed April 2006]; Available from:
http://support.microsoft.com/?kbid=821893.
[86] Moates, R. URN Syntax. 1997 [Viewed 6 Jan 2006]; Available from:
http://www.ietf.org/rfc/rfc2141.txt.
[90] NCI. The National Cancer Institute Thesaurus in OWL. 2003 [Viewed
2007 Jan]; Available from:
http://www.mindswap.org/2003/CancerOntology/.
[91] Neches, R., R. Fikes, T.W. Finin, T.R. Gruber, R. Patil, T.E. Senator,
and W.R. Swartout, Enabling Technology for Knowledge Sharing. AI
Magazine, 1991. 12(3): p. 36-56.
[95] Ning, P., Y. Cui, and D. Reeves, Constructing attack scenarios through
correlation of intrusion alerts, in 9th ACM conference on Computer and
Communications Security. 2002: Washington, DC.
[97] Noy, N.F. and D.L. McGuinness. Ontology Development 101: A Guide
to Creating Your First Ontology. 2001 [Viewed 2004]; Available from:
http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology10
1-noy-mcguinness.html.
[98] Palmer, G. (ed), A Road Map for Digital Forensic Research, in First
Digital Forensic Research Workshop, G. Palmer, Editor. 2001: Ucita,
New York.
[99] Pan, F. and J.R. Hobbs, Time in OWL-S, in 2004 AAAI Spring
Symposium Series - Semantic Web Services. 2004: Stanford University.
[101] Perrochon, L., E. Jang, S. Kasriel, and D.C. Luckham, Enlisting Event
Patterns for Cyber Battlefield Awareness, in DARPA Information
Survivability Conference & Exposition. 2000: Hilton Head, South
Carolina.
[102] Pinto, H.S. and J.P. Martins, Ontologies: How can They be Built?
Knowledge and Information Systems, 2004. 6(4).
168 CHAPTER 9 – Bibliography
[103] Pollack, J. U.S. v Plaza, Acosta (Cr. No. 98-362-10, 11,12). 2002 7 Jan
2002 [Viewed; Available from:
http://www.paed.uscourts.gov/documents/opinions/02d0046p.pdf.
[107] Redgrave, L.M., A.S. Prasad, J.B. Fliegel, T.S. Hiser, and J.H. Jessen,
The Sedona Principles: Best Practices Recommendations & Principles
for Addressing Electronic Document Production in The Sedona
Conference Working Group Series. 2004, The Sedona Conference.
[108] Reed, S.L. and D.B. Lenat, Mapping Ontologies into Cyc, in AAAI
workshop on Ontologies and the Semantic Web. 2002: Edmonton,
Canada.
[111] Richard III, G.G. and V. Roussev, Scalpel: A Frugal, High Performance
File Carver, in Digital Forensics Research Workshop. 2005: New
Orleans, LA.
[114] Roussev, V. and G.G.R. III, Breaking the Performance Wall: The Case
for Distributed Digital Forensics, in 5th Digital Forensics Workshop.
2005: New Orleans, LA.
[123] Spafford, E.H. and S.A. Weeber, Software forensics: can we track code
to its authors? Computers and Security, 1993. 12(6): p. 585-595.
[128] Stevens, M.W., Unification of relative time frames for digital forensics.
Digital Investigation, 2004. 1: p. 225-239.
[131] SWGDE. SWGDE and SWGIT Glossary of Terms. 2005 25 Aug 2007
[Viewed; Available from:
http://68.156.151.124/documents/swgde2005/SWGDE%20and%20SWG
CHAPTER 9 – Bibliography 171
IT%20Combined%20Master%20Glossary%20of%20Terms%20-
July%2020..pdf.
[133] TGO. The Gene Ontology. 2006 [Viewed Jan 2007]; Available from:
http://www.geneontology.org/.
[137] van den Bos, J. and R. van der Knijff, TULP2G–An Open Source
Forensic Software Framework for Acquiring and Decoding Data Stored
in Electronic Devices. International Journal of Digital Evidence, 2005.
4(2).
[140] W3C. RDF Vocabulary Description Language 1.0: RDF Schema. 2004
[Viewed 21 Dec 2006]; Available from: http://www.w3.org/TR/rdf-
schema/.
172 CHAPTER 9 – Bibliography
[141] Weil, C., Dynamic Time & Date Stamp Analysis. International Journal of
Digital Evidence, 2002. 1(2).
[143] Yemini, S.A. and S. Kliger, High Speed and Robust Event Correlation.
IEEE Communications, 1996: p. 433-450.