Você está na página 1de 37

SDSC Digital Preservation Project with NARA

Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Data and Knowledge Systems Group


Staff
Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek M. Kulrul Bertram Ludscher Richard Marciano A. Memon XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu

Graduate Students
A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath L. Sui N. Cotofana D. Le J. Trang L. Yin +/- NN

Undergraduate Interns

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Topics
Digital preservation approach Levels of abstraction Application to NARA collections

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Persistent Archive Approach


Preservation of authentic documents Create archivable form for digital entity Define context by assembling a collection Create archivable form for collection Manage persistent archive Support self-instantiating archive Support discovery and presentation
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

E R A : A r ch iv a l C o m p o n en t s C o n c ep t
S t o r a g e R e s o u r c e B r o k e r / E x t e n s i b l e M e ta - d a t a C A T a l o g
A c c e s s io n i n g
T ap es

ERA Concept model


A rc h i v a l R ep o s i t o r y
C o ll e c t i o n

G r i d S e c u r i t y I n f r a s t r u c tu r e

R ef e r en c e W o r k b en c h

W o r k b en c h
A c c e s s io n

Q u ery

C o ll e c t i o n D is k s
V e r if y R e b u il d

C o ll e c t i o n
W ra p & C o n t a i n e r iz e

In te rn et
D es c r ib e

M e ta d a t a

P r e se n t

M e d i a ti o n o f
R e c o r ds S c h e d u le s

I n f o r m a t io n u s i n g

X M L
O rd e r F u lf i ll m e n t S y s te m

A r c h iv a l R e s e a r c h C a t a l o g

Fundamental Challenge Technology Evolution


Data is a sequence of bits Presentation applications are needed to display a digital entity, based upon a data model Applications issue I/O calls to operating systems Operating systems send commands to storage and display systems
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Presentation of Digital Objects


Application

Operating System

Storage System

Display System

Digital Object
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Technology Management - Emulation


Old Application
Wrap Application

New Operating System

New Storage System

New Display System

Digital Object
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Technology Management SDSC


New Application

New Operating System


Wrap Storage System Wrap Display System

Old Storage System


Migrate Encoding Format

Old Display System

Digital Object
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Specifying Levels of Abstraction


Technology management becomes simpler if the persistent archive infrastructure operates on abstractions, rather than an explicit physical implementation of a resource Need abstractions for
Digital objects Repositories
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Managing Distributed Storage


Separate the organization of digital objects from their physical storage
Logical Name Space to manage attributes about the digital objects Data handling system to manage interactions with remote storage systems

Create storage abstraction layer Storage Resource Broker (SRB) provides data management system
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Information ManagementLogical Name Space


Set of attributes to describe digital entities that are registered into the logical name space
SRB metadata - Unix file system semantics Provenance metadata - Dublin Core Resource metadata - User access control lists Discipline metadata - User defined attributes

Each digital entity may have unique attributes


NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

SDSC Storage Resource Broker & Meta-data Catalog Levels of Abstraction


Application
C, C++, Linux Libraries I/O Unix Shell Java, NT Browsers DLL / Prolog Python Predicate Web WSDL Clients

Consistency Management / Authorization-Authentication

Logical Name Latency Space Management


Catalog Abstraction Databases DB2, Oracle, Sybase

Data Transport

Metadata Transport

Prime Server

Storage Abstraction Archives File Systems Databases


HPSS, ADSM, HRM UniTree, DMF Unix, NT, Mac OSX DB2, Oracle, Postgres

Servers

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Types of Digital Entity Abstractions


Logical representation
What does the digital entity represent? What is the associated meaning?

Physical representation
What is the physical structure of the digital entity?

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Levels of Abstraction for Bits


Abstraction for Digital Entity Logical: I-nodes Physical:
Track / Sector

Digital Entity

Bit Stream Physical:


File System (NFS/AFS/NTFS)

Abstraction for Repository Repository

Logical: File Name

Disk
SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

Levels of Abstraction for Data


Abstraction for Digital Entity Logical: Data Model
(units, semantics)

Physical:
Encoding Format (syntax, structure)

Digital Entity

Files

Abstraction for Repository Repository

Logical: Name Space

Physical:
Data Handling System -SRB/MCAT

File System, Archive


SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

Information Management
Abstraction layer for interacting with information repositories
Manage the schema and physical table structures of a database Extensible schema User defined attributes

Extensible Metadata CATalog (EMCAT) manages collections mySRB.html interface supports dynamic collection creation
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Levels of Abstraction for Information


Abstraction for Digital Entity Logical: Collection Schema Physical: XML Syntax

Digital Entity

Metadata Attributes Logical: Database Schema

Abstraction for Repository Repository

Physical: EMCAT/CWM

Database
SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

Knowledge Management - Characterizing Properties of Collections


Characterization of relationships between attributes
Semantic / logical - cross-walks Procedural / temporal - records management Structural / spatial - GIS

Characterization of knowledge repository operations Mapping from collection attributes to discipline concepts Mapping from knowledge relationships to rules for application in inference engines
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Levels of Abstraction for Knowledge


Abstraction for Digital Entity Logical: Relationship Schema Physical:
ER/UML/XMI/ RDF syntax

Digital Entity

Concept Space
(ontology instance)

Abstraction for Repository Repository

Logical:
Knowledge Repository Schema

Physical:
Model-based Mediation System

Knowledge Repository
SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

Persistent Archives
Storage system abstraction
Logical name space and data manipulations

Information repository abstraction


Logical schema and physical table structure

Knowledge repository abstraction


Topic maps and inference rules

Digital object abstraction


Data model and encoding format
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

NARA Prototype
Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036)
2.5 GB of data 6 required fields 13 optional fields User defined fields (over 1000)

Determine resources required to scale size of collection


NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

XML DTD for E-mail

<!ELEME NT rfc1036_mesg (headers, body)> <!ELEME NT headers (required_headers, optional_headers, other_headers)> <!ELEME NT body #PCDATA>

<!ELEME NT required_headers (From, Date, Newsgroups, Subject, Me ssage-ID, Path)> <!ELEME NT optional_headers (Folloup-To?, Expires?, Reply-To?, Sender?, References?, Control?, Distribution?, Keywords?, Summary?, Approved Lines?, Xref?, Organization?)> <!ELEME NT other_headers other+> <!-- 6 r equired header keywords --> <!ELEME NT From #PCDATA> <!ELEME NT Date #PCDATA> <!ELEME NT Newsgroups #PCDATA> <!ELEME NT Subject #PCDATA> <!ELEME NT Message-ID #PCDATA> <!ELEME NT Path #PCDATA> <!ATTLIST From <!ATTLIST Date <!ATTLIST Newsgroups <!ATTLIST Subject <!ATTLIST Message-ID seqno seqno seqno seqno seqno CDATA #REQUIRED> CDATA #REQUIRED> CDATA #REQUIRED> CDATA #REQUIRED> CDATA #REQUIRED>
SanDiegoSupercomputerCenter

<!ATTLIST P ath

seqno CDATA #R EQUIRED>

NationalPartnershipforAdvancedComputationalInfrastructure

Formatted Message Using XML DTD

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Web-based Interface for Accessing the E-mail Collection

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Automation of Ingestion Process


Application of an Accessioning Template
Defines the concepts, policies or acceptance of the collection

Creation of attributes that represent the accessioning template concepts Analysis of attributes for anomalies and implied inherent knowledge
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Information Generation Processes


Create occurrence index
(Occurrence, attribute, value) This is needed to be able to recreate original form of digital object

Analyze completeness of information


Inverse index of attribute values Identifies unexpected values - consistency

Analyze closure of collection


Are additional concepts needed to represent inverse index value ranges?
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Ingestion Processes for Collection

Aggregation of original objects into containers for storage

Data Organization
NationalPartnershipforAdvancedComputationalInfrastructure

Data Storage
SanDiegoSupercomputerCenter

Ingestion Processes for Collection


Migration of objects into a standard representation
Information Generation Attribute Selection Attribute Tagging

Data Organization
NationalPartnershipforAdvancedComputationalInfrastructure

Collection Storage
SanDiegoSupercomputerCenter

Ingestion Processes for Collection


Accession Template Closure Concept/Attribute Attribute Inverse Indexing

Knowledge Generation Attribute Selection Attribute Tagging

Information Generation Occurrence Tagging

View Management Data Organization


NationalPartnershipforAdvancedComputationalInfrastructure

Collection Storage
SanDiegoSupercomputerCenter

Persistent Collection
Define context for archiving data -annotate information content Create archivable form - standard encoding format Archive information content along with data Test closure of the collection - all digital objects that can be discovered in the collection are members of the collection Test completeness of the collection - inherent relationships within the collection can be cast in terms of attributes generated from the annotated information.
Differentiate between inherent knowledge and anomalies / artifacts
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Self-Instantiating Archive
Archive the processes that are used to control the ingestion process
Conversion to archivable form Annotation of information content

When accessing the collection, retrieve the processes and the original digital objects
Apply the processing steps to re-create the information content Query the result to discover desired digital objects

A self-instantiating archive is a virtual data grid


NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Differentiating between Data, Information, and Knowledge


Data
Digital object Objects are streams of bits

Information
Any tagged data, which is treated as an attribute. Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object

Knowledge
Relationships between attributes Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Types of Knowledge Relationships


Logical / semantic
Digital Library cross-walks

Temporal / procedural
Workflow systems

Spatial / structural
GIS systems

Functional / algorithmic
Scientific feature analysis
NationalPartnershipforAdvancedComputationalInfrastructure SanDiegoSupercomputerCenter

Knowledge Based Data Grids


Ingest Services Management
XTM DTD

Access Services

Knowledge

Relationships Between Concepts

Knowledge Repository for Rules

Rules - KQL SDLIP

Knowledge or Topic-Based Query / Browse

XML DTD

(Model-based Access) Information Repository Attribute- based Query

Information

Attributes Semantics

(Data Handling System - SRB) Grids Data Fields Containers Folders


MCAT/HDF

Storage (Replicas, Persistent IDs)

Feature-based Query
SanDiegoSupercomputerCenter

NationalPartnershipforAdvancedComputationalInfrastructure

Further Information
http://www.npaci.edu/DICE

NationalPartnershipforAdvancedComputationalInfrastructure

SanDiegoSupercomputerCenter

Você também pode gostar