A Metadata Extraction Approach For Selecting Migration Solutions

International Journal of Digital Information and Wireless Communications (IJDIWC) 1(1): 161-174
The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X)
A Metadata Extraction Approach for Selecting Migration Solutions
Feng Luan and Mads Nygrd Department of Computer and Information Science, Norwegian University of Science and Technology NO - 7491, Trondheim, Norway Email: {luan,mads}@idi.ntnu.no for a long-term. In 2006, the Norwegian Research Council established a research project LongRec to explore approaches about how to ensure digital objects can be read, retrieved, understood, and trusted. This project involves many partners from government departments and business companies. Our work is under this project. When we did a literature review about digital preservation, we found that many preservation approaches have been proposed and analyzed in [1-7], such as migration, emulation, universal virtual computer, encapsulation and computer museum. Amongst those approaches, migration is the most often used approach, and it is also deemed as the most promising approach. In [8], migration is defined as a set of organized tasks designed to achieve the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequence generation. In terms of the preservation experience at BBC, migration is classified into minimum preservation, minimum migration, preservation migration, recreation, human conversion migration and automatic conversion migration in [9]. The OAIS standard [1] defines 4 categories for migration, namely refreshment, replication, repackaging and transformation. Besides these 161
ABSTRACT
Preservation becomes an important infrastructural service for information systems. Many research works have done in past decades. The most popular preservation approach is migration, which transfers and/or transforms digital objects between two computers or two generations of computer technology. However, it is difficult for custodians to decide which migration solution should be chosen. This is because the migration selection depends on the old situation (e.g., digital objects, technical infrastructure and restriction rules) and the current situation (e.g., system requirements and organization requirements). Therefore, in order to obtain the old situation of an information system, we in this paper design a new solution to retrieve information about the old situation from stored metadata. The viability and efficiency of our approach are evaluated in an experiment, under which there are several sets of image files to be migrated.
KEYWORDS
Migration, Preservation Plan, Metadata Extraction, Long-Term Preservation, Digital Library
1 INTRODUCTION As a variety of objects are born in a digital form or are digitalized, information systems are required to keep the manipulation of these digital objects
theories, there exist many implementations for migration in the past decades, such as Migration on Request [10], PANIC [11], DAITSS [12], CRiB [13] and PLANETS [14]. No matter how a migration solution is implemented, custodians always need much information about digital objects, system requirements, restriction rules and their organization requirements to select the best migration solution. Hence, we focus on the problem about how to get this information. Several tools have been designed for this purpose. They can scan a file system and extract the needed information from stored objects. However, these tools have some drawbacks. For instance, 1) they spend much time to extract information; 2) the extracted information may not be accurate; and 3) the custodians cannot get complete information. Therefore, in this paper, we propose a new approach to overcome these drawbacks. The approach is called the migration metadata extraction tool (MMET). In the following sections, we will introduce our design. We first introduce related works on this kind of information extraction and our research motivation in Section 2. Second, we propose our solution MMET in Section 3. Third, we do several tests on image files using MMET and JHOVE. The experiment results are shown in Section 4. Finally, we discuss when our solution and other solutions should be used in Section 6. 2 RELATION WORK AND MOTIVATION Currently, many solutions can generate information about content
characteristics, old techniques, restrictions and/or provenance from stored objects. We classify these solutions into three classes. The first class can extract content characteristics of a given object. For example, the eXtensible Characterization Language (XCL) [15] can extract characteristics of a digital object, and XCL can further use XCL-ontology to compare these characteristics before and after migration; ExifTool [16] can read, write and modify metadata that have embedded in digital objects; and Tika [17] can extract metadata and structured text content from different types of digital objects. The second class can judge format and retrieve format information from a registry. To judge a format, some tools use the name of a format extension as a clue. However, since the file extension can be maliciously or unintentionally modified, this judgment cannot be always trusted. Custodians should use other approaches. For instance, files created by Linux embed a unique identifier into their head section. Thus, the FILE command can use these identifiers to judge what format the object belongs to. The second example is DROID [18], which uses internal and external signatures to identify formats. These signatures are stored in a file downloaded from the format register PRONOM [19]. Using the DROID signature, the custodians can query a given format in PRONOM, and then can view the technical context for this format. Another program Fido [20] converts these PRONOM signatures into regular expression for obtaining a good performance in the format identification task.
162
The last class combines the functions of the above two classes. JHOVE [21] is an example. It is designed to identify a format, validate a format, extract format metadata, and audit a preservation system. JHOVE is able to support 12 formats, e.g., AIFF, ASCII, BYTESTREAM, GIF, HTML, JPEG, JPEG-2000, PDF, TIFF, UTF-8, WAVE and XML. In addition, JHOVE provides an interface by which developers can design modules for other formats. There are many projects that have integrated JHOVE into their solutions. For example, AIHT [22] is a preservation assessment project. In their assessment procedure, JHOVE is used to identify formats in a preservation system and calculate the number of files for each format. PreScan [23] is another tool extended from JHOVE. Using PreScan, preservation metadata can be automatically and manually created and maintained. The last example is FITS (Format Information Tool Set) [24]. It contains a variety of third-party open source tools, such as ExifTool, JHOVE, DROID and the FILE command. The above solutions output metadata mainly about format and characteristics. Obviously, it is not sufficient in terms of our previous research work on quality requirements for migration metadata in [25]. Lacking sufficient metadata, it might cause some problems to custodians when they design a migration procedure. For instance, 1) the migration procedure may fail, as digital objects cannot be decrypted; 2) the custodians over estimate migration time, so that they may choose a fast but expensive solution; or 3) the custodians underestimated migration time, so that the old migration does not finish yet, but a new migration has to be started. Moreover, as
characteristics are used to assess migration success, it takes too much time to extract old and new characteristics. For instance, in [23], PreScan spends about 10 hours to extract characteristics metadata from 100 thousand files. Hence, we in this paper try to design a solution that can realize two basic objectives: It should be more efficient than current solutions. It should get more metadata than current solutions. 3 OUR SOLUTION Since current solutions are slow and their outputs are not so good, a new solution is designed in this paper. When surveying preservation systems, we find that most information systems have stored many metadata together with digital objects when these digital objects are inserted into the systems. These metadata provide description information, structural information, and administrative information about these objects. Hence, we decide to use these stored metadata to retrieve necessary information for migration. In the following sections, we will introduce our design. Our solution is a migration metadata extraction tool, so that we call it MMET. The abstract architecture of MMET is depicted in Figure 1, which provides a framework for discussion by pointing out the fundamental functions. In total, the architecture consists of four parts:
163
relationship between the migration metadata and the stored metadata. MMET is constructed by the four functions of the display layer and the business layer. Those four functions are implemented by Java, so that MMET can be run in any operating system without any modification to the codes. Figure 2 illustrates the workflow of the MMET. The workflow consists of 7 steps. In the following sub-sections, we will describe the design of these steps in terms of the functions.
Figure 1. Abstract Architecture of MMET
Operating system layer: In this layer, several examples of operating systems are listed. Custodians can execute MMET under one of these operating systems. Display layer: In this layer, there is one function named MMETManager. MMETManager provides both a graphic interface and a command-line interface for custodians to point out a location to scan and view the scanning result. Business layer: In this layer, there are three functions, namely MMETScanner, MMETExtractor and MMETSummary. These functions work together to extract necessary migration metadata from the stored metadata. Data layer: In this layer, there are two data sets. One is the metadata that is stored with its associated digital object. These metadata are often stored in an XML file rather than in a database. The other data set is a mapping table that specifies the
Figure 2. Workflow of MMET
3.1 MMETManager MMETManager provides a graphic interface and a command-line interface to custodians for carrying out a metadata extraction procedure. At the beginning of the extraction procedure (i.e., Step 1), the custodians should use the interface to choose a folder, which contains a set of XML files with a set of digital objects that are going to be migrated. At the end of the extraction procedure, the custodians also use MMETManager to 164
view the analysis report about this folder in Step 7. MMET is implemented by Java, so that the Swing Java library is used to develop the graphic interface. The Swing library has many programming interfaces for programmers. It is also a part of the standard java libraries. In addition, the development tool NetBeans provides a graphic interface design for Swing. Therefore, we use NetBeans and Swing to develop the MMET graphic user interface. Figure 3 illustrates an example of the MMET output.
Algorithm: Scanner() Input: File file

if (file.isFile()) then if file.extension != xml then return end if AnalyzeFile(file) else if (file.isDirectory()) then File[] files = file.listFiles() for each file f in the array files then Scanner(f) end for end if Figure 4. Scanning Algorithm used in Step 2
Figure 3. Example of the MMET Output
3.2 MMETScanner MMETScanner implements the functions of Step 2 and Step 3. In Step 2, MMET will recursively read metadata files from the folder specified by a custodian. We use the basic recursive algorithm to realize this function. Figure 4 demonstrates this algorithm.
When meeting an xml file, MMET will move to Step 3 to analyze possible migration metadata. In our experiment, the METS metadata schema [26] is used to organize stored metadata. Each METS file contains 7 sub-parts: metsHrd describing this METS file, dmdSec describing files within this preservation package, admSec providing administration information of these files, fileSec providing location of these files, structMap providing the organizational structure of these files, structLink defining hyperlinks between these files, and behaviorSec defining software behaviors necessary for viewing or interacting. An abridged METS file is attached in the Appendix at the end of this paper. As migration metadata exist in admSec and file information exists in fileSec, we use an open source java library1 to parse every METS file. This library is developed by the Australia National University. In this java library, each subpart of METS owns a special java class for them. Thus, it is easy for us to get admSec and fileSec.
1
From the Australia National University, http://sourceforge.net/projects/mets-api/.
165
In admSec, there are techMD, rightsMD, digiprovMD and sourceMD. Each of them contains a wrapper (named mdWrap) or a reference (named mdRef) linking to a XML file. Both the wrapper and the file contain a set of migration metadata. In fileSec, custodians define many different groups represented as the fileGrp metadata element. In fileGrp, each file is described by several attributes and sub-elements. In an attribute named ADMID, there are many identifiers that connect to a special techMD, rightsMD, digiprovMD and sourceMD in admSec. Figure 5 displays the relationship between fileSec and admSec.
3.3 MMETExtractor MMETExtractor realizes the function of Step 4, i.e., migration metadata will be extracted from techMD, rightsMD, digiprovMD and sourceMD. Because different sets of preservation metadata may use different element names to migration metadata, a relation table between migration metadata and stored metadata should first be created. Afterwards, MMET can retrieve the migration metadata based on this relation table. In our experiment, two metadata schemas are used in techMD, rightsMD, digiprovMD and sourceMD, namely PREMIS-v1.0 and MIX-v1.0. PREMISv1.0 defines a set of preservation metadata for a single object, the archive package of this object, rights of this object, events of this object, and agent who ever manipulate this object. Based on our previous study on quality requirements of migration metadata [25], we create a mapping table between the requirements and the metadata elements (illustrated in Table 1). As for MIXv1.0, it mainly provides the characteristics metadata for digital images. This information is just related to R14, so the relation table for MIXv1.0 only has one entry, i.e., MIX-v1.0 > R14. Once we design the relation table, the next process of Step 4 is to extract migration metadata. The easiest way to do retrieval is to store the relations in a database. However, due to the technical constraints in our testing environment, we cannot use any database to store the mapping table. Hence, the database 166
Figure 5. Relationship between FileSec and AdmSec
Therefore, the procedure to analyze a METS file in Step 3 is: 1) a fileSec is retrieved from the METS, 2) every fileGrp in this fileSec is retrieved, 3) every file in the fileGrp is retrieved, and 4) output each pair of the file element and admSec to MMETExtractor for extracting migration metadata.
solution is replaced by the java interface solution. In the java interface solution, 8 abstract operations map to 8 categories of Table 1. Then, each java class who implements this interface should provide a detailed extraction approach. For instance, in our implementation, we
create a PREMIS-v1.0 java class, in which the XML path language (XPath) [27] is used to query and retrieve the migration metadata. Finally, the retrieved migration metadata will be wrapped together and then be transferred to MMETSummary.
Table 1. Mapping Table between PREMIS-v1 and our quality Requirements
Category
Storage
Quality Req.
Storage medium Storage medium player Storage medium application Microprocessor Memory stick Motherboard Peripherals Interpretation application
Elements in PREMIS-v1.0
Storage.storageMedium n/a n/a Environment.hardware.{hwName, hwType, hwOtherInformation} Environment.hardware.{hwName, hwType, hwOtherInformation} Environment.hardware.{hwName, hwType, hwOtherInformation} Environment.hardware.{hwName, hwType, hwOtherInformation} Environment. Software.{swName, swVersion, swType, swOtherInformation, swDependency} CreatingApplication.{creatingApplicationName, creatingApplicationVersion, dateCreatedByApplication, creatingApplicationExtension} objectCharacteristics.format.formatDesignation.{formatName, formatVersion} objectIdentifier.objectIdentifierType relationship.relatedObjectIdentification.relatedObjectIdentifierType relationship.relatedEventIdentification.relatedEventIdentifierType linkingEventIdentifier.relatedEventIdentifierType linkingIntellectualEntityIdentifier.linkingIntellectualEntityIdentifierType linkingPermisionStatementIdentifier.linkingPermissionStatementIdentifier Type objectCharacteristics.inhibitors.{inhibitorType, inhibitorTarget} objectCharacteristics. Fixity.{messageDigestAlgorithm, messageDigestOriginator} objectCharacteristics.significantProperties
Hardware
Application
Specification
Format specification Identifier specification
Hyperlink specification
Encryption specification Fixity specification Characteristics Content characteristics Appearance characteristics Behaviors characteristics Reference characteristics Migration event
objectCharacteristics.significantProperties objectCharacteristics.significantProperties objectCharacteristics.significantProperties eventType eventDateTime linkingAgentIdentifier.{linkingAgentIdentifierType, linkingAgentIdentifierValue, linkingAgentRole} eventOutcomeInformation.{eventOutcome, eventOutcomeDetail} permissionStatement.* permissionStatement.* preservationLevel n/a
Provenance
Changed parts IPRs Law Preservation level Important factors to characteristics Assessment algorithm n/a *. All sub-elements of a given element should be provided. Modification rights Retention rights
167
3.4 MMETSummary MMETSummary is to manage the extracted metadata, so that custodians can view the scanning results. MMETSummary has two steps, i.e., Step 5 and Step 6. In Step 5, MMETSummary will output the migration metadata for each digital object, so that custodians can view the necessary migration information for every object. Since we cannot use database, we have to use xml files to store these metadata. In MMET, the migration metadata are stored in a Document Object Model (DOM) data structure. We use the standard Java library to output this DOM structure to an XML file. When all files are scanned, an overall report will be created in Step 6. This report is also stored as an XML file that has three levels: In the first level, there are 8 elements. Each element is a category of the migration data requirements. In the second level, there are many requirement instances. All these instances are organized in terms of the classification of these requirements. For example, pdf is in the format category, and MD5 is in the fixity category. In the third level, there are many identifiers, referring to digital objects that satisfy a given requirements. In addition, every identifier has an attribute that specifies the location of the extracted migration metadata file. 4 EVALUATIONS We test MMET in sets of digital books, which are digitalized and stored in the
system of the National Library of Norway (NB). Each pages of a book is digitalized as a digital object, but the NB use three formats to store a page: JPEG: This format is small, so that it is used to distribute the content in Internet. JPEG-2000: This format has good quality on the content of a digital object, but the file size is larger than JPEG. The NB uses it as the primary preservation format. XML: This format just store the text of a page. The main function of XML is to provide a location where the full text search can be done to a digitized book. In the next two sub-sections, we will evaluate the speed of MMET and the quality and quantity of the metadata outputted by MMET. 4.1 Speed Evaluation When we evaluate the speed of MMET, we create five date sets with different size. Table 2 summarizes the average running times of MMET. These running times show that the overall performance keeps increase in a linear growth way. When there are 1 million METS files, MMET needs nearly 7,8 hours to extract migration metadata. If we want to scan and extract 10 million METS files, it will take more than 3 days for MMET. Thus, we stop testing at the scale of one million. Table 2 also illustrates that in the whole running time, Step 4 takes almost 79%. We try to replace XPath with the Java Architecture for XML Binding (JAXB) to extract metadata. However, the testing shows that the speed of JAXB is worse than the speed of XPath. So, we still use XPath. 168
Table 2. Speed of MMET (in sec) Files 102 103 104 105 106 Step 3 0,72 2,91 23,11 250,65 2448,72 (40,8 min) Step 4 3,37 22,46 204,78 2044,84 20337,21 (5.7 hr) Step 5 0,35 3,81 38,28 445,36 3898,90 (1,1 hr) Step 6 0,31 1,63 14,30 147,54 1451,18 (24,2 min) Other 0,05 0,09 0,52 5,22 66,39 (1,1 min) Overall 4,81 30,91 280,98 2893,61 28202,39 (7,8 hr)
Table 3. Speed of MMET, JHOVE and JHOVE Audit* (in hr) JHOVE JHOVE Audit 303,3 GB 1,3 1,1 52,0 606,6 GB n/a 2,6 2,3 909,9 GB n/a 3,9 3,3 1213,2 GB n/a 5,4 4,5 *. JHOVE means JHOVE does the characteristics extraction function, whilst JHOVE Audit means JHOVE just does the audit function. Dataset MMET
Besides evaluating speed of MMET, we further compare the speed of MMET against the speed of JHOVE. Using JHOVE as our competitor is because JHOVE is often used in preservation systems to extract metadata. When we run the full JHOVE, it takes around 52 hours to finish for the 303.3 GB data set. Thus, we did not run it on the other data sets. We have to only run the audit function of JHOVE (written as JHOVE Audit) in the experiment. This function can validate file formats and create an inventory about the file system. Table 3 illustrates the experiment results. Amongst these three solutions, JHOVE Audit is 15%-16% faster than MMET, and JHOVE is the slowest. 4.2 Quantity and Quality Evaluation of Output Metadata As for the quantity and the quality evaluation, we still compare JHOVE
Audit, JHOVE and MMET. JHOVE Audit creates few metadata. It just reports the validity statue, format types in the MIME classification, and the number of files for a given format and folder. For instance, for the 303.3 GB dataset, JHOVE Audit reports that all files are valid and there are 4 kinds of formats, i.e., image/jp2, image/jpg, text/plain with the US-ASCII charset, and text/plan with the UTF-8 charset2. However, in the real situation, this information is not accurate. JHOVE Audit recognizes most of XML files using UTF-8 as US-ASCII. JHOVE creates more metadata than JHOVE Audit. For each file, JHOVE shows not only the validity and the MIME format type, but also it retrieves
2
text/plain with the US-ASCII or UTF-8 charset refers to a XML format. Since our test environment has no Internet, the XML module of JHOVE cannot be used.
169
basic metadata embedded in the file and creates characteristic metadata based on the content. For instance, JHOVE use MIX-v1.0 to store characteristics of images. MMET provides more metadata than JHOVE Audit and JHOVE. In the MMET report, there are many metadata about storage, software, format, identifier, reference, fixity, preservation level, the provenance metadata, and the characteristics schema. As for the format metadata, MMET reports there are JPEG2000, JPEG-1.01, and XML-1.0 in the system. This is as same as the real situation. Therefore, for the quality and the quantity of the outputted metadata, MMET is the best in our evaluation. 5 DISCUSSIONS There are two sets of solutions to obtain information for a migration plan design. The first set is called object-based solution, which directly analyzes digital objects, like JHOVE. The second set is named metadata-based solution, which retrieves information from the preserved metadata, like MMET. At different time points, these two sets can play different roles. For instance, when a digital material is inserted into the preservation system, there are few metadata. Hence, the metadata-based solution will not work at all. The object-based solution should be used. However, in the preservation period, the metadata-based solution works better than the object-based solution. The object-based solution can only be used to do some simple functions, such as identifying formats. This is because 1) the object-based solution is slow when it realizes a complex function, e.g.,
characteristics extraction; 2) the extracted metadata may not be the same as the real situation; and 3) in the calculation of the file-based solution, many redundancy files that were ever used but are not used now may be involved. The metadata-based solution plays well in the preservation period. It can retrieve many metadata, and the retrieved metadata are more accurate than the object-based solution. Moreover, the metadata-based solution does not need to access preserved digital objects when the custodians design a migration plan. This advantage is helpful to increase security of the preservation system, and makes it possible for the preservation system to outsource the migration plan design job. For instance, a third-party institution can assess risks in the preservation system and design corresponding solutions. However, the metadata-based solution has some limitations: 1) the quality and the quantity of preserved metadata will affect the migration metadata; and 2) a manual intervention is involved, e.g., defining mapping relations. However, speed is a big challenge for both the metadata-based solution and the object-based solution. In our test, the 1213,2 GB dataset contains 1280 digitized books with 57380 pages in total. MMET needs around 5,4 hours to retrieve the metadata, and JHOVE Audit needs 4,5 hours. However, large preservation systems, such as national libraries or national archives, have hundreds and thousands books. When all these books are digitized, it may take many days or months to retrieve metadata. In this situation, both the metadata-based solution and the objectbased solution are bad. The possible 170
solutions are 1) parallel computing technique should be used in the metadata-based solution and the objectbased solution, and 2) the management task for the metadata should be transferred from the application level to the system level. For example, there exist a preservation-aware storage in [28], in which some metadata can be added. 6 CONCLUSION When custodians plan a migration, they need much knowledge about digital objects, old technique, the preservation system, the current organization requirement and the latest technique. In order to know what old techniques are being used and how many objects are using these techniques, the custodians have to scan the whole system and analyze every file stored in the system. This procedure is very time-consuming. Thus, we in this paper design a new approach, in which custodians can get the information from stored metadata. We further develop our approach as a program called MMET. In our experiment, MMET runs fast and outputs many metadata useful for migration. However, in terms of our quality requirements, some of metadata still cannot be retrieved, because the preservation system does not store them at all. 7 ACKNOWLEDGEMENTS Research in this paper is funded by the Norwegian Research Council and our industry partners under the LongRec project. We would also thank our partners of LongRec, especially the National Library of Norway for
providing experiment environment and technique supports. 8 REFERENCES

1. The Consultative committee for Space Data Systems: The Reference Model for an Open Archival Information System (OAIS). Available from: http://public.ccsds.org/publications/archive/6 50x0b1.PDF. (2002). 2. Lee, K.-H., Slattery, O., Lu, R., Tang, X., McCrary, V.: The State of the Art and Practice in Digital Preservation. Journal of Research of the National Institute of Standards and Technology. 107(1): p. 93-106 (2002) 3. Thibodeau, K.: Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years. CLIR Reports, Conference Proceedings of The State of Digital Preservation: An International Perspective, (2002) 4. Wheatley, P.: Migrationa CAMiLEON discussion paper. Ariadne. 29(2) (2001) 5. Granger, S.: Emulation as a Digital Preservation Strategy. D-Lib Magazine. 6(10) (2000) 6. Lorie, R.A.: A methodology and system for preserving digital data. in Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. Portland, Oregon, USA: ACM (2002) 7. Borghoff, U., Rdig, P., Schmitz, L., Scheffczyk, J.: Migration: Current Research and Development, in Long-Term Preservation of Digital Documents, Springer Berlin Heidelberg. p. 171-206 (2006) 8. Waters, D., Garrett, J.: Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. 1996. 9. Wheatley, P.: Migration--a CAMiLEON discussion paper. Ariadne. 29(2) (2001) 10. Mellor, P., Wheatley, P., Sergeant, D.M.: Migration on Request, a Practical Technique for Preservation. in Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries: Springer-Verlag (2002) 11. Hunter, J., Choudhury, S.: PANIC: an integrated approach to the preservation of composite digital objects using Semantic Web services. International Journal on Digital Libraries. 6(2): p. 174-183 (2006)
171
The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X) 12. Caplan, P.: The Florida Digital Archive and DAITSS: a working preservation repository based on format migration. International Journal on Digital Libraries. 6(4): p. Springer Berlin / Heidelberg--311 (2007) 13. Ferreira, M., Baptista, A., Ramalho, J.: An intelligent decision support system for digital preservation. International journal on digital libraries. 6(4): p. 295-304 (2007) 14. Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., Hofman, H.: Systematic planning for digital preservation: evaluating potential strategies and building preservation plans. International Journal on Digital Libraries. 10(4): p. 157 (2009) 15. Thaller, M., Heydegger, V., Schnasse, J., Beyl, S., Chudobkaite, E.: Significant Characteristics to Abstract Content: Long Term Preservation of Information, in Research and Advanced Technology for Digital Libraries, B. Christensen-Dalsgaard, et al., Editors, Springer Berlin / Heidelberg. p. 41-49 (2008) 16. ExifTool. Available from: http://www.sno. phy.queensu.ca/~phil/exiftool/. 17. Tika. Available from: http://tika.apache.org/. 18. DROID. Available from: http://sourceforge.net/apps/mediawiki/droid/i ndex.php?title=Main_Page. 19. PRONOM. Available from: http://www.nationalarchives.gov.uk/pronom/. 20. Fido. Available from: https://github.com/openplanets/fido. 21. Abrams, S.L.: The role of format in digital preservation. Vine. 34: p. 49-55 (2004) 22. Anderson, R., Frost, H., Hoebelheinrich, N., Johnson, K.: The AIHT at Stanford University: Automated preservation assessment of heterogeneous digital collections. D-Lib magazine. 11: p. 12 (2005) 23. Marketakis, Y., Tzanakis, M., Tzitzikas, Y.: PreScan: towards automating the preservation of digital objects. in Proceedings of the International Conference on Management of Emergent Digital EcoSystems (2009) 24. File Information Tool Set (FITS). Available from: http://code.google.com/p/fits/. 25. Anonymous: Quality Requirements of Migration Metadata in Long-Term Digital Preservation Systems, in Metadata and Semantic Research, S. Snchez-Alonso and I.N. Athanasiadis, Editors, Springer Berlin Heidelberg. p. 172-182 (2010) 26. McDonough, J.: METS: standardized encoding for digital library objects. International journal on digital libraries. 6(2): p. 148-158 (2006) 27. Clark, J., DeRose, S.: XML Path Language (XPath) version 1.0 w3c recommendation., in Technical Report REC-xpath-19991116. 1999, World Wide Web Consortium. 28. Factor, M., Naor, D., Rabinovici-Cohen, S., Ramati, L., Reshef, P., Satran, J., Giaretta, D.L.: Preservation DataStores: Architecture for Preservation Aware Storage. in Mass Storage Systems and Technologies, 2007. MSST 2007. 24th IEEE Conference on (2007)
APPENDIX: AN ABRIDGED METS FILE

<?xml version="1.0" encoding="UTF-8"?>  <mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"> <mets:amdSec> <mets:techMD ID="OBJ_001"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="OBJ_002"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="OBJ_003"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="OBJ_004">
172
The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X) <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="MIX_001"> <mets:mdWrap MDTYPE="NISOIMG"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="MIX_002"> <mets:mdWrap MDTYPE="NISOIMG"> ... </mets:mdWrap> </mets:techMD> <mets:techMD ID="MIX_003"> <mets:mdWrap MDTYPE="NISOIMG"> ... </mets:mdWrap> </mets:techMD> <mets:digiprovMD ID="EVT_001"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:digiprovMD> <mets:digiprovMD ID="EVT_002"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:digiprovMD> <mets:digiprovMD ID="EVT_003"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:digiprovMD> <mets:digiprovMD ID="EVT_004"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:digiprovMD> <mets:digiprovMD ID="EVT_005"> <mets:mdWrap MDTYPE="PREMIS"> ... </mets:mdWrap> </mets:digiprovMD> </mets:amdSec> <mets:fileSec> <mets:fileGrp ID="DSMGRP"> <mets:file ID="DO_0001"> <mets:file USE="PRESERVATION" SEQ="1" MIMETYPE="image/jp2" ID="JP2_0001" SIZE="4124926" ADMID="OBJ_003 MIX_002 EVT_003" CHECKSUM="ab04608d39ebcca81b0d63337a458b34" CHECKSUMTYPE="MD5"> <mets:FLocat LOCTYPE="URN" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="URN:NBN:no-nb_digibok_2008020400079_I3"/> </mets:file> <mets:file USE="LAYOUT" SEQ="2" MIMETYPE="text/xml" ID="XML_0001" SIZE="1555" ADMID="OBJ_002 EVT_002" CHECKSUM="b895ca94c5f0ecc6c6a87d3c69262a8c" CHECKSUMTYPE="MD5"> <mets:FLocat LOCTYPE="URN" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="URN:NBN:no-nb_digibok_2008020400079_I3"/> </mets:file> <mets:file USE="BROWSING" SEQ="3" MIMETYPE="image/jpeg" ID="JPG_0001" SIZE="131645" ADMID="OBJ_004 MIX_003 EVT_004" CHECKSUM="8a8c698eae9ba85652776644ad6fc1ab" CHECKSUMTYPE="MD5"> <mets:FLocat LOCTYPE="URN" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="URN:NBN:no-nb_digibok_2008020400079_I3"/> </mets:file> </mets:file> </mets:fileGrp> </mets:fileSec> <mets:structMap TYPE="PHYSICAL"> <mets:div TYPE="BOOKS:PAGE"> <mets:fptr FILEID="JP2_0001"/> <mets:fptr FILEID="XML_0001"/>
173
The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X) <mets:fptr FILEID="JPG_0001"/> </mets:div> </mets:structMap> </mets:mets>
174

A Metadata Extraction Approach For Selecting Migration Solutions

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Metadata Extraction Approach For Selecting Migration Solutions

Enviado por

Direitos autorais:

Formatos disponíveis

International Journal of Digital Information and Wireless Communications (IJDIWC) 1(1): 161-174

The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X)