Você está na página 1de 22

New Review of Hypermedia and Multimedia, Vol. 13, No.

1, July 2007, 5575

Technical Note

Annotating Web archives structure, provenance, and context through archival cataloguing
P. H. J. WU*, A. K. H. HEOK and I. P. TAMSIR
Nanyang Technological University, 31 Nanyang Link, Singapore 637718 Despite the success of Internet access via search technology, such ease of access is still not available in Web archives, as a greater amount of relevant contextual information is essential in accessing Web archives. The degree of relevance of the contextual information has to be customized to suit research on culture and heritage study over time. Information scientists have long been struggling to find a system that can help them organize Web archives so that users can have access to complete and coherent collections. Lessons can be learned from archivists who have an established tradition of linking materials to its origin and ownership or what is termed provenance. In this paper, we demonstrate how Web Annotation for Web Intelligence, more than just an intuitive way of expressing ones thoughts on the materials under study, is in fact an appropriate tool for cataloguing Web archives in order to ensure a high quality of access for users. Informed by the theory of Records Continuum, a demonstration of access to archived Web materials will be presented. We then recommend an effective way of allowing the continual organization of Web archives based on several design principles for a Web annotation system. This system would preserve the evidence and context of the cataloguing process. Such a tool would also help facilitate collaboration among information professionals in organizing complex Web archives. Implementing the recommended Web annotation system will help ensure better-quality archives with more evidence and contextual information preserved within the system.

1. Introduction

Web users are accustomed to instant access to information with the success of Web search technology. However, due to the different versions of websites kept in a Web archive, a greater effort to catalogue materials in an archive is needed to accommodate the need for easy access that Web users expect. There have been increasing interests in providing a more complex information architecture for leverage, such as taxonomy, metadata, ontology, and the integration of different modes of access, including searching, browsing, and routing. This paper examines a particular case for accessing Web archives which contains complex materials that can serve distinct communities, including social scientists and historians. We present a perspective in which websites are more than mere publications. They should be seen as evidence of
*Corresponding author. Email: hjwu@ntu.edu.sg
New Review of Hypermedia and Multimedia ISSN 1361-4568 print/ISSN 1740-7842 online # 2007 Taylor & Francis http://www.tandf.co.uk/journals DOI: 10.1080/13614560701423620

56

P. H. J. Wu et al.

the cultural activities of contemporary society. As such, its collection should be managed differently, as an archive would its holdings, preserving the contextual evidence of its content. In a previous paper (Wu et al. 2006), we demonstrated a bibliographic approach to cataloguing Web archives and showed how metadata produced by Web annotation can serve as points of access to Web archives. In that paper, a short survey of the various library Web archives models around the world also points to a pressing inadequacy in the available methods of organizing their materials. These usually employ the use of bibliocentric cataloguing that treats each website as an entity without any relationship to the other materials in the collection. This is because the contextual and provenancial information of these collections, which are essential for social scientists and historians to understand, are not made apparent, with much of the information being buried deep within the archives. A more suitable model being developed is the Arizona Model (Pearce-Moses and Kaczmarek 2005), where archival principles of provenance and original order are adopted. This approach may prove more useful in presenting a Web archives holdings to facilitate knowledge discovery. The technological challenge then becomes one of how Web annotation can be effectively extended to help organize contextual and provenancial relationship based on bibliographic metadata. We explained the need for these requirements with a concrete case in Section 2 from a post-custodian approach. In Section 3, a context-aware Web annotation system, termed the Web Annotation for Web Intelligence (or WAWI), is introduced. The WAWI Web annotation system ensures the capture of evidence and contextual information of Web archives catalogue. WAWI is part of a joint project between the National Library Board of Singapore and Nanyang Technological University to catalogue and archive Singapore websites. Before explaining how contextaware annotation works, we will review the difference between context-less and context-aware systems. Context-less annotation does not provide the relationship between the metadata and the Web content (the context which the metadata content is describing). Thus, it is difficult to confirm whether the metadata annotated is consistent with the Web content by a third party who was not involved in the original annotation. Without such verification, the evidence or selected parts of the Web content used to annotate the metadata cannot be corroborated with the annotation. This compromises and renders the annotation unreliable. Context-aware annotation, however, establishes the relationship between the metadata, the content of the Web material and the social context in which the content was produced. A contextaware annotation system can thus help librarians ensure the quality of the records more effectively by being able to: . . . . relate semantic content in the metadata to Web content; render agreement, disagreement, and different granularity of evidence; provide flexible and precise annotation of the evidence; relate ontology to metadata in relational metadata.

Annotating Web archives


2. Post-custodian approach to Web archives cataloguing

57

The tagging movement allows actors other than the creator of the Web materials to structure meaning into the materials. This collaborative approach in organizing information has been shared by professional archivists in the development of the Records Continuum Theory (RCT) for organizing records and archives (Upward 1998). RCT challenges the custodial role of the archives. It advocates that, in a post-custodial paradigm, archivists must become more than mere physical caretakers and take on the role of identifying, controlling, and making electronic records continually accessible to society at large. As professionals in preserving information, archivists should take as much care in the cataloguing of its active holdings to facilitate access of public records as it does in preserving it. Similarly, in the context of a Web archive, the Web archivist should take on a more proactive role in transforming the Web archive into one that allows for greater and easier access to its materials. In the current Web environment, public users could also be encouraged to collaboratively help make sense of informal Web materials that are being preserved, as exemplified by the participants of the tagging movement. In an attempt to illustrate how contextually organized materials can facilitate access to holdings in a Web archive, we shall use the example of the website of the Ministry of Manpower (MOM) in Singapore (www.mom. gov.sg). The MOMs mission is to achieve a globally competitive workforce and a great workplace for a cohesive society and a secure economic future for all Singaporeans. One of the ways it sets out to accomplish this aim is the setting up of an Occupational Safety and Health (OSH) Division that promotes OSH at the national level. It works with employers, employees, and all other stakeholders to identify, assess, and manage workplace safety and health risks so as to eliminate death, injury, and ill health. The department within the OSH Division focusing on the reduction of safety and health hazards is the OSH Inspectorate. It does so by providing advice and guidance through inspections of workplaces, investigating accidents and enforcing the relevant laws. The hierarchical relationship between the various offices can be found on the interactive government online directory at http://www.sgdi.gov.sg/. A snapshot of the relevant page is presented in figure 1. In a typical work process like the communication of information to the public with regards to an industrial accident, both the division in charge of the policy area (OSH in this case) and the corporate communications department (CCD) would put up a joint draft which goes through the PS to the Minister for approval depending on the nature of the subject to be announced. Such cross-divisional collaboration means that the filing of the drafting and approval process would be kept at both divisions with OSH holding a series of case files relating to a particular subject/case (e.g. Industrial accidents, public education on occupational safety issues, reports on occupational health, etc.). These case files involve all the drafts that took place for submission up to the divisional director and CCD containing all

58

P. H. J. Wu et al.

Figure 1. Organizational chart as reected in the Singapore Government Directory interactive.

drafts of press releases they receive from each divisional director and the subsequent changes after vetting by the bureaucratic and political masters. However, because all Web communication comes under the purview of CCD, information based on the Web should be filed under CCD. To facilitate the different categories of CCDs work, the materials are divided into events,

Annotating Web archives

59

marketing, public education, publication, press releases, speeches, etc., and these are further subdivided by subject area, division, or departments which mirror the organization chart. Following an archival arrangement of materials, the MOM fond would contain all fonds of the various divisions and files of the different departments as presented in figure 2. In the case of an industrial accident, the department most intimately involved would be the Investigation Branch under the OSH Inspectorate Department which comes under the Occupational Safety & Health Division. Here is a scenario of how a public policy scholar might examine how the Ministry of Manpower in Singapore handled an industrial accident, specifically the Nicoll Highway Collapse Incident. Being an industrial accident, the OSH Inspectorate was the agency legislated to oversee investigations. To review the events from the governments point of view, the scholar can visit the OSH group of documents. He will be pointed to files containing the various public communication activities (http://www.mom. gov.sg/NewOSHFrameworkandInvestigationsonNicollHighwayCollapse). These files include speeches by the minister (in parliament for the amendment of the Factories Act), commission reports, press release, and even a Frequently Asked Questions (FAQ). However, these files may not all be available from the current website. This is because when events unfold, the importance of information emanating from the government may change. This change can be seen by comparing the websites now and then in figure 3a and 3b. For example, the section on FAQ, one of the key documents available in 2004 to help the public understand and interpret the information on the site, was missing by 2006. All the helpful information is now no longer available at

Figure 2. MOMs organization chart derived from the Singapore Government Directory interactive.

60

P. H. J. Wu et al.

Figure 3. (a) MOM circa 2004 from Web Archives, with FAQ. (b) MOM circa 2006 in the current website, without FAQ.

the live MOM website. The researcher will now no longer be able to learn via the FAQ how the reports were being made and about the various degrees of commissions that the government appointed. However, with the creation of a Web archive where such materials are organized into collections, and the arrangement of records made possible using annotation tools, changes in public communication patterns can be made more apparent. Not only will researchers benefit from being able to access evidence of changing trends, but so will ordinary citizens who want to find out about the accident at a later date. In addition, by relating the files to each other, one also discovers not only that MOM was involved but that the Ministry of National Development (MND) and the Building and Construction Authority (BCA) were also involved in offering joint reports on the event. Their insights help to mould new policies that came out of such reports and led to the creation of a new OSH Framework. With these, we observe that context-aware Web annotation is not only important for the current use but even more crucial for the lasting value of heritage and cultural value of Web materials. It is also important for organizing Web materials as records to be carried across time (Wu and Theng 2005, Wu and Heok 2006). Most of the current approaches surveyed in our last paper (Wu et al. 2006) on Web archives cataloguing have fallen short of the requirements to provide evidential and contextual organization to facilitate effective access.

Annotating Web archives


3. Web annotation system in service of Web archive cataloguing

61

As demonstrated in section 2, a context-aware Web annotation system can facilitate effective information discovery. In this section, we introduce the Web Annotation for Web Intelligence (WAWI) system. We will also demonstrate how four design principles are implemented to achieve the objectives of preserving the evidence and context in cataloguing and arranging Web archives. They need to be able to: . . . . relate semantic content in the metadata to the Web content; render agreement, disagreement, and different granularities of evidence; provide flexible and precise annotation of the evidence; relate ontology to metadata in relational metadata.

The WAWI annotation system is integrated with the Web archiving platform developed by International Internet Preservation Consortium (IIPC) (http:// www.netpreserve.org/about/index.php) which comprises Web harvesting and access components (Heritrix URL: http://crawler.archive.org/; NutchWax URL: http://archive-access.sourceforge.net/projects/nutch/; Wera URL: http:// archive-access.sourceforge.net/projects/wera/): Heritrix, Nutchwax, and Wera. The system architecture resulting from the incorporation of annotation in the cataloguing process is shown in figure 4.

Figure 4. WAWI annotation and cataloguing system integrated with the IIPC Web Archive platform.

62

P. H. J. Wu et al.

Details and a demonstration of WAWI system are discussed in section 3.5. From sections 3.1 to 3.4, we shall focus on the design principles of WAWI system and shall reference Annotea (Kahan et al. 2001) and CREAM (Handschuh et al. 2001) as model systems.
3.1 Relating semantic content of the metadata to Web content

As briefly mentioned in section 1, there are two different kinds of annotation systems: one provides the relationship between the semantic content of the metadata, and the other does not. Examples of context-less annotation systems developed in the Web archiving systems community can be found in Schneider et al. (2002) and Lampos et al. (2004). In Schneider et al. (2002), annotated metadata were used for browsing; in Lampos et al. (2004), it was meant to be implemented as an automatic tagging system. Context-aware annotation establishes the relationship between the metadata and the content of Web material. The Annotea project in the WWW Semantic Web Consortium is an example of a context-aware system (Kahan et al. 2001). It provided a relationship between the semantic and the document content through its two properties: annotates and context in the namespace (defined at http://www.w3.org/2000/10/annotation-ns#). The WAWI annotation system adopted the Annotation Graph schema (Bird and Liberman 1999). The resulting XML document fragments of those highlighted in figure 5 are presented below:

Bannoschema id0{GUID0} datecreated023 09 2005 createdby0ichsan type0ontology datemodified0 modifiedby0 url0http://app.sgdi.gov.sg/listing.asp?agency_subtype0dept& agency_id 0 0000000011 BDivision Title0OrganizationHealthSafty id 0 {GUID1} begin0 566 end0 577 value0 Organizational Health and Safety meta 0 Organizational Safety and Health B/Division B Division Title0 ForeignManpwer id 0 {GUID2} begin0 987 end0 1004 value 0 Foreign Manpower Policy meta 0 Foreign Manpower Policy  B/Division B/annoschema
Each annotation schema contains several annotation attributes and elements. The id attribute contains the system generated unique id for the schema; the url attribute denotes the Web page that is annotated as support of the schema; other self-explanatory attributes include datecreated, datemodified, modifiedby and createdby.

Annotating Web archives

63

Figure 5. Annotation schema, an ontology reecting the MOM organization chart, and its supporting Web page at the Singapore Government Directory interactive (SGDi) (only partially shown).

Each annotation element, such as Division, contains a begin and an end attribute, whose values are the page coordinates (see discussion in Section 3.3) of the text portion of the DOM tree of the Webpage. The value attribute contains value as the text of the Webpage that is delimited by the begin and end page coordinates, which was highlighted as evidence (or context in Annoteas term). The meta attribute contains the metadata that are assigned to the element that was supported by the evidence. In the MOM example discussed earlier, we created the annotation schema, an ontology, that relates to the MOM organization chart found in the SGDi website.
3.2 Rendering agreement, disagreement, and different granularity of evidence

In Annotea, annotations are simply rendered as pencil symbols (Kahan et al. 2001). The pencil symbol model is limited, as it can only indicate the starting point, not the extent of the annotation. On the other hand, the AG model of annotation in WAWI encompasses the whole extent of the annotation. When disagreement and different granularities of evidence occur, various overlapping patterns of the extent will result. This is where the need for rendering complex patterns of annotation comes in.

64

P. H. J. Wu et al.

Figure 6. Multiple-evidence overlapping annotations in WAWI.

As demonstrated in figure 6, two disagreeing metadata records are shown by the overlapped annotation (evidence) of the OSH vision. With the highlighted patterns, the metadata records can then be verified and consolidated to a unified and agreeable metadata record as discussed in Section 1.
3.3 Providing flexible and precise annotation of the evidence

Annotea uses Xpointer to define how annotation is related to the document. The location of the annotated text in the document is represented by Xpath. It uses the page element structure to point to a specific part of the document. However, Xpointer can only point to the text at the element boundary; it does not point to a specific text position. In Annotea, the annotation does not include the extent of the annotation and is unable to point to the part that contains the cross-boundary element. In our WAWI annotation system, the page coordinate approach was developed to provide these features. The page coordinate approach works by serializing the document as a sequence of text by omitting the document element structure. With this sequence of text as a coordinate, the precise position and extent of an annotation are recorded at the start and end positions of the text in the document.

Annotating Web archives

65

Figure 7. Speech, press release, and FAQ les in the Web archives of Occupation Safety and Health (OSH) Division of MOM circa 2004.

3.4 Relate ontology to metadata in relational metadata

As shown in figure 7, the OSH archived Webpages circa 2004 have three metadata records corresponding to Speech, Press Releases, and FAQ files of OSH. The FAQ metadata record for the FAQ files of the Web page in figure 8 is demonstrated below; the url and datecreated attributes indicate that it was archived in 2004:
Bannoschema id 0 {guid} type0metadata datecreated 0 23 09 2004 datemodified0 23 09 2004 createdby 0 ichsan modifiedby0 ichsan url0 http://web.archives/2004/www.mom.gov.sg/OSHD/  Bref Bnodeid {GUID17}B/nodeid

66

P. H. J. Wu et al.
B nodename FAQ File B /nodename  B /ref  BannoElements B Title id 0 1 begin0 34 end0 63 value 0 Nicoll Highway Investigations meta 0 Industrial Accident B/Title BSubject id 0 4 begin0 752 end0 777 value 0 Frequently Asked Questions meta 0 Frequently Asked Questions  B/Subject B/annoElements B/annoschema

Note that the additional Bref  element, like the CREAM Bref  attribute (Kahan et al. 2001), provides a pointer to the ontology FAQ File with {GUID17}. This relates to the additional relational metadata that link the metadata to the ontology. As shown in figure 6, each node of the ontology (displayed on the left-hand frame) has its corresponding metadata, which are displayed on the right-hand frame. The referring path to the FAQ File node above is then: MOM 0 OccupationalHealthSafety 0 OSHInspectorate 0 IndustrialAccidents 0 NicollHighwayCollapse 0 FAQ.

Figure 8. FAQ node in the MOM ontology and its linking metadata record.

Annotating Web archives

67

The View Page button allows the user to see the related Web page with the metadata and the evidence shown in figure 8. The ontology remains the same for the current Web archives in 2005 (figure 3b). As discussed in section 2, despite the fact that there is no FAQ in the current website, a user accessing it is still able to depend on its corresponding ontology to access the archived FAQ Web materials, from 2004. This access allows the user to research the various cultural and heritage concerns, including how Singapores MOM conducted its public education programme on the public hearing of the OSH Inspectorates committee reports.
3.5 WAWI Cataloguing and Annotation System for Web Archives

An overview of the WAWI system is given here, followed by a demonstration in the following sections. The WAWI annotation and cataloguing system works with annotation schema defined by the schema creator. A librarian in a library or the moderator in a user community can create schemas for both ontology and metadata. Annotation schema is represented as an XML document. It is platform-independent, and supports the hierarchical structure (multi-level), whereby one node can be drilled down to several sub-nodes for more detailed annotation. Using Dublin Core elements as an example, Date can be drilled down to CreatedDate and IssuedDate; Coverage can be drilled down to Temporal and Spatial. The annotation schema then serves as a template for cataloguing of Web pages and annotation of the evidence. Section 3.5.1 further explains the annotation schema. For public records or materials that require specialized knowledge to organize, professional cataloguers may be enlisted. Otherwise, for materials that do not require specific skills or knowledge, users in a community can catalogue the websites and provide the evidence for cataloguing using the WAWI system. Librarians, managers, or moderators who are required to ensure the consistency and quality of the catalogue will then confirm this evidence against the catalogue records. Section 3.5.2 further explains the process and ways to administer and use the WAWI cataloguing and annotation system. The result of the annotation is captured by filling specific values in the template based on the annotation schema.
3.5.1 WAWI annotation schema. Conceptually, the WAWI annotation schema

consists of the following major components: . . . . annotation title; annotated text (or the evidence); user input and comment (or the metadata value); permission or access rights.

Other information specified in annotation schema include id, name, datecreated, datemodified, createdby, modifiedby, editable, and url. All of this information is found in the attributes of annotation schema. The id attribute is a unique GUID which is system-generated.

68

P. H. J. Wu et al.

Each element contains begin, end, value, and meta attributes. Begin and end attributes are used to denote the page coordinates of the annotation in a Web page. The annotated text is stored in the value attribute, and the meta attribute is stored with the metadata, input by users. The text in the value attribute then serves as evidence to the metadata in meta attribute. The following XML document is a sample annotation schema that contains Product, Company, and Price information and further details under them:
B?xml version 0 1.0 encoding 0 utf-8? Bannoschema id0{guid} type0metadata name0CatalogueTask datecreated 0 23 09 2005 datemodified0 23 09 2005 createdby 0 ichsan modifiedby0 ichsan editable 0 yes url0  B product id 0 1 begin 0 end 0 value 0 meta 0  B category id 0 2 begin0 end0 value 0 meta 0  B /category  Bmodel id03 begin0 end0 value0 meta0 B/model B name id 04 begin0 end0 value0 meta0 B/name B/product B company id 0 5 begin 0 end 0 value 0 meta 0  B name id 06 begin0 end0 value0 meta0 B/name B address id 0 7 begin0 end 0 value 0 meta0  B building id 0 begin 0 end 0 value 0 meta 0  B /building  B postalcode id 0 10 begin0 end 0 value 0 meta 0  B postalcode B/address B/company B price id 0 11 begin0 end0 value 0 meta 0 / B/annoschema

3.5.2 Cataloguing and annotation process based on the WAWI system. Overall, the

flow of the system is divided into three stages. The first stage is the schema preparation stage. An annotation schema is created and saved in an XML database. The annotation schema can be modified. The second stage is the annotation process. At this stage, the annotation schema will be loaded on the browser, together with the target Web page to be catalogued and annotated, which is retrieved from the archives repository. By clicking and dragging, the targeted portion of the text under consideration is highlighted and captured in the annotation schema template. After users have finished with the annotation, the annotation will be saved at the server side for retrieval and verification of catalogue records subsequently. The third stage is to search and retrieve the metadata and evidence from previous cataloguing and annotation process and to confirm the metadata against the evidence in the catalogue records.

Annotating Web archives


Create/Edit/Delete Schema Annotated Page Librarian Annotate based on the schema (Annotation Result) Web Annotation System Annotated Page Reports Managers

69

Cataloguer

Annotate based on the schema (Annotation Result)

Context Diagram

Figure 9. Context diagram of the WAWI annotation and cataloguing system.

For a better understanding of the system, the context and use-case diagrams of the WAWI cataloguing and annotation process are given in figures 9 and 10. The context diagram shows that there are three types of actors interacting with the annotation system. The Librarian is the actor who creates, edits, and deletes an annotation schema. The Librarian can also catalogue and annotate the Web archive materials. The cataloguer is the actor who annotates the Web pages based on the annotation schema created by Librarian. They can also retrieve and view the annotations and modify them. The last actor is the Manager. They can view the annotations done by cataloguers, and generate a report from the annotations data. The detail processes and interaction of each actor in the annotation system is shown in the use-case diagram above. These annotation data can be used by the Manager actor to produce management reports. We shall discuss the detail of the reporting system and reports in a separate paper.
3.5.3 System demonstration. Based on the description in section 3.5.2, the

system demonstration is divided into three parts: (1) Schema Preparation, (2) Annotation/Cataloguing Process, and (3) Retrieval and Verification. The system is implemented with Web-based client/server architecture. At the client side, it only requires Web browser with JavaScript enabled. The server side requires Web server (Apache), and programming of database server (Berkeley XML Database) and serverlet container (Tomcat). 3.5.3.1 Schema preparation Librarian actors use Annotation Schema Manager to create annotation schema. As shown in figure 11, the annotation schema is represented in Tree

70

P. H. J. Wu et al.

Use-Case diagram
Create Schema

Modify Schema
Librarian

Delete Schema

View Annotated Page

Retrieve Annotated Page

Retrieve Schema
Cataloguer

Annotate Page

Manager

Save Annotation

Generate Report

Figure 10. Use-case diagram of the WAWI annotation and cataloguing system.

view. Librarians can click Save Schema button to indicate that they have finished creating or modifying the schema. Then, the system will convert this tree view to XML document and store it in the database. 3.5.3.2 Annotation/cataloguing process The cataloguer actor uses the Annotation Panel to annotate Web pages. As shown in figure 12, the panel has two frames. The left-hand frame is used to display the Web page, while the one on the right-hand is used to display the annotation schema. In the right-hand frame, annotation schema stored in the XML database will be retrieved when the user selects it from the dropdown list and sends it to the client as an XML document. The XML document is converted to a DOM object and rendered in a Tree view. Next to a tree node is a textbox meant for users to enter the metadata. The annotation evidence

Annotating Web archives

71

Figure 11. Annotation schema manager.

will be automatically extracted and copied to the textbox, and users are free to change the value in the textbox. The left-hand frame is used to display the archived Web page and the associated annotation. The Web page displayed can be further annotated using devices available in the right-hand frame. Lastly, during the verification stage, the left-hand frame will also display the different overlapping effects of evidence. As shown in figure 13, there are two opposing annotations entered by two cataloguer actors in the right-hand frame. The left-hand frame will then render overlapping and non-overlapping highlighted text indicating how the disagreement may be initiated by the evidence applied. 3.5.3.3 Relate the metadata to the ontology Metadata and ontology are related to the Bref element in metadata annotation schema, as demonstrated in figure 14. At the user interface, there is a ref node, which consists of nodeid and nodename nodes, whereby the user can relate these metadata to the specific node of the ontology. Clicking on the . . . button, next to the nodeid node, will bring up the ontology window, and the user is able to select which node of the ontology these metadata will relate to.

72

P. H. J. Wu et al.

Figure 12. Annotation process and result on a Web page.

3.5.3.4 Metadata and evidence search All the metadata and evidence captured during the annotation process can be searched. The search engine will display all the fields of the search results that correspond to the search parameters. The search is translated into an XQuery query to the XML Database. As shown in figure 15, the search panel is divided into two frames. The left frame displays all the available fields to search, and the textbox for user to enter search keywords in the respective field. When the user clicks on the Search button, the system will perform the search and display the results in the right frame. In the result frame, we can see a link at the Title column. This link will bring the user to the archived Web page and its associated annotation as shown in figure 13. The search function in the search panel can be easily extended to perform a browse function. An integrated search on Web archive materials and

Annotating Web archives

73

Figure 13. Overlap annotation reecting disagreeing evidence in collaborative cataloguing.

Figure 14. Relating the metadata with ontology.

74

P. H. J. Wu et al.

Figure 15. Search and search-result user interface.

metadata may be useful at times. To achieve this, the free text and URL search available in WERA via NutchWax can be integrated into the WAWI metadata search engine.
4. Conclusion

Cataloguing is a timeless and fundamental practice for organizing information regardless of the types of materials. However, the growth of the Internet continues to outpace attempts to describe it. With the help of Internet technologies and the WAWI system proposed, it is hoped that more collaborative efforts among information professionals and even the public can be effectively mobilized to help catalogue the Web. One of the most intuitive methods to transform the Web into one that allows greater interaction between systems is through Web annotation. This paper proposes a context-aware Web annotation system which can provide evidence and preserve context to the catalogued records of the materials within a Web archive. It enumerates how such a system can help archivists ensure the quality of the records by being able to: . . . . relate semantic content in the metadata to Web contents; render agreement, disagreement and different granularities of evidence; provide flexible yet precise annotation of the evidence; relate ontology to metadata in a relational metadata.

Annotating Web archives

75

Such a system is also congruent with the tagging movement, such as Technorati, Flickr, and del.icio.us, which itself reflects a growing trend that tries to leverage collective efforts to organize materials on the Internet. A context-aware annotation system will facilitate the assurance of quality of materials being organized in a Web archive where the working behind how a decision was taken to annotate Web materials is made visually obvious, and an inconsistency resolution mechanism like those found in Wikipedia can be invoked to resolve discrepancies immediately or reserve them for future resolution. A review of existing Web archive cataloguing and access practices was carried out to assess whether the WAWI Web annotation system was comparable in providing state-of-the-art ways of organizing Web archives materials. By linking Web-archived and current materials via an ontology, we also concretely demonstrated how better quality access can be achieved to facilitate a historical understanding of a governments handling of accidents on a national scale. With evidence and context annotation in the cataloguing process, the collaborative efforts of a community of users and archivists to maintain the catalogue are facilitated. This effectively opens up new horizons of creating a Web archive that is at once more research-oriented and flexible in its approach, and copes with the changing needs of users. All these are achieved with the archive still remaining robust enough to present its holdings meaningfully through time.
References
S. Bird and M. Liberman, Annotation Graphs as a Framework for Multidimensional Linguistic Data Analysis, in Proceedings of the ACL 99 Workshop Towards Standards and Tools for Discourse Tagging, College Park, MD, 21 June 1999, pages 1 10. S. Handschuh, S. Staab and A. Maedche, CREAM *Creating relational metadata with a componentbased, ontology-driven annotation framework, in Workshop on Knowledge Markup and Semantic Annotation at the First International Conference on Knowledge Capture (K-CAP2001) , Victoria, BC, Canada. J. Kahan, M.R. Koivunen, E. PrudHommeaux and R. Swick, Annotea: An Open RDF Infrastructure for Shared Web Annotations, WWW10 , 1 5 May 2001. C. Lampos, M. Eirinaki, D. Jevtuchova and M. Vazirgiannis, Archiving the Greek Web. 2004 Available online at: http://ww.iwaw.net/04/proceedings/Lampos.pdf (accessed 5 June 2006) R. Pearce-Moses and J. Kaczmarek, An Arizona Model for Preservation and Access of Web Documents. DttP: Documents to the People. 33:1. p.17 24. 2005. S. Schneider, K. Foot, M. Kimpton and G. Jones, Building Thematic Web Collections: Challenges and Experiences from the September 11 Web Archive and the Election 2002 Web Archive, 2002. Available online at: http://bibnum.bnf.fr/ECDL/2003/proceedings.php?f0schneider (accessed 5 June 2006) F. Upward, Structuration. Theory and Recordkeeping, 1998. Available online at: http://www.sims. monash.edu.au/research/rcrg/publications/recordscontinuum/fupp2.html (accessed 5 June 2006) P. Wu and A. Heok, Is Web Archives A Misnomer *How Web Archives Can Become Digital Archives?, in Proceedings of the Asia-Pacic Conference on Library & Information Education & Practice: Preparing Information Professionals for Leadership in the New Age, C. Khoo, D. Singh and A. Chaudhry, 2006, pp. 298 350. P. Wu, I. Tamsir and A. Heok, Applying context-sensitive Web annotation in evidence-based, collaborative Web archives cataloguing, in Proceedings of the International Workshop on Archiving Web, 2006. Available online at: http://www.iwaw.net/06/PDF/iwaw06-proceedings.pdf P. Wu and Y.L. Theng, Weblog Archives: Achieving the recordness of Web archiving, in Proceedings in the Ninth International Cultural Heritage Informatics Meeting , 21 23 September ICHIM 05, Paris, 2005.

Você também pode gostar