Você está na página 1de 61

Alexandru Ioan Cuza University of Iai, s Faculty of Computer Science

Practical Semantic Works Bridging the Web of Users and the Web of Data
M.Sc. candidate: Anca-Paula Luca Scientic coordinator: dr. Sabin Corneliu Buraga

Iai, June 2009 s

Abstract In the current times, of passing from the Social Web, the read-write web, to the data web, where information will be equally accessible to humans and machines, multiple initiatives have emerged to make this pass smoother, to enable access to data for computers. Unfortunately, a general trend can be noticed in disregarding the user in this whole process: building the data web level as a one-way next step of the ancient users web. In this thesis we analyze existing approaches and explore the problematics of creating a system to enable users as peers in the data web communication, with particular concern to not require technical background from the user and, in the same time, preserving the rigorousness and denotative characteristics of the data web, materializing these in a semantic data retrieval and reusage tool.

Contents
1 Motivation 1.1 Users web and data web: living together . . . . . . . . . . . 1.2 Making the Semantic Web relevant in a Social Web context . 2 Modelling knowledge on the web 2.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Semantics on the Social Web . . . . . . . . . . . . . . . . . . 2.3 Semantic descriptions and natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Employing the semantics practically 3.1 Retrieving the information . . . . . . . . . . . . . . . . . . . . 3.1.1 Semantic search . . . . . . . . . . . . . . . . . . . . . 3.1.2 Search for the user: Wolfram Alpha, Google Squared . 3.1.3 Searching the Web of linked data . . . . . . . . . . . . 3.1.4 Ontology alignment for user language property resolution . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Presenting the results to the user . . . . . . . . . . . . . . . . 3.3 Reusing the semantics . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Translating semantic web resources in RDFa and eRDF 3.3.2 Generating microformats from semantic web described resources . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Maximizing user intent . . . . . . . . . . . . . . . . . 4 Putting it all together 5 Position with respect to related approaches 6 Proof of concept and case study 7 Conclusions and further work 3 3 4 7 7 10 13 14 14 15 16 17 21 29 30 30 31 42 44 46 48 53

Chapter 1

Motivation
The current stage of the web is the Social Web [15], where the users are the peers in the communication graph, creating and exchanging information freely, with the machines acting as only the storage and transportation means for this content, with little involvement in data creation and consume. The idea for the next step, one of the data web, is that all information manipulated on the web by the machines is also understandable for them, they can participate in the communication process from positions comparable to the humans positions, easily accessing and reusing the data. The data web initiatives dene formalisms and models for human knowledge, with formats for describing these formalisms as well as tools for processing the aforementioned, to be able to encapsulate the mechanisms and assets of the social web in machine understandable form.

1.1

Users web and data web: living together

The general tendency is to look at this pass as a one-way evolution to the next level: the users web is to be transformed in the data web, but without a link back. Currently, the Social Web, the existing human oriented layer of the web is regarded mostly as a transport layer for the Semantic Web, or a source for the data to be extracted, stored and used by data layers. The initiatives are oriented towards encapsulating the data of the semantic web in the plain HTML layer of the users web such as they can be extracted further by the tools which can understand and process this kind of data, or extract / infer semantics from human natural language works on the web. But the social web disposes of its own methods of semantic markup, along with the tools to process that, approaching the semantics on the web from the perspective of the user, regarding the semantic units as created for humans rst, phrase which not only expresses the readability of such data but also its usability, with the practice as nal purpose and measure of necessity. Markup and processing is to be made if it can bring any benet to

the user, is to be made in such a way that it will not impede the user from understanding the data, always starting from simple, purpose specic cases, to almost never reach general expressivity since the practice rarely requires it. Also, it does not intend to redene everything to a machine readable form, but use existing standards and tools, along with conventions to create machine readable data. Existing data web initiatives focus on extracting these social web semantics and, upon giving it back to the web they generally fail to provide good user interface for access to this data for the non-trained users (data web protocols and formats, technical language, non-intuitive UIs, etc). We believe that the reverse tools can and should be oered: an interface to address the data web information with social web tools, to enable the users to fetch information from the data layer without necessarily having to understand its means and formalisms, in a human accessible form, to render the semantic web relevant to users. Also, upon reusing this data in the read-write web, producing content further on the web in so-called real-world formats [4], semantics of the data retrieved should be preserved but without requiring awareness from the users from this point of view, and without recreating the semantic from scratch: the reverse tools, that transform the data layer semantics into users web lower-case semantics1 should be available.

1.2

Making the Semantic Web relevant in a Social Web context

There exist multiple initiatives for making a domain specic knowledge available to humans and computers equally, where users as well as computers employ the same formalisms and communication interfaces are well dened and implemented, but these approaches represent only islands of the web, where the two layers are well mixed together. We focused our research on semantic search tools available to users, as representing a general purpose mechanism of user access to the Semantic Web, along with the means for users to reuse the such obtained information in a way which is, at its turn, relevant to the Semantic Web, by preserving the semantics. Specically, what we had in mind when exploring the topics to follow, is a semantic search tool which would allow the user to nd a missing piece of information (the author of The Catcher in the Rye, or all papers of Tim Berners Lee) and, upon reusing it in a page he is creating, a blog article for example, to be able to preserve its semantic in a way which is relevant for the created social web content, and allow other tools to propagate the semantic annotation of data. Otherwise put, we are exploring the creation of a tool
Simple semantics embedded in (X)HTML, evolutionary not revolutionary, designed with human usage as rst usage in mind see [4].
1

which would oer the mechanisms for social peers to become participants in the Semantic Web, without changing their habits. Such a system has several key components to address, which we will follow throughout this paper, and try to provide solutions for: we need a method to access the Semantic Web, as a whole, for search purposes, using a user accessible interface which is able to provide answers to queries, not only look up documents that contain keywords upon retrieving the results, they need to be presented to the user in a relevant manner, such way that it encourages him to reuse the information along with its semantic, when one of the retrieved results is reused in the user created content, the semantic markup must be preserved. Figure 1.1 provides an overview of the todays landscape of the semantic web initiatives: while multiple approaches exist to extract semantic data from the social web, along with domain specic applications, there is still room for improvement to allow the users to comfortably access the knowledge of the semantic web as a whole. Also, on the data publishing side, while users can easily read and write content of the social web, the mechanisms for data publishing are still to evolve. In this paper we will focus on the components left unanswered, handling the data ow from the semantic web to the social content through the users in homogenous manner, as a user read and write process on the data web.

Figure 1.1: Overview of the web semantics approaches In the next chapter we will provide context information about the current directions for dealing with knowledge rather than documents on the web, to then discuss how the above issues can be handled in this context in the third chapter. The fourth chapter closes the circle by providing an overview 5

architecture of the designed system, followed by a presentation of the related tools and how this proposal stands out. A case study is detailed in the sixth chapter along with a proof of concept implementation. In the end, we draw the conclusions and present the directions for further research on this matter.

Chapter 2

Modelling knowledge on the web


2.1 The Semantic Web

The target of the Semantic Web is to dene a framework to make the data available on the web understandable and addressable by the machines, initiative which focuses on the denition of formalisms, data formats and mechanisms to understand this data, along with the aim to preserve the web content as a graph, but this time a graph of data rather than one of documents. Its building block is the Resource Description Framework [39] which, in the RDF Primer [40], is dened as: Denition 1 The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web. The core concept is that of making statements about (web) resources (the so-called RDF triples), of the form: Subject has a property whose value is an object. in which the key components are: the subject: the resource being described (for example, a web site), the property, or predicate, which is the actual characteristic of the subject on which the statement focuses (for example, creator ), the object which provides the actual value of the property for the described subject (for example, John Smith). Using these simple assertions, for which RDF denes diverse means of serialization (in XML, N3, etc), complex representations of things can be built 7

by composing them and interlinking resource descriptions. In the context of describing data, RDF and the Semantic Web promote the idea of an Open World where descriptions are not meant to be interpreted as exhaustive but only as statements about the knowledge from one point of view: the fact that some things are not contained in a description does not mean they dont exist, only that the description author does not know them, or cannot state anything about them, another source could provide those descriptions (Anyone can say anything about any topic). To enhance knowledge sharing and interconnectivity, the Semantic Web proposes linking these descriptions in a web of linked data, achievable by following the next guidelines, as described by Tim Berners Lee [2, 3]: URIs are to be used to name things on the web, such as every thing (practical or abstract) on the web is uniquely identiable, the URIs should be HTTP URIs allowing to look-up those names on the web. This is also known as the dereferenceable URIs rule: besides having an unique identier for anything on the web, the URI that identies the thing should direct to the place where more information about that thing is found, when such a URI is looked-up, useful information must be provided. In other words, useful descriptions (complete, correctly described to be machine understandable) should be found at the things locations, descriptions should include links to other URIs to allow discovery of more things: when something on the web relates to something else on the web, the links to the specic resource identiers should be used to create, overall, a graph of knowledge. These ideas are applied and supported by the Linked Data [62], initiative which tries to provide a platform for interlinking the existing semantic resources on the web, centralizing the information and tools to help create the web of linked data. Under this umbrella, there are numerous approaches focused solely on creating the descriptions needed, on triplifying the web, to transform as much content on the web into described resources as possible. The next step in the description stack are the ontologies, evolved from the need to represent concepts and relations between concepts, to enable formal denitions of domains knowledge and reasoning about the knowledge, in a rigorous, (potentially machine understandable) form. The simplest and most widely cited denition of an ontology in computing is the one in [14], presented below. Denition 2 An ontology is the specication of a conceptualization. This denition encompasses two main aspects: the fact that it oers an abstract, simplied version of the world (conceptualization) in a formal and 8

declarative representation, implicitly processable by a computer [13]. In the context of ontologies for web data, Ian Horrocks [20] describes an ontology as [...] an engineering artefact, usually a model of (some aspect of) the world; it introduces vocabulary describing various aspects of the domain being modelled, and provides an explicit specication of the intended meaning of the vocabulary. Finally, a formal denition of the ontology, suitable for its use in logic or formal algorithms and tools operating with the ontology is provided by [10]. Denition 3 An ontology is a tuple o = C, I, R, T, V, , , , = , such that: C is the set of classes; I is the set of individuals; R is the set of relations; T is the set of datatypes; V is the set of values (C, I, R, T , V being pairwise disjoint); is a relation on (C C) (R R) (T T ) called specialization; is a relation on (C C) (R R) (T T ) called exclusion; is a relation over (I C) (V T ) called instantiation; = is a relation over I R (I V ) called assignment. To represent these formalizations on the web, RDFS (the RDF Schema) [41] was dened as an extension of the RDF to include basic features to describe simple ontologies (vocabularies, taxonomies) by the specication of additional meaning to special resources (e.g. rdfs:Class, rdfs:subClassOf, rdfs:range, rdfs:domain, etc) [20]. For a higher expressivity, web ontology languages were dened to allow the description of more complex relations, such as the cardinality of a property, characteristics of properties (symmetry, inverse), class relations (such as the disjunction), or the equivalence of concepts. From multiple proposals for a language formalize all these, OWL (Web Ontology Language) [38] has emerged to the stage of Web standard today, being a W3 Consortium Recommendation, based on RDF and RDFS and using an XML serialization format, to be easily exchangeable on the web. Along with the mechanisms to represent the knowledge on the web, tools have been developed to access these new formats, in structured manner. From multiple early initiatives, SPARQL (SPARQL Protocol and RDF 9

Query Language) [43] is specied by the W3 Consortium as the query language for RDF data, allowing access to resources in the triple statement data model, and dening the access guidelines to expose such data through services. For the ontologies higher level, the OWL Query Language (OWLQL) [11] is proposed to handle sophisticated knowledge query and answering framework for data represented in ontologic formalisms. As a result of transforming the web in a web of (described) data, the approaches to using the data available on the semantic web have appeared as well: one of the most straightforward ideas is the one of exchanging and reusing the described data between services: successful approaches are the ones which use specic data repositories taking advantage of the semantics of the data (such as DBpedia Mobile [49] or the Researchers Map mashup [71]) but also to use the available semantic data to enrich an existing application with supplemental information (the way Revyu [72] enhances its reviews about movies with information about the specic movie from the DBpedia [48] repository). Also, the idea of faceted browsing has evolved into real applications, allowing the user to view dierent characteristics of the content, based on the semantic of the data (such as Simile Longwell [78]), and also multiple initiatives try to provide a data browser rather than a document one (as the Tabulator [85] extension transforms Firefox). A detailed description about the current state in the eld of initiatives related to the linked data web is provided by [3]. Of special interest for our purpose are the semantic search tools, which we will discuss in the next chapter, in the context of the targeted system design.

2.2

Semantics on the Social Web

In the real-world web, the Social Web, the linking data initiatives have found their expression in mechanisms of annotating the content with semantics together with the HTML markup of the data: either creating links to data description (les) from the HTML documents (such as FOAF [52] works), or embedding the data description in the HTML markup. This latter category has received much attention lately, regarded as an accessible way for the developers to create semantic markup for the content of their websites, and targeted by important initiatives such as the W3 Consortium GRDDL Specication [37], aiming to provide a general mechanism for the (real world) semantics to be transformed universally into described resources (RDF). There are three major approaches in this direction, which we will discuss next.

10

RDFa In work (as a Recommendation) at the W3 Consortium, RDFa denes a set of extensions to the standard XHTML elements, using additional attributes (property, datatype, about, typeof, etc), such that the descriptions of the data are contained in the XHTML itself, by mixing the XHTML and the RDF vocabularies in the same serialization. While the clear advantage is that it allows to embed actual RDF in the XHTML and carry the semantics along with the representation, the disadvantage, compared to microformats, for example, is that they address generic problematics, they lack focus on solving social web semantics needs, dont come as a practical variant of RDF but rather use the XHTML layer as a transport layer for RDF. Microformats The microformats initiative [63], started in 2005, aims to provide simple conventions for regular XHTML markup on the web, to create a framework for frequent semantics in the web content to be expressed in uniform manner, such that they can be extracted by automated tools. Stating to be designed for humans rst and focusing on preserving the content human presentable, they are using the standard XHTML elements and classes, dening conventions for the elements encapsulations and attributes values to express specic semantics (e.g. the hCard format for marking persons, hCalendar for marking events, geo for marking geographical positions, etc), the main dierence from the RDF initiatives being that they dont aim to provide a general framework for semantics to be expressed in (X)HTML, but rather highlighting the semantics where they exist on the web. Together with the related POSH (Plain Old Semantic HTML) [68] initiative (which preaches the idea of using XHTML elements according to their meaning and not for presentational purposes), theyre aiming to elaborate guidelines and simple conventions to transform the XHTML in a semantic layer rather than embed the semantic layer in XHTML. Because of their precise semantics and accessibility for developers (for which the term of lower-case semantic web has been coined [4]), they are highly spread in social web content nowadays and used by multiple user oriented tools, compared to other mechanisms also see [26] for an overview of microformats and their usage. eRDF The eRDF idea [50], incubated at Talis1 by Ian Davis, is an approach to combine the two above, to embed RDF in XHTML using standard HTML elements and attributes names as the microformats do. The idea is that,
1

http://www.talis.com/

11

instead of using specic names for attributes as the microformats do to express precise semantics, generic conventions for dening names from any ontology can be used in the same manner, combining the microformats approach for data extraction and the RDFa approach for data interpretation in a mechanism as accessible as microformats and as expressive as the RDFa. The concern for real-world semantic markup is also propagated in the design of new protocols for the content web, materialized in the Microdata proposal for the HTML 5 standard [44], in work at the Web Hypertext Application Technology Working Group. Even if, in the end, the target is the web of linked data, where information is all under the control and authority of the Semantic Web (tools would just enable users to create and consume data), until that point is reached and the tools are all ready to oer a proper interface to the semantic web to humans, the collaboration between the two is very important, as the mean of making the web of data possible: the information is with the users nowadays and machines not only need a way to extract it, but also a way to give it back, to make further creation of data possible, to keep alive the Social Web with its weak and ambiguous semantics as part of the foundation on which the data web is built. Information, in its circuit on the world wide web, will always go through the users, and, until the semantic web tools are ready to create the adequate user interface in their own formalisms, the social web tools must be enhanced and given importance as they are representing the way data is relevant to users, and encourage users to produce it. In this direction, the important players on the web have recently turned their attention to the embedded semantics: since the 3.0 version, the Firefox browser implements a programming interface for extracting microformats from the web pages; the Google search engine uses hCard and hReview microformats as well as data vocabulary [46] RDFa descriptions to detect the signicant objects to be indexed in a page, as well as other Google tools mark their data with microformats; Yahoo produces microformats in all its applications and also implemented a microformats search engine. Traditionally, the tools that the users have to produce semantic markup on the social web are by creating structured data in applications and relying on the latters to mark the data correctly. However, solutions exist for the cases of unstructured content: wikis which encourage the users to produce data types and create data according to their own dened types (OpenRecord [66], Knoodl [59] or XWiki [93]) or which try to facilitate semantic annotation of content in the wiki syntax (Semantic Media Wiki [73]) but which currently fail to correctly markup this data upon rendering it on the web. For the blogs and content management systems, Zemanta [94] is a notable initiative which allow users to retrieve and link data semantically connected with the content they produce, to be discussed in detail in the dedicated chapter. 12

2.3

Semantic descriptions and natural language processing

Although, along the years of research in information retrieval or journeys towards making the web accessible for machines where the main issue has always been understanding the human produced content, natural language processing has been a temptation and promising path, the tendency nowadays is to avoid it by building the tools which would allow semantics to be created along with the data rather than extracted from the data afterwards, in the spirit of the semantic web principles. The data web is designed as an answer to the need of natural language processing, to be a network of linked data with respect to which humans and machines are equal, not a data layer rooted in the users layer and created by processing user language. For these reasons, we will address the problematics of mediating the communication of data between the users and computers using tools exclusively designed for described data (the semantic means), and not relying on the natural language processing. Even if in some situations natural language processing tools could provide greater benet, we will stress on doing it the semantic way. In the described space of tools and problematics, our proposals come to respond to the needs of user to access the semantic web as an equal peer, and stress upon passing the semantics from the retrieved data to the created content, through users hands transparently, so that at no point the rigorousness of the data web is replaced by the connotation of the user natural expression. We will describe, in the following chapter, a model of interaction, along with the ideas and approaches to achieve it from orchestrating existing data, services, formats, tools.

13

Chapter 3

Employing the semantics practically


We will discuss, in this chapter the steps to create the three components of the system: data retrieving, presentation of this data to the user and the mechanisms to preserve the semantics in the social web content created by the users.

3.1

Retrieving the information

To reach the goal of allowing user access the knowledge on the semantic web as a whole, with reusage of the results, we have a few characteristics in mind when analyzing the data retrieving options: the search tool needs to be based, or at least able to consume descriptions of resources (RDF or other format), to constitute an element in the web of linked data chain, and to present the guarantee of searching knowledge rather than searching text, search is to be done in a natural language like expression, the user should be able to freely specify a resource and a property to nd for the resource, all without being aware of the underlying technologies, the search should also be able to take into account methods for a potential disambiguation of the user intents, but these mechanisms are only to be used as additional help, not to be relied on, the retrieved result should be structured and expressable as a described resource, using open standards (even if the human accessible UI would present it in a dierent form), such that the semantic of the original content from where the answer was retrieved is not lost on passing through the search system.

14

The target, when analyzing the search input, is to be able to handle a user query of the following type (considering that the separation of the Subject and property strings is handled by external tools, out of the scope of this paper): What is the value of a property for a Subject? where Subject and the property are provided by the user, in non-technical language: the subject as a string, interpreted as a keyword, and the property as a natural language expression of his inquiry. Although more sophisticated needs can be imagined for the users, it is easy to notice that we modelled this data retrieving as the minimal description unit, a statement (as in the RDF model). The tools that will be able to provide the above requirements are all united under the general concept of semantic search which we will survey in the followings.

3.1.1

Semantic search

A particular place in the semantic web applications landscape is taken by the semantic search tools. From the very beginning, the researchers and companies have been interested by the way the data description on the web could help information retrieval, and how would the semantics interfere with the search process. In the context of the two types of search, navigational and research directed [16], the semantic features facilitate predilectly the second, although initiatives also exist in the direction of navigational search. A survey of the existing approaches in the direction of the semantic search was done in [19], and is maintained online at [74], where the term semantic search is described as dening all applications where semantic is used at any of the phases of the search process, using syntactic matching and semantic mathing to represent the two ways in which a user queried string can be interpreted, as a keyword to be matched or a query to be interpreted. Although multiple interpretations can be assigned to the term, a widely accepted denition of the semantic search can be formulated as follows: Denition 4 Semantic search is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results. A category apart among these tools are the search engines that allow users to nd semantic web resources and documents, such as ontologies, amongst which OntoSearch [34], Swoogle [12, 84] or OntoKhoj [29]. This approach has little relevance for regular users, it addresses the technically trained specialist in search of datatypes descriptions on the web rather than a user looking for a piece of information. 15

Also, a widespread type of tools are the ones focused on searching a specic repository of data with a specic domain knowledge (such as Squiggle [5, 82], AquaLog [22, 45], SemSearch [23] or Kim [58]), and / or requiring the user to specify a priori the type of information hes searching for (as it is done in Kim, SHOE [18, 79], QuizRDF [6] or the semantic search built for OntoWiki [1, 65]). This type of tools are of no interest for a general purpose user-accessible search, since we are analyzing user exible interaction with semantic web as a whole. In the early phases of the semantic search initiatives, the approach to semantic search was based on indexing plain web content, after inferring a structure, or a data type from the document itself using various means related to standard information retrieval and text processing (as OWLIR [30] proposes), and then allowing the user to search for a specic type of data or enhancing the results presentation with the data types such extracted (as Squiggle or Freebase [53] present their results). These tools, although somewhat useful, have limited capacities of operating with semantics and are not built to handle general knowledge (focused most of the times on modelling web documents) or dont provide semantic assistance but after the search was made. Todays general usage search tools which present semantic features are part of two categories: those which focus on inferring semantics from the user query and on the presentation of results in a semantic manner, and those which rely on the knowledge on the web to retrieve answers from data rather than documents.

3.1.2

Search for the user: Wolfram Alpha, Google Squared

On the side of the fully user oriented tools, WolframAlpha [88] was launched in may 2009, aiming to make the worlds knowledge computable. It is designed as a computational knowledge engine, allowing sophisticated natural language-like queries about various domains, providing the answer from its internal repository of knowledge, potentially not linked with the data web. In the same direction of structuring the knowledge on the web for users answers, Google Squared [54] was launched at the beginning of June 2009 as an initiative to structure search responses on the y as retrieved objects with attached properties and values for the properties. The users are also allowed to dene their own properties to search for or objects to retrieve information about. However, the sources of the responses seem to be clustered data from regular documents indexing combined with text processing rather than described resources on the web, which, even if make the data sources wider and probability to have responses bigger, bring a signicant accuracy penalty. Since these tools provide structured answers to natural language queries, we analyzed the possibility of using them as data retrieving backend but 16

identied a series of impediments. First of all, although well setup as user instruments, none of these tools provide yet access as a service, through a programming API, to represent a reusable resource on the web. The second issue is that, although the answer is well structured, this structuring is only presentational, no semantic markup is attached to the answers to be used by machines understanding the retrieved data. This second issue, especially, makes them unusable as inferring the answer semantics without support for the service itself is reduced to the problem of indexing knowledge from nonsemantically marked content on the web. However, current trends in the evolution of the web tools make us believe that these aspects will be addressed soon, turning these services into powerful information retrieving tools, for humans as well as for machines. Unfortunately, even in the eventuality of an answer provided as structured data and indexing resources of the semantic web, an intermediary knowledge model of the search engine (not based on the linked data principle), can deviate the responses from the original semantics of the sources on the web where they were originally found. To address these impediments, we look at the semantic web (the data graph) as a resource to resolve user queries, operating in the semantic data eld therefore preserving semantics of all resources involved in the process.

3.1.3

Searching the Web of linked data

The Semantic Web approach towards the data retrieving can be intuitively started (as done in [25, 70]) by using a structured querying language (such as SPARQL) to access RDF data in a set of data repositories, potentially with their own dened ontologies, and to use conventions for choosing the correct denotation of query components, with respect to the queried data source. In the description formalism, the Subject is an object which can be identied by matching the user inserted value against RDF properties that provide user accessible illustration of the object, such as the RDF Schema rdfs:label, or properties customized for the source data repository, which are known to be more relevant for the specic repository. An approach to give full access to the semantic web as a whole through a query engine (SPARQL) is the Semantic Web Client Library [76] which is built as a programmers universal access to the semantic web. These approaches however, based on SPARQL queryies, without proper infrastructure, are limited and have been already taken one step further by todays second generation semantic search engines. Linked data search engines: SWSE, Sindice, Falcons The actual step of the semantic search tools is represented by the search engines rooted in the Linked Data repositories and annotated documents (using GRDDL or own means to parse embedded semantics) and regarding

17

the web as a collection of such (as opposed of trying to infer data descriptions from the user created documents). Three initiatives in this direction are the early SWSE [77, 17] initiated in 2005, the later Falcons [51] and Sindice [33, 81], focused on retrieving described resources (as RDF documents or fragments) from the semantic web as a whole, based on a keyword. However, the interface for these tools is rather technical, the users need to have at least notions of semantic web formalisms (ontologies, properties, prexes, URIs) and further, mechanisms and tools to process thereof, in order to be able to query and interpret the results. They are, therefore, to be regarded more as building blocks rather than nished semantic search tools. Their resolving of the search subject is done using full text indexes of the text values that appear in the structured data crawled from the web. While SWSE looks for any kind of information and returns locally stored objects created from the aggregated information concerning a specic aspect, using own ontologies to dene objects on the web, either in the linked data or not, Sindice exposes a specic Term Search, which identies locations on the web (dereferenceable URIs) where the specied search string is to be found, Falcons uses a hybrid approach, exposing locally stored data and URI to resource, all these results being ordered by relevance. Regardless of the structure, all these results are useful since they allow the identication of a described object for a search string, the user requested Subject (matching items 1 and 4, and partially 2 in the requirements list above). For the property value search, SWSE doesnt oer any specic solution, being limited to keyword search, while Falcons only identies a property name by keyword. This aspect has been specically handled by Sindice, with the Property Search and advanced query language, both aimed at responding missing values in triples as terms. Therefore, the search is imagined as a search for a subject which has a specic value for a property. Serious user communication limitations are though imposed by the requirement of the property to be either given through its URI (or shortly specied using predened prexes), and the object through its specic value in the triple: can be a string if the value is an atomic value but needs to be an URI if the value is an object. Also, there is no mean to inverse this query to look for the values of a subject and a property, all which render the property search in Sindice useless for our purposes. To target advanced search, SWSE makes available a SPARQL endpoint, which, even if highly powerful as an information extraction point, has a limited query time of 20 seconds. This, combined with the complexity of queries designed to perform several denotations in a single pass, makes its usage in real-world cases in practice impossible. The strategy, given this tools landscape, is to devise a system to resolve the two components independently:

18

subject resolution the identication, from the user expressed string Subject of an object matching the description best. For this, querying independently data repositories on the web, or semantic search services like Sindice, SWSE or Falcons are used. The retrieved URIs are dereferentiated and the descriptions found are taken into account as potential subjects for the search. If such services retrieve multiple answers, the order of the answers is considered to be the relevance order, and after a duplicates elimination based on URIs, they are to be trimmed to a number of m and stored, as s1 , s2 , ..., sm . property resolution the identication, from the user expressed property of the property of the retrieved object whose value is to be retrieved as a result. As opposed to the previous problematic, no satisfactory approaches exist for this second issue (the Falcons proposal being only based on a keyword search), therefore a proposal will be made in the following. Inferring user needs However, even if on the Semantic Web, where resources are precisely marked and semantics unambiguously specied, the user input can still bring a degree of ambiguity in the process. For example, assuming that the user search subject is Firefox, it can either refer to the movie Firefox or the software product with the same name. We can assume that, whichever semantic search we use, both these results are going to be in the results list. In particular cases, the dierentiation can be made based on the property the user is requesting, which would be present only for one of the objects. But for the general case when the property is ambiguous too (in our example, searching for the property author), we are faced with the situation where we would be unable to say for sure what the user refers to and which is the correct description we need to provide as an answer. Excluding the solution of making the user choose, there are a few strategies to determine the users intentions, to disambiguate his search and place it in a user semantic context: the history of the search session can be used: whichever semantic content he has searched before and the ontology of the previous answers can be used as an indicator of his preferences: if there is a previously retrieved object from the movies ontology, for example, then Firefox the movie is the right answer. to enlarge the knowledge about the user not only to his search but also to his content interesses, semantic data can be retrieved from the pages the user is viewing using gleaning methods [37] to unify all types of semantic markup (idea materialized by Similes Piggy Bank [80]). This 19

technique admits two avours: either only the current page resources and ontologies are used to determine the context, or his entire local store. While the rst is more limited in knowledge, it has the power to provide better results due to the localization of the interests of the user (even if, in general, he is interested in software, in this particular instance its about the movies). For the discussions below, we will use the term of (user) context ontology to refer to this knowledge, regardless of the way it is collected, comprising the instance data and the types and relations referred by these instances. While the structure information (types or properties) might not be present in the collected data itself, they are to be retrieved in case of need by dereferencing the resources URIs where possible. Also, this knowledge might reside in multiple domains, or ontologies retrieved might be overlapped, but we will force the term context ontology to refer all these united descriptions of data, as they are found, with no reconciliation or logical consistency ensuring mechanism to be used. None of the studied search engines allow suggestions for the subject type to be retrieved, but the context awareness can be enforced with the property resolution, in a way which we will present in the following section.

Property identication Although the problem of identifying the user referred property from natural language expressions can be stated as matter of natural language processing, we approached it from the perspective of ontology matching, modelling the user expression as a property request in his own model (perceived ontology) of the data, which needs to be aligned with the ontologies of data on the web to determine what is characteristic to retrieve. Note that, while some of the alignment techniques which are to be described in the following are also using natural language mechanisms, we made use of these mechanisms as alignment techniques and not as natural language processing, since the results are the same and keep us in the space of the same formalism. The starting approach ([25, 70]) is to handle the property to be looked for by using an association between user expressions and properties in the description language. Since exhaustive hardcoded mapping is very limitating, a rule needs to be devised so that any user inserted string can nally be semantically associated with a property. An ecient, though simplist, such convention is to associate a default ontology (prex, in syntax) to any queried source and consider that the user inserted property is the actual name of the RDF property. This approach is a particular case of an idea to be evolved in the following, and has surprising success in practice as a result of good design of ontologies which use suggestive names for the properties

20

on one hand, and a high cohesion between ontologies used by data repositories on the web, instances in one repository and users interest in that repository on the other hand (otherwise put, most of the data in a repository will belong to the same ontology and users interest in a repository will be determined by that ontology with its properties, so it can be used as the default prex). As a further step, we will show, in the following, how a fragment of ontology can be inferred from the user expressed query and we will propose the methods to nd the associations for this fragment (such as an answer for the query can be determined) using ontology alignment techniques.

3.1.4

Ontology alignment for user language property resolution

Diverse terminology is used in the eld of ontology alignment ([8], [10]), but, in an agreed form, an ontology alignment is a set of correspondences between the various entities (concepts, objects, relations, etc) in two ontologies o and o . Following the ontology denition in chapter 2, a correspondence is formally dened as follows (adapted from [10]): Denition 5 Given two ontologies o and o , a set of alignment relations and a condence structure over , a correspondence is a 5-uple: id, e, e , r, n such that id is a unique identier of the given correspondence; e o and e o r n . . A condence structure, in turn, is described by the next denition. Denition 6 A condence structure is an ordered set of degrees , for which there exists a greatest element and a smallest element . For example, the boolean set {true, f alse} is the simplest condence structure. The most frequent condence structure used in algorithms is the [0, 1] interval, with the natural order. Otherwise put, such a correspondence states the existence of a relation (equality, subsumation, etc.) between two entities in the two ontologies, with a specied (comparable) condence. Other related terms associate with the alignment problematic: ontology matching is the process of nding such an alignment, ontology mapping is the oriented version of an alignment, allowing rewriting all the knowledge in an ontology to another, ontology merging, integration and reconciliation are about creating a new ontology from other two, by modifying none, one 21

or both of them, for various purposes: dealing with overlapping described domains, sinking one ontology in the other or adapting the two ontologies to communicate well together (see [10] for detailed terminology and denitions). Note that, even if on a rst look, the (one-to-one) mapping seems to be a good solution for our problem, it can add unnecessary restrictions since we are interested in all data potentially associated with the user request, and not in the rewritting of data from one ontology to the other. For this reason, the particularization of ontology matching problem for query rewritting [8] was found to be unnecessarily rigorous for our needs. Employing ontology alignments, the steps to nd the a value for a user inserted property p are the following: formalization of the query: a formal representation of the user query needs to be found for the alignment to be possible (expressing the user required property in a formalism and representing it). identifying the targeted ontologies: the structures of data where the potential answer data is found must be identied, extracted and fetched in order to be able to nd a matching of the p among the sources. nding an alignment between the query formalization and the target ontologies, namely a set C of correspondences between the property p and properties in targeted data web ontologies, belonging to various classes, with various relations and condence values selecting the results to present to the user : choosing, from the found correspondences, according to the condence value and relations they contain, a set of valid correspondences whose matched property values are to be extracted and presented to the user for details about the presentation methods, see section 3.2. Query formalization Since the only information we have about the user query is the user expression of the property he is looking for, we create a representation of this information in the resource description formalization as follows: rst we extract, from the user inserted value of p, a value to assign to the identier of this property, according to the cleaning rules dened by the URI specication [35] and the RDF specication, call this value pc. then we create the actual representation of this knowledge, as RDF(S):

22

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="local-uri-userontology"> <rdf:Property rdf:ID="pc"> <rdfs:label xml:lang="en">p</rdfs:label> <rdfs:comment>p</rdfs:comment> </rdf:Property> </rdf:RDF>

Note that, in the above representation we tried to encapsulate as much as we know about the user model: we know the described entity is a property, which the user recognizes through its label p, information which also goes in the comment, to take as much advantage as possible from the alignment algorithms. Even more, we assume that the identier is also p, with the mentioned restrictions. At this point we know nothing about the range nor domain of the property, so no information can be specied. This description aligns with the Semantic Web in the spirit of the open worlds: knowledge which is known is described, and no assumptions are made about the information we do not know. if context information is found for the current query, as presented in the dedicated section, the current property description can be augmented with with a domain information: for each class (of object) identied as associated to the search subject (either the entire page ontology, or the classes of the objects in the history or page associated to the current subject), a property description will be created: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="local-uri-userontology"> <rdf:Property rdf:ID="source-pc"> <rdfs:label xml:lang="en">p</rdfs:label> <rdfs:comment>p</rdfs:comment> <rdfs:domain rdf:resource="source-class-uri" /> </rdf:Property> </rdf:RDF>

with the possibility of extending this assumption to the user created data: the answers he expects to be retrieved are also part of the context 23

ontology, therefore all the classes in the context ontology can also be considered as target concepts: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="local-uri-userontology"> <rdf:Property rdf:ID="source-target-pc"> <rdfs:label xml:lang="en">p</rdfs:label> <rdfs:comment>p</rdfs:comment> <rdfs:domain rdf:resource="source-class-uri" /> <rdfs:range rdf:resource="target-class-uri" /> </rdf:Property> </rdf:RDF>

Note that, even if these descriptions of the user targeted properties may not be consistent themselves together in an ontology with respect to the user ontology (since we are basically trying all possibilities), fed to the alignment algorithm, they will be dierentiated by the correspondence scores, to present the most relevant to the user. All these descriptions, the ones which can be created, are united and stored together as O1 , the user ontology. Identifying target ontologies To extract the data ontology, O2 , following the models of search we proposed, we can procede in the following ways: from previous knowledge acquired by the data repositories taken into account for search: once a new repository is taken into account, its structure is fetched (classes, relations, axioms) and stored locally by the answering system, in an ontology O2 . Note that serious scalability problems are to be resolved in this case if a subset of interest cannot be identied from O2 for the current search. from the structures already associated with the search answers for the Subject query: all the distinct classes and properties of the subject search results s1 , s2 , ..., sm are fetched in an ad-hoc ontology O2 , created as their raw reunion (no integration, merging or reconciliation is to be taken into account). Since, in the case of the rst solution, property identication in all the classes of the ontology is to be passed through a process of subject ltering also, we can reduce the rst case to the second case, doing the subject ltering rst, in all cases. For example, assuming the search for the subject The Wall retrieved three results: a Last.fm [60] scrobbled track, the 24

movie in the DBpedia repository and the album in the DBpedia repository, the class descriptions of these terms (the rdf:type property), mo:Track1 , dbpedia-owl:Album2 and dbpedia-owl:Film3 , along with their properties which were specied for the identied terms, are fetched and stored together in the data ontology, O2 . Finding an alignment At this point, ontology matching is executed on the identied O1 and O2 , to nd potential associations between the representation of the user expressed property p and several properties in the ontologies of the retrieved answers for the subject Subject. Formally, we expect a set of correspondances: { id, p, pi , r, n |p O1 , pi O2 , r , n } In the following, we will discuss how ontology matching algorithms operate on O1 and O2 , stressing on the techniques which are signicant for our case. Ontology alignment methods are highly connected to general semantic reconciliation and database schemata alignment methods (see [21] for a discussion on the topic), but, since the ontology matching strategies today comprise the schemata methods, we will discuss only these aspects. A fork of the approaches, relevant for the current situation, is given by [7] (further detailed by [8] and [10]), as rule-based approaches and learning-based approaches, the rst using the structure of the information alone (classes, types, properties) while the second relies on populated repositories to infer associations between the properties based on the actual values. Since the structure is the only thing we can infer from the user ontology when querying, we will rst explore the rule-based methods, followed by a strategy to acquire instances for a learning method. Rule-based alignment approaches From the rule-based methods, the name-based techniques are the ones which will prove most useful for our case, the ones that use elements IDs, labels or comments for comparison between entities. These techniques measure the similarity of the names of the compared entities, and provide it as the condence of the alignment, from which in this case, the most successful are: the syntax based similarities (equality, Hamming or edit distance on names themselves or on derived names, such as the composed name of the compared entity and its related entities), the language based methods which rely on normalization of the natural language expressions in the names before comparison, and the ones based on external thesauri for determining semantic equivalence of the names. For this reason, when creating the description for the user
1 2

http://purl.org/ontology/mo/Track http://dbpedia.org/ontology/Album 3 http://dbpedia.org/ontology/Film

25

knowledge property to be aligned, we tried identifying as many of an entitys names as possible, to stimulate the alignment based on any of this criteria. From the structure based techniques, useful in our case only when there is a context to associate to the current property, the domain and range similarity measures will prove to be relevant, as all the others imply verication of more complex structures than a single property (entire structure of a class or cardinality, which we cannot infer). These measures take into account the equality of the domain and or range, as well as subsumation relations on such, and will lead, in our case, to the favorisation of the properties whose domain and range correspond to the user context domain and range. Also, the structural techniques should take into account descendance relation between properties, as what we infer in O1 is a rdf:Property while the actual properties in the targeted ontology could be the more specialized owl:DatatypeProperty or owl:ObjectProperty. Naturally, in all these systems, multiple matchers are used for the same entities and combined to create the condence value of the alignment using various methods: weighted averages, products, multidimensional distance (see [10] for a detailed list of these methods), either chained or iterated to have the structural matchers take advantage of the previous found alignments (see [8] and [10]). In our case, the second method is proved useless, since the graph to align is very simple therefore the aligned neighbours that could inuence the current aligned entities are few. The weighted average is a good measure, with well chosen, practice inspired weights. Moreover, [8, 9] propose using a Sigmoid function to augment the similarity values, with particular advantages in the case of syntactic alignment where the similarity is not linear with the number of letters that dier, for example. As an example, the above mechanism allows to nd associations between a property expressed by the user as authors and properties like the dbpedia-owl:author4 , dc:creator5 , hasAuthor property in the BibTex ontology6 , based on either syntactic or semantic match of the searched string. Then matching also takes into account the context of the search with respect to the retrieved subject types and locally stored interesting types: if retrieved subjects are foaf:Person7 types and the local stored user history highlights a predominance of dbpedia-owl:Film8 then the DBpedia property will have higher alignment condence from the three. Learning based approaches The learning based alignment techniques are based on populated ontologies to determine the similarity of the structures from the values. They are able to identify the identical instances and then
4 5

http://dbpedia.org/ontology/author http://purl.org/dc/elements/1.1/creator 6 http://zeitkunst.org/bibtex/0.1/bibtex.owl 7 http://xmlns.com/foaf/0.1/Person 8 http://dbpedia.org/ontology/Film

26

use this information for establishing the similarity of two classes from the number of instances they share, or to compare the values of two properties in multiple instances and, upon high similarity, establish a similarity in the properties themselves (for more details, also see [8] and [10]). These methods are of high practical importance, especially in our case, because they can associate entities which are equivalent from the usage point of view, rather than from the theoretical one. In our case though, for the ontology of the user model, no instances are available a priori (for the target O2 ontology we assume that some instances can be retrieved along with the ontologies, in the spirit of the linked data web). A simple method to create these instances is, everytime a set of answers are presented to the user and he chooses one of them, the instance of the completed triple, in the users knowledge model, is stored in a local repository to be used in further alignments. Namely, for property p, multiple aligned properties p1 , p2 , ... will be identied, along with their subjects s1 , s2 , ... and values v1 , v2 , .... Upon the user conrming a particular result corresponding to the pk property, the illustration of this choice is created and stored in the local user knowledge model: <rdf:Description rdf:about="subject-k-uri"> <uo:p xmlns:uo="local-uri-userontology" rdf:resource="value-k-uri"/> </rdf:Description> or, if the value of the retrieved property is a literal: <rdf:Description rdf:about="subject-k-uri"> <uo:p xmlns:uo="local-uri-userontology">value k</uo:p> </rdf:Description> which describe the fact that, what the user perceives through the property p is established between the subject sk and the value vk . We notice again the advantages of the open worlds with linked data principle: we can create arbitrary connections with old data to represent newly learned properties. Note that these triples can also be used to retrieve the value of a property for the same subject right away in further queries. This approach, of illustrating user knowledge for further alignment rather than store the retrieved associations right away (once the user has chosen pk as the desired result source, store the xed association between p and pk ) has several advantages: a xed association potentially locks a user property p to a specic associated pk , in a specic ontology, which might not be valid for further subject resolutions;

27

it allows exibility: in a further request, the user might need to choose a dierent answer: xing the association right away, from the users rst choice, hides the other properties from further alignments, thus providing user with wrong answer; it better simulates the user knowledge model: for dierent types of subject (and potentially for dierent instances too), the same property name might mean dierent things for the user. A xed learning of the associated property will make general assumptions about structure, while an instance based association represents the exact facts that have been noticed: the user associated property p with subject sk and value vk . Ideally, in a consistent user knowledge model, the associated property pk will eventually converge. Machine learning, with a neural nets model, can also be employed at the ontology matcher level by communicating back the user feedback to stimulate learning of the similarity measures weights used for the nal computing of the condence (see [9] for details of this approach). Selecting the results to present to the user Once the alignment found, the actual results to be presented as an answer to the user need to be detected. Since the user is not interested in the identied matching property pk but in its value, some preprocessing is be done to identify the actual results: for each correspondence p, pk , r, n , if the relation does not represent pk is a p (equality or subsumation of pk to p), otherwise put, pk can take the place of p in a statement, the correspondence is discarded. Note that this step can be skipped, with the price of accepting wrong results. The relation is then eliminated from the remaining correspondence descriptions. the values of the correspondences are identied: for each p, pk , n , let vk be the value of the property pk for the found term sk , transforming the correspondence in p, vk , n , illustrating that the property p has the value vk with condence n. duplicates are eliminated from the remaining correspondences set, as follows: if the value is a literal, the set of all correspondences with the same value are united in a single one, with the condence value n as the maximmum value in the set. If the value is a concept instance, the URI is used for this identication with n determined by the same method. Note that, while the owl:sameAs property also provides a good identication of objects in the linked data space, we do not lter for it since we need to allow the same object to appear twice in the

28

results but modeled in dierent ontologies, seen from dierent points of view. signicant results are then selected from the remaining p, vk , n set, by applying various thresholding techniques: highest condence alone is kept, all correspondences with a condence above a xed value N0 are kept, all correspondences within a Nmargin distance from the highest condence are kept, or top N% percent are kept (for a well illustrated list, see [8]). In this case, since the aim is towards providing precise answers to the user, we favour outstanding high condence, therefore thresholding is done by computing a top margin ntop as a percent c from the value of the highest condence nmax , and considering as results all correspondences between ntop and nmax , with c chosen experimentally suitable. This way, if there is a correspondence with a condence much higher than the others, it will be returned alone; if the top consists of answers with comparable condences, more answers will be provided and let the user choose. This model, of presenting multiple results to the user, and leaving the choice to the user, together with the learning techniques presented in the previous discussion, address two major issues in ontology alignment: it is generally believed that fully automated semantic mapping cannot be done (clearly stated in [21] but in most references too), and this proposed silent feedback collection, well integrated with the user achievements of his purpose is a good answer to the user involvement problem identied in [31]. The choice of the actual user interface decision of the value to present to the user will be discussed in the following section.

3.2

Presenting the results to the user

For the actual presentation of the found answer vk to the user, a simple rule can be used: if the answer is a literal value, such as a string or a number, or date, it will be presented to the user as it is. If the answer is an object (a described resource, instance of a concept), then the presented value will be the rdfs:label property of the instance, along with an optional rdfs:comment, since they represent the human readable form of the answer. For reusing this answer, however, dierent strategies are applied, to be discussed in the next section. Ideally, a single answer from the retrieving component is expected, since the semantic querying will provide precise answers to the searched properties. In some cases though, multiple answers will be retrieved (for example, when the search subject refers multiple objects of the same type and the searched property is resolved correctly for all), for which reason we went for

29

the presentation of the results as a simple (HTML) list, allowing the user to navigate and choose the correct answer. As for the type of application chosen to implement the instrument as a whole, naturally, since this its designed to be an interface for the user to make the semantic web relevant in his own, social environment on the web, we propose the implementation as a browser extension. This way the user is able to access the semantic web information from his social context (to issue queries without being aware of the access to the semantic web) and also, it allows to technically interfere with the user creating content on the web (anywhere on the web) and ensure that semantically marked content is used. Also, this choice allows simple generic access to the tool, regardless of the (type of) web application. For the query issuing interface, we chose a exible predened command format, using connecting particles between the commands keywords (Subject and property) to create a uent, intuitive natural language-like expression, with handling of missing components and suggestions-based composing interface to manage the diversity of the user input. While this approach does not represent an exhaustive approach, its coverage is sucient in most of the cases and is a satisfactory tradeo between the left out cases and the complexity of a complete natural language query processing, in the spirit of the 80/20 rule [28].

3.3

Reusing the semantics

The other task, upon reusing the results, is attaching a semantic markup to the answer, as a social web block compatible with the documents web. This need comes from the hypothesis that the retrieved information will be reused by the user in the content hes creating, and we need to ensure that when information on the web is operated by the users it does not lose semantics. For example, assume a user has retrieved the desired information and this information represents a member of a band, properly described as such in an ontology, to use it on writing his blog post about music. What we aim now is to investigate how a fragment of raw HTML can be created for the user to insert in edited content to represent the information he just fetched, instead of the plain text name of the artist.

3.3.1

Translating semantic web resources in RDFa and eRDF

The straightforward approach is to use RDFa and eRDF to sink the RDF markup of the retrieved answer directly in the produced HTML. The advantage of this method is that it is technically easy and able to provide a one-to-one mapping of the structured answer, therefore fully preserving the semantics of the answer: the generating tool needs to only provide a transformation (in the XSLT [36] spirit) of a RDF fragment to a HTML fragment 30

which also illustrates the data in the RDF block. However, there are disadvantages from the practical standpoint, RDFa and eRDF being little used in real-world social web applications as a source of data. The semantically marked information is carried on to tools of the semantic web that crawl the documents, but its not relevant to real users or to the tools of the social web which they operate with and use to access information. For example, the follwing fragment of FOAF [52] description of a person: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:Person> <foaf:givenname>Anca</foaf:givenname> <foaf:family_name>Luca</foaf:family_name> <foaf:img rdf:resource= "http://students.info.uaic.ro/~lucaa/lucaa.png" /> </foaf:Person> </rdf:RDF> can be transformed simply in the follwing XHTML with RDFa: <div typeof="foaf:Person" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <span property="foaf:givenname">Anca</span> <span property="foaf:family_name">Luca</span> image <a rel="foaf:img" href="http://students.info.uaic.ro/~lucaa/lucaa.png"> http://students.info.uaic.ro/~lucaa/lucaa.png </a> </div> the only conventions that need to be made are concerning the container XHTML elements and the attachment of the RDF property labels to their values generated in XHTML (in this case we considered that the label is attached only for elds whose values are not strings values but URIs).

3.3.2

Generating microformats from semantic web described resources

To address the practical advantages, the processing tools and data reusage on the social web, we oriented towards experimenting the automatic creation of microformats from described data on the web. First of all, we note that, since specic semantics are associated to microformats, xed and not extensible, there is a limited range of things that

31

can be represented with microformats, causing the need to accept information loss on transforming general, RDF descriptions of data to microformat descriptions. For example, in the case we assumed, a microformat cannot provide markup for an artist, since no such semantic is dened by their ontology. However, an artist is an entity (a person, organization, something that has identity) for which the microformats provide support, through the hCard format. In general, there is an amount of loss of meaning upon illustrating the general data web descriptions on the social web, but this loss is only relative to the semantic web, since for the social web the representation has maximmum utility (the transformation is to be made to the most compatible format). Upon re-reading the information back by semantic web tools, techniques aligned to the GRDDL initiative ensure that the semantics are completely extracted from microformatted blocks. Since the HTML embedded markup dened by these methods is loose, combining two or more of these methods if perfectly feasible, allowing a double markup solution: preserving full information with RDFa and creating the practical microformats markup in the same block. As opposed to producing RDFa or eRDF, straightforward from a RDF representation of the answer, microformats raise special issues since they also need a transformation of semantics. To make such an operation possible, we need to address the following concerns: a formalization of the microformats semantics is needed in order to have automated tools operate with them, then we need to detect the format to be used for a RDF description of a retrieved answer, the one which has the matching semantic, such that, in the end, we are able to associate a value for each of the microformats properties, nally, we need to serialize the format and its properties, for humans rst: make sure that the produced serialization in HTML is a text which is as natural to users as possible. Formalization of microformats One solution, as suggested by the extraction mechanism (the reverse operation), is to employ a microformats mapping to existing ontologies, each to their specic semantic, associating to every microformat a class in an (well-known) ontology to be used as the representant of that format. Unfortunately, there is no mean to do this in automated manner, such that the mapping needs to be done manually on a microformat per microformat basis, which also implies the third issue above to be resolved manually, for each ontology to format transformation in particular. There are projects

32

for resolving optimally this association [64], since this is principle of the extracting tools (GRDDL, search engines such as Sindice). The second solution is to base the formalization on a generic mechanism to describe microformats, the standard in the microformats world, the XMDP [91] (XHTML MetaData Proles) format, and to try to infer an ontology of microformats from these description. This will allow handling all microformats in generic manner, with the possibility of adding a new microformat at any point. XMDP is a microformat itself, to create an HTML Prole [57] for describing in XHTML what are the attributes used by the various microformats, what are the allowed values and how they should be interpreted. Such proles are dened for all microformats, to be linked from the HTML header of documents that use microformats to mark this usage. For example, the prole for the xFolk [89] microformat (used for annotating bookmarks), is the following (with simplied explanations, for formatting reasons for the complete specication see [90]): <dl class="profile"> <dt id="class">class</dt> <dd> <p>[http://www.w3.org/TR/html401/struct/global.html#adef-class HTML4 definition of the class attribute]. This meta data profile defines some class attribute values (class names) and their meanings as suggested by a [http://www.w3.org/TR/WD-htmllink-970328#profile" draft of "Hypertext Links in HTML"]:</p> <dl> <dt id="xfolkentry">xfolkentry</dt> <dd>Indicates a container element for an xFolk entry. [...]</dd> <dt id="taggedlink">taggedlink</dt> <dd>An <a> tag of class taggedlink indicates the URL of the item the xFolk entry bookmarks. One and only one <a> element of class taggedlink must occur within each element of class xfolkentry. [...]</dd> <dt id="description">description</dt> <dd>A further description of the xFolk entry. [...]</dd> <dt id="extended">extended</dt> <dd>A deprecated class value that is equivalent to <code>description</code>.</dd> </dl> </dd> </dl> What the above description can be resumed to is that the xFolk format is 33

marked by using a container element with class value xfolkentry, in which a tagged link and a description of the link are contained, both detected by elements with the corresponding values for the class attribute, and all expressed in a manner which is accessible to human readers (XHTML). Even if this format contains all the information needed to detect the structure of a microformat correctly, unfortunately not all of it accessible by machines, to be able to infer a formalisation for microformats. Namely, some drawbacks can be noticed (see [92] for more details): no machine understandable information about the structure is available for parsing tools: in our example, there is no way for a machine to infer the fact that taggedlink and description are properties of a general object identied by xfolkentry. More general, properties that are to be enclosed by a container element are not marked as such. although human readable specication of the cardinality or mandatory properties is present, no such information is machine parsable. related to the rst issue, if multiple such proles are specied by a HTML document, and an attribute value is described by multiple proles, there is no disambiguation rule stating to which microformat a specic identied element is to be associated, there is no information about how the values of the properties are represented in the elements specifying property attributes, in machine parsable form. Although the microformats initiative states this format as not being designed as a tool to enable automatic parsing, we believe that the description power of this microformat can be harnessed to lead to proper formalization. From XMDP to ontologies We propose here an extension of the XMDP format, along with the semantics and methods to parse it, which will enable automatic generation of microformats ontology based on their prole description. Namely, our extension species the nesting of prole lists under attribute values descriptions, to represent containment. Thus, any dd element corresponding to a dt element representing a possible value of an attribute can (and sometimes must) contain a dl class=profile element by the following rules: the attribute values described, in the XMDP, by an inner prole are meant to appear in HTML only as descendants of an element specifying the attribute value in whose description the inner prole appears;

34

if such an attribute is found outside of an element with the parents reserved attribute value, it is not to be associated meaning as part of the current prole (other proles might specify other meanings for them). Otherwise put, attribute values specied in an inner prole only make sense if they appear in the context in which they were specied; top-level attribute values (the ones described by the root prole) are allowed to appear anywhere in the HTML and are to be interpreted according to their description; everytime a microformat uses a reserved attribute value only as a container for other properties, it must use an inner prole to properly describe this containment. This extension addresses issues 1 and 3 in the above list, for the second issue we will use the semantic web principle of the open worlds, and the fourth will be addressed by conventions presented in the section addressing serialization. With this extension, the xFolk prole presented above would be transformed to: <dl class="profile"> <dt id="class">class</dt> <dd><p>[...]</p> <dl> <dt id="xfolkentry">xfolkentry</dt> <dd>Indicates a container element for an xFolk entry. [...] <dl class="profile"> <dt id="class">class</dt> <dd><dl> <dt id="taggedlink">taggedlink</dt> <dd>An <a> tag of class taggedlink indicates the URL [...]</dd> <dt id="description">description</dt> <dd>A further description of the xFolk entry. [...]</dd> <dt id="extended">extended</dt> <dd>A deprecated class value [...]</dd> </dl></dd> </dl> </dd> </dl> </dd> </dl>

35

which better represents the containment relation between the xfolkentry element and its taggedlink and description, and gives meaning of the latter two only in the context of an xfolkentry. Using this extension, it is easy to infer an associated microformat description structure from its XMDP prole, by the following rules: any reserved attribute value describes an ontology entity, to be named by its dt term and to be characterized by its dd description, all reserved attribute values in a prole represent properties, characterising the resource that contains them: for the top level proles, the HTML page that contains them (which might represent an object in its turn), while the inner proles describe their container element. These properties take their ids and labels from the reserved attribute value, and the comments from their descriptions. any reserved attribute value which contains an inner prole denes a concept, rened by its inner prole. The property which is represented by this attribute value is now to be interpreted as a property whose value is a new resource, described by the inner prole. These classes get their ids and labels from the reserved attribute value and the comments from their descriptions. no assumptions are made about the cardinality of these properties, nor about the top-level property domains, nor about the property ranges for reserved attribute values which dont describe an inner prole. To exemplify this process, we use the xFolk example again. For the extended prole described above, the inferred RDFS model description is the following: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xml:base="local-uri-uF"> <!--define xfolkentry as a property--> <rdf:Property rdf:ID="prop-xfolkentry"> <rdfs:label xml:lang="en">xfolkentry</rdfs:label> <rdfs:comment>Indicates a container element for an xFolk entry. [...]</rdfs:comment> <rdfs:range rdf:resource="#class-xfolkentry" /> </rdf:Property> <!--define xfolkentry as a class--> <rdfs:Class rdf:ID="class-xfolkentry"> <rdfs:label xml:lang="en">xfolkentry</rdfs:label> <rdfs:comment>Indicates a container element for 36

an xFolk entry. [...]</rdfs:comment> </rdfs:Class> <!--define the inner profile properties--> <rdf:Property rdf:ID="taggedlink"> <rdfs:label xml:lang="en">taggedlink</rdfs:label> <rdfs:comment>An <a> tag of class taggedlink indicates the URL [...]</rdfs:comment> <rdfs:domain rdf:resource="#class-xfolkentry" /> </rdf:Property> <rdf:Property rdf:ID="description"> <rdfs:label xml:lang="en">description</rdfs:label> <rdfs:comment>A further description of the xFolk entry. [...]</rdfs:comment> <rdfs:domain rdf:resource="#class-xfolkentry" /> </rdf:Property> </rdf:RDF> Employing this general technique of creating knowledge formalization for the microformats vocabulary, we will now propose a solution based on ontologies matching for creating the microformats automatically from the retrieved answer, preserving their semantics. Detecting the appropriate microformat Once an answer was retrieved, for detecting the correct microformat to use, ontology alignment (as described in the dedicated section) is used once again, to nd a matching microformat for the retrieved response. The source ontology, O1 will be, in this case, the microformats ontology, created from the extended XMDP proles of all microformats. The target ontology, O2 , will contain the user chosen answer, along with the description of his choice in both data source ontology and user ontology, to represent the two facts known by the system. Thus, assuming vk to be the value chosen by the user, sk to be subject corresponding to this answer, p and pk the user property and its match as above, the ontology to align microformats with will be composed of : the full description of the answer type Cvk , the full description of the property pk as retrieved from the data source ontology, the description of the corresponding aligned property p in the user ontology model, as generated from the user query in section 3.3.2, feeding the maximmum knowledge about the context as possible to the semantic matching process. 37

Note that the value vk corresponding to the answer can be a literal value (a string, a number or a properly formatted date), in which case the rst item in the list above will be omitted. During the alignment process, in this case, besides the name-based techniques, this time the structure based mechanisms [10, 8] will bring a greater benet, types similarity detection being stimulated by the number of matching properties, which can in turn inuence range and domain matching of searched properties with microformats descriptions. Also, articulation rules [10, 8], will be detected such as the fact that, though not equal, a microformat is a subtype or a supertype of a response type. Note that in this case, for more complex structures are to be aligned, an iterated matching process will prove useful (as [8, 9] propose). The correspondences found are type correspondences between the answer type and a microformat type, such as Cmi , Cvk , r, n and, especially for the cases when the retrieved answer is a literal value vk , property correspondences between microformats properties and the searched property p or pk . A combined thresholding is then applied, to select the needed microformat and detect the values of its properties, for all alignments above a minimmum condence level N0 , with the following proposed algorithm: 1. from all the alignments for the retrieved value class Cmi , Cvk , r, n , the ones for which the relation r expands to a Cvk is a Cmi statement are kept, all the rest being discarded. This step has the role of selecting only the microformats which can be used to describe, semantically consistent, the type of the value; 2. if any alignment Cmi , Cvk , r, n is left from the above selection, the one with the top condence level is selected; 3. for the found class correspondence, the maches for the microformats properties are retrieved, as follows: for each property pmi of the Cmi type, the alignment with a pvk of the type Cvk with the maximmum condence score and a consistent relation (equality or pvk is a pmi ) is detected and stored; 4. if no such class correspondence exists (also in the case of a literal vk ), the top condence alignment of the queried property p or pk with a microformat property pmi is retrieved; 5. the microformat type corresponding to the aligned property is identied for the two types of microformat properties: the top level properties, describing the HTML document in which they appear and the microformat types properties; (a) if pmi is a top-level property, and there exists a type dened for that microformat, the class Cmi is selected. If the property is 38

an inner prole microformat property, its domain is identied as Cmi . The properties of this class are then identied by the same process presented at step 3, if the value type Cvk exists; (b) if, in the rst case above , the class does not exist, the microformat property is stored. It is possible that, because of nonexistent alignment or condences under the threshold N0 , at one step above a consistent correspondence cannot be built (no Cmi found by any method and no microformat property either), microformats generation can be abandoned and, either semantic markup is not used at all or the process falls back on a one-to-one mapping with RDFa or eRDF. Also, it is possible that not all the microformats properties nd corresponding response type properties, which is acceptable in the open world descriptions principle. Serializing the microformat After the microformat to be used was identied, now the HTML block needs to be created, in a human readable form and with preserving the semantics. For this, the HTML block is created with a container element (a span or div element), holding the value of the attribute corresponding to the identied matching microformat, mi , either as a mathing type Cmi or a matching property pmi . For the serialization of the properties of this microformat, two cases can be distinguished: the one when the value vk is a literal value (a string, number, etc.) or the case when vk is a concept instance. For the rst case, following the semantic matching process described above, two cases are possible: a microformat class Cmi was identied for the value vk (step 5 a). The property of this microformat which has the greatest alignment score with the looked up property (p or pk ) is identied (the one that represents the most the looked up information), and the value vk is serialized as the value of this property the identied microformat is a top level pmi property (step 5 b), in which case the value is to be serialized as the content of the microformat element directly. In the second case, when the retrieved value is a structured answer, if the identied correspondence is a microformat property (5 b), the displayed value (label, comment) of the answer instance is serialized directly as the content of the microformat element. If there is a class identied Cmi along with the matching properties pvk of the response type for every pmi of the

39

microformat type (3 or 5 a), the values of the pvk properties are extracted and serialized inside the microformat container element as follows: if the value is a literal value, it is encapsulated in an element with the appropriate attribute value set (as per the microformat XMDP) and added in the microformat container element. If this value has the format of an URI, its container will be an anchor element, and the value will also be set as its hyper reference (href attribute). If the value to be serialized here is found to have been serialized already from another property, the attribute and its value are only appended to the previously serialized element and the data is not duplicated. if the value is a resource, then an alignment with a microformat type is searched, by the same algorithm, for this object to be serialized as an embedded microformat. If such an object is found, this property will be serialized as an inner microformat HTML block. This step can be skipped for performance reasons, the formatting to be done exclusively as described in the next case. if no alignment was found between the type of the resource and a microformat value, the corresponding block will be an anchor, with the appropriate reserved attribute value (as per the corresponding XMDP), the resource URI as a href and the display value of this object as a label. The non-duplication rule presented at the rst case is applied in this case too. However, this process can be time consuming, so less complex solutions can be adopted, such as always using the retrieved answer as a literal value, based on its display value, using a-priori associations of the properties to retrieve with microformats templates or even allowing the user to specify the semantic of the answer hes expecting (as [25, 70] operates). An important component of this semantics reusing process are the user interface mechanisms to stimulate reusing of data, discussed in the preceding section and to be oered a proposed implementation in chapter 6. To exemplify this process, let us consider the example of the result of an author search, retrieved as a foaf:Person9 : <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:Person> <foaf:givenname>Anca</foaf:givenname> <foaf:family_name>Luca</foaf:family_name>
9

http://xmlns.com/foaf/0.1/Person

40

<foaf:title>MSc. Student</foaf:title> <foaf:nick>lucaa</foaf:nick> <foaf:birthday>10/14/1984</foaf:birthday> <foaf:img rdf:resource= "http://students.info.uaic.ro/~lucaa/lucaa.png" /> </foaf:Person> </rdf:RDF> The hCard [55] microformat, used to annotate entities, has the following XMDP, as per the extended version proposed above (although the hCard prole contains much more class values, we present here only the ones which are interesting for this example for the complete version see [56]): <dl class="profile"> <dt>class</dt> <dd><p>HTML4 definition of the class attribute.[...]</p> <dl> <dt>vcard</dt> <dd>A container for the rest of the class names [...] <dl class="profile"> <dt>class</dt> <dd><dl> <dt>fn</dt> <dd>"Formatted name", See section 3.1.1 of RFC 2426.</dd> <dt>family-name</dt> <dd>See "Family Name" in section 3.1.2 of RFC 2426.</dd> <dt>given-name</dt> <dd>See "Given Name" in section 3.1.2 of RFC 2426.</dd> <dt>nickname</dt> <dd>"Nickname", See section 3.1.3 of RFC 2426.</dd> <dt>photo</dt> <dd>"Image or photograph", See section 3.1.4 of RFC 2426. [...]</dd> <dt>bday</dt> <dd>"Birth date", See section 3.1.5 of RFC 2426. [...]</dd> <dt>title</dt> <dd>"Job title, Functional position or function", See section 3.5.1 of RFC 2426.</dd> [...] </dl></dd> </dl> </dd> </dl> </dd> </dl> 41

As a result of the alignment process between the foaf:Person ontology, fetched as described in section 3.1.3 and the microformats generated ontology (as described and exemplied in 3.3.2), alignments are found between the two sets of properties, givennamef oaf , given namevcard , = , ngivenname , f amily namef oaf , f amily namevcard , =, nf amilyname , titlef oaf , titlevcard , =, ntitle , nickf oaf , nicknamevcard , =, nnickname , birthdayf oaf , bdayvcard , =, nbday , f amilynamef oaf , f nvcard , =, nf n based on name techniques (the latter with some ambiguity, with low condence and close to the other person name properties), then imgf oaf , photovcard , = , nphoto will also be found based on semantic methods, and nally, the two types will be found similar based on the structure (the common set of properties identied): f oaf : P erson, class vcard, , ntype . Note that multiple alignments will be found, between other entities in the two ontologies, what we listed are the alignments selected as a result of the algorithm presented in 3.3.2. The values of these properties are then identied in the object, and the microformat is serialized as follows, with the literal values correctly serialized in span sections, compacted representation of fn and the found URI property illustrated as an anchor: <div class="vcard"> <span class="given-name">Anca</span> <span class="fn family-name">Luca</span> <span class="title">MSc. Student</a> <span class="nickname">lucaa</a> <span class="bday">10/14/1984</span> <a href="http://students.info.uaic.ro/~lucaa/lucaa.png"> http://students.info.uaic.ro/~lucaa/lucaa.png </a> </div> Note that this microformat (as well as the RDFa exemplied earlier), displayed by a browser, is a regular, human readable person description in a string.

3.3.3

Maximizing user intent

Some disambiguation for the retrieved information can be done also at this level, by maximizing the coherence measured for the result: after the answer is retrieved and the result created (as a microformat fragment), a tool for microformats based content recommending [27] can be used to provide the interest of the user in one of the constructed items or in the other, based on his determined personal preferences, as if the content would already be part of the page. Whichever interest is bigger is considered that it would lead to

42

a more consistent presence of the user on the Social Web, so it is likely the answer hes looking for and the content he wants to create. We have seen, in this chapter, how an instrument can be built to allow the users access to the semantic web resources, with an accessible operating interface, and how the data obtained by such means can be reused by the humans in a manner consistent with the current semantic markup initiatives and with the social web. Namely, weve designed an approach towards making the Semantic Web, as a whole, relevant to the Social Web, as an inverse operation of the tools existant nowadays. The next chapters will draw the big picture of this proposed approach, will place it in the context of the resembling tools and will present a proof of concept implementation.

43

Chapter 4

Putting it all together

Figure 4.1: Overview of the proposed system The big picture of the proposed system can be noticed in gure 4.1, 44

illustrating the following ow: while editing content on the Social Web, a user issues a search, to retrieve an answer from the Semantic Web, to use the value in the content hes editing; to resolve this query, the system rst identies the subject as a structured object (RDF description) using semantic search backend; to resolve the user property, a user ontology is inferred from the query and the context of the page or the user history; the data retrieved about the search subject and the inferred user ontology are then aligned and correspondences are identied for the searched property, allowing to associate a value to the user searched property; the values such obtained are presented to the user, who chooses a result to reuse in the context of his editing; in order to produce semantic markup on reusage in the social web content, as microfromats, an ontology inferred from the microformats descriptions (XMDP) is used; the result ontology along with the user property description is aligned with the microformats ontology, and the values for the microformats properties are identied from the resulting correspondences; the microformat thus created is serialized and inserted in the user edited page as the result of his initial inquiry, to enrich the content created by the user on the Social Web. Note that additional steps are made, which have been omitted from the scheme above for presentation purposes: upon the user choice of a value to use, feedback is collected, as described by the learning approaches in section 3.1.4, to use in further alignments; when the markup is created in the edited document, if a microformat alignment is not satisfactory, a fallback on RDFa or eRDF can be used.

45

Chapter 5

Position with respect to related approaches


A rst category of related applications are the semantic search tools, either the ones that extract semantic from documents to use at query resolution time and to provide an answer rather than a location, the ones that search semantic knowledge bases, or the mixt solutions. Above all these approaches, the current one also raises the problem of handling the response semantics: while the search tools only provide the response to the users as plain text, allowing them to read the web, our proposal also handles what happens with the response after this moment, dealing with the envisaged following write operations, omnipresent on the Social Web. As opposed to initiatives like Wolfram Alpha [88], Google Squared [54] or the hybrid Powerset [69], the data retrieval backend is fully based on the Semantic Web, harnessing the power of the knowledge, not that of the documents, to bring a plus of accuracy. Compared to the search tools rooted in the Semantic Web (like Sindice [81], SWSE [77], Falcons [51] or the domain specic ones), it brings user accessibility, avoiding the formalisms of the Semantic Web in its user interface, and generality, allowing any types of searches, in the entire semantic web. In the same time, for the implementation of this natural language-like querying system, it relies on mechanisms of the semantic web, knowledge processing rather than language processing, as an advantage over question answering engines or hybrid approaches. In the eld of ontology based query answering, [8] cites SWAP [32, 83] (Semantic Web and Peer to Peer) as a representative usage. SWAP uses ontology matching to unify the knowledge in a peer-to-peer data sharing network, to allow communication in a decentralized network in which peers handle independently their knowledge formats. While there are components of user expression processing, the accent in SWAP is on mapping existing known ontologies rather than discovering and aligning them, as in our case. For the specic purpose of handling user queries through ontologies,

46

AquaLog [22, 45] is a project aiming to provide a exible semantic search for any underlying knowledge base: it is installable over any knowledge repository, handling its specic ontology seamlessly, to respond natural language user queries stated against a user queries domain ontology. Two main improvements of our system can be noticed already: while AquaLog only provides access for a single data repository and handles a domain specic ontology, our approach is designed as an interface to the entire semantic web, and it also does not require an input ontology to analyze the user queries against. In addition, AquaLog stresses heavily on the query processing component, employing complex pure natural language processing methods for this handling, while, even with the price of less exibility, our proposal doesnt require such. On the same direction, of allowing semantic user queries but without requiring natural language processing, SemSearch [23] handles user queries where the data types and data keywords are specied in user language. In similar manner, they propose a solution for mapping the user inserted values to ontology entities, but which is limited to the matching made by a text indexing tool for the identiers of the entities (labels, etc). By tackling this task through the syntactic and semantic components of the ontology alignment algorithms, our proposal takes this approach one step further, designing user query recognition based on knowledge instead of text. A second category of related tools are those dealing with the user created content on the web, trying to semantically enrich it. The notable approach is Zemanta [94], which allows to connect related information to the content the user is currently creating, through specically adapted interfaces of common blogging tools, content management systems, web-based email services. These references (links, inserted images, etc.) are suggested automatically, based on natural language processing and text analysis techniques, from a xed set of resources on the web (image sharing services, online shops, music services, etc.), with little semantic web involvement in the process. One dierence can already be noted, in the degree of generality of the addressed data, which is superior in our approach. In the same line, the basis of the connection creation mechanism is the natural language processing not the knowledge representation initiatives. Also, Zemanta is not at all concerned the semantics of the enriched content, the additions to the original content being plain HTML, without any semantic markup. From our knowledge, the proposal presented here is unique in the exibility of semantic markup and in particular microformats creation, oering a transparent tool for the user (not the developer) to create them, in a generic manner, able to handle any type of resource to be represented in any matching social web semantic. Also, the above analysis highlights another particularity of this system, its approach to handling search on the web as data retrieval for reusage, read and write on the web as two connected processes, which should be handled together to facilitate semantic data circulation on the web. 47

Chapter 6

Proof of concept and case study


To prove the proposed system, and to prototype the user interaction, we have done the rst iteration in the implementation of the system, as a contextual semantic search command for the Firefox browser, on top of the Ubiquity [87] extension, the queries to be issued as commands of the latter and the reusage of the response to be done on the execution of the command. The technique used for data retrieval is the simple approach presented in the above chapter: SPARQL queries are issued for semantic data repositories available as such (current implementation uses DBpedia [48], DBLP Berlin database [47] and the Linked Movie Database [61]) through their SPARQL endpoints, for subjects and properties resolved through conventions and exible mapping rules. This proof of concept is based on the PSW script [25, 70] which participated in the Scripting Challenge during the 5th Workshop on Scripting and Development for the Semantic Web, collocated with the European Semantic Web Conference, 2009. To detail the functionality of such a system, let us consider the example of a user editing a wiki page, therefore producing unstructured content on the web, about his musical preferences. He will open a rich text editor normally, and, when the information will be needed, he will invoke a search command through Ubiquity, as Figure 6.1 shows. It is now to remark how the user interaction takes place, in a nonintrusive manner and accessible to all the contexts when the user is reading or writing the web: the access to search does not depend on the page, or application the user is editing. Also, to be noted, the natural language-like form of the search command, in the spirit of Ubiquity which aims to make web services accessible in natural language, along with the command compose suggestions, provided by the latters mechanisms to help command processing. For this kind of query expression, an answer will be computed, as men-

48

Figure 6.1: Invoking a search command tioned in the approach section. Using DBpedia as a knowledge repository (the default if nothing is specied), the user subject string, Depeche Mode is looked up in the foaf:name1 property of the objects in the repository the identier, in human accessible format, by convention, of the objects in the DBpedia repository and, for the found objects, a RDF property associated with members is retrieved. A previous assignation resolves this property as the dbpedia-owl:currentMembers2 RDF property, which is evaluated and the rdfs:label3 is retrieved for the identied values, if the values are objects. To be noted though that all these technical details are well hidden from the user, under an easy to understand and straightforward interface. Once the answers are retrieved, the user chooses his answer from the list of values for the property and executes the command (by hitting enter). The result, in its human readable form, is then automatically inserted in the content he was editing on the web, at the position he currently left of see gure 6.2. Now, upon the user saving and publishing the content on the web, in the resulting HTML, for a human accessible representation, the microformats detection tools identify semantics, namely a person, and oer contextual actions related to the content the behaviour of the leading microformats processing application in Firefox, Operator [67] can be seen in gure 6.3. This is possible because, when the information was retrieved for the user query and inserted in the edited content as a string, a microformat was actually inserted in the underlying edited HTML, which was determined to be an hCard for the semantic of the retrieved value, a dbpedia-owl:Person4 .
1 2

http://xmlns.com/foaf/0.1/name http://dbpedia.org/ontology/currentMembers 3 http://www.w3.org/2000/01/rdf-schema#label 4 http://dbpedia.org/ontology/Person

49

Figure 6.2: Receiving the results and reusing the value in the edited content

Figure 6.3: Operator extension detects the semantics In this prototype, the microformat resolution is made through a exible association between the property to be retrieved and the semantic of the answer. The same process, of retrieving and reusage can be iterated multiple times, for all the information the user needs to retrieve (a birthdate, the other members of the band or the band url). On publishing the content, all the retrieved data has preserved semantics and is accessible to microformats processing tools as can be seen for Tails [86] in gure 6.4. Still, some human queries can target properties which are not to be found in the repository ontology, but their inverses, such as the query for the bands an artist belonged to, seen in gure 6.5, which targets the dbpedia-owl:pastMembers5 property but looking for the subject with the user inserted keyword as a predicate (David Bowie). In this case, the response is easily retrieved with a ag specifying that the RDF property associated with the user string is to be searched the other way around. The potential limitation imposed by a predened mapping is handled by
5

http://dbpedia.org/ontology/pastMembers

50

Figure 6.4: Tails extension detects the inserted semantics

Figure 6.5: Retrieving the subject as a value of a mapped property allowing the user to specify any property to search for and trying to resolve that property as a property name in the default ontology for the targeted repository. This makes possible queries like the ones in gure 6.6, along with allowing users to specify explicitly how is the response to be interpreted for these retrieved answers 6.7. An overall view of the created content, along with Operators detection of the embedded semantics can be viewed in Figure 6.8. These results, obtained by a simple proof-of-concept implementation, evaluated as a compound between the human usability of the tool and its achievements as a web semantics consumer and producer, put our design eorts on a correct, successful way towards a complete implementation.

51

Figure 6.6: Searching for additional properties

Figure 6.7: Specifying the type of the response

Figure 6.8: Created content along with the detected semantics 52

Chapter 7

Conclusions and further work


Motivated by the missing link in todays semantic web initiatives, we have proposed, in this paper, a system to make the rigorous descriptions of data in the Semantic Web, along with its principles, relevant to non-technically trained users, and to provide them with the method of using this data without alterating its value for the machines. This idea is materialized in a search tool rooted in the semantic web data, enhanced with the mechanisms to automatically create semantic markup on the social web when the data such fetched is reused in the content created by the users. After an introduction in the context of the semantics on the web, in its data and social avours, we designed our search system by aggregating the existent approaches, and enhancing their functionality through harnessing the power of the semantic web tools (knowledge models, operations, formalisms) to resolve natural language-like expressions rather than standard natural language processing techniques. After framing the user inquiries in a sintactic exible format transposed to a semantic pattern, we proposed a method to infer a user ontology for the expressions he is using when retrieving information, and, by relying solely on ontology alignment mechanisms, to identify the precise answer in the knowledge graph of the Semantic Web. Further research is to be made in this direction to extend the semantic pattern of the user queries to more sophisticated forms, to allow dierent types of questions and variations, but still preserve the semantic web tools approach to resolving such queries and not go in the natural language processing eld. This can be done by extending the format to additional components and use an advanced solving engine, to handle complex relations by using reasoners to deduce data from exising knowledge (dealing with inverse, symmetric or transitive property characteristics, composition of multiple properties, etc.) rather than matchers to nd it, as the current stage proposes. Also, rigorous usability studies need to be made to conrm

53

the proposed tool as an instrument for the non-technical users. Still on the search side, the emerging initiatives from large scale semantic search approaches rooted in standard information retrieval or query answering engines need to be taken into account for future experiments, as a substitute of the natural language commands processing, resolving this aspect at the (externalized) search backend level. For the semantics reusing component, our proposal innovates by putting the problem of retrieval in the terms of reusage and by devising a generic mechanism to rewrite these semantics for the social web as miroformats, based on a semantic alignment approach for the mapping tool. As a building block, we devised an extension for the microformats description format, to be used as a support for inferring a general microformats ontology, exible to the denition of new microformats or to modications of the existent, and inline with the existing mechanisms in the eld, built in the microformats design principles spirit. Then we used this ontology to automatically detect, based on alignment techniques and a proposed mapping model, the corresponding microformat to be used for a retrieved answer, to preserve its semantic. As a future direction, this extension proposal needs to be revised and discussed with the microformats community for integrating its contributions in the specications. Also, future improvements on this side must take into account the automatic generation of human targeted content by better integration of the data semantic with the XHTML semantics, better synchronization of the generated elements with the user relevant markup. With the growth in semantic complexity, both on the data retrieving side and in the data reusage techniques, specic performance issues will need attention, to be able to remain responsive and reasonable from a practical point of view. In the current stage, the proposal described in this paper explores a view over the desired relation between the Semantic Web and the Social Web and draws the lines for building it by orchestrating modern services, protocols and formats of the semantic layer, in the spirit of todays descentralized web. It also explores a model of user interaction, addressing the current patterns in user operation and information circuit on the web, and trying to ameliorate the positive discrimination that the Semantic Web does for the machines, by consolidating the humans positions as peers in the data communication graph.

54

Bibliography
[1] Sren Auer, Sebestian Dietzold, Thomas Riechert, OntoWiki A Tool o for Social, Semantic Collaboration, The Semantic Web ISWC 2006, 5th International Semantic Web Conference, ISWC 2006 [2] Tim Berners-Lee, Design Issues: Linked Data, http://www.w3.org/DesignIssues/LinkedData.html, 2006 [3] Christian Bizer, Tom Heath, Tim Berners-Lee, Linked Data - The Story So Far, Heath, T., Hepp, M., and Bizer, C. (eds.), Special Issue on Linked Data, International Journal on Semantic Web and Information Systems (IJSWIS), 2009 to appear [4] Tantek Celik, Kevin Marks, Real World Semantics, ETech 2004, http://tantek.com/presentations/2004etech/realworldsemanticspres.html [5] Irene Celino, Emanuele Della Valle, Dario Cerizza and Andrea Turati, Squiggle: a Semantic Search Engine for Indexing and Retrieval of Multimedia Content in Proceedings of the 1st International Workshop on Semantic-Enhanced Multimedia Presentation Systems (SEMPS-2006), Athens, Greece, December 6, 2006 [6] John Davies, Richard Weeks, Uwe Krohn, QuizRDF: Search Technology for the Semantic Web, Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004 [7] An-Hai Doan, Alon Halevy, Semantic integration research in the database community: A brief survey, AI Magazine, Special issue on Semantic integration, 26(1):8394, 2005 [8] Marc Ehrig, Ontology Alignment: Bridging the Semantic Gap (Semantic Web and Beyond), Springer, October 2006 [9] Marc Ehrig, York Sure, Ontology Mapping - An Integrated Approach, The Semantic Web: Research and Applications, Springer, 2004 [10] Jerome Euzenat, Pavel Shvaiko, Ontology Matching, Springer, 2007

55

[11] Richard Fikes, Patrick Hayes, Ian Horrocks, OWL-QL A Language for Deductive Query Answering on the Semantic Web, Journal of Web Semantics, 2(1), 2005. [12] Tim Finin, Li Ding, Rong Pan, Anupam Joshi, Pranam Kolari, Akshay Java and Yun Peng, Swoogle: Searching for knowledge on the Semantic Web, CIKM 04: Proceedings of the thirteenth ACM conference on Information and knowledge management, pp. 652659, 2004 [13] Dragan Gaevi, Dragan Djuri, Vladan Devedi, Model Driven Engis c c zc neering and Ontology Development, Second Edition, Springer, 2009 [14] T.R. Gruber, A translation approach to portable ontology specications, Knowledge Acquisition, Vol. 5, no. 2, pp. 199220, 1993 [15] Tom Gruber, Where the Social Web Meets the Semantic Web, The 5th International Semantic Web Conference, Keynote Presentation, 2006, http://videolectures.net/iswc06 gruber wswms/ [16] R. Guha, Rob McCool, Eric Miller, Semantic Search, WWW 03: Proceedings of the 12th international conference on World Wide Web, pp. 700709, 2003 [17] Andreas Harth, Jrgen Umbrich, and Stefan Decker, MultiCrawler: A u Pipelined Architecture for Crawling and Indexing Semantic Web Data, 5th International Semantic Web Conference, Athens, GA, USA. November 5-9, 2006 [18] Je Hein, James Hendler, Searching the Web with SHOE, Articial Intelligence for Web Search, Papers from the AAAI Workshop, pp. 35 40, AAAI Press, 2000 [19] M. Hildebrand, J.R. van Ossenbruggen, L. Hardman, An analysis of search-based user interaction on the Semantic Web, CWI INS Technical report, 2007 [20] Ian Horrocks, Ontologies and the Semantic Web, Communications of the ACM, Vol. 51, no. 12, pp 5867, December 2008 [21] Yannis Kalfoglou, Marco Schorlemmer, Ontology mapping: the state of the art, The Knowledge Engineering Review, Volume 18, Issue 1, pp 131, January 2003 [22] Vanessa Lopez, Victoria Uren, Enrico Motta, Michele Pasin, AquaLog: An ontology-driven question answering system for organizational semantic intranets, Journal of Web Semantics, 5, 2, pp. 72105, Elsevier, 2007

56

[23] Yuangui Lei, Victoria Uren and Enrico Motta, SemSearch: A Search Engine for the Semantic Web, Lecture Notes in Computer Science: Managing Knowledge in a World of Networks, Volume 4248/2006, pp. 238245, Springer, 2006 [24] Atanas Kiryakov, Borislav Popov, Ivan Terziev, Dimitar Manov and Damyan Ognyano, Semantic annotation, indexing, and retrieval, Web Semantics: Science, Services and Agents on the World Wide Web Volume 2, Issue 1, pp. 4979, 1 December 2004 [25] Anca Luca, Practical Semantic Works - a Bridge from the Users Web to the Semantic Web, 5th Workshop on Scripting and Development for the Semantic Web, May 2009 [26] Anca Luca, Recomandri Web prin intermediul microformatelor, a n Sabin Buraga (coord.), Programarea Web 2.0, Polirom, 2007 n [27] Anca Paula Luca, Sabin Buraga, Microformats Enabled Navigation Assistant, International Conference on Intelligent Systems Design and Applications, INSTICC Press, 2007 [28] Vilfredo Pareto, Alfred N. Page, Translation of Manuale di economia politica (Manual of political economy), A.M. Kelley, 1971 [29] Chintan Patel, Kaustubh Supekar, Yugyung Lee, EK Park, OntoKhoj: A Semantic Web Portal for Ontology Searching, Ranking and Classication, WIDM 03: Proceedings of the 5th ACM international workshop on Web information and data management, pp. 5861, 2003 [30] Urvi Shah, Tim Finin, Anupam Joshi, R. Scott Cost, James Mateld, Information Retrieval on the Semantic Web, CIKM 02: Proceedings of the eleventh international conference on Information and knowledge management, pp. 461468, 2002 [31] Pavel Shvaiko, Jerome Euzenat, Ten Challenges of Ontology Matching, Technical Report, University of Trento, August 2008 [32] Steen Staab, Heiner, Stuckenschmidt (Eds.), Semantic Web and Peerto-Peer, Decentralized Management and Exchange of Knowledge and Information, Springer, 2006 [33] Giovanni Tummarello, Renaud Delbru, and Eyal Oren, Sindice.com: Weaving the Open Linked Data, Proceedings of the International Semantic Web Conference (ISWC 2007), 2007 [34] Yi Zhang, Wamberto Vasconcelos, Derek Sleeman, OntoSearch: An Ontology Search Engine, 2004

57

[35] Network Working Group, URI RFC, http://www.ietf.org/rfc/rfc2396.txt [36] W3 Consortium, Extensible Stylesheet Language Transformation http://www.w3.org/TR/xslt [37] W3 Consortium, Gleaning Resource Descriptions from Dialects of Languages (GRDDL), http://www.w3.org/TR/grddl/ [38] W3 Consortium, OWL Web http://www.w3.org/TR/owl-ref/ [39] W3 Consortium, Resource http://www.w3.org/RDF/ Ontology Language Reference, Framework,

Description

[40] W3 Consortium, RDF Primer, http://www.w3.org/TR/REC-rdfsyntax/ [41] W3 Consortium, RDF Schema, http://www.w3.org/TR/rdf-schema/ [42] W3 Consortium, RDFa Primer, http://www.w3.org/TR/xhtml-rdfaprimer/ [43] W3 Consortium, SPARQL Query http://www.w3.org/TR/rdf-sparql-query/ Language for RDF,

[44] Web Hypertext Application Technology Working Group, Microdata in HTML 5, http://www.whatwg.org/specs/web-apps/currentwork/multipage/microdata.html [45] * * *, AquaLog, http://technologies.kmi.open.ac.uk/aqualog/ [46] * * *, Data Vocabulary, http://www.data-vocabulary.org/ [47] * * *, DBLP Berlin Semantic Repository, http://www4.wiwiss.fu-berlin.de/dblp/ [48] * * *, DBpedia Initiative, http://dbpedia.org/ [49] * * *, DBpedia Mobile, http://wiki.dbpedia.org/DBpediaMobile [50] * * *, Embedded RDF, http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml [51] * * *, Falcons Semantic Search, http://iws.seu.edu.cn/services/falcons/ [52] * * *, The FOAF Project, http://www.foaf-project.org/

58

[53] * * *, Freebase, http://www.freebase.com/ [54] * * *, Google Squared, http://www.google.com/squared [55] * * *, hCard microformat, http://microformats.org/wiki/hcard [56] * * *, hCard microformat XMDP, http://microformats.org/wiki/hcardprole [57] * * *, HTML Meta Data Proles, http://www.w3.org/TR/1999/REC-html40119991224/struct/global.html#h-7.4.4.3 [58] * * *, The Kim Platform: Knowledge & Information Management, http://www.ontotext.com/kim/index.html [59] * * *, Knoodl, http://knoodl.com [60] * * *, Last.fm, http://www.last.fm [61] * * *, Linked Movie Database, http://www.linkedmdb.org/ [62] * * *, Linked Data Initiative, http://linkeddata.org/ [63] * * *, The Microformats Initiative, http://microformats.org [64] * * *, Microformats in RDF, http://semanticweb.org/wiki/Microformats in RDF [65] * * *, OntoWiki, http://ontowiki.net/ [66] * * *, Open Record, http://www.openrecord.org/ [67] * * *, Operator Firefox Extension, http://www.kaply.com/weblog/operator/ [68] * * *, Plain Old Semantic HTML, http://microformats.org/wiki/posh [69] * * *, Powerset, http://www.powerset.com/ [70] * * *, Practical Semantic Works, http://students.info.uaic.ro/ lucaa/psw [71] * * *, Researchers Map in Germany, http://researchersmap.informatik.hu-berlin.de/ [72] * * *, Revyu, http://revyu.com/ [73] * * *, Semantic Media Wiki, http://semantic-mediawiki.org

59

[74] * * *, Semantic Search Overview on the Sematic Web User Interaction website, http://swuiwiki.webscience.org/index.php/Semantic Search Overview [75] * * *, Semantic Search on Wikipedia, http://en.wikipedia.org/wiki/Semantic search [76] * * *, Semantic Web Client Library, http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/index.html [77] * * *, Semantic Web Search Engine, http://www.swse.org/ [78] * * *, Simile Longwell, http://simile.mit.edu/wiki/Longwell [79] * * *, SHOE Search Engine, http://www.cs.umd.edu/projects/plus/SHOE/search/ [80] * * *, Simile Piggy Bank Firefox Extension, http://simile.mit.edu/wiki/Piggy Bank [81] * * *, Sindice, The Semantic Web index, http://sindice.com/ [82] * * *, Squiggle Semantic Search and Conceptual Indexing, http://swa.cefriel.it/Squiggle [83] * * *, SWAP: Semantic Web and Peer-to-Peer, http://swap.semanticweb.org/public/index.htm [84] * * *, Swoogle Semantic Web Search, http://swoogle.umbc.edu/ [85] * * *, Tabulator Firefox Extension, http://dig.csail.mit.edu/2007/tab/ [86] * * *, Tails Firefox Extension, https://addons.mozilla.org/refox/addon/2240 [87] * * *, Ubiquity Firefox Extension, http://labs.mozilla.com/projects/ubiquity/ [88] * * *, Wolfram Alpha, http://www.wolframalpha.com/ [89] * * *, xFolk microformat, http://microformats.org/wiki/xfolk [90] * * *, xFolk microformat XMDP, http://microformats.org/wiki/xfolkprole [91] * * *, XHTML MetaData Prole, http://gmpg.org/xmdp/ [92] * * *, XMDP Discussions, http://microformats.org/wiki/xmdp-brainstorming [93] * * *, XWiki, http://www.xwiki.org/ [94] * * *, Zemanta, http://www.zemanta.com/

60

Você também pode gostar