Você está na página 1de 21

ARTICLE IN PRESS

Information Systems 30 (2005) 543563


www.elsevier.com/locate/infosys

On the query renement in the ontology-based searching


for information
Nenad Stojanovic
Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

Abstract

One of the main problems in the (web) information retrieval is the ambiguity of users queries, since they tend to post
very short queries which do not express their information need clearly. This seems to be valid for the ontology-based
information retrieval in which the domain ontology is used as the backbone of the searching process. In this paper, we
present a novel approach for determining possible renements of an ontology-based query. The approach is based on
measuring the ambiguity of a query with respect to the original users information need. We dened several types of the
ambiguities concerning the structure of the underlying ontology and the content of the information repository. These
ambiguities are interpreted regarding the users information need, which we infer from the users behaviour in searching
process. Finally, the ranked list of the potentially useful renements of her query is provided to the user. We present a
small evaluation study that shows the advantages of the proposed approach.
r 2004 Elsevier Ltd. All rights reserved.

1. Introduction

An increasing trend in developing (web) information portals is the usage of ontologies as a semantic
means for describing the information content [1]. Indeed, an ontology, as the explicit specication of the
conceptualisation of a domain [2], supports structuring information related to that domain, so that the
communication between providers and users of information is signicantly improved. From the users point
of view, the main benet is a more efcient process of searching for information. First of all, the retrieval
process is organised as an inferencing process, which implies that all retrieved results are relevant for the
users query, i.e., the precision of the retrieval process is 1. Moreover, the recall of the process is quite high,
since the inferencing enables the derivation of new relevant information based on the users query and
background knowledge [3].
Obviously, these estimations are valid only under the assumption that the users query perfectly
corresponds to what the user is searching for, i.e., that the meaning of the query cannot be misinterpreted.

E-mail address: nst@aifb.uni-karlsruhe.de (N. Stojanovic).

0306-4379/$ - see front matter r 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.is.2004.11.004
ARTICLE IN PRESS

544 N. Stojanovic / Information Systems 30 (2005) 543563

Otherwise, in a misinterpreted case, the results can be highly relevant for the users query, but not for the
user herself. Although the formally dened and commonly agreed meaning of the ontology terms reduces
the possible (mis)interpretations of a query, and the user can express her information need more precisely,
the problem of the perfect matching between the users query and her need remains in the ontology-based
information retrieval, as well.1 One of the main reasons is the usage of short queries, whose meaning can be
easily misinterpreted.
Moreover, the usage of a common vocabulary (i.e., an ontology) for describing the users query
introduces a new source for an ambiguous interpretation of the meaning of the query: the ontology
structure. First of all, the hierarchical structuring of information requires the usage of the right level
of the abstraction in users queries. Otherwise, the meaning of a query can be too general (broad) and
some results will be irrelevant for the users need. Secondly, a part of a query can be redundant, because it
can be derived, using the domain theory, from the other parts of the query. This introduces the ambiguity in
interpreting what the user is searching for and why she expressed her information need in such a redundant
manner, e.g. if she misunderstood the used vocabulary or described the need inappropriately. Anyway,
the user gets the list of results that potentially contains some resources irrelevant for the users infor-
mation need. A similar discussion can be done for other types of the query ambiguities, e.g. when the
user posts a query which is unsatisable, since two conditions in a query imply contradictory informa-
tion. As a response to the arisen problems, in all the above-mentioned cases, the user tries to change
something in the query, in order to obtain results that are more relevant. This process is called query
renement, and it should support the user in closing the gap between her information need and the query
she formed.
In such a renement, the main questions are (i) what the initial users information need she tried to
express in an (ambiguous) query is and (ii) which part of the users query to change, in order to
disambiguate its meaning regarding the users need. Consequently, it implies the need for a system that
can discover and measure ambiguities in a query and support the user in resolving these ambiguities
efciently, according to her need. Such a system, consequently, will enable more efcient searching for
information in an ontology-based portal, i.e., will enable increasing the real precision and recall of the
retrieval process.
However, although the research in the ontology-based information retrieval has made a signicant
progress recently, especially in the Web domain [5], none of these efforts has been focused on such a system
for discovering and resolving ambiguities in the users queries.
In this paper, we propose a query renement approach which inspects the ambiguities in the users query
and helps her in a step-by-step modication of the query until the most appropriate query for her
information need is achieved. The approach is implemented in a system for managing querying in the
ontology-based information portals called Librarian Agent [6], which reects the analogy a human
librarian uses in helping users nd appropriate books in a library. The agent measures query ambiguities
regarding the ontology structure and the content of the knowledge base, the so-called structure- and
content-related ambiguity. These ambiguities are interpreted from the point of view of the users need,
which are implicitly induced by analysing the users behaviour. According to these interpretations, the
agent denes the neighbourhood of the users query, which encompasses the similar queries that can be
used for replacing (rening) the users query. Moreover, these renements are ranked regarding the users
information need. Therefore, the query renement process is treated as moving through the neighbourhood
of a query, in order to decrease its ambiguity regarding the users need. We present a small evaluation study
that shows the advantages of the proposed approach.

1
Indeed, from the early beginning of the IR research, the users query has been treated just as an approximation of the, often ill
dened, users information needs [4].
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 545

The paper is organised as follows: in Section 2, we give a motivating example for the proposed approach,
which is described in Section 3. A small evaluation study is presented in Section 4. In Section 5, we discuss
some of the related works, whereas Section 6 contains conclusion remarks.

2. Motivating example

In order to make the description of our approach more understandable, in this section, we give several
examples of problems, regarding the ambiguity of the users query, which can occur in an ontology-based
searching for information.
Let us consider the following example related to the activities of an institute, referred to as the Research
ontology in the rest of the text:

knowledge base:

researcher(rst)1 workIn(rst, KM)2 teaches(rst, Ontology)3


researcher(nst) workIn(nst, KM) teaches(nst, Ontology)
researcher(ysu) workIn(ysu, KM) teaches(ysu, Ontology)
researcher(jan) workIn(jan, KM) teaches(jan, Ontology)
researcher(meh) workIn(meh, KM) teaches(meh, Ontology)
isA(phDStudent, researcher)4 researchGroup(KM) teaches(meh, DM)
isA(professor, researcher), researchArea(KMsystem)5 project(OntoWeb)
phDStudent(nst) researchArea(CBR)6 head(rst,OntoWeb)
phDStudent(ysu) groupArea(KM, Kmsystem) manages(ysu,OntoWeb)
professor(rst) researchIn(rst, CBR) participate(meh,OntoWeb)
professor(jan) lecture(Ontology) topic(OntoWeb, KM)
1
It means that rst is a researcher.
2
It means that rst works in the group KM.
3
It means that rst teaches course about Ontology.
4
It means that PhDStudent is a subtype researcher.
5
It means that KM group is focused on research in KM systems.
6
It means that rst researches in the CBR.

axioms:
8x; y; z researchInx; z workInx; y ^ groupAreay; z; (1)

8x; y professorx headx; y ^ project y: (2)


In the following, we analyse several problems that can occur in querying this repository:
Case 1redundant queries (I): Suppose that a user posts the following query:
8x worksInx; KM ^ researchIn x; KMsystems
which searches for all researchers who work in the group KM and who research in the research topic
KMsystems.
According to axiom (1) and the fact groupArea(KM, KMsystem), it follows that all researchers who
work in the KM group, research in the topic KMsystems as well, which implies that the constraint
researchIn(x, KMsystems) is redundant for this query. Consequently, the user should be informed
that this constraint provides no effects in the searching. This information should help the user understand
how her information need is interpreted in the query space, and, probably, she would like to rene the
ARTICLE IN PRESS

546 N. Stojanovic / Information Systems 30 (2005) 543563

query. For example, in the case that the users information need was to cluster members of the KM group
regarding their research topics (and therefore she used the condition researchIn(x, KMsystems)), she
could try to cluster them based on some subtopics of the topic KMsystems. However, performing this
renement in a trial-and-error manner can be a very time-consuming process. Therefore, a query renement
system should analyse the ontology and the information repository, in order to recommend to the user the
renements that are the most appropriate for her information need.
Case 2imprecise queries: Suppose the query:

8x; y researcher x ^ head x; y:

According to axiom (2), only a professor can be the head of a project, this query will be evaluated
only for professors. The user should be informed about such an interpretation of her query, since she
explicitly expressed the need for a researcher,2 probably expecting a longer list of results. In case the
user nds the list of answers too short, the query renement system should recommend her the most
suitable renements regarding her initial information need, i.e., a renement that reects the users explicit
need for the researchers. For example, by assuming that the user is interested in clustering the researchers
regarding the management of a project, the adequate renement can be the replacement of the relation head
with the relation manage, since each researcher can manage a project.
Case 3redundant queries (II): Suppose the following query:

8x worksInx; KM ^ teaches x; Ontology:

Although there are no additional constraints implied by the ontological structure (i.e., rules), the system
should discover that in the underlying information repository all researchers who work in the KM
group are included in the Ontology lecture. Therefore, the constraints posted by the user are
redundant, and she should be informed and guided how to resolve that problem, similarly as in case 1.
Case 4misformulated queries: Suppose the following query:

8x researcherx ^ project OntWeb:

Inuenced by Boolean searching, in which a query consists of terms syntactically correlated using AND
and/or OR connectors, users tend to forget the semantic relations which exist in an ontology, and to
make Boolean-like queries in ontology-based searching. In such cases, the meaning of the query is not clear,
since the query can be interpreted in several (many) ways. For example, the above-mentioned query could
express several relations between researchers and projects. By assuming that the user needs the information
about a specific relation which she failed to name properly or to nd alone, the task of a query renement
system is to support the user in nding this specic relation. Since the user sets no further constraints for the
desired relation, the system should enable her to judge all possible relations between researchers and
projects.3 It means that the user should be provided with a step-by-step renement of the query, which
enables inspecting the most general relations rst.
It is clear that the resolving of these situations requires a careful analysis of the underlying ontology and
knowledge base regarding the users query, in order to suggest the most suitable renements from the users
point of view. In the next section, we present an approach for such a query renement, called the Librarian
Agent Query Refinement Process.

2
Note that the query 8x researcherx ^ headx; Ontoweb. returns the same results as the query 8x headx; Ontoweb.
3
For a user, it is more difcult to express clearly what she is searching for than to recognise a relevant (searched for) item, once it is
shown to her.
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 547

3. Librarian agent query renement process

The goal of the Librarian Agent Query Renement Process is to enable the user to nd the results
relevant for her information need efciently, even if some problems we sketched in Section 2 appear in the
searching process. Indeed, this query renement process can be seen as a way to deal with (to compensate
for) problems which exist in the structure (i.e., vocabulary, information repository, query mechanism) of
a portal. These problems lead to some misinterpretations of the users needs in the query, so that either
many irrelevant results and/or only few relevant results are retrieved. In the Librarian Agent Query
Renement Process, potential ambiguities4 (i.e., misinterpretations) of the initial query are rstly
discovered and assessed (the so-called ambiguity discovery phase). Next, these ambiguities are interpreted
regarding the users information need, in order to estimate the effect of an ambiguity on the fullment of
the users goals (the so-called Ambiguity Interpretation phase). Finally, the recommendations for rening
the given query are ranked according to their relevance for fullling the users information need and
according to the disambiguation of the meaning of the query (the so-called Query Renement phase). In
that way, the user is provided with the list of relevant queries, sorted according to their capability to
decrease the number of irrelevant results or/and increase the number of results relevant5 for the users need.
In the next three subsections, we explain these three phases in more details.

3.1. Ambiguity discovery

We dene the query ambiguity as an indicator of the gap between the users information need and the
query that results from the need. If a query is more ambiguous, then it follows that there are more
(mis)interpretations of the query, i.e., the probability that the query will be interpreted in the right manner
(according to the users expectations) decreases. The users often characterise the ambiguity of a query
through the number of results: many results can be an indicator that there are some irrelevant results, i.e.,
that some other information needs are covered by the query. In most of the existing IR systems, the user
gets only this information, the number of results, as the characterisation of the ambiguity. However, the
ambiguity of the query is a more complex category, and it requires handling by using a more formalised
approach.
We have found two main factors which affect the ambiguity of a query:
(a) The vocabulary (ontology): e.g. if in the vocabulary the concept researcher is modelled through
three subconcepts: phDStudent, postDoc and professor, the query 8x researcher(x) can
be (mis)interpreted as the information need for information resources about (i) researchers, or (ii)
phDStudents, or (iii) postDocs or (iv) professors.
(b) The information repository: e.g. if adding the term topic(y, KM) to the query 8x, y
researcher(x) and project(y) and participate(x,y) is redundant (i.e., its adding does not
change the list of results), this query can be (mis)interpreted as the information need for information
resources about (i) researchers and projects or about (ii) researchers and projects and KM.
In the rst case, the query ambiguities are dened on the schema-level, i.e., without considering the
instances of the ontology. In the second case, the instantiations of the schema for a particular domain are
analysed, in order to nd possible misinterpretations of the users query. By measuring these ambiguities,
the sources of the misinterpretations (problems) can be discovered. Consequently, they have to be resolved
in the renement process.
4
The term ambiguity depicts the inability of the user to dene her information need in a clear manner (the so-called ill-dened
information need), that the created query can be interpreted in various ways. We do not consider an inherent ambiguity of the query
language (the so-called ill-formed queries) as the query ambiguity in this research.
5
i.e., to increase the precision and recall of the retrieval process.
ARTICLE IN PRESS

548 N. Stojanovic / Information Systems 30 (2005) 543563

Therefore, we dene two types of the ambiguity in the interpretation of a query: (i) the semantic
ambiguity, as the characteristic of the used ontology and (ii) the content-related ambiguity, as the
characteristic of the repository. In the next two subsections, we give more details.

3.1.1. Semantic ambiguity


The goal of an ontology-based query is to retrieve a set of all instances which full all constraints given in
the query. In such a logic query, the constraints are applied to the query variables. For example, in the query
8x researcherx and workIn x; OntoWeb
x is a query variable and workIn(x, OntoWeb) is a query constraint. As these constraints are stronger,
the instances retrieved in the query are more relevant for the users information need (note that the query is
just an approximation of the users need and an unambiguous query can be treated as a very good
approximation of the need). Therefore, in order to dene the semantic ambiguity of a query, we need a
measure to estimate how strongly each of the query variables is constrained.
Since an instance in an ontology is described through (i) the concept it belongs to and (ii) relations to
other instances, we see two factors which determine the semantic ambiguity of a query variable:
 the concept hierarchy
how general the concept (type, class) which the variable belongs to is (how many
interpretations it has)
e.g. the query 8 x researcherx is more ambiguous than the query 8 x PhDStudentx
 the relation-instantiation
how descriptive (strong) the constraints applied to the variable are
e.g. the query 8x researcherxandworkInx; OntoWeb is more ambiguous than the query
8x researcherxandworkInx; OntoWebandresearchInx; KM

In other words, to dene the semantic ambiguity of a query, we prove how many constraints are applied
to each query variable and how specic each of these constraints is. Consequently, we dene two
parameters to estimate these values:
(1) VariableGenerality as
VariableGenerality(X) Subconcepts(Type(X))+1,
where
Type(X) is the concept the variable X belongs to,
Subconcepts(C) is the number of subconcepts of the concept C.

For example, for the query 8x researcher(x), VariableGenerality(X) 4, in case there are
three subconcepts of the concept researcher.
(2) VariableAmbiguity as
 
RelationTypeX  1

VariableAmbiguityX ; Q   ;
AssignedRelationTypeX ; Q 1
1
   
AssignedRelationX ; Q AssignedRelationTypeX ; Q 1 ;

where Relation (C) is the set of all relations dened for the concept C in the ontology, Assign-
edRelation(C,Q) is the set of all relations dened in the Relation (C) which appear in the query Q. Note
AssignedRelations(C,Q)X0. AssignedRelation(X,Q) is the set of all relations related to the variable
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 549

X which appear in the query Q. Note AssignedConstraints(X,Q)41. |A| depicts the cardinality of the
set A.
Note that VariableAmbiguity(X, Q) 4 0 and that smaller values indicate less ambiguity in interpreting
the variable; e.g. for the query 8x project(x) and topic(x, KM) and head(x, rst)
VariableAmbiguity(X) 6/3, by assuming that there are ve relations dened for the concept project,
i.e., |Relation (project)| 5.
On the other hand, for the query 8x project(x) and topic(x, KM) and topic(x,
KMsystem) VariableAmbiguity(X) (6/2)(1/3).
Note that although in both cases the same number of constraints ( 2) is dened on the query variable,
the values of their VariableAmbiguity parameters in these queries are different. The ambiguity in the rst
case is lower, since the constraints are related to different relations from the ontology. In this way, we
slightly give advantage to the uniform constraining of the meaning of a query variable in the ontology space
(i.e., constraining the meaning using as many different ontology relations as possible). However, the query
8 x project(x) and topic(x, KM) is more ambiguous than query 8 x project(x) and
topic(x, KM) and topic(x, KMsystem).
The total ambiguity of a variable is calculated as the product of these two parameters, in order to
uniformly model the directly proportional effect of both parameters to the ambiguity. Note that the second
parameter can be less than 1.
AmbiguityX ; Q VariableGeneralityX
VariableAmbiguityX ; Q:
Finally, for the query Q, the semantic ambiguity is calculated as follows:
X
SemanticAmbiguityQ Ambiguityx; Q;
x2VariableQ

where Variable(Q) represents a set of variables that appear in the query Q.


Moreover, by analysing these parameters, it is possible to discover which of the query variables
introduces most of the ambiguity in the query. Consequently, this variable should be rened in the query
renement phase. However, the previous analysis is based only on the structure of the ontology and the
assumption that each structure in the ontology is instantiated. Indeed, it is possible that, according to the
structure of a query, a query variable should be further constrained, but such constraints do not exist (are
not instantiated) in the information repository. For example, in the query 8x researcher(x) and
researchIn(x, KM), the variable x has a high value for VariableGenerality and should be more
specied, e.g. constraining its generality using the concept phDStudent. However, in case there are no
phDStudent who research in KM, this renement should not be recommended to the user. Therefore, in
order to recommend only the renements which are useful for the users query, we have to measure the
ambiguity of the query regarding the content of the information repository.

3.1.2. The content-related ambiguity


As we have seen in the last example, the ontology structure is just one component for determining
candidates which can replace an ambiguous query, in order to close the gap between the query and the
users real information need. One can say that the semantic ambiguity estimates the theoretical ambiguity
of a query. Moreover, it enables quantifying the ambiguity very precisely and recommending some
renements by determining which parts of the query are more ambiguous than others. However, an
ontology denes just a model how the entities from a real domain should be structured. If there is a
part of the model no entities from the real domain belong to, i.e., the part which is not instantiated in
the domain, then that part of the model cannot be used for calculating ambiguity. Moreover, an ontology
is a commonly shared model, usually developed manually, in a collaborative process, for a community
of users. It is possible that some dependencies which exist in the real domain are missing in the
ARTICLE IN PRESS

550 N. Stojanovic / Information Systems 30 (2005) 543563

ontology.6 Therefore, we use the content of the knowledge base to prune results from the ontology-related
analyses of the users query.
From the ontology point of view, we have analysed the structure of a query, not caring actually for the
results of the query. On the other side, from the content point of view, the results of a query are used for
dening potential ambiguities which arise in the querying process. For example, if two queries have the
same list of results, then the list of results can be treated as an ambiguous entityit can be (mis)interpreted
as the results of two different queries. However, since the user posts a query and wants to rene a query,
and not directly the list of results, we will interpret all content-related ambiguities on the level of the users
query. Regarding the previous example, two queries that return the same list of results are treated as
(structurally) equivalent queries (although it is possible that these queries do not have any common query
term), i.e., after posting a query, a list of structurally equivalent queries is presented to the user, in order to
quantify a content-related ambiguity of her query. For the precise interpretation of this equivalence, see
Denition 3.
Therefore, the content-related ambiguity of a query can be dened by comparing the results of the given
query with the results of other queries. In the rest of this subsection, we rst dene several relations between
queries, which are, thereafter, used for estimating the content-based ambiguity of a query.

3.1.2.1. Query neighbourhood. Due to the lack of space, we avoid here the formal denition of an
ontology, which we use in this research and which can be found in [7]. We introduce here only the notation
for ontology-related entities that are used in the rest of this subsection:
 Q(O) is a query dened against the ontology O. The setting in this paper encompasses positive
conjunctive queries. However, the approach can be easily extended to queries that include negation and
disjunction.
 O(O) is the set of all possible elementary queries7 for an ontology O.
 KB(O) is the set of all relation instances (facts) which can be proven in the given ontology O. It is called
the knowledge base.
 A(Q(O)) is an answer for the query Q regarding ontology O.

Denition 1 (Ontology-based information repository). An ontology-based information repository IR is the


structure (R, O, ann), where:
 R is a set of elements ri that are called resources, R {ri}, 1pipn;
 O is an ontology which denes the vocabulary used for annotating these resources. We say
that the information repository is annotated with ontology O, i.e., with its knowledge base
KB(O);
 ann8 is a binary relation between a set of resources and a set of facts from the knowledge base KB(O), ann
D R KB(O). We write ann(r, mi), meaning that the metadata mi is assigned to the resource r (i.e., the
resource r is annotated with the metadata mi).
Denition 2 (Query-Answering pair (the users request)). A Query-Answering pair (B) in an information
repository IR (R, O, ann) which is annotated using the ontology O is a tuple B (M0 ,R0 ), where
 M0 DO(O), is called a set of B_constraints,
 R0 DR, is called a set of B_resources. It follows: R0 fr 2 Rj8m 2 AM 0 : annr; mg:

6
An ontology evolution process is needed, in order to keep an ontology (logic) consistent and aligned to the users needs.
7
An elementary query contains only one query constraint.
8
ann stands for the annotation (e.g. providing metainformation).
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 551

Note that a B_constraint corresponds to a constraint introduced in Section 3.1.1.


Note that a Query-Answering pair corresponds to the users request. In that context, B_constraints are
features of a resource the user is interested in and B_resources are resources that full the request. In order
to simplify the notation, instead of the term Query-Answering pair, in the rest of the text, we will use the
term the users query, which means: the users queries against a given ontology-based repository.
In the following, we dene the relations between the users queries, in order to measure the content-
related ambiguity of a query. The global goal is to determine which renements of the initial users query
can be recommended to the user, in order to retrieve more results that are (more) relevant for her need.
Denition 3. Structural equivalence ( ) between two users queries B1, B2, is dened by M 01 ; R01
M 02 ; R02 2R01 R02 ; which can be written as B1 B2 2R01 R02 :
It means that two users queries are structurally equivalent if their sets of resources (B_resources) are the
same. Note that the structural equivalence is not based on the syntax or semantics of two queries, but on
the equivalence of their results.
Denition 4 (Cluster of users queries (in the rest of text: Query cluster)). The set of all structurally
equivalent users queries form a Query cluster: D M x ; Ry ; where
 M x  OO; Mx is called a set of D_constraints (set of constraints), and contains the union of constraints
of all queries that are equivalent.
For the users query Bu, it is calculated in the following manner.
( S )
zi _constraints
Mx :
8zi zi zu

It holds: 8z1 ; z2 z1 z2 ^ z1 _contraints  M x ! z2 _constraints  M x :


 RyDR, Ry is called a set of D_resources (resource set) and is equal to the B_resources set of the query Qx.
Formally: Ry fr 2 Rj8m 2 AM x ! annr; mg: Note that the set of resources is equal in all (mutually)
equivalent queries.

The Query cluster which contains all the existing resources in IR (i.e., a cluster for which Ry R) is called
the root cluster.
The set of all Query clusters D is denoted by D(IR). We dene the following relations on this set.

Denition 5. Structural subsumption (parentchild relation) (o) is dened by M x1 ; Ry1 oM x2 ; Ry2


2Ry1  Ry2 or M x1 ; Ry1 oM x2 ; Ry2 2M x1  M x2 :
A Query cluster D2 subsumes another cluster D1 if the set D_resources of D2 subsumes the resource set of
the cluster D1, or if the set D_constraints of D1 is subsumed by the set of constraints of the cluster D2. Note
that this relation is irreexive, anti-symmetric and transitive.
We dene two special subsumption relations on the set of Query clusters:
 DirectParents as: D1 odir D2 2D1 oD2 ^ :9Di ; D1 oDi oD2 ;
In that case, we call D2 a direct_parent cluster of D1
 DirectChildren as: D1 odir D2 2D2 oD1 ^ :9Di ; D2 oDi oD1 ;
In that case, we call D2 a direct_child cluster of D1 :

Denition 6 (Query Neighbourhood (Map)). The Neighbourhood of the users query Bu is the structure
N:E; P; C; g; Z; where
 E : fBi jBi Bu g; i.e., the set of all users queries equivalent to Bu .
ARTICLE IN PRESS

552 N. Stojanovic / Information Systems 30 (2005) 543563

Consequently, all queries from E form the starting cluster for Bu ; in notation Dstart Bu ; i.e.,
E Dstart Bu
 P : fDi jDi 4dir Dstart Bu g; i.e., the set of all d i rect_parent clusters of Dstart Bu DirectParentsDstartBu
 C : fDi jDi odir Dstart Bu g; i.e., the set of all direct_child clusters of Dstart BDirectChildrenDstart B)
 g:P ! <; is the relevance function for direct_parent clusters (it is used for ranking direct_parent clusters)
 Z : E ! <; is the relevance function for direct_child clusters (it is used for ranking direct_child clusters)
(< denotes the set of real numbers)

3.1.2.2. Quantifying content-related ambiguity. For a users query Ba ; we dene two properties which
characterise its content ambiguity: Largest equivalent query and Smallest equivalent query .
For a Query cluster Da ; we dene four properties which characterise its content ambiguity: Uniqueness,
NecessarySetOfObjects, Covering and CoveringTerms.
The Largest_equivalent_query of the query Ba is the set of constraints found in its equivalent query with
the most query_terms, Ba max : Ba max is equal to the largest Query cluster that contains Ba ; the so-called
starting cluster Dstart Ba ; or Da as a shorthand here. This cluster is calcu lated as Da M xa ; Rya ; so that
M xa  M 0 a ^ :9Di ; Da odir Di ; M xi  M 0 a : Note that M 0a is the set of attributes in the query Ba :
The Smallest equivalent query is the set of constraints found in the smallest equivalent query for the given
users query, Ba min : There can be several such queries. They are calculated in the following way: Ba min 2
f M xi \ M xa ; Rya jDa odir Di ; i 1; . . . ; ng; where Da Dstart Ba :
Note that both types of equivalent queries (Largest and Smallest ) retrieve the same list of results as the
original query Ba :
For the query cluster Da ; it is possible to dene a subset of resources which are unique for the cluster, i.e.,
they cannot be obtained for any direct_child cluster. We call that the Uniqueness of the cluster and the
ambiguity is inversely proportional to it.
UniqunessDa fRya \f[Ryi gjDi odir Da ; i 1; . . . ; ng:
The set of objects which have to be contained in the resource set of a cluster is called the
NecessarySetOfObjects of the cluster. It means that by excluding these resources from the resource set of a
query, the cluster becomes equivalent to one of the direct_child clusters. For a query cluster, there can be
several sets of NecessarySetOfObjects, which are calculated as follows:
NecessarySetOfObjectsDa 2 ffRya \Ryi gjDi odir Da ; i 1; . . . ; njg:
Covering and CoveringTerms are measures which dene the percent of identical D_resources and
D_constraints, respec tively, in two query clusters. More formally, for two clusters Da and Db ; we dene:
CoveringDa ; Db jRya \ Ryb j= maxfjRya j; jRyb jg;

CoveringTermsDa ; Db jM xa \ M xb j=maxfjM xa j; jM xb jg:


It is clear that the calculation of all above-mentioned parameters (Sections 3.1.1 and 3.1.2) could be time-
consuming. In order to make this calculation more effective, we use methods developed in the formal
concept analysis (FCA) [8] for organising data in the so-called concept lattices which correspond to the
multi-inheritance hierarchical clusters. Each of these clusters can be considered a query-answering pair and,
consequently, the lattice represents the clustering of a query space.
Due to the lack of space, we omit here the detailed introduction of the FCA, which can be found in [8].
We mention only the main concepts needed for the understanding of our approach. The formal concept
analysis (FCA) is a technique derived from the lattice theory that has been successfully used for various
analysis purposes. The organisation of the data is achieved via a mathematical entity called a formal
context. A formal context is a triple (G, L, I), where G is a set of objects, L is a set of attributes, and I is a
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 553

Table 1
A part of the ResearchInstitute ontology given in Section 2

Attr. Researcher Professor Project workIn - Research researchIn - ResearchIn-


Obj. 44LA ( LA: area 44CBR ( CBR: 44KM ( KM:
Project) ResearchArea) ResearchArea)

rst x x x x x x x
nst x x x x x x
ysu x x x x x x
jan x x x x x x
meh x x x x x
sha x

binary relation between the objects and the attributes. A formal concept of a formal context (G, L, I) is a
pair (A, B) where A D G, B D L, A B0 {g A G | 8l A B: (g,l) A I} and B A0 {l A L | 8g AA: (g,l)
A I}. For a formal concept (A, B), A is called the extent, and is the set of all objects that have all the
attributes dened in B. Similarly, B is called the intent, and is the set of all attributes possessed by all the
objects in A. As the number of attributes in B increases, the concept becomes more specic, i.e., a
specialisation ordering is dened over the concepts of a formal context (Table 1).
In this representation, more specic concepts have larger intents and are considered less than (o)
concepts with smaller intents. The same partial ordering is achieved by considering extents, in which case
more specic concepts have smaller extents. The partial ordering over concepts is always a lattice.
Note: Since an ontology uses the three-dimensional space for presenting information (object-attribute-
value), a transformation into the two-dimensional space (attribute-value) is needed. Due to the lack of the
space, we avoid here the discussion about this transformation. For example, the information
worksIn(rst, LA) is represented as the pair (rst,worksIn-44LA)in the table. In order to
enhance the readability of the table, we replace the relations with the name of the domain of the relation
(for exampleLA:Project is the replacement for the workIn-44LA, because the relation workIn has
the concept Project for the range).
Such a representation enables a very intuitive interpretation of a query: one can see a formal concept as a
representation of a query state, where the intent of the formal concept represents the query itself, and the
extent represents all resources that match the query. For example, the ontology-based query 8 x, y
researcher(x) and workIn(x, y) and project(y) and researchIn(x, KM) will be mapped
into the formal concept described as ({Project, LA:Project, KM:Research_Area}, {meh}) in the
concept lattice. Note that a formal concept encompasses all objects from its super-conceptsi.e., the
(attribute, object) set for that formal concept is: ({Researcher, Project, LA:Project,
KM:Research_Area}, {meh, jan, nst, rst, ysu}).
Such an ordering in the query space enables a very easy interpretation of query results regarding their
ambiguity. Moreover, the values for the content-related ambiguity parameters can be read directly from the
concept lattice. For the given query 8x, yresearcher(x) and workIn(x, y) and project(y)
and researchIn(x, KM), these parameters are as follows:
Largest equivalent query: 8x, y researcher(x) and workIn(x, y) and
Project(y) and researchIn(x, KM) and workIn(x,
LA).
Smallest equivalent query: 8 x, y Researcher(x) and workIn(x, y) and
Project(y)
Uniqueness: meh
ARTICLE IN PRESS

554 N. Stojanovic / Information Systems 30 (2005) 543563

NecessarySetOfObjects meh
Covering for upper formal concept: 1/2
CoveringTerms for upper formal concept: 5/4

These parameters are useful for estimating the ambiguity. A user is provided with this information, in
order to determine the position of her query with respect to other queries. That can enhance the efciency
of the query renement process. For the given example, according to the Largest equivalent query,
expanding the initial query with the term ResearchArea will not cause any changes in the set of answers.
Moreover, the Smallest equivalent query, 8 x, y researcher(x) and workIn(x, y) and
project(y), means that the constraint researchIn(x, KM) in the query 8x, yresearcher
(x) and workIn(x, y) and project(y) and researchIn(x, KM) is redundant, because all
researchers research in KM research area. Further, according to the Covering parameter, almost all results
from the query 8xresearcher(x) are contained in the results of the query 8x, yresearcher
(x) and workIn(x, y) and project(y) and researchIn(x, KM), which means that the
importance of the query constraints related to Project and KM research area is not high.
By reusing methods for creating a FCA lattice, our approach enables an efcient calculation of the
above-mentioned ambiguity parameters. More details about these calculations can be found in [9].

3.2. Interpretation of query ambiguities w.r.t. users needs

The previously dened parameters estimate the ambiguity of a query regarding the underlying ontology
and information repository. However, the problems in the meaning of a query have to be analysed/
discovered regarding the users needs, i.e., regarding the resources the user is searching for. For example, it
is possible that the ambiguous query 8 xresearcher(x) matches very precisely what the user is
searching for.
This discussion implies the need to interpret query ambiguities with respect to the users need. However,
modelling the need of the user is a non-trivial task, especially in case, as in our approach, all the users
interactions are anonymous. Moreover, since the users are reluctant to provide the explicit feedback about
the relevance of resources, the model of the users preferences in the current searching session has to be
developed implicitly, i.e., by analysing the so-called implicit relevance feedback [10], whose main idea is to
infer the information need of the user by analysing her interaction with the portal.
We found three types of the users behaviour, whose analysis can indicate the intention of the user in
searching:
1. the process of forming a query that corresponds to her information need,
2. the process of changing the query, in order to conform it to her information need and
3. the process of interacting with some resources that are relevant for her information need.

It means that by analysing (i) the structure of the users query, (ii) the process how the user changed
her query and (iii) the actions the user performed regarding retrieved results, we can discover
what the information need of the user in querying is. In the rest of this subsection, we present such
analyses.

3.2.1. Users query


A general assumption is that a user forms a query according to her current information need, i.e., that all
parts of the query correspond, in some extent, to her need. Although it is possible that some constraints the
user introduced in a query are results of a mistake, we do not consider such a case in our analysis, since it
can be very difcult to differ between errors and intentions in the users query. Therefore, each constraint
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 555

the user made is important for expressing her information need. Even in case the user makes a redundant
constraint, which will not be used in the information retrieval process, it should be used for discovering the
users information need. For example, in the query 8 xresearcher (x) and professor (x), the
rst constraint will not be used in the retrieval process, since all Professors are Researchers, as well.
However, this redundant constraint can be an indicator what the intention of the user in querying is, and
that she failed to express in an unambiguous manner. For example, an interpretation might be that the user
is interested in the research-related (not lecture-related) aspects of a professor. Consequently, the query
should be expanded in such a manner.
In order to discover such signs of hidden users needs, we analyse two content-related ambiguity
parameters: the Smallest equivalent query and the Largest equivalent query. The difference between the
users query and these min- and max-equivalent queries corresponds to the users needs she did not express
in the query.
Let us analyse the example given in Section 3.1.2.2. The user formed the query 8 xresearcher (x)
and workIn (x, y) and project (y) and researchIn (x, KM). The Largest equivalent query is 8
x, yresearcher (x) and workIn (x, y) and project (y) and researchIn (x, KM) and
workIn (x, LA) and the Smallest equivalent query is 8 x, yresearcher (x) and workIn (x,
y) and project (y).
Since the smallest equivalent query does not contain the constraint researchIn (x, KM), it means
that this part of the users need is not used in the information retrieval process, i.e., the query results (i.e.,
researchers) are not constrained regarding the research area the corresponding researcher is involved in.
It is possible that such a constraint is very important for the users need, but the user failed to express
it clearly. In other words, the relation researchIn might be an important constraint, but the research
area should be replaced with another research area which clusters researchers in a better way. We call
such a constraint an unsatisfied constraint. Consequently, a potential renement of the query should
take into account this information. For example, considering Fig. 1, the most suitable renement
for the users query should be CBR:researchArea, since it contains a renement of a hidden
constraint.
Moreover, the largest equivalent query is an expansion of the users original query with constraints
that do not inuence the results of searching. It encompasses a set of equivalent queries from the usage
point of view. Therefore, this query will replace the original one in the further query renements steps.
However, it is possible that a constraint from the largest equivalent query does not inuence the users need
at all, i.e., it either matches or disturbs the users information need. We call such a constraint a weak
constraint and it should not inuence the query renement process, i.e., the renement of the constraint
does not help in closing the gap between the users query and her information need. For the given case, the
constraint that the researcher works in the project LA is irrelevant (but not contradictory) for the users
information need.
In order to model this kind of the users preferences, we introduce a parameter called Importance, which
relates each constraint in a query to a number between 0 and 1. By default, Importance(constraint) 0.5,
which indicates an average importance. In case of an unsatised constraint, Importance 1. On the other
side, for the weak constraints, Importance 0.1.
Note that a constraint corresponds to a B_constraints of a users request (see Denition 2).

3.2.2. Process of changing a query


In an information portal, users are often unfamiliar with the content of the repository and try to
make short queries, in order to be sure that some results will be retrieved. In the next step, they try to
expand the query with more constraints, until the quality/quantity of the results corresponds to their
expectation. It means that users form a session of queries that results in a list of resources relevant for
their needs.
ARTICLE IN PRESS

556 N. Stojanovic / Information Systems 30 (2005) 543563

Fig. 1. An example showing the process of generating a concept lattice from a set of data given in Table 1. The concepts represented in
the lattice should be read as in the following example: foremost left concept, ({Prof.}, {jan}), corresponds to the objects (jan,
rst) and attributes (Researcher, Prof., Project, LA:Project, KM:ResearchArea)some attributes are inherited from
upper formal concepts.

By analysing the sequence how a query was formed, some useful information about the current users
need can be implied. For example, if the current query is 8 x, yproject (x) and researcher (y)
and head (x, y) and workIn (y, KM), then two different interpretations of the current users focus can be
generated in case the query session was:
(1)
8x project(x)
8x, y project(x) and researcher(y) and head(x, y)
8x,y project(x) and researcher(y) and head(x, y) and workIn(y, KM)
or
(2)
8y researchery
8y researcher(y) and workIn(y, KM)
8x, y project(x) and researcher (y) and head(y, x) and workIn(y, KM)
In the rst case, it is more likely that the constraints related to projects(x) should not be rened
further, since in the last steps the user tried to give more information related to the constraint
researcher(y).
Otherwise, in the second case, the users interest for project(x) is recently introduced and could be
rened further.
In order to model these dependencies between the querys constraint and the query session, we introduce
the function Actuality(constraint), which denes the actuality of a constraint for the users need by assigning
a real number between 0 and 1 to each constraint.
A simple calculation of the Actuality is as follows:
Actualityconstraint 1=num_steps 1;
where num_steps is the number of session steps the constraint is involved in.
Finally, we combine the Importance and the Actuality of a constraint X in order to get total importance:

Actual_ImportanceX ActualityX  ImportanceX :


ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 557

3.2.3. Users action


The theory about the implicit relevance feedback is based on the assumption that if the user
selects a resource from the list of retrieved results, then this resource corresponds, to some extent,
to the users information need. It means that, by analysing the commonalities in the properties
of selected resources, we can infer more information about the intention of the user in the current
querying.
For example, by continuing the example from Section 3.2.1, if there are ten results to the query 8 x,
yresearcher(x) and workIn(x, y) and project(y) and researchIn(x,KM) and the user
selects three of them, from which all are phDStudents, it can be implied that the user is more interested in
PhDStudents than in the other two subtypes of researchers, namely professors and postDocs.
Consequently, such a renement should be recommended to the user with a higher priority. We call such a
constraint the desired constraint.
We model this phenomenon by introducing the relation Implicit_Relevance(constraint), which assigns a
real number between 0 and 1 to each constraint. The relevance of a desired constraint is high ( 1).

3.3. Query refinement

The goal of this phase is to support the user in changing her initial query in the way that the modied
query expresses the users information need more clearly. This support is provided in the form of
recommendations for the querys modication, from which the user selects the most suitable one. These
recommendations are based on the problems (i.e., ambiguities) regarding the structure/meaning of the
query, which are discovered in the previous phase. Particularly, the neighbourhood of the current query is
presented to the user, including the information about the query ambiguities. This list of the
recommendations is complete and minimal, i.e., none of these neighbour queries can be deleted and no
query has to be added, in order to enable a step-by-step query renement, i.e., all possible query renements
can be achieved from this set. This property results from the FCA clustering of the queries and, due to the
lack of space, we omit here the proof.
In [9], we presented how the existence of extreme values of some ambiguity parameters can inuence
the query renement process. For example, too high value of Covering/CoveringTerms parameter
means that the query gives similar results as the query from the neighbourhood, and if the user needs
more results, then in the query renement process the query should be moved in the direction of the
similar query.
Since the list of potential renements can be large, we focus the attention of the user on the renements
which are likely to help her in nding what she is searching for, i.e., we rank the recommendations
according to their ambiguities and relevance for the users information need.
We assume that with every decision taken by the user more information is gained about the users
intensions and background. Each transition (renement) from the Query cluster D1 d; Ry1 to the cluster
D2 e; Ry2 involves a conscious choice by the user by preferring the constraints contained in e to the other
options in the constraint d. Note that according to the Denition 5: eCd. The probability of a constraint d
being the target of the navigation process is related to the distance uctuations while travelling the
navigation path. Recent uctuations have more impact than past uctuations. Typically, the farther
removed a constraint is from the path, the less likely that it is the target of the search. This is captured by
the probability of d being the search target, after having traversed search path p. For this purpose we will
transform the search space into a transition network, allowing the use of Markov chains theory. First, the
set of states is dened as the set of query constraints augmented with a special state called stop. This state
represents the termination of the search process. The transition between states are dened as follows: if
constraints x and y are connected regarding the ontology structure, then they are connected in the
transition network.
ARTICLE IN PRESS

558 N. Stojanovic / Information Systems 30 (2005) 543563

We assume
P for each transaction e a probability qe 4 0 of occurring. In a transition network, for each
state a: b:a!b qa!b 1: The transition matrix T is dened as
(
qs if x 2 Relatedy;
Tx; y
0 otherwise;

where Related(y) is the set of constraints that are in a relation with the constraint y regarding the underlying
ontology. For example, Related(PhDStudent(x)) {Researcher(x)}, since PhDStudent is a
subconcept of Researcher.
Further, T0 (x,y) is the P1 probability of reaching y starting from x in i transitions. T 0 I; the identity
matrix. Thus, the sum i0 T i e; d is the probability of reaching d from e in any number of steps.
Next, we focus on the probability Pr(d|e) of a constraint d being the search target of a navigation path.
P1 destination
The probability for d after traversing a path from e is dened as follows: Prdje
i
i0 T e; d
Td; stop:
The innite sum converges to (I T) 1(e,d) (see [11]). This requires the calculation of the inverse matrix of
an |O(O)| |O(O)| matrix, where O(O) is the set of elementary constraints regarding given ontology O.
Note that this operation can be calculated off-line.
However, in the step-by-step renement a user is navigating through query clusters that can contain
several constraints. In that case we expand the probability to include the cluster information as follows:
1 X 1 X
PrDd jDe Prdje
jM xd j d2M jM xe j e2M
xd xe

We use this destination probability function as a starting point for computing which neighbours bring the
user most direct towards the highest probable destination constraint. This is formalised by assigning the
coefcient Relevance to each destination cluster Dd that belongs to the direct_child clusters of a query De, as
follows:
1 X 1
PrDd jDe
jM xd j d2M jM xe j
xd
!
X
Prdje
Implict_Relevanced
Actual_Importancee
AmbiguityVariablee
De :
e2M xe

In that way, our approach prioritises highly relevant renements, i.e., the renements that are related to
the highly ambiguous variables and tailored to the users need.

4. Evaluation

The research presented in this paper is a part of the Librarian Agent, a management system we have
developed for the improvement of searching in an information portal. The Librarian Agent is developed
using the KAON ontology engineering framework (kaon.semanticweb.org). As a test bed for the presented
research, we use the VISION Portal (www.km-vision.org), a semantics-driven portal that allows browsing
and querying of the state-of-the-art information (researcher, projects, software, etc.) related to the
knowledge management. It is developed in the scope of the EU-funded VISION project, which should
provide a strategic roadmap towards the next-generation organisational knowledge management. The
backbone of the system is the VISION ontology, which includes the ResearchInstitute ontology presented in
Section 2. It is used as a common vocabulary for providing and searching for information. The ontology
lexical layer contains about 1000 terms, and the information repository consists of about 500 information
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 559

resources (the web page of concrete persons, projects, etc.). Each of the information resources is related to a
concrete instance in the ontology (e.g. to the person John Taylor). The query renement system is
implemented as an additional support in the searching process. When the renement support is turned on,
after posting a query, the user gets the query placed in its neighbourhood.
There are several techniques for query renement, which are successfully applied in the traditional
information retrieval (e.g. local context analysis (LC) [12]). However, all of them are based on analysing the
content of the resources retrieved for a query, in order to recommend expansion terms for the query. Since
the ontology-based information retrieval is focused on the facts, i.e., formal statements, the traditional
query renement method cannot be applied straightforwardly, and we did not consider them in the
evaluation experiment. Although our approach can be seen as an adaptation of the traditional methods
for the ontology-based querying, it employs much more semantics about the queries and the users needs,
due to the background knowledge provided by the domain ontology. Consequently, qualitative new results
can be achieved in the renement as illustrated in Section 2.
The goal of our experiment was to evaluate how the effectiveness of the ontology-based querying (i.e.,
inferencing) is changed, when the query process is enhanced with the presented renement facility. Actually,
we evaluated the possibility of our system to help the user dene her information need more precisely. To
obtain the basic retrieval system with which to compare our system, we simply turned off the query-
renement support.
For the experiment, we randomly selected 20 queries which cannot be expressed precisely using the
dened vocabulary, but whose answers are contained in the information repository. For example, a
question was: Find researchers with diverse experiences about Semantic Web, which cannot be directly
expressed using the given ontology vocabulary, but it can be answered by considering the information
repository. For example, there are two persons who work in ve projects related to the Semantic Web. They
can be treated as the broadly experienced experts for the Semantic Web.
We tested six subjects in the experiment. The subjects were computer science students with little
knowledge of the ontology domains (or domain) and no prior knowledge of the system. The six subjects
were asked to retrieve the resources relevant to ten queries in one session, using the two retrieval methods.
For assigning the queries to the methods, we used a repeated-measures design, in which each subject
searched each query using each method. To minimise sequence effects, we varied the order of the two
methods. The subjects were asked to conrm explicitly when they found a relevant answer. Otherwise, the
searching was treated as unsuccessful.
For each search, we considered four measures: success, quality, number of queries, and search time (i.e.,
the time needed by the user to perform her task). The quality (0 1) is the subjective judgment of the three
domain experts about the relevance of the results which are proclaimed by the user as a success. The results
are displayed in Table 2. The table shows that the searching with the query renement support results in
better evaluation scores for all measures. These results are not surprising, because our approach
complements the basic capabilities of a retrieval system with additional useful features. In particular, it
allows a smooth query renement/enlargement, which is likely to be the key factor for obtaining the
improvement in the searching time [13]. Moreover, the experiment shows that our system can play the role
of a query-assistant who, according to the users query, provides more (quantied) information about the

Table 2
Average values of retrieval performance measures

Method Success for the session Quality for the session Number of queries pro a question Search time (s) for session

Boolean 57% 0.6 10.3 2023


Our 85.7% 0.9 5.2 1203
ARTICLE IN PRESS

560 N. Stojanovic / Information Systems 30 (2005) 543563

queries around the initial query, making the process of expressing/satisfying the users needs more
efcient (about 85% of searching was highly relevant).

5. Related work

The research presented in this paper can be related to several research areas, some of which we presented
in this section.

5.1. Query ambiguity

The determination of an ambiguity in a query, as well as the sources of such an ambiguity, is the
prerequisite for the efcient searching for information. Word sense disambiguation of the terms in the input
query and words in the documents have shown to be useful for improving both the precision and recall of
an IR system [14]. In [15], the set of experiments using lexical relations from WordNet for the query
expansion is described, but without treating the query ambiguity. Although some work has recently been
done in quantifying the query ambiguity based on the language model of the knowledge repository [16,17],
the IR research community has not explored the problem of using a rich domain model in modelling the
querying. Some very important results in the query analysis can be found in the deductive database
community [18], namely the semantic query optimisation. That approach, although revolutionary for using
domain knowledge for the optimal compilation of the queries, does not consider the ambiguity of the query
regarding the users information need at all.

5.2. Query refinement

There is a lot of research devoted to the query renement in the Web IR community. In general, we see
two directions of modifying queries or query results to the needs of users: query expansion and
recommendation systems, respectively. The query expansion is aimed at helping the users make a better
query, i.e., it attempts to improve the retrieval effectiveness by replacing or adding extra terms to an initial
query. The interactive query expansion supports such an expansion task by suggesting candidate expansion
terms to users, usually based on the hyper-index [19] or concept-hierarchies automatically constructed from
the document repository. In [20] the model of the query-document space is used for the interactive query
expansion. Recommendation systems [21] try to recommend items similar to those a given user has liked in
the past (content-based recommendation), or try to identify users whose tastes are similar to those of the
given user, and recommend items they have liked (collaborative recommendation). Personalised web
agents, e.g. WebWatcher [22], track the users browsing, and formulate user proles which are used in
suggesting which links are worth following from the current web page. However, none of these approaches
uses the rich domain model for the renement of a query, i.e., the reasons for doing a renement are not
based on the deep understanding of the structure of a query, or the deep exploring of the interrelationships
in the information repository. Moreover, none of them tries to determine (measure) the ambiguity in a
query, and to suggest a renement which will decrease such an ambiguity.
In [13], the authors described an approach named REFINER, to combine Boolean information retrieval
and content-based navigation with concept lattices. For a Boolean query REFINER builds and displays a
portion of the concept lattice associated with the documents being searched centred around the users
query. The cluster network displayed by the system shows the result of the query along with a set of
minimal query renements/enlargements. A similar approach is proposed in [23], by adding the size of the
query result as an additional factor of the navigation. Moreover, the distance between queries in the lattice
is used for similarity ranking.
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 561

The main drawback in these approaches is that both start from the characteristics of the concept lattice
and try to map these characteristics into the query renement process. In that way, practically, they
constrain the queries which are analysed to the set of queries which correspond to the formal concepts in a
formal context (i.e., a query is treated only when it is the largest query of another query in the repository).
Thus, they do not model the query renement process and miss some very important properties for the
renement (such as a set of equivalent queries, Covering, CoveringTerms, etc.). Another problem is that
they do not consider the partial ordering of the metadata into a vocabulary, losing the possibilities to nd a
real specialisation/generalisation of a query, which is one of the very frequently used renements.
Moreover, the ambiguity of the query, as the crucial reason for the query renement process, is not treated
at all.
Conceptually, the most similar approach to our query renement system is the Query By Navigation [24],
an approach for the navigation through a hyperindex of query terms. The hyperindex search engine [19]
leads users to add, delete or substitute a term from the initial query by providing the minimal query
renements/enlargements. It is designed specically to (i) help the user provide a precise description
of their information need and (ii) reduce information overload by presenting the search result at a
higher level of abstraction. Moreover, in [25] the analogy between the lithoid, a crystalline structure
which organises document descriptions (and may be used to support searchers in formulating their
information demands via Query by Navigation) and the formal concept lattice is shown and used in the
phrase searching.
Although Query By Navigation treats each query the user posts separately, the variety of the analyses,
especially regarding the ambiguity of the query and the query equivalence, is missing. Since the approach is
not based on a controlled vocabulary, the query transformations regarding the querys siblings are not
explicitly recognised.
A lot of work on improving the cooperative behaviour of information systems has been done in the elds
of database query answering systems, logic programming, and deductive databases (see [26] for a review).
One common concern is to recover from a failing query, (i.e., a query producing an empty answer set) by
extending relational and deductive database systems with facilities to nd minimal failing, maximal
succeeding, and minimal conicting subqueries. Compared to our approach, the work done in this area
relies on different assumptions (powerful query languages) and has a different scope.
General remark: By not quantifying the query renement process, none of the approaches enables guiding
the user through the querys neighbourhoodone of the primary tasks of a human shop assistant in the
brick-and-mortar searching.

6. Conclusion

In this paper, we presented an approach for the query renement in ontology-based IR systems. The
system treats a library scenario in which a user searches for knowledge resources through a repository.
Consequently, the so-called Librarian Agent plays the role of the human librarian in the traditional
libraryit uses all possible information about the domain vocabulary, the behaviour of users and the
capacity of the knowledge repository, in order to help users nd the resources they are interested in. Based
on various analyses, the agent, through an interactive dialogue, guides the users in more efcient searching
for information. Particularly, for the query given by the user, the agent measures its ambiguity, regarding
the underlying vocabulary (i.e., ontology), as well as the content (capacity) of the information repository. In
the case of a high ambiguity, the agent suggests the user the most effective reformulation of the query. The
recommendations are ranked according to their relevance for the users information need. We presented an
evaluation study, which showed that this approach decreases the time, and enhances the precision of the
retrieval process.
ARTICLE IN PRESS

562 N. Stojanovic / Information Systems 30 (2005) 543563

We nd that our approach represents an important step in simulating the brick-and-mortar environment
by applying the practical results obtained in that environment in the searching for information in the virtual
world. Moreover, this approach leads to the self-adaptive knowledge portals, which can discover, by
analysing users interactions with the system, some drawbacks in their structure automatically, and evolve
their structure correspondingly.
One of the very important benets we did not elaborate in this paper is the possibility to use this
approach for the more cooperative question-answering scenarios, such as suggesting the user how to change
(not just rene) her query, in order to get a better result that exists in the neighbourhood of the query. For
example, if the user is searching for a professor who is an expert for KM, but in the repository exists only
one KM-expert professor and there are ve such Ph.D. Students, then the system should recommend the
user to replace the original constraint with a constraint related to Ph.D. Students. It will be a part of our
future work.

Acknowledgements

The research presented in this paper would not have been possible without our colleagues and students at
the Institute AIFB and the FZI, University of Karlsruhe. We would like to thank anonymous reviewers for
useful comments. The research for this paper was partially nanced by BMBF in the project SemIPort
(08C5939).

References

[1] N. Stojanovic, A. Maedche, S. Staab, R. Studer, Y. Sure, SEALA Framework for Developing SEmantic PortALs, ACM
K-CAP 2001, October, Vancouver, 2001.
[2] N. Guarino, P. Giaretta, Ontologies and knowledge bases: towards a terminological clarication, in: N. Mars (Ed.), Towards
Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing, IOS Press, Amsterdam, 1995, pp. 2532.
[3] N. Guarino, C. Masolo, G. Vetere, OntoSeek: content-based access to the web, IEEE Intell. Syst. 14 (3) (1999) 7080.
[4] T. Saracevic, Relevance: a review of and a framework for the thinking on the notion in information science, J. Amer. Soc. Inform.
Sci. 26 (6) (1975) 321343.
[5] U. Shah, T. Finin, A. Joshi, S. Cost, J. Mayeld, Information Retrieval on the Semantic Web, ACM Conference on Information
and Knowledge Management CIKM02, McLean, USA, 2002.
[6] N. Stojanovic, An approach for using query ambiguity for query renement: the Librarian Agent Approach, 22nd International
Conference on Conceptual Modeling (ER 2003), Chicago, IL, USA, Springer, Berlin, 2003.
[7] N. Stojanovic, N. On the query renement in the ontology-based searching for information, International Conference on
Advanced Information Systems Engineering CAiSE 03, Springer, Berlin, 2003.
[8] B. Ganter, R. Wille, Formal Concept Analysis: Mathematical Foundations, Springer, Berlin, 1999.
[9] N. Stojanovic, Information-need driven query renement, The 2003 IEEE/WIC Conference on Web Intelligence (WI 2003)
Halifax Canada, IEEE Press, New York, 2003.
[10] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley-Longman Publishing Co., Reading, MA,
1999.
[11] K.S. Trividi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice-Hall, Englewood
Cliffs, NJ, 1982.
[12] J. Xu, W.B. Croft, Improving the effectiveness of information retrieval with local context analysis, ACM Trans. Inform. Syst. 18
(1) (2000) 79112.
[13] C. Carpineto, G. Romano, Effective reformulation of Boolean queries with concept lattices, Flexible Query Answering Systems
FQAS98, Springer, Berlin, 1998, pp. 277291.
[14] M. Rila, The Use of WordNet in information retrieval, Annual Meeting of the Association for Computational Linguistics (ACL)
Workshop on the Usage of WordNet in Natural Language Processing Systems, 1998, pp. 3137.
[15] E. Voorhees, Query expansion using lexical-semantic relations. Proceedings of the 17th ACM/SIGIR, Dublin, 1994.
[16] J. Ponte, W.B. Croft, A language modeling approach to information retrieval, Proceedings of the 21st ACM/SIGIR98, 1998,
pp. 275228.
ARTICLE IN PRESS

N. Stojanovic / Information Systems 30 (2005) 543563 563

[17] S. Cronen-Townsend, W.B. Croft, Quantifying Query Ambiguity, Human Language Technologies HLT 2002, pp. 9498.
[18] U. Chakravarthy, J. Grant, J. Minker, Logic-based approach to semantic query optimization, ACM Trans. Database Syst. 15 (2)
(1990) 162207.
[19] P.D. Bruza, S. Dennis, Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine, Computer-
Assisted Information Searching on Internet RIAO97, Montreal, 1997.
[20] J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Clustering User Queries of a Search Engine. World Wide Web Conference WWW10, ACM
May, Hong Kong, 2001.
[21] M. Balabanovic, Y. Shoham, Content based collaborative recommendation, Commun. of the ACM CACM 40 (3) (1997) 6672.
[22] T. Joachims, D. Freitag, T. Mitchell, Webwatcher: a tour guide for the World Wide Web, International Joint Conference on
Articial Intelligence IJCAI-97, 1997.
[23] P. Becker, P. Eklund, Prospects for document retrieval using formal concept analysis, Proceedings of the Sixth Australasian
Document Computing Symposium, Coffs Harbour, Australia, December, 2001.
[24] P. Bruza, T. van der Weide, Stratied hypermedia structures for information disclosure, Comput. J. 35 (3) (1992) 208220.
[25] F. Grootjen, Employing semantical issues in syntactical navigation, Proceedings of the BCS-IRSG 2000 Colloquium on IR
Research, 2000.
[26] T. Gaasterland, P. Godfrey, J. Minker, An overview of cooperative answering, J. Intell. Inform. Syst. 1 (2) (1992) 123157.

Você também pode gostar