Você está na página 1de 9

The INQUERY Retrieval System

James P. Callan, W. Bruce Croft, and Stephen M. Harding


Department of Computer Science
University of Massachusetts
Amherst, Massachusetts 01003, USA
croft@cs.umass.edu
Abstract net. This model is powerful in the sense that it
can represent many approaches to IR and combine
As larger and more heterogeneous text them in a single framework [8]. It also provides the
databases become available, informa- ability to specify complex representations of infor-
tion retrieval research will depend on mation needs and compare them to document rep-
the development of powerful, ecient resentations.
and exible retrieval engines. In this pa- In this paper, we focus on the architecture
per, we describe a retrieval system (IN- and implementation of the INQUERY system,
QUERY) that is based on a probabilis- which has been designed for experiments with
tic retrieval model and provides support large databases. We start by giving a brief descrip-
for sophisticated indexing and complex tion of the underlying inference net model in the
query formulation. INQUERY has been next section. We then present an overview of IN-
used successfully with databases con- QUERY's architecture, followed by more detailed
taining nearly 400,000 documents. descriptions of the important system components.
Throughout this description, we will give timing
gures from recent experiences with a 1 Gigabyte
1 Introduction database that contains nearly 400,000 documents
varying in length from short abstracts to 150 page
The increasing interest in sophisticated informa- reports. We conclude by discussing the current
tion retrieval (IR) techniques has led to a num- research and development issues.
ber of large text databases becoming available for
research. The size of these databases, both in
terms of the number of documents in them, and 2 The Inference Network
the length of the documents that are typically full
text, has presented signi cant challenges to IR re-
Model
searchers who are used to experimenting with two Bayesian inference networks are probabilistic mod-
or three thousand document abstracts. In order to els of evidential reasoning that have become widely
carry out research with di erent types of text rep- used in recent years [1; 6]. A Bayesian infer-
resentations, retrieval models, learning techniques, ence network, or Bayes net, is a directed acyclic
and interfaces, a new generation of powerful, ex- graph (DAG) in which nodes represent proposi-
ible, and ecient retrieval engines needs to be im- tional variables and arcs represent dependencies.
plemented. At the Information Retrieval Labora- A node's value is a function of the values of the
tory in the University of Massachusetts, we have nodes it depends upon. Leaf nodes typically repre-
been developing such a system for the past two sent propositions whose values can be determined
years. The INQUERY system is based on a form by observation. Other nodes typically represent
of probabilistic retrieval model called the inference propositions whose values must be determined by
inference. The notable feature of Bayes nets is that the concept level . Additional levels of abstrac-
c

dependencies are not necessarily absolute. Cer- tion are possible, but are not currently needed by
tainty or probability can be represented by weights INQUERY.
on arcs. Query nodes represent the proposition that an
INQUERY is based upon a type of Bayes net information need is met. Query nodes are always
called a document retrieval inference network [9; true . Concept nodes represent the proposition
8]. A document retrieval inference network, or in- that a concept is observed in a document. Con-
ference net, consists of two component networks: cept nodes may be either true or f alse.
one for documents, and one for queries (see Figure The query network is attached to the docu-
1). Nodes in an inference net are either true or ment network by arcs between concept nodes and
false. Values assigned to arcs range from 0 to 1, content representation nodes. The mapping is not
and are interpreted as belief. always one-to-one, because concept nodes may de-
ne concepts not explicitly represented in the doc-
2.1 The Document Network ument network. For example, INQUERY's phrase
operator can be used to de ne a concept that is not
A document network can represent a set of doc- represented explicitly in the document network.
uments with di erent representation techniques The ability to specify query concepts at run-time is
and at varying levels of abstraction. Figure 1 one of the characteristics that distinguishes intelli-
shows a simple document network with two lev- gent information retrieval from database retrieval.
els of abstraction: the document text level , and d

the content representation level . Additional lev-


r 2.3 The Link Matrix
els of abstraction are possible, for example audio or
video representations, but are not currently needed Document retrieval inference networks, like the
by INQUERY. Bayes networks from which they were derived, en-
A document node represents the proposition
di
able one to specify arbitrarily complex functions to
that a document satis es a user query. Document compute the belief in a proposition given the be-
nodes are assigned the value . The value on an
true
liefs in its parent nodes. These functions are some-
arc between a document text node and a content
di
times called link matrices. If the belief for each
representation node is the conditional probabil-
rk
combination of evidence were speci ed directly, a
ity ( j ). A document's prior probability ( ) link matrix for a node with parents would be
n

of size 2  2 . This problem can be avoided by re-


P rk di P di

is 1/(number of documents).
n

A content representation node represents stricting the ways in which evidence is combined.
INQUERY uses a small set of operators, described
rk

the proposition that a concept has been observed.


The node may be either or . The value in Section 6, for which closed-form expressions can
be found.
true f alse

on an arc between a content representation node


rk and a query concept node is the belief in the
cl

proposition.
INQUERY uses several types of content rep-
3 Overview of the Architecture
resentation nodes. The simplest corresponds to a The major tasks performed by the INQUERY
single word of the document text, while more com- system are creation of the document network, cre-
plex concepts include numbers, dates, and com- ation of the query network, and use of the net-
pany names. Section 4 describes in more detail works to retrieve documents. The document net-
the types of content representation nodes created, work is created automatically by mapping docu-
and the methods used to create them. ments onto content representation nodes, and stor-
ing the nodes in an inverted le for ecient re-
2.2 The Query Network trieval. Query networks are speci ed by a user
through a user interface. Document retrieval is
The query network represents a need for informa- performed by using recursive inference to propa-
tion. Figure 1 shows the network for a query with gate belief values through the inference net, and
two levels of abstraction: the query level , and q then retrieving documents that are ranked high-
   

1
d

2
d :::

j ?1
d

j
d

@ @ ?@
Document @@ @@ ?? @@
? @@ R?2  @@R3 ? ? @@R?
Network
   
? HHH k 
1
r r r ::: r

?? HHH
HHH
? ?? ?
? HHjm?
1H 2  3 

c c c ::: c

HH @
Query HHHH @ @  
Network HH@H@ 
Hj@R?
 
q

Figure 1: A simple document retrieval inference network.

est. Figure 2 shows the major components of the lexical tokens (usually words or eld markers) to
INQUERY system, and how information ows be- the syntactic analyzer. The database builder stores
tween them. The following sections discuss each the document text in a database for use by the user
component in more detail. interface. Concept analyzers identify higher-level
concepts, for example dates and names, that occur
4 The Parsing Subsystem in the text. The activities of these lexical analyz-
ers are loosely coordinated by a lexical analysis
The rst task in building a document network is manager.
to map each document onto a set of content rep- One reason that it is desirable to have so many
resentation nodes. This mapping process is re- lexical analyzers is that INQUERY currently con-
ferred to as parsing the document, and consists of tains parsers for six di erent document formats.
ve components: lexical analysis, syntactic analy- The burden of supporting many document formats
sis, concept identi cation, dictionary storage, and is minimized by keeping the database builder and
transaction generation. It is important that each concept analyzers ignorant of the document for-
of these components be ecient, because con- mat. The lexical analysis manager enforces this ig-
struction of the document network is one of the norance by controlling access to the input stream.
most time-consuming parts of building and us- The manager reads large blocks of text into an
ing inference nets. The current set of INQUERY internal bu er, from which the lexical analyzers
parsers, without high-level concept recognition, re- read. When a new document is encountered, the
quire 19.8 CPU hours on a Sun SPARCserver 490 parser's analyzer is given exclusive access to the
with 128 MBytes of memory to parse a 1 GByte document, as shown in Figure 3a. The parser's
document collection. The following subsections analyzer is responsible for converting into canoni-
describe how each of the parsing components is cal format all eld markers found in the document.
implemented. When the parser's analyzer reaches the end of the
document, the other analyzers are given access to
4.1 Lexical and Syntactic Analysis the document, as shown in Figure 3b.
There are three distinct uses of lexical analysis in The parser's analyzer has two important duties
INQUERY. The parser's lexical analyzer provides besides converting the document to canonical for-
Documents
?
Document Parser
Document Concept Transactions
Database Dictionary ?
File Inversion
Queries Inverted File
? ?? Query - Text Processing Query - ?
User Interface  Retrieval Subsystem
Document Rankings
?
Relevant Documents

Figure 2: The architecture of the INQUERY information retrieval system.

mat. It is responsible for providing tokens, usually mars similar to Mauldin's [5]. The major di er-
words, numbers, or eld markers, to the syntac- ence is INQUERY's use of string arithmetic to
tic analyzer. It is also responsible for converting avoid roundo errors in the number recognizer.
words to lower case, discarding user-speci ed stop The recognizers map di erent expressions of a con-
words (e.g. `a', `and', `the'), and optionally remov- cept (e.g. 1 million, or 1000000, or 1,000,000) into
ing word endings (e.g. `-ed', `-ing') before notifying a canonical format.
the transaction manager (discussed below) about The company name recognizer is similar to,
the occurrence of each word. but less sophisticated than, Rau's [7]. It looks
The principal use of syntactic analysis in IN- for strings of capitalized words that end with one
QUERY is to ensure that a document is in the the legal identi ers that often accompany company
expected format, and to provide error recovery if names (e.g. \Co", \Inc", \Ltd", or \SpA"). If the
it is not. All of INQUERY's syntactic analyzers company name occurs once with a legal identi-
are created by YACC [3]. er, the recognizer can usually recognize all other
occurrences of the name in the document. This
4.2 Concept Recognizers strategy performs reasonably well on our test col-
lections.
INQUERY is currently capable of recognizing and
transforming into canonical format four types of The person name recognizer uses a strategy
concepts: numbers, dates, person names and com- similar to the company name recognizer, except
pany names. INQUERY also contains a concept that it looks for occupation titles and honori c ti-
recognizer to recognize and record the locations of tles. This strategy performs poorly on our test
sentence and paragraph boundaries. Concept rec- collections. We are contemplating replacing the
ognizers tend to be complex [5; 7], so it is desirable current algorithm with one that relies more heav-
to implement them as eciently as possible. All ily on a large database of known names.
of INQUERY's concept recognizers are currently The sentence and paragraph boundary recog-
nite state automata created by LEX [4]. In prin- nizer is currently only able to recognize bound-
cipal, it is possible to combine the recognizers into aries that are explicitly tagged with eld mark-
a single nite state automaton, however LEX can- ers. The locations of these boundaries is saved in
not create automata of the required size. a le, for use in a planned project on paragraph-
The number and date recognizers use gram- and sentence-level retrieval from large documents
Beginning End Beginning End
of of of of
Document Document Document Document
? ? ? ?
cccccc ::: ::: : : :c cccccccc cccccc ::: ::: : : :c cccccccc

??@I6@ 6
Parser's
??6 6 6
Parser's
Concept Database Concept Database
recognizers builder lexical recognizers builder lexical
analyzer analyzer
(a) (b)
Figure 3: (a) The parser's lexical analyzer converts eld markers to canonical form and provides tokens
to the syntactic analyzer. (b) When the parser analyzer reaches the end of the document, the database
builder and concept recognizers are allowed to read the document, one at a time.

collections. its location is reported to the transaction man-


In principle, there is no limit to the number ager. When the end of the document is reached,
and complexity of concept recognizers that can be the transaction manager writes to disk a set of in-
added to INQUERY. For example, we are inves- dexing transactions that record for each term the
tigating the use of stochastic tagging [2] to auto- frequency and locations of its occurrence in that
matically identify phrases. The main consequence document.
of additional concept recognizers is the overhead Transactions are currently stored in text les
that they add to the parsing process. The current using a suboptimal encoding method. Our exper-
set of recognizers slows parsing by about 25%. iments with large collections have produced more
transactions than t on one of our disks. The
4.3 Concept Storage transaction manager copes with this problem by
creating a new transaction le each time an IN-
The lexical analyzers are designed to work e- QUERY document parser is invoked. (One invo-
ciently with strings of characters, but the rest of cation of a parser may parse many documents.)
INQUERY is not. When a decision is made to in- The periodic creation of new transaction les en-
dex a document by a word or higher-level concept, ables us to scatter them across several disks.
the string of characters is replaced with its entry
number in a term dictionary. A reference to an
entry number take less space and can be manipu- 5 File Inversion
lated much more eciently than a reference to a
string of characters. If the word already exists, the Each transaction represents a link between a docu-
number of the existing entry is returned, otherwise ment node and a content representation node. The
a new entry is created. entire document network is represented by the set
INQUERY originally stored its dictionary in a of transaction les produced during parsing. The
B-tree data structure. However, performance anal- task addressed after parsing is to organize the net-
ysis showed the dictionary to be a bottleneck. The work so that evidence may be propagated through
current version of INQUERY stores its dictionary it rapidly and eciently.
in a hash table. This change alone reduced the The value of an internal network node is a
time required to parse a 339 MByte document col- function of the values assigned to its parents. IN-
lection from 16.8 CPU hours to 8.2 CPU hours. QUERY avoids instantiating the entire document
network during retrieval by using recursive infer-
4.4 Transaction Generation ence to determine a node's value. The speed of
recursive inference depends upon how fast infor-
Each time a term is identi ed, whether by the mation about a node and its links can be obtained.
parser's lexical analyzer or a concept recognizer, INQUERY provides fast access to this information
by storing it as an inverted le in a B-Tree data mit the user to provide structural information in
structure. the query, including phrase and proximity require-
The inverted le is constructed most eciently ments. Query text is converted to lower case, pos-
if the transactions for a term are processed to- sibly checked for stopwords or stemmed to canon-
gether. Therefore the transaction les are sorted ical word form, and compared to the concept dic-
before the inverted le is constructed. The sorting tionary before being converted into a query net.
procedure involves several steps, both for eciency Query net nodes correspond to structured lan-
and because transactions may be stored in multi- guage operators and query terms. The informa-
ple les that do not all t on one disk. tion contained in a node varies, depending on its
We begin by using the UNIX sort program to type. The attachment of the query net to the pre-
sort each transaction le by term and document existing document net occurs at the term nodes.
identi ers. If the transactions all t on a single
disk, we merge-sort the sorted transaction les. 6.2 Retrieval Engine
Otherwise we partition the sorted transaction les
at some term (e.g. the 10,000th) and merge-sort The INQUERY retrieval engine accepts the root
partitions covering the same ranges of terms. node of a query net and evaluates it, returning a
Sorting is one of the most time-consuming single node containing a belief list. This belief list
tasks in building a document network. In tests is a structure containing the documents and their
with a 1 GByte document collection, sorting, par- corresponding \beliefs" or probabilities of meet-
titioning and merge-sorting 1.3 GBytes of transac- ing the information need as de ned by the query.
tions required 13.6 CPU hours on the Sun SPARC- The retrieval engine does its work by instantiat-
server 490. ing proximity lists at term nodes, and converting
After the transactions are sorted, the inverted such lists to belief lists as required by the struc-
le can be constructed in ( ) time. The keys to
O n ture of the query net, using methods de ned in [9].
the inverted le are term ids. The records in the This list may be sorted to produce a ranked list of
inverted le store the term's collection frequency, documents for the user to see.
the number of documents in which the term oc- The inference net is evaluated by recursive calls
curs, and the transactions in which the term oc- of the main evaluation routine which in turn calls
curs. The inverted le is stored in binary format, one of many possible node speci c evaluation rou-
which makes it smaller than the transaction les tines. The routines represent the canonical form
from which it is assembled. The 1.3 GBytes of of evaluating a simpli ed link matrix at each node.
transactions referred to above were converted to The closed form expressions for computing the be-
an 880 MByte inverted le in 2.5 CPU hours. lief at node are:
Q

6 The Retrieval Subsystem belnot(Q ) = 1? 1 p (1)


The retrieval subsystem converts query text into a belor(Q ) = 1 ? (1 ? 1 ) 
 (1 ? ) (2)p ::: pn

query network, and then evaluates the query net- beland( Q ) 1 2 


= p p (3)
::: pn

work in the context of the previously constructed belmax( Q ) =


max( 1 2 ) (4)
p ; p ; : : : ; pn

document network. ( 1 1+ 2 2+ + )
belwsum( Q ) =( 1 + 2 + + ) (5)
w p

w
w p

w
:::

:::
wn pn wq

wn

6.1 Building a Query Network belsum( ) = ( 1 + 2 + + )


Q
p p
(6) ::: pn

Queries can be made to INQUERY by using ei-


n

ther natural language or a structured query lan- The basic structures from which all compu-
guage. Natural language queries are converted tations of document node belief are derived are
to the structured query language by applying the proximity lists and belief lists. A proximity list
#sum operator to the terms in the query. Table contains statistical and proximity (term position)
1 describes #sum and the other operators in IN- information by document on a term speci c basis.
QUERY's query language. Query operators per- The belief list is a list of documents and associated
OPERATOR ACTION
#and AND the terms in the scope of the operator.
#or OR the terms in the scope of the operator.
#not NEGATE the term in the scope of the operator.
#sum Value is the mean of the beliefs in the arguments.
#wsum Value is the sum of weighted beliefs in the arguments, scaled by the sum of the
weights. An additional scale factor may be supplied by the user.
#max The belief is the maximum of the beliefs in the arguments.
# n A match occurs whenever all of the arguments are found, in order, with no more
than words separating adjacent arguments. For example, #3 (A B) matches
n

\A B", \A c B" and \A c c B".


#phrase Value is a function of the beliefs returned by the #3 and #sum operators. The
intent is to rely upon full phrase occurrences when they are present, and to rely
upon individual words when full phrases are rare or absent.
#syn The argument terms are to be considered synonymous.
Table 1: The operators in INQUERY's query language.

belief values at a given node, as well as default be- collection. Terms with high collection frequen-
liefs and weights used when combining belief lists cies are likely to add to processing time due to
from di erent nodes. The belief list will contain the length of associated proximity lists. Retrieval
the cumulative probability of a documents' rele- performance is much improved over boolean and
vance to the query given the values of the parents. conventional probabilistic retrieval. The reader is
Belief lists may be computed from proximity lists, referred to [9] for details.
but the reverse derivation is not possible. This lim-
itation imposes some restrictions on query form.
The query form must not produce a proximity list 7 Interfaces
type resultant node which is acted upon by a rou- INQUERY o ers batch and interactive methods of
tine expecting a belief list type. Proximity lists query processing, and an application programmers
are transformed into belief values using the infor- interface (API) to support development of cus-
mation in the list and combined using weighting tomized front-ends to the retrieval engine. Each
or scoring functions. of these interfaces is discussed below.
Node belief scores are calculated as a combina-
tion of term frequency (tf) and inverse document
frequency (idf) weights. The values are normal- 7.1 Application Programmers Interface
ized to remain between 0 and 1, and are further The INQUERY application programmers interface
modi ed by tf and belief default values which the (API) is a set of routines that allow programmers
user may de ne at program invocation. to develop interfaces of their own to the INQUERY
Calculation of a belief for a given node is de- retrieval engine. The API functions open and close
pendent on the type of node and the number and INQUERY databases and les, convert query text
belief in its parents as presented in Equations 1- into query nets, evaluate query nets, and retrieve
6. The probability combinations are achieved via documents.
belief list merges and negation.
7.2 Batch Interface
6.3 Retrieval Performance The batch program takes command line arguments
Typical query processing time is 3 to 60 seconds on in the form of input le names and switches. The
a 1 GByte document collection. Processing time output of the program is a ranked list of documents
varies according to query complexity, the number by weight (the calculated probability of relevance)
of terms in the query and their frequency in the in a le format readable by an evaluation program,
which can produce standard recall-precision tables user interface. Japanese documents are particu-
on retrieval performance. A le of relevance judg- larly challenging because word boundaries are im-
ments for the submitted queries is required as in- plicit. JINQUERY currently indexes documents
put for the batch program. This arrangement al- with Kanjii characters and Katakana words. A
lows queries to be run repeatedly, so that changes segmenter that divides a stream of Kanjii charac-
to the system may be evaluated. ters into words is being tested.
Finally, research is underway to provide bet-
7.3 User Interface ter support for queries expressed in natural lan-
guage. INQUERY and JINQUERY currently han-
The interactive user interface supports queries in dle natural language by summing the beliefs con-
natural language or structured form, and was pro- tributed by the individual query words. We be-
duced using routines from the API. Query results lieve that improvements can be made by automati-
are displayed on the screen in the form of a ranked cally identifying phrases, incorporating words from
document list. The user may browse through the thesauruses, running the concept recognizers on
retrieved documents to determine their relevance queries, and performing other types of morpholog-
to the query. A le containing the session results ical processing.
may also be produced.
Acknowledgements
8 Current Status This research was supported in part by the Air
The INQUERY system has been tested on both Force Oce of Scienti c Research under contract
standard information retrieval collections [9; 8] AFOSR-91-0324.
and a heterogeneous 1 GByte collection. We con-
tinue to conduct research on intelligent informa-
tion retrieval with the INQUERY system, and en-
References
courage others to do so. INQUERY version 1.3, [1] Eugene Charniak. Baysian networks without
described in this paper, is distributed by a tech- tears. AI Magazine, 12(4), Winter 1991.
nology transfer agency of the University of Mas-
sachusetts for a nominal fee. [2] Kenneth Church. A stochastic parts program
Current work on INQUERY addresses both and noun phrase parser for unrestricted text. In
software engineering and research issues. One re- Proceedings of the 2nd Conference on Applied
cent improvement was the addition of encoding Natural Language Processing, pages 136{143,
methods to reduce the sizes of both the inverted 1988.
le and the user-interface indices. The inverted le [3] Stephen C. Johnson. Yacc: Yet another com-
index has been reduced to 40% of its previous size, piler compiler. In UNIX Programmer's Man-
while the user-interface index has been reduced to ual. Bell Telephone Laboratories, Inc, Murray
5% of its previous size. This improvement will en- Hill, NJ, 1979.
able us to install a 2 GByte document collection on
our current hardware during the summer of 1992. [4] M. E. Lesk and E. Schmidt. Lex - a lexical ana-
We are also studying the use of relevance feed- lyzer generator. In UNIX Programmer's Man-
back in INQUERY. Relevance feedback enables a ual. Bell Telephone Laboratories, Inc, Murray
user to identify those retrieved documents that are Hill, NJ, 1979.
most relevant to the user's information need. The [5] Michael Loren Mauldin. Information retrieval
system then analyzes those documents, produces by text skimming. PhD thesis, School of
a revised query based upon the analysis, and re- Computer Science, Carnegie Mellon University,
trieves a new set of documents. Pittsburg, PA, 1989.
A colleague is developing a Japanese ver-
sion of INQUERY, called JINQUERY. The only [6] Judea Pearl. Probabilistic reasoning in intelli-
di erences between INQUERY and JINQUERY gent systems: Networks of plausible inference.
are the lexical and syntactic analyzers, and the Morgan Kaufmann, San Mateo, CA, 1988.
[7] Lisa F. Rau. Extracting company names from
text. In Proceedings of the Sixth IEEE Con-
ference on Arti cial Intelligence Applications,
1991.
[8] Howard Turtle and W. Bruce Croft. Evalu-
ation of an inference network-based retrieval
model. ACM Transactions on Information Sys-
tems, 9(3), July 1991.
[9] Howard R. Turtle and W. Bruce Croft. E-
cient probabilistic inference for text retrieval.
In RIAO `91 Conference Proceedings, pages
644{661, Barcelona, Spain, April 1991.