Inverted Text Index

Using a Relational Database for an Inverted Text Index
Steve Putz Xerox Palo Alto Research Center
SSL-91-20
[P91-00158]
Copyright January 1991 Xerox Corporation. All rights reserved.
System Sciences Laboratory Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, California 94304

Steve Putz Xerox Palo Alto Research Center
Abstract
Inverted indices for free-text search are commonly stored and updated using B-Trees. This paper shows how to efciently maintain a dynamic inverted index for rapidly evolving text document sets using a suitable SQL-based relational database management system. Using a relational system provides performance and reliability features such as efcient index access and maintenance, caching, multi-user transactions, access control, back-up and error recovery.
Table of Contents
1.0 2.0 3.0 4.0 Introduction..................................................................................................2 Inverted Index Storage Optimizations .........................................................2 Relational System Requirements .................................................................4 Structure of the Relational Database Tables ................................................4
4.1 4.2 4.3 Structure of the Postings Table ............................................................................. 5 Specifying the SQL Table and Index .................................................................... 7 Using Separate Word and Postings Tables............................................................ 7 Retrieval of Postings............................................................................................. 9 Retrieval from Separate Word and Postings Tables.............................................. 9 Retrieval with a Limited Document Identier Range......................................... 10 Update Optimizations ......................................................................................... 11 Appending to Existing Postings Lists................................................................. 11 Removal of Postings ........................................................................................... 11 Space Used.......................................................................................................... 12 Indexing Speed ................................................................................................... 13 Search Speed....................................................................................................... 13
5.0
Using the Inverted Index..............................................................................8

5.1 5.2 5.3
6.0
Insertion of Postings ..................................................................................11

6.1 6.2 6.3
7.0
Sybase Implementation Statistics ..............................................................12

7.1 7.2 7.3
8.0 9.0
Conclusion .................................................................................................14 References..................................................................................................14
1.0 Introduction
Efcient implementations of free-text search algorithms often require an inverted index. An inverted index is a data structure that maps a search key (e.g. a word) to a postings list enumerating documents containing the key. It is often useful for the postings list to also include the locations of each word occurrence within each document. An index of this type allows efcient implementation of boolean, extended boolean, proximity and relevance search algorithms [5]. Because it allows efcient insertion, lookup and deletion, the le-based B-Tree data structure is useful for implementing a large dynamically updated inverted index [1]. Cutting and Pedersen have developed several space and time optimizations for efciently maintaining an inverted index using a B-Tree and a heap le [2]. Unfortunately good BTree packages are not generally available for most computer languages and operating systems, and implementation of efcient and reliable B-Tree software is difcult. The algorithms for B-Tree insertion and deletion can be complex and difcult to debug, especially for B-Trees with variable length records. Adding the capability for multi-user transactions and error recovery is even more difcult. A relational database management system is a kind of general purpose database system that represents data as the contents of one or more tables. Relational systems typically use B-Trees to maintain indices on the contents of the database tables. The tables can be searched in a wide variety of ways using a general purpose query language such as SQL [6]. Several relational systems are commercially available on a wide variety of computer hardware and operating systems. Many relational systems have excellent performance and reliability, and they provide valuable features such as distributed database access, multi-user transactions, access control, back-up, and error recovery. Many have application programming interfaces that allow convenient integration of database access into computer programs. Although relational systems can maintain indices on the contents of database tables, they are not directly suitable for free-text search of large document sets. Many relational systems cannot store large (or even small) text documents, and the relational systems that can store large documents do not provide for efcient searching of them. However a full-text information retrieval system can be built using the database tables in a suitable relational system as a B-Treelike database structure, eliminating the need for separate B-Tree software and gaining the performance, robustness and additional features provided by the relational system. The methods described in this paper will not turn a relational system by itself into a text retrieval system. Rather the relational system serves as an efcient and robust data storage and retrieval system upon which text indexing and retrieval application programs can be built. In situations where a good B-Tree package is available, its use (with the optimizations described in section 2.0) may be more appropriate than using a relational system. Most of the text indexing and search algorithms described in this paper can be applied to implementations using either a B-Tree package or relational system.
2.0 Inverted Index Storage Optimizations

Cutting and Pedersen have developed several space and time optimizations for maintaining an inverted text index using a B-Tree and a heap le [2]. These optimizations can be adapted for implementation with a suitable relational system. Cutting and Pedersen use a B-Tree to store a short postings list for each indexed word as shown in Figure 1. When a postings list becomes too large for the B-Tree (e.g. more than 16 postings), portions of it are pulsed to a separate heap le. The heap is a binary memory le with contiguous chunks allocated as necessary for the overow postings lists. For very long postings lists (several hundred postings), heap chunks are linked together with pointers. Heap management software handles allocation and deallocation and keeps track of deallocated chunks for reuse.
2.0 Inverted Index Storage Optimizations
Figure 1. An Inverted Index using a B-Tree and Heap B-Tree word Heap le
postings list
postings list
word
postings list
postings list
Cutting and Pedersen use several techniques to minimize the space required for storing postings lists. A words postings list is encoded as a list of integer document identier and word frequency pairs, followed by a list of integer word positions for each document as shown in Figure 2. Figure 2. Postings List Encoding Example document list positions list
doc A id integer values: with delta encoding: with variable length byte encoding (hex): 515 515
doc A freq 1 1
doc B id 676 161
doc B freq 3 3
doc A pos 12 12
doc B pos 2 2
doc B pos 19 17
doc B pos 29 10
83, 04
01
A1, 01
03
0C
02
11
0A
Document identiers (after the rst) within a document list are delta encoded as a difference from the previous identier. For example, the sequence of document identiers (515, 676, 786, 881) becomes (515, 161, 110, 95) when delta encoded. Word positions within a document are also delta encoded. The integer values are each byte encoded using a variable-length format such that small values require less storage space than large values as shown in Table 1. The high bit of each encoded byte indicates whether additional bytes follow. The low seven bits of each byte encode a portion of the integer with least signicant bytes rst. Table 1. Space Required to Store Integer Values min value 0 128 16384 2097152 max value 127 16383 2097151 268435455 bits encoded 7 14 21 28 bytes required 1 2 3 4
Table 2 shows a sequence of document identiers along with their delta encodings and variable length byte encodings. The byte values are shown in hexadecimal notation.
Table 2. Byte and Delta Encoded Document Identiers identier doc A doc B doc C doc D doc E doc F 515 676 786 881 1150 1182 encoding (hex) 83, 04 A4, 05 92, 06 F1, 06 FE, 08 9E, 09 id 515 161 110 95 269 32 encoding (hex) 83, 04 A1, 01 6E 5F 8D, 02 20
3.0 Relational System Requirements

Cutting and Pedersens encoding for postings lists can be adapted for use with relational systems that have an applications programming interface and provide an efcient variable length binary data type capable of storing values up to approximately 250 bytes long.1 The performance and space efciency of the resulting text retrieval system depend on various properties of the relational system. For example, space efciency will be very poor using a relational system that consumes a xed amount of memory for every table value regardless of the value lengths. The methods described in this paper have been implemented and tested using the SQL Server database management system available from Sybase Incorporated using application programs written in the C programming language.2 In order to use a relational system to implement an efcient inverted text index, several algorithms must be implemented in application programs. The application programs are necessary to encode, decode and manipulate the postings lists stored in the relational system. The Sybase DB Library is a set of programming language routines and macros that allow an application program to interact with the SQL Server [4]. The Sybase SQL Server supports Transact-SQL, an enhanced version of the SQL relational database language [3]. Transact-SQL provides a variable length character data type called varchar and a variable length binary data type called varbinary. These data types can store values from 1 to 255 bytes long. For a large index it is important that the storage space used by these data types is the actual lengths of the stored values, not the maximum length. The amount of storage space required to store an inverted text index in a relational system using these methods can vary widely depending on the relational system used and other factors such as the choice of non indexed drop words. In experiments with the Sybase SQL Server and a list of 72 drop words, the index size tends to be between 30% and 50% of the size of the original documents for document sets between 10 and 64 million bytes.
4.0 Structure of the Relational Database Tables

It is possible to store the postings lists of an inverted text index as rows in a relational database table rather than using a B-Tree and heap le as described in section 2.0. When an appropriate SQL index is created for the table, the SQL Server provides the same search and update capabilities as the B-Tree implementation (in fact the SQL Server creates a B-Tree to accomplish this).
1. A character data type can be used if character values are allowed to contain any of the 8-bit character codes from 0 through 255. Although many relational systems provide separate data types for storing long character or binary data strings, the methods described in this paper do not require them. Note that special long data types provided by some relational systems may be inefcient when used to store small or medium size data strings. 2. SYBASE, SQL Server, Transact-SQL, and DB-Library are trademarks of Sybase Inc.
Rather than allowing postings lists of arbitrary length or linking postings lists together with pointers, long postings lists are split into pieces no longer than 255 bytes and stored in adjacent database table rows. Since most words in a full text inverted index occur relatively infrequently, most postings lists are short and a single table row per word is sufcient to contain the postings for most words. Depending on the document set size, about 5% of the words will require two or more table rows to store all the postings. In a large document set, some words may require a hundred or more table rows for their postings.
4.1 Structure of the Postings Table

An inverted text index can be efciently stored using a single relational database table with the record structure shown in Table 3. The tables word column contains an indexed word or other key. When a words postings do not t in a single postings block, multiple rows will contain the same word, each with a block value containing a portion of the complete postings. Table 3. Columns for Combined Word and Postings Table column name word rstdoc ags block data type varchar integer tinyint varbinary bytes 100 4 1 255 description contains an indexed word contains the lowest document id referenced in the block indicates the block type, length of doc list and/or sequence number contains encoded document and/or position postings for the word
The rstdoc column contains the lowest document identier for which postings are contained in the corresponding block value. There are three types of block encodings which are distinguished by the value stored in the ags column: When 0 < ags < 128, the block value contains a the complete encoded postings list as described in section 2.0. The ags value contains the number of document/frequency pairs in the document list. With this block type the positions list is never split; it is always contained entirely in the block. Figure 3 shows the column values for the example from section 2.0. Figure 3. Postings Example Using a Single Table Row (0 < ags < 128) word apple rstdoc 515 ags 2 doc A id 83, 04
(515 encodes as 83, 04)
block doc A freq 01 doc B id A1, 01 doc B freq 03 doc A pos 0C doc B pos 02 doc B pos 11 doc B pos 0A
document list (hex)
positions list (hex)
When ags = 0, the block contains only a document list, encoded as described in section 2.0. The number of identier/frequency pairs in the document list can be inferred from the length of the block. Any row with ags = 0 must be immediately followed by a positions list row with the same word and rstdoc values and have a ags value = 128.
Figure 4. Postings Example Using Two Table Rows (ags = 0 and ags = 128) word box rstdoc 515 ags 0 document list (hex):
(515 encodes as 83, 04)
block doc A id 83, 04 doc A freq 01 doc B id A1, 01 doc B freq 02 doc C id 6E doc C freq 02 doc D id 5F doc D freq 03
box
515
128 positions list (hex):
doc A pos 1F
doc B pos B1, 02
doc B pos 6B
doc C pos 42
doc C pos 07
doc D pos 91, 01
doc D pos 32
doc D pos 0E
When ags 128, the block contains only a positions list, encoded as described in section 2.0. A row with ags 128 is always associated with the document list from a previous row with ags = 0. Figure 4 shows the relationship between the document list encoded in one row and the positions list encoded in the following row. When a document list becomes too long to encode within a single block, it is split and additional database rows are used. The corresponding positions list is also split. If they are small enough, the remaining portions of the document list and positions list can be combined into a new third row (with 0 < ags < 128). Otherwise two new rows are created, one with ags = 0 for the left over documents list and the other with ags = 128 for the corresponding positions list. For example, assuming the maximum block size is 12 bytes (rather than 255), adding new postings to the example of Figure 4 would require the creation of a new row as shown in Figure 5. The rows shown in Figure 4 would remain unchanged. Figure 5. Added Table Row after a Document List Overow word box rstdoc 1150 ags 2 doc E id FE, 08
(1150 encodes as FE, 08)
block doc E freq 02 doc F id 20 doc F freq 01 doc E pos 55 doc E pos 62 doc F pos 11
document list (hex)
positions list (hex)
When a positions list overows, its block is split, but the corresponding document list block is not split. A document list may refer to postings in several consecutive table rows. These continuation rows are given a rstdoc value equal to the document identier in effect where the split occurred in the positions list. The ags value of a continuation row is set to 128 unless the rstdoc value is the same as that of the previous row, in which case the ags value is set to the previous rows ags value plus one. This is necessary to guarantee unambiguous ordering of the continuation blocks. An example of continuation rows is given in Figure 6.
Figure 6. Postings Example Using Continuation Rows word crate rstdoc 515 ags 0 document list (hex):
(515 encodes as 83, 04)
block doc A id 83, 04 doc A freq 01 doc B id A1, 01 doc B freq 02 doc C id 6E doc C freq 0C doc D id 5F doc D freq 04
crate
515

(doc Cs id is 786)
doc A pos 82, 01
doc B pos 72
doc B pos
doc C pos
doc C pos
doc C pos 93, 01
9E, 01 BC, 01 D7, 02
crate
786
doc C pos
doc C pos
doc C pos F3, 03
doc C pos 4D
doc C pos 89, 02
doc C pos 8A, 02
E3, 02 86, 01
crate
786
doc C pos DD, 01
doc C pos 5E
doc C pos 6B
doc D pos 42
doc D pos 17
doc D pos 2F
doc D pos 36
4.2 Specifying the SQL Table and Index

In order to get efcient access to information in a large relational database table, the relational system must be instructed to create an index. The following SQL commands create an empty postings table called postings and build an appropriate index for it.
CREATE TABLE postings (word varchar(100) NOT NULL, rstdoc int NOT NULL, ags tinyint NOT NULL, block varchar(255) NOT NULL) CREATE CLUSTERED INDEX clust_index ON postings (word, rstdoc, ags)
It is essential that the index be on the word, rstdoc, and ags elds in that order. This causes most relational systems to create a B-Tree index using these three elds as the key. When the table rows for a given word are retrieved using the index, they will be enumerated in ascending order of the rstdoc eld. Rows with the same rstdoc eld will be enumerated in ascending order of the ags eld. Making the index CLUSTERED tells the relational system to place the indexed elds directly into the B-Tree, saving a level of indirection. Retrievals will be faster if the index is clustered.
4.3 Using Separate Word and Postings Tables

As a variation, the words may be held in a separate table which is joined to the postings table for queries via an integer wordnum key. The separate words table has the advantage of providing a place to store information about each word such as number of occurrences of each word as shown in Table 4.
Table 4. Columns for Separate Word Table column name word wordnum doc_count word_count type varchar integer integer integer bytes 100 4 4 4 description contains an indexed word contains a unique word number optional eld for number of documents containing the word optional eld for total number of occurrences of the word
The postings table is the same as described in section 4.1 except it has a wordnum column instead of a word column as shown in Table 5. Table 5. Columns for Separate Postings Table column name wordnum rstdoc ags block type integer integer tinyint varbinary bytes 4 4 1 255 description contains the word number for an indexed word contain the lowest document id referenced in the block indicates the block type, length of doc list and/or sequence number contains encoded document and/or position postings for the word
The following SQL commands creates an empty words table called wordlist and postings table called numpostings and builds an appropriate index for each.
CREATE TABLE wordlist (word varchar(100) NOT NULL, wordnum integer NOT NULL, doc_count integer NOT NULL, word_count integer NOT NULL) CREATE CLUSTERED INDEX clust_index ON RH_word list (word, wordnum, doc_count, word_count) CREATE TABLE numpostings (wordnum integer NOT NULL, rstdoc int NOT NULL, ags tinyint NOT NULL, block varchar(255) NOT NULL) CREATE CLUSTERED INDEX clust_index ON numpostings (wordnum, rstdoc, ags)
In practice, additional database tables may be used to store other useful information such as document lengths, document creation dates, and access control information.
5.0 Using the Inverted Index

In order to perform a text search operation, an application program must perform several steps. It must retrieve the table rows containing postings for one or more words and decode the posting blocks. The postings lists from each word are compared or combined as appropriate for the type of text search being performed. For boolean searches, the document lists for each word are combined using set intersection and union operations. Proximity search algorithms use the word positions information to select just the documents in which the queried
5.0 Using the Inverted Index
words occur in a particular relationship. In both kinds of searches, the resulting documents can be priority ranked according to the number of times the queried words occurred, possibly weighted by parameters such as the document lengths and importance of each query word.
5.1 Retrieval of Postings

The postings for a given word are retrieved from the relational system using an appropriate SQL query expression. For example, the SQL query below retrieves the rows for the word box from the postings table named postings:
SELECT word, rstdoc, ags, block FROM postings WHERE word = box ORDER BY rstdoc, ags
In practice, the ORDER BY clause is not required because the rows are already properly ordered in the clustered index. With some relational systems, using ORDER BY may impose a performance penalty. The resulting rows are shown in Table 6. The rst row contains a document list (ags = 0), the second contains the corresponding word positions (ags 128), and the third row contains combined document and positions list for the last two documents (ags = 2). Table 6. Results of Search for Postings of the Word box word
box box box
rstdoc
515 515 1150
ags
0 128 2
block
830401A101026E025F03 1FB1026B42079101320E FE08022001556211
Many text search queries do not require word position information (e.g. simple boolean queries). For these queries, the SQL expression can be modied to retrieve only those rows that contain document lists (i.e. with ags < 128). Also it is not necessary to retrieve the word eld, since it is known in advance. The rows retrieved by the following SQL query are shown in Table 7:
SELECT rstdoc, ags, block FROM postings WHERE word = box AND ags < 128
Table 7. Results of Search for Document Lists Only rstdoc

515 1150
ags
0 2
block
830401A101026E025F03 FE08022001556211
5.2 Retrieval from Separate Word and Postings Tables

When separate word and postings tables are used as described in section 4.3, the two tables are joined as in the following SQL query, which produces the same results as the previous example:
SELECT rstdoc, ags, block FROM wordlist, numpostings WHERE word = box AND wordlist.wordnum = numpostings.wordnum AND ags < 128
10
5.3 Retrieval with a Limited Document Identier Range

By specifying a constraint on the rstdoc column, it is possible to construct an SQL query to retrieve word postings that occur within a specied range of document identiers. This is useful if document identiers are assigned in a meaningful order, such as by creation or entry date. For example, the following SQL query retrieves only rows containing document identiers between 1160 and 1170:
SELECT rstdoc, ags, block FROM postings WHERE word = box AND rstdoc >= (SELECT max (rstdoc) FROM postings WHERE word = box AND ags < 128 AND rstdoc <= 1160) AND rstdoc < 1170
Table 8 shows the results of this query. The requested document identier range falls within the results (along with a few postings outside the range). Table 8. Results of Search for box in Documents 1160 through 1170 rstdoc
1150
ags
2
block
FE08022001556211
When document identiers are assigned in date order and the correspondence between identiers and key dates is stored in another database table, the rstdoc column can be used to implement a date range constraint. For example, the document identiers associated with key dates can be stored in a two-column table as in Table 9. Table 9. Example of Date and Document Identier Correspondence Table date
1 1 1 1 1 1 Aug 1990 Sep 1990 Oct 1990 Nov 1990 Dec 1990 Jan 1991
rstdoc
1 480 829 1160 1170 1302
Given such a table named docdates, the document identier values 1160 and 1170 in the previous SQL query can be replaced with appropriate SQL sub-queries as shown below to nd documents indexed during November 1990:
SELECT rstdoc, ags, block FROM postings WHERE word = box AND rstdoc >= (SELECT max (rstdoc) FROM postings WHERE word = box AND ags < 128 AND rstdoc <= (SELECT max (rstdoc) FROM docdates WHERE date <= 1 Nov 1990)) AND rstdoc < (SELECT min (rstdoc) FROM docdates WHERE date >= 1 Dec 1990)
6.0 Insertion of Postings
11
6.0 Insertion of Postings

The inverted index implementation described here is designed so postings from new documents can be efciently added to the index at any time. A limitation of this implementation is that document identiers must be added to the index in ascending order.
6.1 Update Optimizations

Cutting and Pedersen have developed efcient algorithms for creating and updating an inverted index by buffering a large number of postings lists in main memory and periodically merging them into the externally stored index [2]. This merge update optimization dramatically reduces the number of secondary storage accesses required to index new documents. This optimization is equally important when storing the inverted index in a relational system. At any given time, the nal row(s) containing the document list and positions list for a large number of words should be kept in main memory. Whenever a rows block becomes lled (to its maximum of 255 bytes), it is stored into the database and a new empty row begun. If the relational systems application programming interface provides efcient bulk data copy primitives, they should be used for inserting new rows into the database tables.
6.2 Appending to Existing Postings Lists

In order to append new postings to those already stored for a word, the table rows containing the last document list and positions list for the word must rst be retrieved into main memory. The new postings are appended to the retrieved lists and the updated table rows eventually stored back to the database. The following SQL expression retrieves just the rows containing the last document list and positions list for the word box. The resulting row from the postings table is shown in Table 10.
SELECT rstdoc, ags, block FROM postings WHERE word = box AND rstdoc >= (SELECT max (rstdoc) FROM postings WHERE word = box AND ags < 128)
The result may either be a single row containing document and positions lists (with 0 ags < 128) or multiple rows, the rst containing the document list (with ags = 0) and the rest containing the positions lists (with ags 128). Only the last row of positions needs to be kept in main memory. Table 10. Results of SQL Query to Retrieve Last Document and Positions Lists for the Word box rstdoc
1150
ags
2
block
FE08022001556211
6.3 Removal of Postings

The fact that the index is inverted according to individual words makes the removal of a single document inefcient. The technique described in section 5.3 can be used to reduce the number of rows which must be examined looking for the documents identier, but single document removal is still inefcient. Note that removing a range of document identiers from the index can be done for about the same cost as removing a single document.
12
If document removal is infrequent, an auxiliary table can be used to hold the removed document identiers. Often there is already a table containing an entry for each document. If so, this feature can be added for very little cost. When it is known in advance what identier ranges will be removed, the insertion algorithm can be modied to arrange the database so the range boundaries always occur between rows. This will cause less efcient space utilization, but removing all documents between the range boundaries can then be done with a single SQL expression rather than an exhaustive search of individual word postings. For example, if the indexing software started new table rows for 1991s documents (starting with document identier 1302), the following SQL expression will remove all document postings from the year 1990 or earlier:
DELETE postings WHERE rstdoc < (SELECT min (rstdoc) FROM docdates WHERE date >= 1 Jan 1991)
7.0 Sybase Implementation Statistics

The methods described in this paper have been implemented on a Sun Unix workstation using a Sybase SQL Server. Text indexing and search programs were implemented in the C programming language using the Sybase DB-Library package. The indexing program is 2000 lines long and the boolean text search program is 1200 lines long. Two electronic encyclopedia databases were used to test the software: The Alphapedia section of the 1990 Random House Encyclopedia (18,000 articles, 126,000 words) and Groliers Academic Encyclopedia (31,000 articles, 780,000 words). An inverted index was created for the words (except for a list of 72 drop words) occurring in each encyclopedia. A separate index was created for the article titles in each encyclopedia, as well as an index for the Random House article categories.
7.1 Space Used

The amount of storage space used by a relational system to store an inverted index depends on the efciency of its internal data and index storage. For each of the two encyclopedias, Table 11 shows the original text size, the number of indexed words, the number of distinct words, and the number of table rows and space required to encode the inverted index. Table 11. Size of Full Text Encoded Inverted Indices table name RH_word GR_word text size 7856 KB 63256 KB words vocabulary 844753 5035874 59141 139870 rows 65334 191148 table size (%text) 3632 KB 17816 KB (46%) (28%) Sybase-1 6374 KB 31142 KB Sybase-2 (%text) 4398 KB 20908 KB (56%) (33%)
Three index sizes are given. The column labeled table size is the amount of storage space required using the encodings described in this paper, assuming no additional storage or indexing overhead. The column labeled Sybase-1 is the actual database table size reported by the Sybase SQL Server after the table rows were added to the relational database by the document indexing program. It was discovered that for the Sybase SQL Server, 30% less space is required for the same database tables when they are copied using the Sybase bulk database copy program, apparently because the data is then stored more compactly. The column labeled Sybase-2 shows the reduced amount of space resulting from this technique. It may be possible
7.0 Sybase Implementation Statistics
13
to achieve similar space savings in the incremental indexing program by using the available DB-Library bulk copy routines. The encoding techniques presented here are most efcient for indexing the words of text documents, however the same database structure can also be used for a smaller inverted index of document titles, keywords, or category codes. Table 12 shows the space required to encode inverted indices for the encyclopedia article titles and category codes. Complete article titles were indexed separately from individual title words. Table 12. Size of Title and Category Encoded Inverted Indices table name RH_category RH_titleword RH_title GR_titleword GR_title text size 105 KB 290 KB 290 KB 590 KB 590 KB words vocabulary 17938 34762 17938 73899 30762 55 17006 17632 26170 30388 rows 317 17050 17632 26396 30388 index size (%text) 58 KB 392 KB 456 KB 672 KB 880 KB (55%) (135%) (157%) (114%) (149%) Sybase-1 112 KB 766 KB 892 KB 1292 KB 1658 KB Sybase-2 (%text) 80 KB 528 KB 620 KB 908 KB 1164 KB (76%) (182%) (214%) (154%) (197%)
7.2 Indexing Speed

Indexing speed depends almost entirely on the speed with which rows can be retrieved and stored into the relational system. Our indexing program was able to create an inverted index for the Random House Encyclopedia in 37 minutes1. This is the equivalent of 380 words per second or about 14 megabytes of text per hour. Due to overhead associated with writing out and later retrieving buffered postings when main memory becomes full, the larger Groliers Encyclopedia was indexed at a rate of 140 words per second (6.4 megabytes of text per hour). Better buffering and the use of bulk copy routines would improve the indexing speed.
7.3 Search Speed

Text search speed using the inverted index depends on the speed with which rows can be retrieved from the relational system and the postings lists decoded and combined by the search program. Our boolean text search program performs separate SQL queries for each word in the search query, decodes the results and combines the document identier lists using union and intersection operations2. The resulting document list is sorted so the documents with the most word matches occur at the head of the list. Table 13 lists several boolean text search queries performed using the Groliers Encyclopedia index, and the amount of time taken to compute the matching document identiers. The % at the end of a search query term is the SQL wildcard character used to match multiple words with the same prex (e.g. newt% matches newt, newton, newtonian, etc.). The & character indicates boolean AND operator and the | character indicates the OR operator. For each search, Table 13 lists the number of matching documents, the number of SQL queries required, the number of table rows retrieved, and the total number of document identiers processed. The column labeled 1st time indicates the number of seconds required for a query that has not been performed recently. Due to caching in the SQL server, immediately repeating the query results in a faster response as shown in the column labeled 2nd time. The timings shown include the time to construct and perform the SQL queries, decode the results, and combine the document identier lists from multiple rows, but do not include the time required to initialize the SQL server connection and rank the document lists.
1. Running on a Sun 4/490 SPARC processor with 32 megabytes of main memory. 2. Multiple term queries could be made faster by retrieving the postings for all the terms using a single SQL query.
14
Table 13. Boolean Text Search Speed text search query matches 0 7 205 197 758 785 2235 4170 5471 1257 79 4 14 queries 1 1 1 1 1 1 1 1 1 2 3 6 6 rows 0 1 11 2 7 7 18 34 44 78 85 112 112 docs 0 7 205 197 758 785 2235 4170 5471 9641 10426 13616 13616 1st time 2nd time 0.09 0.13 0.36 0.18 0.11 0.11 0.29 0.37 0.37 0.59 0.89 1.30 1.14 0.02 0.02 0.07 0.06 0.02 0.03 0.07 0.11 0.12 0.36 0.38 0.54 0.56
xxxxx newt newt% ght battle california general war century war & century california & war & century ght & battle & california & general & war & century (ght | battle) & california & general & war & century
8.0 Conclusion
Information retrieval systems often provide full text search using B-Trees and heap les, but reliable high performance B-Tree software is not available for many computing environments and it is difcult to implement. It is possible to implement an efcient inverted index using database tables in a relational database management system as a substitute for B-Trees and heap les. Using a relational systems application programming interface, it is possible to create text indexing and search programs with performance and space efciency similar to that obtained using a BTree package. Available relational systems provide high performance along with valuable features not normally found in a B-Tree package, such as multi-user transactions, access control, reliable back-up, and automatic error recovery.
9.0 References
[1] R. Bayer and E. McCreight. Organization and Maintenance of Large Ordered Indexes. Acta Informatica, 1:173189, 1972. [2] D. Cutting and J. Pedersen. Optimizations for Dynamic Inverted Index Maintenance. Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405411, September 1990. [3] M. Darnovsky and J. Bowman. Transact-SQL Users Guide, SYBASE Release 4.0, Sybase Inc., May 1989. [4] S. Goodman and A. Cohen. Open Client DB-Library Reference Manual, SYBASE Release 4.0, Sybase Inc., May 1989 [5] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983. [6] J. H. Trimble, Jr. and D. Chappell. A Visual Introduction to SQL. John Wiley & Sons, 1989.

Inverted Text Index

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Inverted Text Index

Enviado por

Direitos autorais:

Formatos disponíveis

Using a Relational Database for an Inverted Text Index

Steve Putz Xerox Palo Alto Research Center

Copyright January 1991 Xerox Corporation. All rights reserved.

Using a Relational Database for an Inverted Text Index

Using the Inverted Index..............................................................................8

Insertion of Postings ..................................................................................11

Sybase Implementation Statistics ..............................................................12

Conclusion .................................................................................................14 References..................................................................................................14

Using a Relational Database for an Inverted Text Index

2.0 Inverted Index Storage Optimizations

2.0 Inverted Index Storage Optimizations

doc B id 676 161

Using a Relational Database for an Inverted Text Index

3.0 Relational System Requirements

4.0 Structure of the Relational Database Tables

4.0 Structure of the Relational Database Tables

4.1 Structure of the Postings Table

document list (hex)

positions list (hex)

Using a Relational Database for an Inverted Text Index

128 positions list (hex):

doc B pos B1, 02

doc D pos 91, 01

document list (hex)

positions list (hex)

4.0 Structure of the Relational Database Tables

128 positions list (hex):

doc A pos 82, 01

doc C pos 93, 01

9E, 01 BC, 01 D7, 02

128 positions list (hex):

doc C pos F3, 03

doc C pos 89, 02

doc C pos 8A, 02

129 positions list (hex):

doc C pos DD, 01

4.2 Specifying the SQL Table and Index

4.3 Using Separate Word and Postings Tables

Using a Relational Database for an Inverted Text Index

5.0 Using the Inverted Index

5.0 Using the Inverted Index

5.1 Retrieval of Postings

Table 7. Results of Search for Document Lists Only rstdoc

5.2 Retrieval from Separate Word and Postings Tables

Using a Relational Database for an Inverted Text Index

5.3 Retrieval with a Limited Document Identier Range

6.0 Insertion of Postings

6.0 Insertion of Postings

6.1 Update Optimizations

6.2 Appending to Existing Postings Lists

6.3 Removal of Postings

Using a Relational Database for an Inverted Text Index

7.0 Sybase Implementation Statistics

7.1 Space Used

7.0 Sybase Implementation Statistics

7.2 Indexing Speed

7.3 Search Speed

Using a Relational Database for an Inverted Text Index

Você também pode gostar