Você está na página 1de 6

ISCL wintersemester 2007 IR Midterm exam

17 December 2007 SOLUTIONS Non-electronic documents and calculators are authorized. Name : Semester :

Exercise 1 : Denitions
Dene the following terms : tokenization segmentation of a document in order to produce a list of items (deals with punctuation, acronyms, dates, etc.) permuterm index index mapping all permutations of characters (including delimiters) of a given word to this word (used for wildcard queries) champion list pre-computed list of the r most relevant documents with respect to a given term

Exercise 2 : Characteristics of a collection and its index


Consider a collection made of 500 000 documents, each containing on average 800 words. The number of dierent words (i.e. not taking duplicates into account) is estimated to 700 000. For all questions, give your computation. What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ? 500000 800 6 = 240000000 bytes = 2.4 GB With the best reduction rate of the dictionary achieved when using a linguistic preprocessing (noise words, stemming), what is the size (number of terms) of the dictionary ? best reduction rate : 50 % 700000/2 = 350000 keywords Consider an index where the average length of a non-positional posting list is 200. What is the estimation of the total number of postings of this index ? 350000 200 = 70000000 postings How many bytes do you allow respectively for encoding (without compression) a dictionary term ? a non-positional posting ? Dictionary term : 40 bytes for the keyword (unicode charset, 2 bytes per char, and maximum 20 characters per word), 4 bytes for the keyword frequency, and 3 bytes (pointer to the posting list, log2 (350000) 19) 40 + 4 + 3 = 47 bytes Posting : 500 000 documents to refer to log2 (500000) 19 bits 3 bytes 1

What are the size (mega or giga bytes) of the resulting dictionary and posting lists ? Dictionary : 350000 47 = 16450000 bytes = 16.45 MB Postings : 70000000 3 = 210000000 bytes = 210 MB If you compress your dictionary using the dictionary-as-a-string method, what is the new size of the dictionary ? 350000 (4 + 3 + 3 + 2 8) = 9100000 bytes = 9.1 MB (4 bytes for the term frequency, 3 bytes for the pointer to the posting list, 3 bytes for the pointer into the string, and 8 characters per word on average, each encoded with 2 bytes)

Exercise 3 : Querying an index


What kind of queries can be applied to the collection, for each of these, what index is needed ? boolean queries : non-positional index phrase queries : positional index wildcard queries : permuterm index or n-gram index similarity query : frequency index

Exercise 4 : Linguistic preprocessing


Are the following statements right or false (justify your answer) ? a) stemming increases retrieval precision. false. Stemming decreases precision since the exion of words is ignored, many documents are retrieved even if they do not relate to the query (ex. Golden retriever vs. Gold retrieval). b) stemming only slightly reduces the size of the dictionary. false. Stemming can in some cases divide the size of the dictionary from 33 to 50 %. c) stop lists contains all most frequent terms. false. Stop lists contain some of the most frequent terms (a counter-example is the word water for English, which is among the most frequent but not included in stop-lists).

Exercise 5 : Porter stemming


What would be the result of the porter stemmer used with the following words ? busses busses buss rely rely reli

realised realised realis What is the Porter measure of the following words (give your computation) ? crepuscular cr C ep VC usc VC ul VC ar VC V m=4

rigorous r C ig VC or VC ous VC V m=3

placement pl C ac VC em VC ent VC V m=3

Exercise 6 : Index architecture


Propose a Map-Reduce architecture for creating language specic indexes from an heterogeneous collection. You can illustrate this architecture using a gure.

Exercise 7 : Index compression


What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding ? With 2 bytes, we use 2 continuation bits, and 14 bits are available for gap encoding (20 to 213 ). Hence, the largest gap that can be encoded is 214 1 = 16383 (when all 14 bits are set to 1). What is the posting list that can be decoded from the variable byte-code 10001001 00000001 10000010 11111111 ?

10001001 9

00000001 10000010 130

11111111 127

What would be the encoding of the same posting list using a -code ? 9 1110,001 130 11111110,0000010 127 11111110,111111

Exercise 8 : Vector Space Model


Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the following : Term actor movie trailer tfd1 12 15 52 tfd2 35 24 13 tfd3 55 48 12 df 123 240 85

Compute the vector representations of d1 , d2 and d3 using the tf idft,d weighting and the euclidian normalisation. Estimation of the collection size : either you dene your own (symbolic or not) collection size, or you use a heuristic such as the 3 keywords appear only together in d1 , d2 and d3 . With the latter, the collection size is N = (123 + 240 + 85 3) = 445.

v (d1 ) =

445 12log10 ( 123 ) D 445 15log10 ( 240 ) D 445 52log10 ( 85 ) D 445 35log10 ( 123 ) D 445 24log10 ( 240 ) D ) 13log10 ( 445 85 D

where D = where D = where D =

445 2 445 2 2 (12 log( 445 123 )) + (15 log ( 240 )) + (52 log ( 85 ))

v (d2 ) =

445 2 445 2 2 (35 log( 123 )) + (24 log( 445 240 )) + (13 log ( 85 ))

v (d3 ) =

53log10 ( 445 ) 123 D 445 48log10 ( 240 ) D ) 12log10 ( 445 85 D

445 2 445 2 2 (53 log( 123 )) + (48 log( 445 240 )) + (12 log ( 85 ))

Compute the cosine similarities between these documents.


445 445 445 s(d1 , d2 ) = v (d1 ).v (d2 ) = (12 log( 445 123 ) 35 log ( 123 )) + (15 log ( 240 ) 24 log ( 240 )) + 445 (52 log( 445 85 ) 13 log ( 85 )) 445 445 445 s(d1 , d3 ) = v (d1 ).v (d3 ) = (12 log( 445 123 ) 55 log ( 123 )) + (15 log ( 240 ) 48 log ( 240 )) + 445 445 (52 log( 85 ) 12 log( 85 )) 445 445 445 s(d2 , d3 ) = v (d2 ).v (d3 ) = (35 log( 445 123 ) 55 log ( 123 )) + (24 log ( 240 ) 48 log ( 240 )) + 445 (13 log( 445 85 ) 12 log ( 85 ))

Give the ranking retrieved by the system for the query movie trailer.

We need the vector representation for the query q movie trailer. We can use the following : 0 v (q ) = 1 1 Then we can compute the score of each document and rank them by decreasing order of score : score(q, d1 ) = v (q ).v (d1 ) score(q, d2 ) = v (q ).v (d2 ) score(q, d3 ) = v (q ).v (d3 )

Exercise 9 : Term weighting


Compute the vector representations of the documents introduced in the previous exercise using the ltn weighting scheme. By ltn, we mean the following measure (cf. lecture 6) :
: tft,d idft : normalisation :

1 + log10 (tft,d ) N ) log10 ( df t 1 (no normalisation)

Hence, we obtain : (1 + log10 (12)) log10 ( 445 123 ) 445 ) v (d1 ) = (1 + log10 (15)) log10 ( 240 445 (1 + log10 (52)) log10 ( 85 )

idem for v (d2 ) and v (d3 ).

Exercise 10 : Index architecture (extra credit)


Consider a hashtable as a structure mapping keys to values using a hash function h such that h(key ) = value. What problem may arise from such a structure when inserting new key-value pairs ? For large collections of data, it may be hard (if not impossible) to guaranty the bijectivity of the hash function. Indeed, two dierent keys may be associated with the same value. In other terms, it is likely to happen that the mapping to encode in the hashtable has to deal with keys having the same hash : x = y h(x) = h(y ). What workaround would you propose for this insertion ? Give an algorithm for inserting a key-value pair.

A workaround for the insertion of key-value pairs whose hash-value is identical consists of using a primary mapping and a secondary mapping. The latter contains the redundant pairs (i.e. the pairs with identical hash-values), that are themselves linked to the main pair in the primary index.

In this context, the insertion algorithm checks the slot for the pair to be inserted in the primary hashtable. If it is unset, the pair is stored, otherwise the pair is stored at the end of the linked list of pairs in the secondary hashtable. proc insert(key k, value v, hashtable H, hashfunction h) int i = h(k) if (H[i].isUnset()) then H[i].key = k H[i].value = v H[i].next = -1 else int j = H[i].next int m = H.nextFree() int n = i while(j != -1) // we traverse the linked list n = j j = H[j].next endwhile H[m].key = k // we store the duplicate hash-value H[m].value = v // in the first free slot H[m].next = -1 H[n].next = m // we link the previous end of the linked list endif endif

Você também pode gostar