It refers to the process of finding and retrieving information on the
Internet Major Differences between IR Systems and WWW Search Engines 1. WWW documents are distributed around the Internet while documents in an IR system are centrally located; 2. The number of WWW documents is much greater than that of an IR system; 3. WWW documents are more dynamic and heterogeneous than documents in an IR system; 4. WWW documents are structured with HTML while the documents in an IR system are normally plain text; 5. WWW search engines are used by more users and more frequently than IR systems
WWW Documents are Dynamic and Heterogeneous because of
1. First, a huge vocabulary must be used to cope with the large number of documents. 2. Second, it is difficult to make use of domain knowledge to improve retrieve effectiveness as documents are from many different domains. 3. Third, document frequency cannot be obtained by calculating the term weight as the Web database is built progressively and is never complete. 4. Fourth, the vector space model is not suitable because document size varies and this model favors short documents. 5. Fifth, the index must be updated constantly as the documents change constantly. 6. Sixth, the search engine must be robust to cope with the unpredictable nature of documents and Web servers. General Structure of WWW Search Engines All search engines have three major elements: 1. The first is the spider, crawler, or robot. The spider visits a Web page, reads it, and then follows links to other pages within the site. The spider may return to the site on a regular basis, such as every month to look for changes.
2. The second part of a search engine is the index. The index,
sometimes called the catalog, is like a giant book containing a copy of every Web page that the spider finds. If a Web page changes, then this book is updated with new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a Web page may have been "spidered" but not yet "indexed." Until it is indexed added to the index it is not available to those searching with the search engine. 3. The third part of a search engine is search engine software. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it estimates is most relevant. Different search engines use different similarity measurement and ranking functions. However, they all use term frequency and term location in one way or another.
Basic Concepts of ASR :An ASR system operates in two stages:
training and pattern matching. Training Stage, features of each speech unit is extracted and stored in the system. Recognition Process, features of an input speech unit are extracted and compared with each of the stored features, and the speech unit with the best matching features is taken as the recognized unit.
Describe the four common approaches to image retrieval. What are their strengths and weaknesses?
1- image contents are modeled as a set of attributes
extracted manually and managed within the framework of conventional database management systems 2- integrated feature-extraction/object-recognition subsystem. This subsystem automates the feature extraction and object recognition. However, automated approaches to object recognition are computationally expensive, difficult, and tend to be domain specific. 3- uses free text to describe (annotate) images and employs IR techniques to carry out image retrieval
4- in this approach uses low-level image features such as
color and texture to index and retrieve images. The advantage of this approach is that the indexing and retrieval process is carried out automatically and is easily implemented. It has been shown that this approach produces quite good retrieval performance. 5-