Você está na página 1de 61

Statement by the Candidate

We wish to state that work embodied in this Special Topic titled “SEARCH ENGINE” forms our
own contribution to the work carried out under the Guidance of Dr.B.B.Meshram at the ‘VeerMata
Jijabai Technological Institute’, Matunga, Mumbai-19. This work has not been submitted for any
other Degree or Diploma of any University / Institute. Wherever references have been made to
previous works of others, it has been clearly indicated.

Vivekanand Holkar (061080019)

Ajay Raut (061080039)

Santosh Rode (061080040)

Ashish Shelke (061080047)

Vishal Thakre (061080055)

Signature of Candidate

1
CERTIFICATE

This is to certify that Vivekanand Holkar (061080019)

Ajay Raut (061080039)

Santosh Rode (061080040)

Ashish Shelke (061080047)

Vishal Thakre (061080055)

students of B.Tech (Information Techology), VJTI Mumbai have successfully completed the
Special Seminar on “SEARCH ENGINE” under the Guidance of Dr.B.B.Meshram.

(Dr.B.B.Meshram)

Head of Department.

Department of Computer technology ,

VJTI, Mumbai.

2
Contents

1.PROBLEM STATEMENT....................................................................................................6
1.1) Administrator.................................................................................................................6
1.2) User.............................................................................................................................6
2.Video Segmentation............................................................................................................7
2.1 Shot clustering for browsing:.............................................................................................10
2.2 Scene Change Detection Algorithm:....................................................................................11
2.3 Histogram-Based Systems.................................................................................................13
2.4 Color Layout-Based Systems.............................................................................................14
2.5 Region-Based Systems.....................................................................................................14
2.6 Similarity Measures for Content.........................................................................................15
2.6.1 Histogram color measure:...............................................................................................15
2.6.2 Shape measures:..........................................................................................................15
2.6.3 Sketch measure:...........................................................................................................15
3 Video Indexing.................................................................................................................16
4. Extraction of Texture........................................................................................................19
4.1 Extraction and Description of Perceptual Texture Features:........................................................20
4.2. Coarseness...................................................................................................................20
4.3. Contrast......................................................................................................................20
4.4 Directionality................................................................................................................20
4.5 Line-likeness................................................................................................................20
4.5.1Gray co-occurrence matrix..............................................................................................21
4.6. Regularity....................................................................................................................22
4.7. Roughness...................................................................................................................22
4.8 Texture retrieval descriptor of MPEG-7 (TRD)......................................................................24
4.8.1. Feature extraction:.......................................................................................................24
4.8.2. Deriving the layout for browsing......................................................................................25
4.9. Texture Browsing Descriptor of MPEG-7 (TBD)....................................................................25
5.Audio Segmentation:.........................................................................................................26
5.1 Basics of transforms........................................................................................................27
5.2 Outline of indexing scheme:..............................................................................................27

3
5.3 Blocking and segmentation :..............................................................................................28
5.4Advantages of transform-based indexing...............................................................................28
5.5 search algorithm.............................................................................................................29
5.5.1 Simple search algorithm.................................................................................................29
5.4 Algorithm is given as follows:............................................................................................29
6. Shape descriptor..............................................................................................................30
6.1 Shape features...............................................................................................................30
6.2 Pattern Recognition.........................................................................................................31
6.3Curvature Scale Space Matching.........................................................................................31
6.4 Shape Context descriptor.................................................................................................32
6.5 shape context in matching,...............................................................................................32
7.Color extraction................................................................................................................33
7.1 Color Quantization..........................................................................................................34
7.2 COLOR HISTOGRAM....................................................................................................34
A. Definition......................................................................................................................35
7.3 Similarity Calculation :....................................................................................................35
8. INTRODUCTION TO SEARCH ENGINE.............................................................................36
8.1 TYPES OF SEARCH ENGINES........................................................................................36
8.2 WORKING OF SEARCH ENGINE....................................................................................38
9. WEB CRAWLING.............................................................................................................39
9.1Crawling Applications......................................................................................................40
9.2 Basic Crawler Structure....................................................................................................41
9.3 Requirements for a Crawler...............................................................................................42
9.4 Crawling policies............................................................................................................43
9.4.1 Selection policy...........................................................................................................44
9.4.2 Re-visit policy.............................................................................................................45
9.4.3 Politeness policy..........................................................................................................46
9.4.4 Parallelization policy.....................................................................................................47
10. INDEXING..................................................................................................................49
10.1 RANKING..................................................................................................................49
10.2 STORAGE COSTS AND CRAWLING TIME.....................................................................50
10.3 Features of different Search Engines :.................................................................................51
10.3.1 AltaVista..................................................................................................................51
10.3.2 Lycos......................................................................................................................52
10.3.3 Webcrawler...............................................................................................................53

4
10.3.4 HotBot.....................................................................................................................54
10.3.5 Yahoo......................................................................................................................55
11. GOOGLE SEARCH ENGINE...........................................................................................56
11.2 GOOGLE ARCHITECTURE...........................................................................................56
11.3 GOOGLE CRAWLER...................................................................................................57
11.4 GOOGLE DATA STRUCTURES.....................................................................................57
11.5 GOOGLE INDEXING...................................................................................................60
11.6 GOOGLE RANKING SYSTEM.......................................................................................61
11.7 GOOGLE QUERY EVALUATION...................................................................................62
12.Reference:.....................................................................................................................63
12.1 Video Segmentation.......................................................................................................63
12.2 Texture:......................................................................................................................63
12.3Audio:........................................................................................................................64
12.4 Colour:.......................................................................................................................64
12.5 Search Engine ,Web Crawler ,Indexing , Google Search Engine...............................................65
12.5 Search Engine ,Web Crawler ,Indexing , Google Search Engine

1.PROBLEM STATEMENT
Our primary Goal is to develop Video Storage and Retrieval System which stores and
manages a large number of video data and allows users to retrieve information from the database
efficiently.
5
The objective is to develop an interactive website which takes keywords from users and
retrieve the information from the database. Database consists of all kind of video data like still
images, audio and video. The retrieval should be based on the description as well as content of the
video object.

Our project provides different functionality for two main clients, which are:

1.1) Administrator
✔ Administrator is responsible for controlling the entire database.
✔ Administrator authentication ensures that the database is safe from outside users.
✔ Administrator can Add new video data into database, Update existing data or
Delete unwanted data from database.

1.2) User
✔ User cannot view the database.
✔ User is only provided with search options for searching Image, Audio and Video.
✔ The retrieval is based on the description of the video data such as color, shape, etc.
✔ User can also search the video based on technical metadata of the object. For
example user can enter the query image or audio to find all the videos of ‘.mpg,
file format.

2.Video Segmentation

• Video segmentation aims to segment moving objects in video sequences. Roughly speaking,
video segmentation initially segments the ¯rst image frame as the I image frame (IF) into
some moving objects and then it tracks the evolution of the moving objects in the subsequent
image frames. After segmenting objects in each image frame, these segmented objects have
many applications, such as surveillance, object manipulation, scene composition, and video
retrieval .

6
• Video is created by taking a set of shots and composing them together using specified
composition operators.
• Extracting structure primitives is the task of video segmentation, that involves the detecting
of temporal boundaries between scenes and between shots. A robust video segmentation
algorithm should be able to detect all kinds of shot and scene boundaries with good accuracy.

The problem of video segmentation has been approached from several different sides. These can be
broadly divided into two categories:
1. ‘Low-level’ approaches that use information at pixel level such as optical flow, gradient,
color and texture information . Areas of similar measure are then grouped to form regions.

2. ‘High-level’ approaches where some knowledge of the structure of real-world scenes is embedded
in an attempt to produce a more ‘natural’ solution to segmentation . These include model-based
approaches and segmentation based on three dimensional structure.

Low-level techniques suffer from a ‘myopia of content’, where the segmented shapes may not
actually correspond to semantically meaningful objects. These forms of segmentation are generally
used for increasing coding efficiency. Grouping together regions of similar motion works very
effectively for continuously moving objects moving over static backgrounds. However in natural
video scenes, accurately estimating motion parameters is difficult.

Higher level approaches include using structure-fiommotion (SFM) algorithms to estimate the 3D-
scene structure , which is then segmented leading to a corresponding 2D segmentation. If done
accurately, this leads to a far more meaning fbl segmentation of the sequence. SFM is notoriously
difficult to compute ‘however and is subject to much ambiguity. In order to get
a reliable segmentation many unnatural restrictions have to be imposed on the structure such as
rigidity, connectivity and the number of independent motions. 3D estimation is also a very
computationally intensive process.

Model-based coding generally imposes some form of restraint on the shape and nature of the object
to be segmented. A common example is segmentation of the face where usually a detailed model of
the shape and location of the facial features is built up. These algorithms are not easily generalized to
deal with other forms of segmentation.

We propose the following structured representation scheme.


• Sequence
Sequence-ID: x (a unique index key)

7
Scenes: { Scene(l), Scene (2), ... Scene (L)}
• Scene
Scene-ID: yl (a unique index key of the scene I)
Shots: { Shot(l), Shot(2), ... Shot(M)},,
Key-frames: {KF(l), KF(2), ..., KF(I)}
• Shot
Shot-ID: zm (a unique index key of the shot m)
Primitives: f( I), f (2), ..., f(N)1
Key-frames: {KFm(l), KF, (2), ..., KF, (J)}

This representation scheme maintains the hierarchical structure of video: sequences, scenes and
shots, each is defined by a set of primitives. Therefore, this structure meets the need for random and
nonlinear browsing: one can easily move from one level of detail to the next and skip from one
segment of video to another. Note that the bolded items are primitives based on visual abstractions of
a sequence, which are not usually contained explicitly in conventional video representation schemes.
That is, an abstracted representation is associated with each level of the structure hierarchy:
highlights associated at the top level; two levels of key-frames are associated with the scene and shot
level, respectively. These primitives are included to make the representation especially useful in
content-based video browsing.

Key-frames are still images extracted from original video data that best represent the content of shots
in an abstract manner. Key-frames have been frequently used to supplement the text of a video log,
though they were selected manually in the past .Key-frames, if extracted properly, are a very
effective visual abstract of video contents and are very useful for fast video browsing. A video
summary, such as a movie preview, is a set of selected segments from a long video program that
highlight the video content, and it is best suited for sequential browsing of long video programs. Just
like keywords and abstracts in text document browsing and indexing, these two forms of video
abstraction represent the landscape or structure of video sequences and entries to shots, scenes or
stories that make fast and content-based video browsing possible. Apart from browsing, key-frames
can also be used in representing video in retrieval: video index may be constructed based on visual
features of key-frames, and queries may be directed at key-frames using query by image retrieval
algorithms.

The first step for video-content analysis, content based video browsing and retrieval is the
partitioning of a video sequence into shots. A shot is defined as an image sequence that presents
continuous action which is captured from a single operation of single camera. Shots are joined
together in the editing stage of video production to form the complete sequence. Shots can be
effectively considered as the smallest indexing unit where no changes in scene content can be
perceived and higher level concepts are often constructed by combining and analyzing the inter and
intra shot relationships. There are two different types of transitions that can occur between shots:
abrupt (discontinuous) shot transitions, also referred as cuts; or gradual (continuous) shot transitions,
which include video editing special effects (fade-in, fade-out, dissolving, wiping, etc.).

KEY FRAME EXTRACTION

8
Key frames provide a suitable abstraction and framework for video indexing, browsing and retrieval .
One of the most common ways of representing video segments is by representing each video
segment such as shot by a sequence of key frame(s) hoping that a “meaningful” frame can capture
the main contents of the shot. This method is particularly helpful for browsing video contents
because users are provided with visual information about each video segment indexed. During query
or search, an image can be compared with the key frames using similarity distance measurement.
Thus, the selection of key frames is very important and there are many ways to automate the process.

There exist different approaches for key frame extraction: shot boundary based, visual content based,
shot activity based and clustering based. Clustering is a powerful technique used in various
disciplines, such as pattern recognition, speech analysis, and information retrieval. Yang and Lin
introduce a clustering approach based on a statistical model. This method is based on the similarity
of the current frame with their neighbors. A frame is important, if it contains more temporally
consecutive frames that are spatially similar to this frame. The principal advantage of this method is
that the clustering threshold is set by a statistical model.

2.1 Shot clustering for browsing:


Clustering video segments, such as shots, into groups or classes, each of which contains similar
content, is essential to building an index of shots for content-based retrieval and browsing. The first
step in clustering is to define content similarity between video sequence. Video content similarity
can be defined based on key-frame based features, shot-based temporal and motion features, object-
based features, or a combination of the three. With key-frame-based similarity, if two shots are
denoted as Si and Sj, their key-frame sets as Ki and Kj , then the similarity between the two shots
can be defined as

where Sk is a similarity metric between two images defined by any one or a combination of the
image features.
This definition assumes that the similarity between two shots can be determined by the pair of key-
frames which are most similar, and it will guarantee that if there is a pair of similar key-frames in
two shots, then the shots average of the most similar pairs of key-frames. The key-frame based
similarity measure can be further combined with the shot-based similarity measure to make the
comparison more meaningful for video.are considered similar and it will guarantee that if there is a
pair of similar key-frames in two shots, then the shots are considered similar. Another measure of
the similarity between two shots is the

9
For the purpose of clustering a large number of shots to build an index with different levels of
abstractions, partitioning-clustering methods are most suitable. This is because they are capable of
finding optimal clustering at each level and more suitable to obtain a good abstraction of data items.
Such an approach has been proposed for video shot grouping . This approach is flexible in that
different feature sets, similarity metrics and iterative clustering algorithms can be applied at different
levels.

2.2 Scene Change Detection Algorithm:


First, video input is split into audio stream and video stream. Then, the silence segments are first
detected on the basis of the signal energy. The non-silence segments

are further classified into speech, music, and environmental sound by evaluating the ZCR (zero
crossing rate), the SCR (Silence Crossing Rate) and frequency tracking of the audio samples. For
speech data, they are segmented into different elements according to

different speakers with Bayesian Information Criterion. t same time, shot detection is performed on
video stream. Then, color correlation analysis among shots is performed and a so-called expanding
window grouping algorithm is applied such that shots whose objects or background are closely
correlated, for instance shots occurring in the same environment, are grouped.

At the next step, we combine these audio segments and visual segments together for a more robust
video segmentation scheme. H. Sundaram and S.F. Chang determine constraints on a scene model by
using filmmaking rules and experimental results in the psychology of audition. We use these
constraints along with ten hours of video data to come up with three broad categories of computable
scenes:

• AV-Scenes:
AV-Scenes are characterized with a long-term consistency of chromatic composition, lighting
conditions and sound.

• A-Scenes:
A-Scenes, e.g. montage/MTV scenes, are characterized with widely different visuals
differences in location, time or lighting conditions) which create a unity of theme by manner
in which they have been juxtaposed. A-Scenes contain several consecutive visual segments
within a single audio segment.

• Dialogs:

10
Dialogs that usually occur in the same environment have several audio segments within a
visual segment. Since each audio segment could have only one person’s speech and a dialog
involves more than one speaker, a dialog often have a few audio segments.

We apply unsupervised clustering, such as the standard k-means, to all frames of a sequence, with
each represented by a feature vector .We classify all frames of a sequence into 1 to N sets of clusters,
where N is the maximum allowed number of partitions within a sequence.
• Steps are as follows:
1.Specify frame proximity parameter T.
2.Compute colour histogram metric for each video frame pair.
Chie –square or histogram difference can be used

or
3.Cluster frame dissimilarities into 2 classes using K-means algorithm.
3a.Arbitrarily select initial cluster means µ1,µ2(k=2).
3b.Assign each sample to class of nearest mean.

3c.Update cluster means ,as sample mean of all samples s(x) in cluster,

3d.If terminate procedure go to


Step 4.
4.Label all means in cluster with largest mean as scene changes.

5. Label frames between 2 successive scene changes if change points are closer than T.
6. Make shot boundaries.
• Main advantages of algorithm are its simplicity and speed which allows it to run on large
video shots.
• This algorithm is followed by selection of key frames to represent visual content of each
cluster unit.

11
2.3 Histogram-Based Systems

A histogram of an image conveys the relative quantities of colors in an image. Each pixel in the
image is represented as a weighted amount of three colors: red, green, and blue. Each of these three
colors is quantized into m divisions, yielding a total of m3 composite colors, or bins of histograms.
The total numbers of pixels that correspond to each bin are counted. The signature of the image is a
vector containing these counts and may then be used to build an index. When a query image is
presented, its signature is extracted and compared with entries in the index. The histogram-based
systems suffer from several limitations.

First, these systems disregard the shape, texture, and object location in an image, leading to a high
rate of return of semantically unrelated images. For instance, the histograms of the two images in
Figure 1 are identical, yet the images Fig. 1. Two images with opposite colors but identical average
histogram intensities are obviously different. Also, the color quantization leads to additional source
of error. The histogram-based systems may improve efficiency by using fewer bins than the number
of colors. Here, many colors

may get quantized into a single bin. The first problem is that similar colors that are near the division
line may get quantized into different bins and for a given bin, the extreme colors may be quite
different. Second, given a set of three color channels, perceptual sensitivity to variations within
colors is not equal for all three channels. However, histogram quantization may incorrectly use a
uniform divisor for all three color channels. Finally, a query image may contain colors that are
similar to the colors of a particular image in the index, but may result in a large distance if they are
not close enough to fall into the same bins.

2.4 Color Layout-Based Systems

12
Color layout-based systems extract signatures from images that are similar to low resolution copies
of the images. An image is divided into a number of small blocks and the average color of each block
is stored. Some systems, such as Walrus, utilize significant wavelet coefficients instead of average
values in order to capture sharp color variations within a block.

The traditional color layout systems are limited because of their intolerance of object translation and
scaling. The location of an object in an image is frequently helpful in identifying semantically similar
images. For instance, the histogram of a query image containing green grass and blue sky may have a
close distance to a green house with a blue roof, while a color layout scheme may discard this image.
The Walrus system uses a wavelet-based color layout method to overcome the translation and scaling
limitation. For each image in the index, a variable number of signatures (typically thousands) are
computed and clustered. Each signature is computed on a square area of the image, with the size and
location of each square varying according to prescribed parameters. The squares with similar
signature are clustered in order to speed up the search during a query. While this system is tolerant of
image translation and scaling, the computational complexity is dramatically increased.

2.5 Region-Based Systems

Region-based systems use local properties of regions as opposed to the use of global properties of the
entire image. A fundamental stumbling block for these systems is that objects may cross across
multiple regions, each of which inadequately identifies the object. Examples of region-based systems
include QBIC, SaFe, Blobworld, and SIMPLIcity.

QBIC uses both local and global properties and incorporates both region-based and histogram
properties. It identifies objects in images using semiautomatic outlining tools. SaFe is a complex
system that automatically extracts regions and allows queries based on specified arrangement of
regions. It uses a color-set back-projection method to extract regions.

For each region, it stores characteristics such as color, shape, texture, and location. Then, it performs
a separate search for each region in the query image. Blobworld is a region-based system that defines
regions, or blobs, within an image using the Expectation-Maximization algorithm on vectors
containing color and texture information for each pixel. For each blob, it computes and stores the
anisotropy, orientation, contrast, and

two dominant colors. SIMPLIcity is a region-based system that partitions images into predetermined
semantic classes prior to extracting the signature. It varies signature construction and distance
formulations according to the semantic class. It uses the k-means algorithm and Haar wavelet to
segment the image into regions.

2.6 Similarity Measures for Content

The queries are based on similarity and not on exact match. Results are shown as a list of matches
ordered by their measure of similarity. To do the matching, we use similarity functions (or inversely,
distance functions) for each feature or feature set. These similarity/distance functions are normalized

13
so that they can be meaningfully combined. Most of the similarity/ distance functions are based on
weighted Euclidean distance in the corresponding feature space (e.g., three dimensional average
Munsell color, three dimensional texture, or 20 dimensional shape). The weights are the inverse
variances of each component over the samples in the database. Other similarity measures are:

2.6.1 Histogram color measure:


For histogram matching, we use a quadratic form distance measure defined as llXll = X^TSX, where
X is the query histogram, Y is the histogram of an item in the database (both normalized), X is X -
Y , and S is a symmetric color similarity matrix in which s(i, j ) indicates the similarity of colors i
and j in the histograms. This method accounts for both the perceptual distance between different
pairs of colors (e.g., orange and red are less different than orange and blue) and the difference in the
amounts of a given color (e.g., a particular shade of red).

2.6.2 Shape measures:


Besides the weighted Euclidean distance mentioned earlier, the distance between two shapes
represented by parametric spline curves and its derivatives is computed from the spline control
points. This computation involves a combination of Euclidean distance and the evaluation of
quadratic forms with pre-computed terms. Another distance calculated is between the curves
represented by sets of turning angles which is computed using a dynamic programming algorithm.
Algorithm descriptions are given.

2.6.3 Sketch measure:


In the full scene sketch matching, a user roughly draws a set of dominant lines or edges in a drawing
area. Images from the database with a similar set of edges are retrieved. A flexible matching method
compares the user drawn edges to automatically extracted edges from the images in the database.

3 Video Indexing
Video indexing is a process of tagging videos and organizing them in an effective manner for fast
access and retrieval. Automation of indexing can significantly reduce processing cost while
eliminating tedious work.

14
Techniques can be generally classified into four categories based on different cues that algorithms
use in video indexing and retrieval, including visual features, auditory features, texts, and task
models.
• Visual Feature Based Approach
This method normally indexes videos based on their visual features such as shape, texture, and color
histograms. It involves extensive image processing. Some algorithms use key frames automatically
extracted from videos as indices. The basic idea is that every video clip has a representative frame
that provide a visual cue to its content. Those representative frames (key frames) are automatically
extracted from original videos based on their image features . Visual-feature based approaches
normally use image queries. The video retrieval relies on a set of similarity measures between image
features of a query and those of key frames, which can be performed at three abstraction levels: raw
data level, feature level, and semantic level . Some indexing algorithms use objects and their
attributes, as well as spatial and/or temporal relations among objects in a video to label and index
video sequences . For example, Gunsel et al. introduced an approach to temporal video partitioning
and content-based video indexing, in which the basic indexing unit was “life-span of a video object,
rather than a camera shot or story unit.” They indexed motion and shape information of video object
planes tracked at each frame and provided an object-based access to video data.

The disadvantages of a visual-feature based approach are that users usually do not have an image
handy to formulate a query and that content-based image retrieval has not reached a semantic level
that is directly useful to users . When a video has few scene changes, as in a typical videotaped
lecture, a visual feature based approach will encounter serious problems in extracting key frames.
In addition, defining quantitative measures of key frame similarity still remains a challenging
research topic.

• Auditory Feature Based Indexing


Sound is an essential component of a video. The audio track provides a rich source of information to
supplement understanding of any video content. Audio information can also be used in video
indexing . Image/sound relationships are critical to the perception and understanding of video
content. In a number of studies, both the auditory and visual information of videos have been used to
extract high-level semantic information as indices. The parsing and indexing of audio-source and
video-source often lead to the extraction of a speaker label and of a talking-face mapping of the
source over time. Integration of these audio and visual mappings constrained by interaction rules
results in higher levels of video abstraction and even partial detection of its context . There are
several useful methods for classifying and indexing sounds, such as simile (one sound is similar to
another sound or a group of sounds in terms of certain characteristics), acoustical features (e.g.,
pitch, timbre, and loudness), and subjective features (e.g., describing sounds using personal
descriptive language) . Cambridge University has developed retrieval methods for video mail based
on keyword spotting in the soundtrack via integrating speech recognition methods and information
retrieval technology .In addition, the audio track of a video is also often used to generate a text
transcript by means of a speech recognition system for text-based analysis and retrieval .

• Text Based Approach


In this approach, videos are indexed by keywords that are automatically extracted from texts related
to videos . The retrieval methods rely on keyword search in the free text obtained either from closed
captions or from transcriptions of the video sound track via speech recognition [26]. Other methods
use text identified from video images for video indexing , For example, Lienhart [27] proposed a
method to automatically recognize text appearing in videos by using OCR software for video
indexing. After a user specified a search string, video sequences were retrieved through either exact
substring match or approximate substring match. Text-based video indexing is straightforward and

15
easy to implement. It allows random access to specific points within a video when a particular
keyword appears. The major disadvantage of this approach is the loss of the context of search terms.
For example, let us consider that one of two videos in a video digital library describes the symptoms
of skin cancer and the other explains how to prevent skin cancer. Although both videos contain the
same term “skin cancer”, they have disparate contexts and address different questions. It will be
difficult to distinguish those two videos by using a keyword-spotting approach.

• Task Based Approach


In addition to visual, auditory, and textual cues, the semantic features of tasks can also be used to
create video indexes. In general, there have been two ways to apply this approach. One is to create a
structured content frame for each video clip as index. A frame is a data object that plays a role
similar to that of a record in a relational database. It often consists of a set of fields, usually called
slots, that provide various semantic information of a video clip . Burke and Kass developed a video
indexing scheme based on “Universal Indexing Frame” to retrieve video clips for presentation in a
case-based teaching environment. The index frame contained slots such as “Anomaly”, “Theme”,
“Goal”, and “Plan”, which explicitly indicated the points of interest or anomalies in a video story.

The other type of method is to build a task model for a collection of videos within a particular
domain. Researchers decompose a task into multiple subtasks or subgoals and generate a hierarchical
structure, called a task model, for video indexing and retrieval . For example, in order to train novice
transportation planners, Johnson et al. developed a Trans-ASK system that contained 21 hours of
video detailing the experience of United States Transportation Command personnel in planning for
operations such as Desert Shield and Desert Storm. The videos were segmented into a collection of
video clips in which experts told “war stories” of their actual experiences. In that system, a six-level
hierarchy of objectives and targets was manually created, ranging from national security objectives
to individual targets. Video clips were indexed according to a hierarchy of questions in the task
model. There are some limitations of current task-based approaches. First, a task model is domain
dependent and inflexible.Second, task frames or models are mainly created by human experts. It can
be very time consuming and inefficient.

Representative scenes are extracted from the videos based on the audio-visual features . Segments of
cuts, camera-work, and telop entries are extracted as visual events. Segments of speech and music
are extracted as audio events. These events and their time-codes are used to label the images.

WebClip is a complete working prototype for Editing browsing MPEG- 1 and MPEG-2 compressed
video over the World Wide Web . It uses a general system architecture to store, retrieve, and edit
MPEG-1 or MPEG-2 compressed video over the network. It emphasizes a distributed network
support architecture. It also uses unique CVEPS (Compressed Video Editing, Parsing, and Search)
technologies. Major features of WebClip include compressed domain video editing, content-based
video retrieval, and multi-resolution access. The compressed domain approach has great synergy
with the network editing environment, in which compressed video sources are retrieved and edited to
produce new video content, which are represented in
compressed form also. WebClip includes a video content analyzer for automatic extraction of visual
features from compressed MPEG videos. Video features and stream data are stored in the server
database with efficient indexing structures. The editing engine and the search engine include
programs for rendering special effects and processing queries requested by users. On the client side,
the content-based video search tools allow for formulation of video query directly using video
features and objects. The hierarchical browser allows for rapid visualization of important video
content in video sequences. The shot-level editor includes tools and interfaces for performing fast,
initial video editing, while the frame-level editor provides efficient JAVA-based tools for inserting
basic editing functions and special effects at arbitrary frame locations. To achieve portability, current
implementations also include client interfaces written in JAVA, C, and Netscape plugins.

16
In order to accommodate needs of different types of users, WebClip includes editing facilities at
various levels. The idea is to preserve the highest level of interactivity and responsiveness in any
arbitrary editing platform. The shot-level editor is intended for platforms with low bandwidth and
computing power, such as light-weight computers or notebooks with Internet access. The frame-level
editor includes sophisticated special effects, such as dissolve, motion effects, and cropping. It is
intended for high end workstations with high communication bandwidth and computation power.
Before the editing process is started, users usually need to browse through or search for videos of
interest. WebClip provides a hierarchical video browser allowing for efficient content preview. A
top-down hierarchical clustering process is used to group related video segments into the clusters
according to their visual similarity, semantic relations, or temporal orders. For example, in the news
domain, icons of key frames of video shots belonging to the same story can be clustered together.
Then, users may quickly view the clusters at different levels of the hierarchical tree. Upper level
nodes in the tree represent a news story or group of stories, while the terminal nodes of the tree
correspond to individual video shots. The networked editing environment, which takes compressed
video input and produces compressed video output, also makes the compressed-domain approach
very desirable. The editing engine of WebClip uses compressed domain algorithms to create basic
video editing effects, such as cuvpaste, dissolve, and various motion effects. Compressed domain
algorithms do not require full decoding of the compressed video input and thus provide great
potential in achieving significant performance speedup. However, sophisticated editing functions,
such as morphing, may incur significant computational overheads in the compressed domain due to
the restricted formats used in existing video compression standards.

The conventional features used in most of the existing video retrieval systems are the features such
as color, texture, shape, motion, object, face, audio, genre etc. It is obvious that more the number of
features used to represent the data, better the retrieval accuracy. However, as the feature vector
dimension increases with increasing number of features, there is a trade off between the retrieval
accuracy and complexity. So it is essential to have a minimal features representing the videos,
compactly. We use low-level and high level features such as color, texture, face present in the image,
and speech/music details obtained from the audio stream for indexing. We believe that these features
are optimal in retrieval accuracy and complexity while being informative enough in
identification task and robust to likely distortions as well.

In our CBVR system ,we are going to consider following feature:


1. Colour
2. Texture
3. Audio
4. Shapes etc.

4. Extraction of Texture

Texture can be defined as the visual patterns that have properties of homogeneity that do not result
from the presence of only a single color or intensity.

17
The methods for texture feature extraction and description can usually be categorized into structural,
statistical, spectral and transform methods. Relatively comprehensive overview for these methods is
given in Materka and Strzelecki (1998). Structural methods, such as mathematical morphology,
represent texture by well-defined primitives (micro texture) and a hierarchy of spatial arrangements
(macro texture) of those primitives. Mathematical morphology is used for texture feature extraction
and description of bone image to detect changes in bone microstructure (Chen and Dougherty, 1994).
Statistical methods, such as gray co-occurrence matrix, represent the texture indirectly by the non-
deterministic properties that govern the distributions and relationships between the grey levels of an
image.

During the critical period of visual development, the feature detectors in the human primary visual
cortex are shaped largely by the visual environment. In fact, exposure to an impoverished
environment during this period can lead to serious visual deficits, including the inability to perceive
some low-level stimuli. There is evidence that the feature detectors in the human visual system are
“tuned” to frequently-occurring visual stimuli by a process known as Hebbian learning. It therefore
seems plausible that feature detectors would be particularly sensitized to the most commonly
encountered textures in the visual environment. Perhaps these textures would then serve as basis
functions for the perception of textures later in life. In fact Leung uses a clustering approach with
commonly-encountered textures to derive a set of about 100 “3D textons”, which he suggests can be
used to characterize other textures.

By combining basis functions (such as Leung’s 3D textons) it might be possible to characterize many
complex textures in a manner similar to the human visual system. However, the human visual system
typically characterizes and recognizes textures at the subconscious level. Because of this we often
don’t have adequate words to describe the textures that we perceive and recognize.

Fortunately, some of the more important (and more salient) textures that we perceive become
associated with higher level content, and words associated with them can be found by studying a
lexicon. For example, English words that represent characteristic textures of particular objects
include basket, braid, brickwork, canvas, cauliflower, concrete, feather, foam, and frost. These
textures have potential for facilitating high level CBIR. There are also adjectives that are associated
with characteristics of scenes, such as alpine, aquatic, arctic, cloudy, cosmic, or dusty, and objects
can be characterized as amoeboid, bleeding, colorful, or crumpled. The fact that these particular
features have been given names seems to indicate that they are especially significant, and might be
more perceptually salient. Thus, they might provide a means for evaluating the similarity of images.

4.1 Extraction and Description of Perceptual Texture Features:


• Coarseness
• Contrast
• Directionality
• Regularity
• Line-likeness
• Roughness

18
• Some additional measures are :
Uniformity , Skewness , Entropy etc.

Tamura et al (1978) proposed a texture feature extraction and description method based on
psychological studies of human perceptions. The method consists of six statistical features, including
coarseness,contrast, directionality, line-likeness, regularity and roughness, to describe various texture
properties

4.2. Coarseness is the most fundamental feature in texture analysis. It refers to texture granularity,
that is, the size and number of texture primitives. A coarse texture contains a small number of large
primitives, whereas a fine texture contains a large number of small primitives. Suppose f (x, y)
denotes an image of size n x n,coarseness ( fcrs) can be computed as follows.
where k is obtained as the value which maximizes the differences of the moving averages (1/22k)
p (i, j), taken over a 2k 2k neighborhood, along the horizontal and vertical directions.

4.3. Contrast refers the difference in intensity among neighboring pixels. A texture on high contrast
has large difference in intensity among neighboring pixels, whereas a texture on low contrast has
small difference. Contrast ( fcon) can be computed as follows.

Fcon= s/(u4/s^4)^1/4
where s is standard deviation of the image and  is fourth moment of the image.
4.4 Directionality refers the shape of texture primitives and their placement rule. A directional
texture has one or more recognizable direction of primitives, whereas an isotropic texture has no
recognizable direction of primitives. Directionality ( fdir) can be computed as follows.

where HD is the local direction histogram, np is the number of peaks of HD, p is the pth peak
position of HD, wp is the range of pth peak between valleys, r is a normalizing factor, and is the
quantized direction code.

4.5 Line-likeness refers only the shape of texture primitives. A line-like texture has straight or
wave-like primitives whose direction may not be fixed. Often the line-like texture is simultaneously
directional. Line-likeness ( flin) can be computed as follows.

where PDd(i, j) denotes the n x n local direction co-occurrence matrix of distance d in the image.

4.5.1Gray co-occurrence matrix

Gray co-occurrence matrix is one of most elementary and important methods for texture feature
extraction and description. Its original idea is first proposed in Julesz (1975). Julesz found through
his famous experiments on human visual perception of texture, that for a large class of textures no
texture pair can be discriminated if they agree in their second-order statistics. Then Julesz used the

19
definition of the joint probability distributions of pairs of pixels for texture feature extraction and
description, and first used gray level spatial dependence co-occurrence statistics in texture
discrimination experiments. Weid et al. (1970) used one-dimensional co-occurrence for a medical
application. Haralick et al (1973) suggested two-dimensional spatial dependence of the gray levels in
a co-occurrence matrix for each fixed distance and/or angular spatial relationship, and used statistics
of this matrix as measures of texture in an image.
Givenl image f(x, y) of size Lr × Lc with a set of Ng gray levels, if P(k, j, d, θ) denotes the
estimate of the joint probability of two pixels with a distance d apart along a given direction θ having
particular values k and j, it is called gray co-occurrence matrix of the image with a distance d
apartalong a given direction θ having particular values k and j and denoted as follows.

where the parameter d denotes the distance between pixels (x1, y1) and (x2, y2) in the medical
image, the parameter θ denotes the direction aligning (x1, y1) and (x2, y2) . If the conditional co-
occurrence probabilities are based on the undirected distances typically used in the symmetric co-
ccurrence probabilities, θ in [0°, 180°). And N{.} denotes the amount of elements in the set.

When the gray co-occurrence matrix is used for texture feature extraction and description, it is
susceptible to noise and entity rotation in an image. Furthermore, it is proposed in Zhang et al (2007)
that directions 0°, 45°, 90° and 135° aren’t usually dominant directions for images of specific parts in
a medical image database, and dominant directions of images of different parts are usually different.
So directed filter for dominant directions is proposed in Zhang et al (2007).

Figure: Gray co-occurrence of direction θ and distance d

The gray co-occurrence matrix encodes the intensity relationship between pixels in different
position. The texture feature extraction and description method is susceptible to noise and change in
illumination. A texture feature extraction and description method using motif co-occurrence matrix is
proposed in Jhanwar et al (2004). The method doesn’t encode the intensity relationship between
pixels in different positions, but encode the relationship between texture primitives in different
positions. The method can reduce the effect of noise and change in illumination effectively.
Moreover, the actual gray levels in the image are irrelevant.

The method first defined the six texture primitives over a 2 x 2 grid . Each texture primitive,, depicts
a distinct sequence of pixels starting from the top left corner . If texture primitives are rotated 90°,
180° and 270° respectively, and a sequence of pixels are started from the top right corner, the bottom
right corner and the bottom left corner respectively, twenty-four texture primitives are computed
including the original texture primitives. All the texture primitives form a texture primitive set.

Co-occurrence matrix is used for the motif transformed image to extract and describe the texture
features. The parameters k and j of P(k, j, d, θ) in formula (1) denote two motif. The parameter d
denotes the distance from the motif k to the motif j. And the parameter θ is used to constraint the

20
angle from the motif k to the motif j. The P(k, j, d, θ) denotes the estimate of the joint probability of
two motifs with a distance d along a given direction θ having particular values k and j.

4.6. Regularity refers to variations of the texture-primitive placement. A regular texture is composed
of identical or similar primitives, which are regularly or almost regularly arranged. An irregular
texture is composed of various primitives, which are irregularly or randomly arranged. Regularity (
freg) can be computed as follows.

freg = 1 – r ( Scrs +S con +S dir +S lin)


where r is a normalizing factor and S xxx means the standard deviation of fxxx.
4.7. Roughness refers tactile variations of physical surface. A rough texture contains angular
primitives, whereas a smooth texture contains rounded blurred primitives. Roughness ( frgh) can be
computed as follows.
frgh = fcrs + fcon

1.Relative Smoothness :
R = 1 – 1 / (1 + S^2)
R=0 for constant intensity , R-> 1 for large S^2
2.Uniformity :
U =∑ P^2(zi)

3. Skewness is third moment of input image.

coarseness, contrast, and directionality achieve successful correspondences with psychological


measurements. But line-likeness, regularity, and roughness require further improvement due to their
discrepancies with psychological measurements. Based on the research results of Tamura, Flickner et
al. used three of Tamura features, including coarseness, contrast, and directionality, to design the
QBIC system, in which a texture image is described by the three features. And the image comparison
is achieved by evaluating the weighted Euclidean distance in the 3D feature
space, and each weight is the inverse variance on the feature.

Representation can effectively model the human perception subjectivity for finding similar textures.
And it produces compact texture descriptions but preserves perceptual attributes of
textures. Furthermore, MPEG proposed a texture feature extraction and representation method,
which consists of a perceptual browsing component (PBC) and a similarity retrieval component
(SRC) in the MPEG-7 standard (Wu et al, 1999). The method is a perceptual texture feature
extraction and description method, which use regularity, directionality and coarseness to describe
various texture information.

After Gabor wavelet is used for an image, the image is decomposed into a set of filtered images.
Each image represents the image information at a certain scale and at a certain direction. After
observing these images, researchers found that structured textures usually consist of dominant
periodic patterns; that a periodic or repetitive pattern, if it exists, could be captured by the filtered
images. This behavior is usually captured in more than one filtered output; that the dominant scale
and direction information can also be captured by analyzing projections of the filtered images. Based
on these observations, Texture browsing component was denoted as follows.
PBC = [v1, v2, v3, v4, v5]

The computation of the browsing descriptor is described in detail, and the descriptor was used for
image classification in Wu et al (2000). The descriptor is a compact descriptor that requires only 12

21
bits to characterize the texture features of an image. Here regularity, directionality, and coarseness
are described by 2 bits, 4 bits and 6 bits respectively. Moreover, a texture may have more than one
dominant directions and associated scales. So MPEG-7 specification allows a maximum of two
different directions and coarseness values.
The regularity of a texture is graded on a scale of 0 to 3, with 0 indicating an irregular or random
texture. A value of 3 indicates a periodic pattern with well-defined directionality and coarseness
values.

There is some flexibility (or implied ambiguity) in the two values in between. The regularity of a
texture has close relation to the direction of the texture. The regularity of a texture having a well-
defined directionality in the absence of a perceivable micro-pattern texture (texture primitives) is
compared with that of a texture that lacks directionality and periodicity and where the individual
micro-patterns are clearly identified in Manjunath et al (2001). And it is found in Manjunath et al
(2001) that the former is more regular that the latter. The directionality of a texture is quantized to
six values, ranging from 0° to 150° in steps of 30°.Three bits are used to represent the different
directions. The value 0 is used to signal textures that do not have any dominant directionality, and
the remaining directions are represented by values from 1 to 6. Associated with each dominant
direction is a coarseness component. Coarseness is related to image
scale or resolution. It is quantized to four levels, with 0 indicating a fine grain texture and 3
indicating a coarse texture.

4.8 Texture retrieval descriptor of MPEG-7 (TRD)

4.8.1. Feature extraction:


The discrete Gabor transform of image I(x, y) is given by the convolution of:

where Gm,n is the filtered image at scale m, orientation n, K is the filter mask size, and g∗m,n is the
complex
conjugate of:

22
where a > 1, θ = n N π, M and N are respectively the maximum number of scale and orientation. The
frequency response of the texture at different scale and orientation is captured by filters at different
m and θ. The values of the constants are: Ul = 0.05, Uh = 0.4 and K=60, M = 4 and N = 6. The
energy of the filtered image at scale m and orientation n is the sum of magnitude of the filtered
image:

where P and Q are the image width and height.


The response of the texture at each scale and orientation can be summarised as the ratio of the energy
level. The feature vector is the ratio of the energy level at each scale and orientation and it is used as
the index of the texture.

The dissimilarity between two textures is defined by the dissimilarity of their indexes calculated
using
a distance metric, such as L1 or Earth Mover’s Distance (EMD) proposed by Rubner.

4.8.2. Deriving the layout for browsing.


To browse the database, one could use proximity visualisation in which the proximity of objects is
dictated
by the dissimilarities of the objects as measured by a distance metric. Once the features representing
the objects and a distance metric are selected, a layout can be generated using Principal Component
Analysis, Multidimensional Scaling (MDS) or Genetic Algorithms. The use of MDS can be
described using the following example. Given a map, finding the distances between all towns in the
map is trivial. However, if the question is reversed: given the distances between all towns find the
layout of the towns; the task is no longer trivial. MDS is used to answer such a question. The layout
it produces can then be used to draw a map of the towns, a proximity visualisation. As mentioned
before, the dissimilarity between two TRD indexes can be calculated using L1 or the Earth Mover’s
Distance (EMD) [3]. The problem with EMD is that it is computationally expensive. Calculating the
distance between two indexes using EMD on an Intel P4, 1.4GHz PC running Linux 2.4 takes
0.28ms. Using L1, it takes a negligible 1e−7 ms. If the calculation involves only two indexes, the
speed gained by using L1 is meaningless. However, MDS algorithms calculate the distances many
times, normally of one index with every other indexes, and this needs to be done iteratively until a
suitable layout is found. So, in this case, using L1 will speed up the process of finding a layout.

4.9. Texture Browsing Descriptor of MPEG-7 (TBD)

Like TRD, TBD is also extracted from the convolution of an image with Gabor filters as described in
Section 2.1.1. A graphical representation of the relationship between TRD and TBD is given in Fig.
1.
TBD is supposed to conform with human perception of texture similarities in terms of regularity or
structuredness, coarseness and directionality and it is defined as:

23
v1 ∈ {1, . . . , 4}: four classes of texture which describe the regularity or structuredness of the
texture
(higher value being more regular or structured).

v2, v3 ∈ {1, . . . , 6}: two dominant directions of the texture. It is vague what v3 means if there is
only one
dominant direction. TBD was derived from Gabor filters with 6 orientations, so the maximum values
of v2 and v3 are six.

v4,v5 ∈ {1, . . . , 4}: quantised dominant scales of the texture along the two main dominant
directions
(higher value being coarser).

Although the browsing feature is defined, the browsing process is undefined. It is unclear how to use
TBD
to browse a database because the monitor is a two dimensional medium and TBD is a five
dimensional feature vector. To display the images on a two dimensional medium, the vectors must be
reduced to two dimensions as well. TBD can be reduced to two dimensions by using
Multidimensional Scaling or by selecting only the more meaningful features. The following sections
describe the two methods.

5.Audio Segmentation:

The past five years has seen considerable attention devoted to the problem of content-based image
indexing in the research literature and very little to content-based audio indexing.
The skewed focus is partly explained by the larger number of applications that are anticipated to
need image indexing as well as the relatively large size of image data. However, audio applications
would also increase in number as multimedia databases proliferate and as specialized audio
applications gain currency. Audio could serve as an independent data type or as part of multimedia
data. These specialized applications originate in domains such as the medical industry (ultrasound,
electrocardiography ), entertainment industry (music, speech recordings, sound databases for virtual
reality environments) and security (voice recognition, sound identification).

The general problem can be described as follows.


We are given a large collection of digitized audio sounds, each sound typically longer than a few
seconds in length. It is convenient to think of each audio sound as a file consisting of digitized
samples taken at a particular sampling rate. A query consists of a small snippet of sound, most often
a few seconds in length. The goal is to identify those files which contain sounds “similar” to the
query. Here, the notion of “similar” in a practical context can be somewhat subjective and user-
dependent. We do not resolve the general problem, also found in image databases , of determining
exactly whether a piece of audio is a suitable match to a query. Instead, we use a simple
distance measure such as mean-square-difference and in our testing, we use queries that are actually
extracted from the sounds existing in the data set. In this manner, we are able to make sure that a

24
query at least returns the data that it should. The percentage of files reported by the search which
actually contain data matching the given query, is called the hit-ratio. While a high hit-ratio is
desirable, low false-hit and false-dismissal counts are also important. A false-hit occurs when the
search process reports a file, which in fact, contains no resemblance to the query at any portion of the
data. A false-dismissal occurs when the search does not report a file, which does in fact, contains a
match of the query. It should be noted that false-hits are more tolerable than false-dismissals.

Since coarse visual feature may cause potential false matching and can’t accurately locate the
instance, fine audio features are introduced in this section in order to locate the instance more
accurately and also for verification. From the available audio features, we choose audio loudness and
SFM (spectrum flatness measure) features, which have demonstrated good performance in . In
our method, both features are estimated directly from 32 sub-band coefficients of the compressed
audio granules without decompression. In MPEG 1 layer 1/2 audio compression, each non-
overlapped compressed audio granule represents a temporal window that contains 384 audio
samples. Both Loudness and SFM features were calculated based on 4 frequency bands with equal
length. Therefore in total 8 coefficients (#Loud/#SFM=4/4) will represent an individual compressed
audio granule, instead of its 32 sub-bands coefficients. Finally an n× 8 matrix will be generated to
represent the whole clip, where n is the clip length, namely the number of granules.

5.1 Basics of transforms

The data sizes for audio are huge. So the raw data used as index results in increased storage
requirements and slow searches. Use of pitch changes and other acoustical features have been used
as indices for audio databases. These also require considerable computations. We propose an
indexing scheme based on the transform techniques used in signal processing and data compression.

Transforms are applied to signals (time domain signals, like audio or spatial, like images) to
transform the data to the frequency domain. This offers several advantages such as easy noise
removal, compression, and facilitates several kinds of processing. Specifically, given a vector
X = (z1 , z2, . . . , ZN) representing a discrete signal, the application of a transform yields a vector Y
= (y1, y2, . . . , YN) of transform coefficients and the original signal X can be recovered from Y by
applying the inverse transform. The transform-inverse-transform pairs for DFT and DCT can be
found in any book on signal processing.

The standard DFT pair is given by:

25
5.2 Outline of indexing scheme:
Each audio file or stream is divided into small blocks of contiguous samples and a transform
like discrete fourier transform is applied to each block. This yields a set of coefficients in the
frequency domain.
An index entry is created by selecting an appropriate subset of the frequency coefficients and
retaining a pointer to the original audio block. Thus, the index occupies less space than the data and
allows for faster searching.

Next, a query is similarly divided into blocks to each of which the transform is applied and a
subset of transform coefficients is selected. This forms the pattern. Then, the index is searched for an
occurrence of this pattern.
Two strings are matched if they are within small enough “root-mean-square-distance” of
each other.
Suppose A = ala2 . . . an, represents the
discrete samples of the original audio signal.
Q = q1,q2 . . . qm represents the samples of a
given query.
Both the original signal and the query are divided into blocks of size L.

Consider a block Ai = a1,ai2, . . . aiL of the original signal. Application of a transform (say FFT,
DCT or any similar transform) to A, will yield a new sequence of values yi = y1 y2 . . . y whe re x =
T . Ai, where T is the transform matrix and is independent of the input signal. With a suitable
transform, usually a few significant values of Y, (the first few values by position (zonal selection) or
the largest few values by magnitude (threshold selection)) would be enough to reconstruct a good
approximation of the original data. Suppose k significant values of each of the blocks are retained to
serve as index for the original data. Specifically, Let yi1,yi2 . . . yak be the index
for block A, denoted by DBC,. With threshold selection, we need to remember the locations
(positions) of the coefficients and these are saved in DBCL,. There will be N such indices for A, one
for each block of A. Together, these form the index set for the original data. Similarly application of
the same transform to a block Q, of the query will yield a sequence of values
QBC, = z1 ,z2, .. Z L , where QBC, = T . W. The appropriate k values of QBC, are compared
against the index sets to determine a match (exact or close).

5.3 Blocking and segmentation :


To derive the transform-based index, the audio data (signal) is divided into fixed-size units called
blocks, a process referred to as blocking. A suitable transform is then applied to these individual
blocks.
The advantages of blocking are the following:
• When transforms are applied to the whole signal, the transform coefficients capture global
averages but not the the finer details.
• Blocks of appropriate sizes would contain samples which are highly intercorrelated, so that
when transforms are applied, there is more energy compaction and thus fewer transform
coefficients would adequately describe the data.
• The transforms on the individual blocks could be carried out in parallel.
In segmentation on the other hand, the audio data is divided into variable-length units called
segments. The data within a segment does not vary much. The positions in audio data where very
sharp changes occur define the segment boundaries.

26
5.4Advantages of transform-based indexing
The transform-based indexing scheme has several advantages:
1. After the application of a transform (FFT, DCT, etc.) to a signal, we have quantities in the
frequency domain. It is known that frequencies are less sensitive to noise than the signal amplitude.
So applying transforms to the audio data and to the query would make them less sensitive to noise
and result in better search success in the case of index searches than raw-data searches.
2. If the data content is stored with different sampling rates or resolutions, then the raw-data
search scheme or schemes based on acoustical attributes may fail to identify a given query in
the data files. However, the transforms make the query and data insensitive to sampling rates.
3. By using a good transform, only a few transform coefficients would suffice as the index, thus
reducing the storage space for the index.

Items (1) and (2) result in better hit ratios, while (3) results in lower search times.

5.5 search algorithm


In the index searches, the transform coefficients of the query are compared with corresponding
coefficients of the data blocks and the distance between them is determined. If the distance is below
an experimentally determined threshold, it is accepted as a match. The first of the two index search
schemes is called simple search. It assumes that the query block boundaries are aligned with those of
the data block boundaries. The other algorithm called Robust search.

5.5.1 Simple search algorithm


• It assumes that the query block boundaries are aligned with those of the data block
boundaries.
In the following algorithms and their analysis, the
following notations are used:
L : length of a block (Number of samples).
N : Number of blocks of the data.
M : Number of blocks of the query.
K: Number of significant transform coeffs.
per block retained as index.
QBC: Query Block Coefficients. (Obtained by
applying transform on query blocks.
DBC: Data Block Coefficients. (Obtained by
applying transform on data blocks.
DBCL: Data Block Coefficient Locations.
RBC: Reconstructed Data Block Coefficients.
RBCL : Reconstructed Data Blk Coeff. Locations.

Note that each block of QBC contains L elements and each block of DBC and DBCL contains k
elements

5.4 Algorithm is given as follows:


(QBC[l :M, 1 : L], DBC(1 : N, 1 : k], DBCL[l : N, 1 : k])

27
1. begin
2. for i = 1 to N - M do
3. dist = 0;
4. for j = 1 to M do
5. dist = dist + Euclidean distance between corresponding
coefficients of QBC[j] and DBC[i + j – 1].
6. endfor
7. if (dist < Threshold)
8. Report ‘Match Found’ and
9. endif
10. endfor
11. End

Time Complexity
Consider the simple search algorithm given above. The time for applying the transform to the query
takes O(M . L log L) time . Step 5 of the algorithm takes O(k) time. In the worst-case, the loops are
executed (N - M) . M times.
Complexity of the algorithm is:
O( M . L log L + (N - M)Mk)
=appro. O(ML log L + NMk) since M << N.

6. Shape descriptor
The shape of the video object can be an arbitrary polygon along with ovals of arbitrary shape and
size. The visual palette allows the user to sketch out an arbitrary polygon with the help of the cursor,
other well-known shapes such as circles, ellipses, and rectangles are predefined, and are easily
inserted and manipulated.

One of the biggest challenges with using shape as a feature is to be able to represent the object while
retaining a computationally efficient metric to compare two shapes. One approach is to use
geometric invariants to represent shape . These are invariants on the coefficients of the implicit
polynomial used to represent the shape of the object. However, these coefficients need to be very
accurately calculated as the representation (that of implicit polynomials) is very sensitive to
perturbations. Additionally, generating these coefficients is a computationally intensive task.

Shape attributes techniques can be represented in two distinct categories: measurement-based


methods ranging from simple, primitive measures such as area and circularity to the more
sophisticated measures of various moment invariants; and transformation-based methods ranging
from functional transformations such as Fourier descriptors to structural transformations such as
chain codes and curvature scale space feature vectors. An attempt to compare the various shape
representation schemes is made in .Those features, which characterize the shape of any image object,
can be classified into the following two categories. Global shape features are general in nature and
depend on the characteristics of the entire image object. Area, perimeter, and major axis direction of
the corresponding image region are examples of such features. Local shape features are based on the
low-level characteristics of image objects. The determination of local features usually requires more
involved computation. Curvatures, boundary segments, and corner points around the boundary of the
corresponding image region are examples of such features.

28
6.1 Shape features
An approach to shape recognition which has been prevalent for many years, the “statistical pattern
recognition” approach. We make a set of measurements which independently characterize some
aspect of the shape. Hopefully, we have a large collection of examples, so we may then characterize
the shape statistically. Suppose, for example, the mission of the project is to distinguish between
sharks and sting rays. Measurements may include properties of a region such as area, perimeter,
aspect ratio, eigen values, convex discrepancy, and various central moments. Though the
computation of such features can be challenging, we do not discuss the actual computational process
here, but rather refer the reader to texts on computational
geometry. Before we can consider the use of simple geometric features, we must discuss briefly how
such features might be used, which is, in turn, a pattern recognition problem.

6.2 Pattern Recognition

Approach is to restrict the problem to “to which shape in the data base is the observed shape most
similar?” Thus, we are not classifying the shape as a shark or a sting ray, and we do not assume that
we have any statistical information about the properties of the typical shark or sting ray. This
restriction eliminates most of the “statistical” part of statistical pattern recognition, and leaves us
with the problem of finding, not the class to which an observation belongs, but which prototype it
most resembles. We expound on this in the next subsection.

6.3Curvature Scale Space Matching

Curvature Scale Space(CSS) is a shape representation method introduced by Mokhtarian and


Macworth. CSS has also been adopted in the MPEG- 7 standard as a contour shape descriptor. The
CSS representation is a multi-scale organization of curvature zero crossing points of a planar curve.
The authors define the representation in the continuous domain but sample it later.

CSS descriptors are translation-invariant because of the use of curvature. Scale invariance is obtained
by re-sampling the curve to a fixed number of boundary points. Rotation and starting point changes
cause circular shifts in the CSS image and are compensated for during the matching process.

Various shape descriptors are boundary, edge, thickness, etc Variety of statistical methods can be
utilized to make use of such features like Shape Context algorithm, shape matching algorithm. A
boundary can be either continuous, and parameterized (typically) by arc length, or discrete, and
parameterized by an index, say i. Our objective is to compare two boundaries, iC and jC. In the
discrete form, a boundary, say boundary i, is an ordered set of points in the plane, iC = {iC1, iC2, . . .
, iCN}, where iCk = [ixk, iyk]T .When we use the continuous form, we will denote the ith curve as
iC(s), using s to denote arc length. For parametric representation ,We have a situation in which two
measurements have been made of length and width of a region with two different classes It is
impossible to draw a simple straight line which separates the two classes. The correct decision
depends on the underlying distribution of the data from which the samples were drawn

29
figure 6.1(two class)

In the nonparametric case, one of the more popular representations is the K-Nearest neighbor (K-
NN) form. In that case, it turns out that the best decision is to simply assign the unknown observation
to the class in which the K nearest neighbors of the observation belong . for each element of the data
base, say, element i, calculate the feature vector iH = [ih1, ih2, . . . , ih7]T . Make the same
measurements on the unknown region O = [O1O2, . . . ,O7]T .then decide that the observation O
belongs to class i iff ||O − iH|| < ||O − jH||8j 6= i. Here || · || denotes some norm.

6.4 Shape Context descriptor for matching 2D shapes is introduced, Given a contour iC, assumed
sampled iC ={iC1, iC2, . . . iCN}, let P belongs to iC .For any element of iC we can compute the
vector from that point to P, and thus can construct an ordered set of such vectors, iP = {P − iC1, P −
iC2, . . . , P − iCNi}. The elements of iP are converted to log-polar coordinates and coarsely 1
quantized and then a two-dimensional histogram is constructed, defining the shape context of point

P: h(P, θ , r) =∑ (θ - θj, r –rj)

j where j = 1, . . . ,N

The histogram is simply a count of how many times the particular angle ,log-distance pair occurs
on this contour We refer to h(P, k) as the shape context of point P in curve iC .

6.5 shape context in matching, we will use it to find a measure which will characterize how well
iC matches jC. Let P denote a point on curve iC and let Q on jC. Denote the corresponding
histograms as h(P,k) and h(Q, k). For any particular P and Q, we may match their shape contexts by
matching individual points along the two curves ,we can construct a matrix of matching costs. we
may define a cost of assigning P to correspond to Q. This “assignment cost” is

CPQ= ½ ∑ (h(P,k)-h(Q,k))2/h(P,k)+h(Q,k)

k=i

Based on this cost matching between two shape is determine.

30
7.Color extraction

Color is chosen as the major segmentation feature because of its consistency under varying
conditions. As boundaries of color regions may not be accurate due to noise, each frame of
the video shot is filtered before color region merging is done. Edge information is also incorporated
into the segmentation process to improve the accuracy. Optical flow is utilized to
project and track color regions through a video sequence.

The representative color in the quantized CIE-LUV space. It is important to bear in mind that the
quantization is not static, and the quantization palette changes with each video shot. The quantization
is calculated anew for each sequence with the help of a self-organizing map.

Color attributes could be represented as a histogram of intensity of the pixel colors. Based on a fixed
partition of the image, an image could be indexed by the color of the whole image and a set of inter-
hierarchical distances, which encode the spatial color information. The system Color-WISE
described in partitions an image into 8*8 blocks with each block indexed by its dominant hue and
saturation values. A histogram refinement technique is described in by partitioning histogram bins
based on the spatial coherence of pixels. A pixel is coherent if it is a part of some sizable similar
colored region, and incoherent otherwise.

A proper choice of colour space is imperative in colour video database indexing and retrieval.
Comparing two colours requires the calculation of a meaningful distance between them in the chosen
colour space. The RGB 3-D orthogonal space and its tri-stimulus principle are convenient in
generating different colours for display on standard monitors, and in calculating Euclidean distances.
Unfortunately the RGB colour space is not suitable for human visual perception of colours. The HSI
(Hue- Saturation-Intensity) colour space is known to reflect human visual perception better. Hue is a
colour sensation, measured by the angle around the circle . Saturation is a measure of how pure or
dense the colour is, and is measured by the radius. For example, bright red is highly saturated with
little white while pink is unsaturated by white. Intensity is a measure of lightness from white
through grey to black and measured along the vertical ordinate, e.g. light red comp*red to dark red.
The conversion between RGB and HSI is non-linear, and although it is relatively simple to convert
from RGB to HSI, the opposite operation is rather complicated. The perceptual closeness between
two colours should really be measured in the HSI space in which, unfortunately, the Euclidean
distance does not have meaning.

• Color is an intuitive feature and color features play more important role in image matching.
• Our approach focuses HSV Color Space, Color Quantization, Color Histogram.

HSV Color Space


To provide a user representation in the user interfaces, programmers prefer the HSL color space or
HSV color space .The acronym stands for Hue, saturation, and Luminosity.
HSV color space is the often used one because of its accordance with human visual feature. HSV
color space has two distinct characteristics: one is that lightness component is independent of color
information of images; the other is that hue and saturation component is correlative with manner of
human visual perception. So programmers would like to transform RGB color space to HSV
(HIS) color space in image indexing.

31
7.1 Color Quantization
• Color quantization is a process that reduces the number of distinct colors used in an image.
• In order to reduce the number of colors in images before color feature extraction, we should
convert all our colors into a subset which is called quantization.
• Along with the advanced analysis of color features in HSV space, we propose a new dividing
method to Quantize the color space into 12*4*4 colors.

• From above color ring, we find that each color is non uniform according to the range of hue.
For instance, three primaries of red, green and blue occupy more space than other colors,
three primaries’ ranges are the most widest, secondary colors take second place, other colors
ranges are most narrowest. So we adopt non-uniform quantized method.
• we quantize Hue just like this: (25,45],(45,75](75,95],(95,145],(145,165],(165,195],
(195,215],(215,265],(265,285],(285,315],(315,335], (335,25].

• Saturation and value also adopt uniform quantization. Saturation is quantized into four parts,
and value is also quantized into four parts.

• S= 0 s in [0,0.25)
=1 s in [0.25,0.5)
=2 s in [0.5,0.75)
= 3 s in [0.75,1]

• v=0 v in [0,0.25)
=1 v in [0.25,0.5)
=2 v in [0.5,0.75)
=3 v in [0.75,1]

7.2 COLOR HISTOGRAM :


Color is usually represented by color histogram, color
correlogram, color coherence vector, and color moment
under a certain color space. The color histogram serves as an effective representation of the color
content of an image if the color pattern is unique compared with the rest of the data set. The color
histogram is easy to compute and effective in characterizing both the global and local distribution of
colors in an image. In addition, it is robust to translation and rotation about the view axis and
changes only slowly with the scale, occlusion and viewing angle. So Color histograms are widely
used for the Content-Based Image Retrieval.

32
A. Definition
At first, we quantize the colors in images using above
method. And then, we make color histograms;
A(j)=i+H(i) i in[0,Nc), j in [0,Nb),H(i) !=0.
Nc is the number of quantized color. Nb is the number of
nonzero of H(i). i is the sequence of the i’th color. We use
A(j) to store those colors histograms which are nonzero. A(j) is a dynamic array.

7.3 Similarity Calculation :


INT(A(j)) is a function which returns integer part of A(j)
throwing away the decimal fraction. Ap(i) expresses A(i) in
picture P and Aq(i) expresses A(i) in picture Q. Sim(P,Q)
expresses the similarity between image P and image Q.

Ncp is the number of nonzero in the color histogram of image P.


Ncq is the number of nonzero in the color histogram of
image Q.

1. i=0,j=0;
2. Sim(P,Q)=0;
3. While (i<Ncp-1&&j<Ncq)
4. Do{
5. If(INT(Ap(i))== INT (Aq(j))) {
6. Sim(P,Q)= Sim(P,Q)+min(Hp(i),Hq(i));
7. i++;
8. j++;
9. } else if(INT(Ap(i))< INT(Aq(j)))
10. i++;
11. else
12. j++;
13. }

Two histograms are similar if similarity distance is minimum.

33
8. INTRODUCTION TO SEARCH ENGINE
A search engine or search service is a program designed to help find information stored on a
computer system such as the World Wide Web, inside a corporate or proprietary network or a
personal computer. The search engine allows one to ask for content meeting specific criteria
(typically those containing a given word or phrase ) and retrieves a list of references that match those
criteria.

Search engines use regularly updated indexes to operate quickly and efficiently. Search Engine
usually refers to a Web search engine, which searches for information on the public Web. However
there are other kinds of search engines like enterprise search engines, which search on intranets,
personal search engines, which search individual personal computers, and mobile search engines.
Some search engines also mine data available in newsgroups, large databases, or open directories
like DMOZ.org. Unlike Web directories, which are maintained by human editors, search engines
operate algorithmically. Most web sites which call themselves search engines are actually front ends
to search engines owned by other companies.

Any search engine has the following parts:

1) Crawler: Crawler downloads the web pages from the web sites distributing
animations on the Internet.
2) Metadata Extractor: Metadata Extractor extracts the metadata present in the non-
text animation files e.g. SWF files or CLASS files
3) Indexer: Indexes the downloaded pages according to there HTML content as well as
metadata extracted from the animation files
4) Ranking Criteria: It decides which pages given higher rank compared to other
depending upon some predefined algorithms.
5) Searcher: Accepts user query and searches pages satisfying given query and gives
results based on the ranking criteria.

8.1 TYPES OF SEARCH ENGINES


Although the term "search engine" is often used indiscriminately to describe crawler-based search
engines, human-powered directories, and everything in between, they are not all the same. Each
type of "search engine" gathers and ranks listings in radically different ways.
1. Crawler-Based
Crawler-based search engines such as Google, compile their listings automatically. They
"crawl" or "spider" the web, and people search through their listings. These listings are what
make up the search engine's index or catalog. You can think of the index as a massive
electronic filing cabinet containing a copy of every web page the spider finds. Because
spiders scour the web on a regular basis, any changes you make to a web site may affect your
search engine ranking. It is also important to remember that it may take a while for a spidered
page to be added to the index. Until that happens, it is not available to those searching with
the search engine.

2.Directories

34
Directories such as Open Directory depend on human editors to compile their listings. Webmasters
submit an address, title, and a brief description of their site, and then editors review the submission.
Unless you sign up for a paid inclusion program, it may take months for your web site to be
reviewed. Even then, there's no guarantee that your web site will be accepted.

After a web site makes it into a directory however, it is generally very difficult to change its search
engine ranking. So before you submit to a directory, spend some time working on your titles and
descriptions. Moreover, make sure your pages have solid well-written content.
3.Mixed Results /hybrid search engine

Some search engines offer both crawler-based results and human compiled listings. These hybrid
search engines will typically favor one type of listing over the other however. Yahoo for example,
usually displays human-powered listings. However, since it draws secondary results from Google, it
may also display crawler-based results for more obscure queries. Many search engines today
combine a spider engine with a directory service. The directory normally contains pages that have
already been reviewed and accessed.

4.Pay Per Click


More recently, search engines have been offering very cost effective programs to ensure that your
ads appear when a visitor enters in one of your keywords. This new trend is to charge you on a Cost
per Click model (CPC).The listings are comprised entirely of advertisers who have paid to be there.
With services such as Yahoo SM, Goggle AdWords, and FindWhat, bids determine search engine
ranking. To get top ranking, an advertiser just has to outbid the competition.
5.Metacrawlers Or Meta Search Engines

Metasearch Engines search, accumulate and screen the results of multiple Primary Search Engines
(i.e. they are search engines that search search engines) Unlike search engines, meta crawlers don't
crawl the web themselves to build listings. Instead, they allow searches to be sent to several search
engines all at once. The results are then blended together onto one page.

The search engine can be categorized as follows based on the application for which they are used:

1.Primary Search Engines:


They scan entire sections of the World Wide Web and produce their results from databases of Web
page content, automatically created by computers.
2.Subject Guides :
They are like indexes in the back of a book. They involve human intervention in selecting and
organizing resources, so they cover fewer resources and topics but provide more focus and guidance

3.People Search Engines :


They search for names, addresses, telephone numbers and e mail address.
4.Business and Services Search Engines:
They essentially national yellow page directories .
5.Employment and Job Search Engines
35
They either (a) provide potential employers access to resumes of people interested in working for
them or (b) provides prospective employees with information on job availability .
6.Finance-Oriented Search Engines:
They facilitate searches for specific information about companies(officers, annual reports, SEC
filings, etc.)
7.News Search Engines :
They search newspaper and news web site archives for the selected information
8.Image Search Engines:
They which help you search the WWW for images of all kinds.
9.Specialized Search Engines:
They that will search specialized databases, allow you to enter your search terms in a particularly
easy way, look for low prices on items you are interested in purchasing, and even give you access to
real, live human beings to answer your questions.

8.2 WORKING OF SEARCH ENGINE


Search engines use automated software programs known as spiders or bots to survey the
Web and build their databases. Web documents are retrieved by these programs and
analyzed. Data collected from each web page are then added to the search engine index.
When you enter a query at a search engine site, your input is checked against the search
engine's index of all the web pages it has analyzed. The best URLs are then returned to you as
hits, ranked in order with the best results at the top. The working of a search engine can be
thus divided into three steps

1. Web Crawling
2. Indexing
3. Ranking

9. WEB CRAWLING
web crawlers, automated programs, used by search engines. Web crawlers explore new pages by the
hyper linked network structure and discover updates of pages the engine already knows about. This
process can be done without human involvement, and has already help create many popular search
engines such as Google

36
The huge size of data on the Internet give the birth of web search engines, which are becoming more
and more indispensable as the primary means of locating relevant information. Such search engines
rely on massive collections of web pages that are acquired by the work of web crawlers, also known
as web robots or spiders.

A web crawler is a program, which browses the World Wide Web in a methodical, automated
manner. Web crawlers are mainly used for automating maintenance tasks by scratching information
automatically. Typically, a crawler begins with a set of given Web pages, called seeds, and follows
all the hyperlinks it encounters along the way, to eventually traverse the entire Web . General
crawlers insert the URLs into a tree diagram and visit them in a breadth-first manner. There has been
some recent academic interest in new types of crawling techniques, such as focused crawling based
on semantic web , cooperative crawling , distributed web crawler , and intelligent crawling ,and the
significance of soft computing comprising fuzzy logic (FL), artificial neural networks (ANNs),
genetic algorithms (GAs), and rough sets (RSs).

The behavior of a web crawler is the outcome of a combination of policies:

1. A selection policy that stated which pages to download.

2. A re-visit policy that states when to check.

3. A politeness policy that states how to avoid overloading websites.

4. parallelization policy that states how to coordinate distributed web crawlers.

In the rid of repeated operation, crawlers need make a record of the web pages which have been
downloaded by Hash Table. That means after crawling search engines store numerous pages in their
databases. The harder task is that the crawling and storing work should repeat in a certain period.
Taking the most popular search engine Google as an example, in 2003, Google’s crawler crawled in
every month, but now, crawls every 2 or 3 days. So crawling on the massive pages in such
frequency, the cost of net resource and storage is huge. It is exactly the motivation of this paper that
since we have to run a crawler to fetch numerous pages of data at a enormous cost of machine hour
and storage, why don’t we take full advantage of it and try to get more useful information in the form
of metadata which is data about data ?

A web crawler (also known as a web spider or web robot) is a program or automated script which
browses the World Wide Web in a methodical, automated manner. Other less frequently used names
for web crawlers are ants, automatic indexers, bots, and worms This process is called web
crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means
of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages
for later processing by a search engine, that will index the downloaded pages to provide fast
searches. Crawlers can also be used for automating maintenance tasks on a web site, such as
checking links or validating HTML code. Also, crawlers can be used to gather specific types of
information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit,
called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds

37
them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively
visited according to a set of policies.

9.1Crawling Applications

There are a number of different scenarios in which crawlers are used for data acquisition. We now
describe a few examples and how they differ in the crawling strategies used.

Breadth-First Crawler:

In order to build a major search engine or a large repository such as the Internet Archive, high-
performance crawlers start out at a small set of pages and then explore other pages by following links
in a “breadth first-like” fashion. In reality, the web pages are often not traversed in a strict breadth-
first fashion, but using a variety of policies, e.g., for pruning crawls inside a web site, or for crawling
more important pages first1.

Recrawling Pages for Updates:

After pages are initially acquired, they may have to be periodically recrawled and Checked for
updates. In the simplest case, this could be done by starting another broad breadth-first crawl, or by
simply requesting all URLs in the collection again. However, a variety of heuristics can be employed
to recrawl more important pages, sites, or domains more frequently. Good recrawling strategies are
crucial for maintaining an up-to-date search index with limited crawling bandwidth, and recent work
by Cho and Garcia-Molina has studied techniques for optimizing the “freshness” of such collections
given observations about a page’s update history.

Focused Crawling:

More specialized search engines may use crawling policies that attempt to focus only on certain
types of pages, e.g., pages on a particular topic or in a particular language, images, mp3 files, or
computer science research papers. In addition to heuristics, more general approaches have been
proposed based on link structure analysis and machine learning techniques. The goal of a focused
crawler is to find many pages of interest without using a lot of bandwidth. Thus, most of the previous
work does not use a high-performance crawler, although doing so could support large specialized
collections that are significantly more up-to-date than a broad search engine.

Random Walking and Sampling:

Several techniques have been studied that use random walks on the web graph (or a slightly modified
graph) to sample pages or estimate the size and quality of search engines .

Crawling the “Hidden Web”:

38
A lot of the data accessible via the web actually resides in databases and can only be retrieved by
posting appropriate queries and/or filling out forms on web pages. Recently, a lot of interest has
focused on automatic access to this data, also called the “Hidden Web”, “DeepWeb”, or “Federated
Facts and Figures”. Work in has looked at techniques for crawling this data. A crawler such as the
one described here could be extended and us ed as an efficient front-end for such a system. We note,
however, that there are many other challenges associated with access to the hidden web, and the
efficiency of the front end is probably not the most important issue.

9.2 Basic Crawler Structure


Given these scenarios, we would like to design a flexible system that can be adapted to different
applications and strategies with a reasonable amount of work. Note that there are significant
differences between the scenarios. For example, a broad breadth-first crawler has to keep track of
which pages have been crawled already; this is commonly done using a “URL seen” data structure
that may have to reside on disk for large crawls. A link analysis-based focused crawler, on the other
hand, may use an additional data structure to represent the graph structure of the crawled part of the
web, and a classifier to judge the relevance of a page , but the size of the structures may be much
smaller. On the other hand, there are a number of common tasks that need to be done in all or most
scenarios, such as enforcement of robot exclusion, crawl speed control, or DNS resolution.

For simplicity, we separate our crawler design into two main components, referred to as crawling
application and crawling system. The crawling application decides what page to request next given
the current state and the previously crawled pages , and issues a stream of requests (URLs) to the
crawling system. The crawling system (eventually) downloads the requested pages and supplies them
to the crawling application for analysis and storage. The crawling system is in charge of tasks such as
robot exclusion, speed control, and DNS resolution that are common to most scenarios, while the
application implements crawling strategies such as “breadth-first” or ”focused“. Thus, to implement
a focused crawler instead of a breadth-first crawler, we would use the same crawling system (with a
few different parameter settings) but a significantly different application component, written using a
library of functions for common tasks such as parsing, maintenance of the “URL seen” structure, and
communication with crawling system and storage.

39
Web Crawler

At first glance, implementation of the crawling system may appear trivial. This is however not true in
the high performance case, where several hundred or even a thousand pages have to be downloaded
per second. In fact, our crawling system consists itself of several components that can be replicated
for higher performance. Both crawling system and application can also be replicated independently,
and several different applications could issue requests to the same crawling system, showing another
motivation for the design.

9.3 Requirements for a Crawler


We now discuss the requirements for a good crawler, and approaches for achieving them. Details on
our solutions are given in the subsequent sections.

Flexibility:

As mentioned, we would like to be able to use the system in a variety of scenarios, with as few
modifications as possible.

Low Cost and High Performance:

The system should scale to at least several hundred pages per second and hundreds of millions of
pages per run, and should run on low cost hardware. Note that efficient use of disk access is crucial
to maintain a high speed after the main data structures, such as the “URL seen” structure and crawl
frontier, become too large for main memory. This will only happen after downloading several
million pages.

Robustness:

There are several aspects here. First, since the system will interact with millions of servers, it has to
tolerate bad H TML, strange server behavior and configurations, And man y other odd issues. Our
goal here is to err on the side of caution, and if necessary ignore pages and even entire servers with
odd behavior, since in many applications we can only download a subset of the pages anyway.
Secondly, since a crawl may take weeks or months, the system needs to be able to tolerate crashes
and network interruptions without losing (too much of) the data. Thus, the state of the system needs
to be kept on disk. We note that we do not really require strict ACID properties. Instead, we decided
to periodically synchronize the main structures to disk, and to recrawl a limited number of pages
after a crash.

Etiquette and Speed Control:

It is extremely important to follow the standard conventions for robot exclusion (robots.txt and
robot’s meta tags), to supply a contact URL for the crawler, and to supervise the crawl. In addition,
we need to be able to control access speed in several different ways. We have to avoid putting too
much load on a single server; we do this by contacting each site only once every 30 seconds unless
specified otherwise. It is also desirable to throttle the speed on a domain level, in order not to
overloads mall domains, and for other reasons to be explained later. Finally, since we are in a
campus environment where our connection is shared with many other users, we also need to control
the total download rate of our crawler. In particular, we crawl at low speed during the peek usage

40
hours of the day, and at a much higher speed during the late night and early morning, limited mainly
by the load tolerated by main campus router.

Manageability and Reconfigurability:

An appropriate interface is needed to monitor the crawl, including the speed of the crawler, statistics
about hosts and pages , and the sizes of the main data sets. The administrator should be able to adjust
the speed, add and remove components, shut down the system, force a checkpoint, or add hosts and
domains to a “blacklist” of places that the crawler should a void. After a crash or shutdown, the
software of the system may be modified to fix problems, and we may want to continue the crawl
using a different machine configuration. In fact, the software at the end of our first huge crawl was
significantly different from that at the start, due to the need for numerous fixes and extensions that
became only apparent after tens of millions of pages had been downloaded.

9.4 Crawling policies


There are two important characteristics of the Web that generate a scenario in which web crawling is
very difficult: its large volume and its rate of change, as there are a huge number of pages being
added, changed and removed every day. Also, network speed has improved less than current
processing speeds and storage capacities.

The large volume implies that the crawler can only download a fraction of the Web pages within a
given time, so it needs to prioritize its downloads. The high rate of change implies that by the time
the crawler is downloading the last pages from a site, it is very likely that new pages have been
added to the site, or that pages have already been updated or even deleted.A crawler must carefully
choose at each step which pages to visit next.

The behavior of a web crawler is the outcome of a combination of policies:

• A selection policy that states which pages to download.


• A re-visit policy that states when to check for changes to the pages.
• A politeness policy that states how to avoid overloading websites.
• A parallelization policy that states how to coordinate distributed web crawlers.

9.4.1 Selection policy


Given the current size of the Web, even large search engines cover only a portion of the publicly
available content; no search engine indexes more than 16% of the Web. As a crawler always
downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction
contains the most relevant pages, and not just a random sample of the Web.

This requires a metric of importance for prioritizing Web pages. The importance of a page is a
function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the
latter is the case of vertical search engines restricted to a single top-level domain, or search engines
restricted to a fixed Website). Designing a good selection policy has an added difficulty: it must
work with partial information, as the complete set of Web pages is not known during crawling.

41
Various selection policies are:

A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to
request only HTML resources, a crawler may make an HTTP HEAD request to determine a web
resource's MIME type before requesting the entire resource with a GET request. To avoid making
numerous HEAD requests, a crawler may alternatively examine the URL and only request the
resource if the URL ends with .html, .htm or a slash. This strategy may cause numerous HTML web
resources to be unintentionally skipped.

Some crawlers may also avoid requesting any resources that have a "?" in them (are dynamically
produced) in order to avoid spider traps which may cause the crawler to download an infinite number
of URLs from a website.

Path-ascending crawling

Some crawlers intend to download as many resources as possible from a particular web site. Cothey
(Cothey, 2004) introduced a path-ascending crawler that would ascend to every path in each URL
that it intends to crawl. For example, when given a seed URL of http://foo.org/a/b/page.html, it will
attempt to crawl /a/b/, /a/, and /. Cothey found that a path-ascending crawler was very effective in
finding isolated resources, or resources for which no inbound link would have been found in regular
crawling.

Focused crawling

The importance of a page for a crawler can also be expressed as a function of the similarity of a page
to a given query. Web crawlers that attempt to download pages that are similar to each other are
called focused crawlers or topical crawlers. Focused crawling was first introduced by Chakrabarti et
al. The main problem in focused crawling is that in the context of a web crawler, we would like to be
able to predict the similarity of the text of a given page to the query before actually downloading the
page. The performance of a focused crawling depends mostly on the richness of links in the specific
topic being searched, and a focused crawling usually relies on a general Web search engine for
providing starting points.

Crawling the Deep Web

A vast amount of web pages lie in the deep or invisible web. These pages are typically only
accessible by submitting queries to a database, and regular crawlers are unable to find these pages if
there are no links that point to them. Google’s Sitemap Protocol and mod_oai (Nelson et al., 2005)
are intended to allow discovery of these deep-web resources.

9.4.2 Re-visit policy


The Web has a very dynamic nature, and crawling a fraction of the Web can take a long time, usually
measured in weeks or months. By the time a web crawler has finished its crawl, many events could
have happened. These events can include creations, updates and deletions.From the search engine's
point of view, there is a cost associated with not detecting an event, and thus having an outdated
copy of a resource. The most used cost functions, introduced in (Cho and Garcia-Molina, 2000), are
freshness and age.

42
Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The
freshness of a page p in the repository at time t is defined as:

Age This is a measure that indicates how outdated the local copy is. The age of a page p in the
repository, at time t is defined as:

Evolution of freshness and age in Web crawling

The objective of the crawler is to keep the average freshness of pages in its collection as high as
possible, or to keep the average age of pages as low as possible. These objectives are not equivalent:
in the first case, the crawler is just concerned with how many pages are out-dated, while in the
second case, the crawler is concerned with how old the local copies of pages are.

Two simple re-visiting policies were studied by Cho and Garcia-Molina

Uniform policy:

This involves re-visiting all pages in the collection with the same frequency, regardless of their rates
of change.

Proportional policy:

43
This involves re-visiting more often the pages that change more frequently. The visiting frequency is
directly proportional to the (estimated) change frequency in terms of average freshness, the uniform
policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The
explanation for this result comes from the fact that, when a page changes too often, the crawler will
waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh.

9.4.3 Politeness policy


Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have
a crippling impact on the performance of a site. Needless to say if a single crawler is performing
multiple requests per second and/or downloading large files, a server would have a hard time keeping
up with requests from multiple crawlers.

As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but
comes with a price for the general community. The costs of using Web crawlers include:

Network resources, as crawlers require considerable bandwidth and operate with a high degree of
parallelism during a long period of time. Server overload, especially if the frequency of accesses to a
given server is too high. Poorly written crawlers, which can crash servers or routers, or which
download pages they cannot handle. Personal crawlers that, if deployed by too many users, can
disrupt networks and Web servers.

A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt
protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web
servers should not be accessed by crawlers. This standard does not include a suggestion for the
interval of visits to the same server, even though this interval is the most effective way of avoiding
server overload. A non-standard robots.txt file may use a "Crawl-delay:" parameter to indicate the
number of seconds to delay between requests, and some commercial search engines like MSN and
Yahoo will adhere to this interval

Anecdotal evidence from access logs shows that access intervals from known crawlers vary between
20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and taking all the
safeguards to avoid overloading Web servers, some complaints from Web server administrators are
received.

9.4.4 Parallelization policy


Main article: Distributed web crawling

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the
download rate while minimizing the overhead from parallelization and to avoid repeated downloads
of the same page. To avoid downloading the same page more than once, the crawling system requires
a policy for assigning the new URLs discovered during the crawling process, as the same URL can
be found by two different crawling processes. Cho and Garcia-Molina (Cho and Garcia-Molina,
2002) studied two types of policies:

Dynamic assignment:

44
With this type of policy, a central server assigns new URLs to different crawlers dynamically. This
allows the central server to, for instance, dynamically balance the load of each crawler.With dynamic
assignment, typically the systems can also add or remove downloader processes. The central server
may become the bottleneck, so most of the workload must be transferred to the distributed crawling
processes for large crawls.There are two configurations of crawling architectures with dynamic
assignments that have been described by Shkapenyuk and Suel (Shkapenyuk and Suel, 2002):

A small crawler configuration, in which there is a central DNS resolver and central queues per
website, and distributed downloaders.

A large crawler configuration, in which the DNS resolver and the queues are also distributed.

Static assignment:

With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how
to assign new URLs to the crawlers.

For static assignment, a hashing function can be used to transform URLs (or, even better, complete
website names) into a number that corresponds to the index of the corresponding crawling process.
As there are external links that will go from a website assigned to one crawling process to a website
assigned to a different crawling process, some exchange of URLs must occur.

To reduce the overhead due to the exchange of URLs between crawling processes, the exchange
should be done in batch, several URLs at a time, and the most cited URLs in the collection should be
known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and
Garcia-Molina, 2002).

An effective assignment function must have three main properties: each crawling process should get
approximately the same number of hosts (balancing property), if the number of crawling processes
grows, the number of hosts assigned to each process must shrink (contra-variance property), and the
assignment must be able to add and remove crawling processes dynamically. Boldi et al. (Boldi et
al., 2004) propose to use consistent hashing, which replicates the buckets, so adding or removing a
bucket does not requires re-hashing of the whole table to achieve all of the desired properties.
crawling is an effective process synchronization tool between the users and the search engine.

URL normalization

Crawlers usually perform some type of URL normalization in order to avoid crawling the same
resource more than once. The term URL normalization, also called URL canonicalization , refers to
the process of modifying and standardizing a URL in a consistent manner. There are several types of
normalization that may be performed including conversion of URLs to lowercase, removal of "." and
".." segments, and adding trailing slashes to the non-empty path component

Crawler identification

Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP
request. Website administrators typically examine their web servers’ log and use the user agent field
to determine which crawlers have visited the web server and how often. The user agent field may
include a URL where the website administrator may find out more information about the crawler.

45
Spambots and other malicious web crawlers are unlikely to place identifying information in the user
agent field, or they may mask their identity as a browser or other well-known crawler.

It is important for web crawlers to identify themselves so website administrators can contact the
owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may
be overloading a web server with requests, and the owner needs to stop the crawler. Identification is
also useful for administrators that are interested in knowing when they may expect their web pages to
be indexed by a particular search engine.

10. INDEXING
The web pages searched by the Web Crawlers have to be indexed for future use so that
whenever search for related articles is asked the result can be provided by looking up the
index. The index stores the URL along with the keywords relevant to the page .It may also
store information like the date on which the page was last visited and also the date on which
the page should be revisited under the revisiting policy along with the age and freshness
factors.

Given a Web page P, the importance of that page, I(P), can be evaluated in one of the following
ways:

1. Similarity to a Driving Query Q. Based on a query Q which drives the crawling process IS(P) is
defined to be the textual similarity between P and Q.

2. Backlink Count. The value of IB(P) is the number of links to P that appear over the entire Web.
Intuitively, a page P that is linked to by many pages is more important than one that is seldom
referenced. A crawler may estimate the number of links to P that have been seen so far which is
IB’(P) .

3. PageRank. The IB(P) metric treats all links equally. Thus, a link from the Yahoo home page
counts the same as a link from some individual's home page. The PageRank back link metric, IR(P),
recursively defines the importance of a page to be the weighted sum of the backlinks to it. Then, the
weighted backlink count of page P is given by IR(P) = (1-d) + d ( IR(T1)/c1 + ... + IR(Tn)/cn) .

4. Forward Link Count. A metric IF(P) counts the number of links that emanate from P. Under this
metric, a page with many outgoing links is very valuable, since it may be a Web directory. This
metric can be computed directly from P.

5. Location Metric. The IL(P) importance of page P is a function of its location, not of its contents.
If URL u leads to P, then IL(P) is a function of u. URLs ending with ".com" or URL containing the
string "home" may be deemed more useful.

10.1 RANKING

46
Most of the search engines return results with confidence or relevancy rankings. In other
words, they list the hits according to how closely they think the results match the query.
However, these lists often leave users shaking their heads on confusion, since, to the user,
the results may seem completely irrelevant.

Basically this happens because search engine technology has not yet reached the point where
humans and computers understand each other well enough to communicate clearly.

Most search engines use search term frequency as a primary way of determining whether a
document is relevant. If you're researching diabetes and the word "diabetes" appears multiple
times in a Web document, it's reasonable to assume that the document will contain useful
information. Therefore, a document that repeats the word "diabetes" over and over is likely
to turn up near the top of your list.

If your keyword is a common one, or if it has multiple other meanings, you could end up with
a lot of irrelevant hits. And if your keyword is a subject about which you desire information,
you don't need to see it repeated over and over--it's the information about that word that
you're interested in, not the word itself.

Some search engines consider both the frequency and the positioning of keywords to
determine relevancy, reasoning that if the keywords appear early in the document, or in the
headers, this increases the likelihood that the document is on target. For example, one
method is to rank hits according to how many times your keywords appear and in which
fields they appear (i.e., in headers, titles or plain text). Another method is to determine which
documents are most frequently linked to other documents on the Web. The reasoning here is
that if other folks consider certain pages important, you should, too.

If you use the advanced query form on AltaVista, you can assign relevance weights to your
query terms before conducting a search. Although this takes some practice, it essentially
allows you to have a stronger say in what results you will get back.

As far as the user is concerned, relevancy ranking is critical, and becomes more so as the
sheer volume of information on the Web grows. Most of us don't have the time to sift
through scores of hits to determine which hyperlinks we should actually explore. The more
clearly relevant the results are, the more we're likely to value the search engine.

10.2 STORAGE COSTS AND CRAWLING TIME

47
Storage costs are not the limiting resource in search engine implementation. Simply storing 10
billion pages of 10kbytes each (compressed) requires 100TB and another 100TB or so for indexes,
giving a total hardware cost of under $200k: 400-500GB disk drives on 100 cheap PCs.

However, a public search engine requires considerably more resources than this to calculate query
results and to provide high availability. And the costs of operating a large server farm are not trivial.

Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or
11.6 days on a very high capacity Internet connection. Most search engines crawl a small fraction of
the web (10-20% pages) at around this frequency or better, but also crawl dynamic web sites (e.g.
news sites and blogs ) at a much higher frequency.

10.3 Features of different Search Engines :

10.3.1 AltaVista
Alta Vista is a fast, powerful search engine with enough bells and whistles to do a
extremely complex search, but first you have to master all its options. If you're serious about
Web searching, however, mastering Alta Vista is a wise policy.

Type of search: Keyword

Search options: Simple or Advanced search, search refining.

Domains searched: Web, Usenet

Search refining: Boolean "AND," "OR" and "NOT," plus the proximal locator
"NEAR." Allows wildcards and "backwards" searching (i.e., you can find all the
other web sites that link to another page). You can decide how search terms should
be weighed, and where in the document to look for them. Powerful search refining
tools, and the more refining you do, the better your results are.

Relevance ranking: Ranks according to how many of your search terms a page
contains, where in the document, and how close to one another the search terms are.

Results presented as: First several lines of document. "Detailed" summaries don't
appear any more detailed than "standard" ones.

User interface: Reasonably good, but not very friendly to the casual user.
Advanced query now allows you to further refine your search at the end of each
results page. You can also visit specialized zones or channels in areas like finance,
travel, news.

Help files: Complete, but confusing. Too much thrown at you at once. More
clarity and more explanation of options would be appreciated!

48
Good points: Fast searches, capitalization and proper nouns recognized, largest
database; finds things others don't. Alta Vista searches both the Web and Usenet. It
will search on both words and on phrases, including names and titles. You can even
search to discover how many people have linked their site to yours. You can also
have the resulting pages of your searches translated into several other languages.

Bad points: Multiple pages from the same site show up too frequently; some
curious relevancy rankings, especially on Simple search.

Overall Rating: A-

10.3.2 Lycos
Type of search: Keyword, but Lycos is gradually becoming less of a search engine,
it seems, and more of a Yahoo-like subject index. Has recently had a cool graphical
facelift. Proud of its ability to search on image and sound files.

Search options: Basic or Advanced

Domains searched: Web, Usenet, News, Stocks, Weather, Mult-media.

Search refining : Lycos now has full Boolean capabilities (using choices on drop-
down forms).

Relevance ranking: Lycos no longer provides a relevancy ranking.

Results presented as: First 100 or so words in simple search, you choose in
advanced search--summary, full results or short version.

User interface: Clean, clear, focuses more on directory now than on simple search.

Help files: Good, informative, graphical help screens are easy to understand.

Good points: Large database. Comprehensive results given--i.e., the date of the
document, its size, etc. Lycos indexes the frequency with which documents are
linked to by other documents to make sure the most popular web sites are found and
indexed before the less popular ones.

Overall Rating: B+

49
10.3.3 Webcrawler
AOL owns Webcrawler, but AOL's new deal with Excite means that the
Webcrawler search engine and directory will be incorporated into Excite.

Type of search: Keyword

Search options: Simple, refined

Search options: Domains searched: Web, Usenet

Search refining : Uses either "and" or "any." Webcrawler has added full Boolean
search term capability, including AND, OR, AND NOT, ADJ, (adjacent) and
NEAR.

Relevance ranking: Yes--frequency calculated--computes the total number of times


your keywords appear in the document and divides it by the total number of words
in the document. Webcrawler returns surprisingly relevant results.

Results presented as: lists of hyperlinks or summaries, as the user chooses.

User interface: Good--easy and fun to use

Help files: Useful tips and FAQ.

Good points: Easy to use. Popular on the Web because it belongs to AOL and
there are a lot of websurfers who sign on from AOL. Publishes usage statistics on
their site. Also provides a service by which you can check to see whether a
particular URL is in their index, and, if so, when it was last visited by their
"spider." There is also some fascinating information about how Webcrawler's
search strategy works.

Bad points: Speed seems to be slowing down a little recently. Its previous
weakness--no way to refine search--has been eliminated with the addition of
Boolean operators.

Overall Rating: B-

10.3.4 HotBot
Type of search: Keyword

Search options: Simple, Modified, Expert

Domains searched: Web

50
Search refining: Multiple types, including by phrase, person and Boolean-like
choices in pull-down boxes. No proximal operators at present. In Expert searches
you can search by date and even by different media types (Java, JavaScript ,
Shockwave, VRML, etc.).

Relevance ranking: Yes. Methods used--search terms in the title will be ranked
higher search terms in the text. Frequency also counts, and will result in higher
rankings when search terms appears more frequently in short documents than when
they appear frequently in very long documents. (This sounds sensible and useful).

Results presented as: Relevancy score and URL

User interface: Very cool and lively. Some users have complained about the bright
green background, but we kind a like it.

Help files: A FAQ that answers users' questions, but not a lot of serious help files.

Good points: Claims to be fast because of the use of parallel processing, which
distributes the load of queries as well as the database over several work stations.

Bad points: Some limitations still on Boolean operators, and the help files still
aren't very good.

Overall Rating: B

10.3.5 Yahoo
Although not precisely a search engine site, Yahoo is an important Web resource. It
works as an hierarchical subject index, allowing you to drill down from the general
to the specific. Yahoo is an attempt to organize and catalogue the Web.

Yahoo also has search capabilities. You can search the Yahoo index (note: when
you do this you are not searching the entire Web). If your query gets no hits in this
manner, Yahoo offers you the option of searching the Alta Vista, which does search
the entire Web.

Yahoo will also automatically feed your query into the other major search engine
sites if you so desire. Thus, Yahoo has the capacity to act as a kind of meta-search
engine.

Type of search: Keyword

Search options: Simple, Advanced

51
Domains searched: Yahoo's index, Usenet, E-mail addresses. Yahoo searches
titles, URLs and the brief comments or descriptions of the Web sites Yahoo indexes.

Search refining: Boolean AND and OR. Yahoo is case insensitive.

Relevance ranking: Since Yahoo returns relatively few hits (it will never return
more than 100), it's not clear how results are ranked.

Results presented as: Yahoo tells you the category where a hit is found, then gives
you a two-line description of the site.

User interface: Excellent, easy-to-use

Help files: Not very complete, but since there aren't a lot of search options, detailed
help files are not necessary.

Good points: Easy-to-navigate subject catalogue. If you know what you want to
find, Yahoo should be your first stop on the Web.

Bad points: Only a small portion of the Web has actually been catalogued by
Yahoo.

Overall rating: A (This rating refers simply to Yahoo's quality as a directory--


searches of the entire Web are not possible).

11. GOOGLE SEARCH ENGINE


GOOGLE is one of the popular and most widely used and efficient search engines. Hence
here we are discussing the various policies used by Goggle in web crawling , indexing and
ranking.

11.2 GOOGLE ARCHITECTURE

Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.

52
Figure 1. High Level Google Architecture

A. Crawlers: In Google, the web crawling (downloading of web pages) is done by several
distributed crawlers.
B. URL Servers: The URLserver sends lists of URLs to be fetched to the crawlers. The web
pages that are fetched are then sent to the storeserver.
C. Store Server: The storeserver compresses and stores the web pages into a repository. Every
web page has an associated ID number called a docID which is assigned whenever a new
URL is parsed out of a web page.
D. Indexer: The indexing function is performed by the indexer and the sorter. The indexer
performs a number of functions. It reads the repository, uncompresses the documents, and
parses them. Each document is converted into a set of word occurrences called hits. The hits
record the word, position in document, an approximation of font size, and capitalization. The
indexer distributes these hits into a set of "barrels", creating a partially sorted forward index.
The indexer performs another important function. It parses out all the links in every web page
and stores important information about them in an anchors file. This file contains enough
information to determine where each link points from and to, and the text of the link.
E. URL Resolver: The URLResolver reads the Anchor files and in turn converts relative URL
into absolute URL and then into docIDs. It puts the anchor text into the forward index,
associated with the docID that the anchor points to. It also generates a database of links
which are pairs of docIDs. The links database is used to compute PageRanks for all the
documents.
F. Sorter: The sorter takes the barrels, which are sorted by docID and resorts them by wordID
to generate the inverted index. This is done in place so that little temporary space is needed
for this operation. The sorter also produces a list of wordIDs and offsets into the inverted
index.
G. Lexicon: A program called DumpLexicon takes this list together with the lexicon produced
by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by

53
a web server and uses the lexicon built by DumpLexicon together with the inverted index and
the PageRanks to answer queries.

11.3 GOOGLE CRAWLER


In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system.
A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both
the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300
connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak
speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to
roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler
maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each
document. Each of the hundreds of connections can be in a number of different states: looking up
DNS, connecting to host, sending request, and receiving response. These factors make the crawler a
complex component of the system. It uses asynchronous IO to manage events, and a number of
queues to move page fetches from state to state.

11.4 GOOGLE DATA STRUCTURES


Google's data structures are optimized so that a large document collection can be crawled, indexed,
and searched with little cost. Although, CPUs and bulk input output rates have improved
dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to
avoid disk seeks whenever possible, and this has had a considerable influence on the design of the
data structures.
The data structures maintained by Google are as follows:
1. BigFiles:
BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit
integers. The allocation among multiple file systems is handled automatically. The BigFiles
package also handles allocation and deallocation of file descriptors, since the operating systems
do not provide enough for our needs. BigFiles also support rudimentary compression options.
2. Repository: The repository contains the full HTML of every web page. Each page is
compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff
between speed and compression ratio. In the repository, the documents are stored one after
the other and are prefixed by docID, length, and URLThe repository requires no other data
structures to be used in order to access it. This helps with data consistency and makes
development much easier; we can rebuild all the other data structures from only the
repository and a file which lists crawler errors.

Figure 1Repository
3. Document Index:

54
The document index keeps information about each document. It is a fixed width ISAM
(Index sequential access mode) index, ordered by docID. The information stored in each entry
includes the current document status, a pointer into the repository, a document checksum, and
various statistics. If the document has been crawled, it also contains a pointer into a variable width
file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist
which contains just the URL. This design decision was driven by the desire to have a reasonably
compact data structure, and the ability to fetch a record in one disk seek during a search

Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums
with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular
URL, the URL's checksum is computed and a binary search is performed on the checksums file to
find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is
the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial
because otherwise we must perform one seek for every link which assuming one disk would take
more than a month for our 322 million link dataset.
4. Lexicon:
The lexicon has several different forms. One important change from earlier systems is that
the lexicon can fit in memory for a reasonable price. It is implemented in two parts -- a list of the
words (concatenated together but separated by nulls) and a hash table of pointers. For various
functions, the list of words has some auxiliary information
5. Hit List:
A hit list corresponds to a list of occurrences of a particular word in a particular document
including position, font, and capitalization information. Hit lists account for most of the space
used in both the forward and the inverted indices. Because of this, it is important to represent
them as efficiently as possible. We considered several alternatives for encoding position, font,
and capitalization -- simple encoding (a triple of integers), a compact encoding (a hand optimized
allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding
since it required far less space than the simple encoding and far less bit manipulation than
Huffman coding.

55
compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits.
Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include
everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a
document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the
rest of the document using three bits (only 7 values are actually used because 111 is the flag that
signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a
fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of
position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor
occurs in. This gives us some limited phrase searching as long as there are not that many anchors for
a particular word. We expect to update the way that anchor hits are stored to allow for greater
resolution in the position and docIDhash fields. We use font size relative to the rest of the document
because when searching, you do not want to rank otherwise identical documents differently just
because one of the documents is in a larger font. The length of a hit list is stored before the hits
themselves. To save space, the length of the hit list is combined with the wordID in the forward
index and the docID in the inverted index. This limits it to 8 and 5 bits respectively (there are some
tricks which allow 8 bits to be borrowed from the wordID). If the length is longer than would fit in
that many bits, an escape code is used in those bits, and the next two bytes contain the actual length.

6. Forward Index:
The forward index is actually already partially sorted. It is stored in a number of barrels (we
used 64). Each barrel holds a range of wordID's. If a document contains words that fall into a
particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists
which correspond to those words. This scheme requires slightly more storage because of
duplicated docIDs but the difference is very small for a reasonable number of buckets and saves
considerable time and coding complexity in the final indexing phase done by the sorter.
Furthermore, instead of storing actual wordID's, we store each wordID as a relative difference
from the minimum wordID that falls into the barrel the wordID is in. This way, we can use just
24 bits for the wordID's in the unsorted barrels, leaving 8 bits for the hit list length.
7. Inverted Index:
The inverted index consists of the same barrels as the forward index, except that they have been
processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that
wordID falls into. It points to a doclist of docID's together with their corresponding hit lists. This
doclist represents all the occurrences of that word in all documents.

An important issue is in what order the docID's should appear in the doclist. One simple solution is
to store them sorted by docID. This allows for quick merging of different doclists for multiple word
queries. Another option is to store them sorted by a ranking of the occurrence of the word in each
document. This makes answering one word queries trivial and makes it likely that the answers to
multiple word queries are near the start. However, merging is much more difficult. Also, this makes
development much more difficult in that a change to the ranking function requires a rebuild of the
index. We chose a compromise between these options, keeping two sets of inverted barrels -- one set
for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the
first set of barrels first and if there are not enough matches within those barrels we check the larger
ones.

56
11.5 GOOGLE INDEXING
• Parsing -- Any parser which is designed to run on the entire Web must handle a huge array
of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle
of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other
errors that challenge anyone's imagination to come up with equally creative ones. For
maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a
lexical analyzer which we outfit with its own stack. Developing this parser which runs at a
reasonable speed and is very robust involved a fair amount of work.
• Indexing Documents into Barrels -- After each document is parsed, it is encoded into a
number of barrels. Every word is converted into a wordID by using an in-memory hash table
-- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are
converted into wordID's, their occurrences in the current document are translated into hit lists
and are written into the forward barrels. The main difficulty with parallelization of the
indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took
the approach of writing a log of all the extra words that were not in a base lexicon, which we
fixed at 14 million words. That way multiple indexers can run in parallel and then the small
log file of extra words can be processed by one final indexer.
• Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels
and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text
inverted barrel. This process happens one barrel at a time, thus requiring little temporary
storage. Also, we parallelize the sorting phase to use as many machines as we have simply by
running multiple sorters, which can process different buckets at the same time. Since the
barrels don't fit into main memory, the sorter further subdivides them into baskets which do
fit into memory based on wordID and docID. Then the sorter, loads each basket into memory,
sorts it and writes its contents into the short inverted barrel and the full inverted barrel

11.6 GOOGLE RANKING SYSTEM

Google maintains much more information about web documents than typical search engines. Every
hitlist includes position, font, and capitalization information Additionally, Google factors in hits
from anchor text and the PageRank of the document. Combining all of this information into a rank is
difficult. Google’s ranking function does not allow a particular factor to have too much influence.
First, consider the simplest case -- a single word query. In order to rank a document with a single
word query, Google looks at that document's hit list for that word. Google considers each hit to be
one of several different types (title, anchor, URL, plain text large font, plain text small font, ...), each
of which has its own type-weight. The type-weights make up a vector indexed by type. Google
counts the number of hits of each type in the hit list. Then every count is converted into a count-
weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a
certain count will not help. Then Google takes the dot product of the vector of count-weights with
the vector of type-weights to compute an IR score for the document. Finally, the IR score is
combined with PageRank to give a final rank to the document.

For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned
through at once so that hits occurring close together in a document are weighted higher than hits
occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are
matched together. For every matched set of hits, a proximity is computed. The proximity is based on
how far apart the hits are in the document (or anchor) but is classified into 10 different value "bins"

57
ranging from a phrase match to "not even close". Counts are computed not only for every type of hit
but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts
are converted into count-weights and the dot product of the count-weights and the type-prox-weights
is used to compute an IR score. All of these numbers and matrices can all be displayed with the
search results using a special debug mode. These displays have been very helpful in developing the
ranking system.

11.7 GOOGLE QUERY EVALUATION


1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the
full barrel for every word and go to step 4.
7. If we are not at end of any Doc list go to step 4. Sort the documents that have matched by
rank and return the top k.

58
12.Reference:

12.1 Video Segmentation


Video Browsing and Retrieval Based on Multimodal Integration
Yingying Zhu, Dongru Zhou
College of Computer Science, Wuhan University, Luojia Hill, Wuhan ,430072, P.R.China
E-mail: breezeeyy@sohu.com , wuzy541@sina.com

12.2 Texture:
1]Artificial Intelligence for Maximizing Content Based Image Retrieval

Zongmin Ma Northeastern University, China


Review on Texture Feature Extraction and Description Methods in Content-Based Medical
Image Retrieval...............................................................................................................................
Gang Zhang, College of Information Science and Engineering, Northeastern University,
China
Z. M. Ma, College of Information Science and Engineering, Northeastern University, China
Li Yan, College of Information Science and Engineering, Northeastern University, China
Ji-feng Zhu, College of Information Science and Engineering, Northeastern University, China

2] A Method for Evaluating the Performance of Content-Based Image Retrieval Systems


[1] Chang, E. Y., Li Beitao, and Li Chen. "Toward Perception-
Based Image Retrieval." Proceedings IEEE Workshop on
Content-Based Access of Image and Video Libraries. IEEE
Comput. Soc Los Alamitos CA USA, 2000. viii+119.
[2] Frese, T., C. A. Bouman, and J. P. Allebach. "Methodology

59
for Designing Image Similarity Metrics Based on Human
Visual System Models." Proceedings of the SPIE The Intl
Society for Optical Engineering 3016 (1997): 472-83.

12.3Audio:

Transform-Based Indexing of Audio Data for Multimedia Databases


S.R. Subramanya , Dept. of EE & CS, George Washington University

Rahul Simha B. Narahari Abdou Youssef , Dept. of CS Dept. of EE&CS ,College of George
Washington , William and Mary University

[l] Marven, C. and Ewers, G. A Sample Approach to Dagita1


Signal Processing, Wiley-Interscience, 1996.
[2] A.D.Alexandrov, et.al. Adaptive filtering and indexing
for image databases. SPIE, Vol. 2420, pp. 12-22.
[3] M.D’Allegrand. Handbook of image storage and retrieval
systems. Van Nostrand Reinhold, New York,
1992.
[4] W.Grosky and R.Mehrotra, eds. Special issue on image
database management. IEEE Computer, Vol. 22,
No. 12, Dec 1989.
[5] V.Gudivada and V.Raghavan. Special issue on
content-based image retrieval systems. IEEE Computer,
Sept. 1995, Vol. 28, No. 9.
[6] R.Jain, et.al. Similarity measures for image
databases. SPIE, Vol. 2420, pp. 58-61.
[7]A.D.Narasimhalu, ed. Special issue on content-based
retrieval. ACM Multimedia systems, Vol. 3, No. 1,
Feb 1995.
[SI Hawley, M.J. Structure of Sound, Ph.D. Thesis, MIT,
Sept. 1993.
[9] E.Wold et al. Content-based classification, search and
retrieval of audio data. IEEE Multimedia Magazine,
1996.
[lo] A.Ghias et al. Query by humming. Proc. ACM Multimedia
Conf., 1995.

12.4 Colour:

An Improving Technique of Color Histogram in Segmentation-based Image


Retrieval
[1] R. Brunelli, and O. Mich , “Histograms Analysis for Image Retrieval”,
Pattern Recognition, Vol. 34, 2001, pp.1625-1637.
[2] Y. K. Chan, and C. C. Chang, “A Color Image Retrieval System
Based on the Run-Length Representation”, Pattern Recognition Letter,
Vol. 22, 2001, pp. 447-455.
[3] W. J. Kuo, “Study on Image Retrieval and Ultrasonic Diagnosis of

60
Breast Tumors,” Dissertation, Department of Computer Science and
Information Engineering, National Chung Cheng University, Chiayi,
Taiwan, R.O.C, January 2001.

12.5 Search Engine ,Web Crawler ,Indexing , Google Search Engine


The following list contains list of web sites that are used as references by our project group.

1. http://www.serachtools.com
2. http://www.press.umich.edu/jep/07-01/bergman.com
3. http://www.robotstxt.org
4. http://www.archives/eprints.org
5. http://en.wikipedia.org/wiki/Web_crawler
6. http://www.searchengineshowdown.com/features/google/review.html
7. http://en.wikipedia.org/wiki/Search_engine
8. http://www.openarchives.org/registar/browsesites

61

Você também pode gostar