Você está na página 1de 34

SKF 2523: Information Retrieval

System
Lecture 8: Automatic indexing, Abstracting and Bibliographies

Automatic indexing and file


organization

Introduction

Definition of automatic indexing: when the assignment of the


content identifiers is carried out with the aid of modern
computing equipment the operation becomes automatic
indexing.
Subject of a document can be derived by a mechanical analysis
of the words in a document by their arrangement in a text.

The Process of Indexing

The advantages of automatic indexing.


1.
2.
3.
4.

Level of consistency in indexing can be maintained


Index entries can be produced at a lower cost in the long
run
Indexing time can be reduced
Better retrieval effectiveness can be achieved

The process of indexing

The idea of analyzing the subject of a document


through automatic counting of term occurrences
was first put forward by H.P Luhn 1957.
1.
2.

3.

The frequency of word occurrence in an article furnishes


a useful measure of word significance
The relative position of a word within a sentence
furnishes a useful measurement for determining the
significance of sentences.
The significance factor of a sentence will be based on a
combination of these two requirements.

Process of indexing

The basic idea behind Luhns theory was that the more
frequent the occurrence of a term in a given document,
the more significant is that term in denoting the subject
content of the document. Therefore, by counting the
frequency of occurrences of all the words in a given
document one can identify the most significant words
that can represent the subject of the document.

The process of indexing

The steps for preparing an automatic index will be:


1.
2.
3.

4.

Identification of all words occurring in all the documents


in a given collection
Deletion of function words by consulting a stop-word
list
Preparation of word stems by suffix stripping method,
which helps decrease the number of words by producing
a common stripped form for the words that have the
same root (e.g. COMPUT for COMPUTING,
COMPUTER, COMPUTATION, COMPUTATIONAL,
etc)
Computation of the value of weight for each term in
each document: thus terms that have low weight can be
eliminated.

Automatic indexing in the SMART


retrieval project
Identification of individual word occurrences from document or query texts.
Deletion of common functional words and suffix stripping for generation of
word forms
Computation of term weights for all the terms
Formation of phrases for high frequency terms with inadequate term
weights based on term co-occurrences in the original sentences.
Formation of thesaurus classes for low frequency terms with inadequate
term weights using the single link term classification process
Using remaining terms, term phrases, and thesaurus classes with
appropriate term weights to construct document and query vectors

Inverted file techniques for organizing index file are used in


all the major IR systems, including online and CD-ROM
databases, available today.
As the sophistication of IR systems increased, several
improvements took place in the organization and access to the
inverted files.
Indexing and file organization has drawn much attention of
researchers since the introduction of computers in IR system,
and with the developments in internet and www, a number of
researchers are now engaged in developing improved indexing
and file organization techniques for managing digital
information resources.

Abstract and abstracting

Definitions

An abstract is a summary or an epitome of a book, scientific article, or legal


document. Ranganathan defines an abstract as a summary usually by a
professional, other than the author, of essential contents of a work, usually an
article in a periodical, together with the specification of its original.
Ashworth defines the term abstract as a precise of information, which in its
narrower sense now usually refers to the information contained in an article in a
periodical, short pamphlet, serial publication.
According to the ISO, the term abstract signifies an abbreviated, accurate
representation of a document without added interpretation or criticism and
without distinction as to who wrote the abstract.
Lancaster defines an abstract as a brief but accurate representation of the
contents of a document.
Rowley defines an abstract as a concise and accurate representation of the
contents of a document in a style similar to that of the original document.

Types of Abstract

Abstract may differ according to their writer, purpose, and style. Guinchat
and Menou suggest that the various types of abstract can be distinguished
by:

Their length, which normally ranges from a few dozen to several hundred words,
and is occasionally over a thousand
The amount of detail, certain abstracts (known as indicative abstracts) simply
provide a brief summary, whereas others (known as informative abstracts)
include a varying number of points that are likely to interest the user
The inclusion of judgments or critical analysis, which may amount to some form
of evaluation of the document
Whether the indexer deals with the whole document or only with aspects that
are likely to interest the user (known as slanted abstract)
Whether the author of the abstract is the author of the original document or
some other person
The language used, which may be a natural language or a more formalized
(artificial) language.

Kinds of abstract

Abstract by writer

Abstract may be written by authors, by subject experts, or by


professional abstractors. Thus, we may categorize them as:
author-prepared abstracts, expert-prepared abstracts, and
professional-prepared abstracts.

Abstracts

Abstracts by purposes

Abstracts are written with certain purposes in mind, and therefore there may be
different abstracts to serve different purposes. Borko and Bernier have identified
four different types of abstracts: the indicative abstract, informative abstract,
critical abstracts, and special purpose abstract.
Informative abstract to provide readers with quantitative and qualitative
information as presented in the parent document.
Indicative abstract indicates what the parent document is all about
Critical abstract, kind of critical comments or review by the abstractor.
Some abstracts may have been written to serve a special purpose or with a
specific category of users in mind. Such abstracts are called slanted or special
purpose abstracts.
Modular abstract, abstractor is expected to prepare different kinds of abstractsindicative, informative, critical, and so on

Abstracts by form

Three other kinds of abstract can also been identified:


structured abstract: may have a frame and slots that are to
be filled in with information taken from the original
document. This type of abstract is valuable in handbooks.
mini abstract: a highly structured abstract designed primarily
for searching by computer; it is a cross between an index
and an abstract and can be called a machine-readable-index
.
telegraphic abstract: an abstract that contains brief
statements.

Qualities of abstracts
Abstract should possess the following qualities:

1.

2.
3.

4.

concision: however long the abstract is, care should be taken to


avoid expressions or circumlocutions that can be replaced by single
words, but this should not be done at the expense of precision
Precision: one should use expressions that are exact and as specific
as possible without exceeding the abstracts requested length
Self-sufficiency: the description of the document should be complete
in itself and fully understandable without reference to any other
document
Objectivity: there must not be any personal interpretation or value
judgment on the part of the abstractor (obviously this does not
apply to critical abstracts).

Three major functions of abstracts:


1.
2.
3.

Dissemination of information
Selection of information by the end-user
Retrieval of information, especially in computerized
information retrieval systems

Major uses of abstracts


1.

2.

3.

They promote current awareness: abstracts repackage the


information contained in the original document into a more
condensed form and therefore are less time-consuming to read and
to keep up-to-date.
They save reading time: abstract are much smaller in size in
comparison to the original document, and yet can provide as much
information as the user needs without going into the full text
They facilitate selection: in an information retrieval environment one
may retrieve a large number of items, reading the full texts of which
may be impossible or very time-consuming. In such cases the user
may consult the abstracts of the retrieved items in order to be
selective.

Major uses of abstracts


4.

5.

6.

7.

They help overcome the language barrier: most abstracting journals


cover more than one language, and therefore the user can find out
what studies and research have been published in languages that he
or she cannot read, which would have been impossible otherwise
even if the original documents were available
They facilitate literature searches: without indexed abstracts,
searches of open and classified literature would be impossible due
to the huge volume of material available.
They improve indexing efficiency, as they can be indexed much more
rapidly than can the original document. The rate of indexing can be
improved by a factor of two to four, and the cost of preparing the
index is reduced with little or no reduction in quality
They aid in the preparation of reviews and can be of much help in
the preparation of bibliographies and so on.

Guidelines
Guidelines for abstracts

1.

2.

3.
4.

Read the whole document at least once to obtain a clear idea about
its essential contents and special features, such as tables, illustrations,
and list of references.
After reading, examine the original document carefully with
emphasis on the author abstract, if any, the first and the last
paragraphs, and the key sections, such as introduction, purpose,
conclusion, summary and recommendations. Also take note of the
footnotes, if any.
Underline, if necessary, the key phrases and sentences, while reading
and examining the original document.
Write the abstract in the style and the manner in which you feel
most comfortable and efficient of course being consistent with the
principles and rules formulated for the purpose.

Guidelines
5.
6.
7.
8.

9.
10.

Initially, prepare a rough draft of the abstract.


Make use of the author abstract, if it exists, to work out the draft
abstract.
Choose, if suitable, direct excerpts from the original to work out the
draft abstract.
Make references to directly related abstracts in earlier issues and in
the same issue of the compilation. The different subject categories in
which one and the same abstract falls are also deemed to be directly
related
Revise and edit the draft abstract to prune redundancy and improve
quality before finalizing the abstract.
Put your initials at the end of the abstract.

Automatic abstracting

Edmundson identifies four possible methods for the creation


of automatic abstracts:
1. The key method is similar to the method of word
frequency: sentences are given weight that is the sum of the
weights of the component words.
2. The cue method is based on some cue words that
determine the significance of a given sentence within a text. In
this method a cue dictionary includes a list of words that
receive a positive weight or a negative weight and the value of
a sentence is the sum of the value of the words.

Automatic abstracts

3. The title method is based on the assumption that


words occurring in titles and subheads are good
indicators of the content. Sentences are given a
significance value based on the number of title and
subhead words that they contain.
4. the location method weight is given to a sentence
on the basis of where it appears in a document.
Sentences appearing in certain sections of the text
are assumed to be more indicative of content than
others.

Abstracting

With the rapid increase in the availability of full-text and


multimedia information in digital form, the need for
automatic abstracts or summaries as a filtering tool is
becoming extremely important.
These areas have drawn much research interest. Related
research are information extraction and text mining.

Bibliographies

A bibliography is a list of related library materials or resources,


usually subject-related. Bibliography:
- the writing of books
- the systematic description and history of books, their
authorship, printing, publication, edition, etc
- a book containing such details
- a list of books of a particular author, printer, country, the
literature of subject, etc.
Bibliographies are important because they list items of interest
to users. Working in an information agency often involves
preparing bibliographies or reading lists for clients, either on
request or in anticipation of demand.

Preparing a bibliography

The following bibliographic elements are needed to help users


identify items in a bibliography:
- books (monographs): author/s, editor/s, compiler/s or the
institution responsible for the work; title and subtitle; edition;
place of publication; publisher; date; series.
- journal (serial) articles: author/s of article; title of article; title
of journal; issue details; page numbers.
- non-book items: author/s, editor/s, compiler/s or the
organization responsible for the work; title and subtitle;
edition; type of medium-eg videocassete, map; series; publisher;
place of publication.
- electronic sources: author/s, editor/s, compiler/s or the
institution responsible for the work; title and subtitle; edition;
type of medium eg. online, CD-ROM; information supplier if
appropriate; full address to find the item eg. url address; date of
access.

Bibliographies

Bibliographies provide bibliographic description to


help users or librarians find sources on a particular
topic or by a particular author.
They can be used to check that the bibliographic
details are accurate and the description of the item is
unique
In libraries, this bibliographic verification is important
for acquisitions, interlibrary loans and reference work
as well as in preparing bibliographies for users.

Types of bibliographies

National bibliographies: provide a systematic list of publications


published in one country or in one language; include items
received under legal deposit and cataloged by a national
agency; are usually arranged in classified order with detailed
indexes.
Trade bibliographies are produced from information supplied
by publishers and include items for sale and in print. Their
information may not conform to library description standards.
Subject bibliographies list material on a particular topic.
Bibliographies by information package or the type of
publication, eg government publications.

Vocabulary Control

Introduction

Vocabulary control is one of the most important


components of an information retrieval system. In order to
match the contents of the user requirements (the search
terms) with the contents of the stored documents, one
must follow a vocabulary that is common to both.
User requirements need to be translated and put to the
retrieval systems in the same language (using the same
terms, for example) as was used to express the contents of
the document records.
This leads us to the concept of using a standard or
controlled vocabulary in an information retrieval
environment.

Vocabulary control tools have been designed over the years:


they differ in their structure and design features, but they all
have the same purpose in an information retrieval
environment.
Availability of vocabulary control helps both the indexers,
i.e., people who are engaged in creating document records,
particularly those who create subject representation for the
documents (by using keywords, in a post-coordinate system,
for example), as well as the end-users in the formulation of
their search expressions.

Lancaster identifies two major objectives of


vocabulary control in an information retrieval
environment:
1.

2.

To promote the consistent representation of subject


matter by indexers and searchers, thereby avoiding the
dispersion of related materials. This is achieved through
the control (merging) of synonymous and nearly
synonymous expressions and by distinguishing among
homographs.
To facilitate the conduct of a comprehensive search on
some topic by linking together terms whose meanings
are related paradigmatically or syntagmatically.

Vocabulary control tools


Subject headings lists

The subject headings used in the dictionary catalogues of the Library


of Congress (LCSH) are the most important and broadest general
list of subject headings covering all known subjects.
Sears list of subject headings is a smaller work designed for small to
medium sized libraries.
Each entry may be accompanied by all or some of the following:
1.
2.
3.
4.

A scope note showing how the term may be used


A list of headings to which see also references may be made
A list of headings from which to see references may be made
A list of headings from which see also references may be made.

Thesauri

Rowley defines a thesaurus as a compilation of words and phrases showing


synonyms and hierarchical and other relationships and dependencies, the
function of which is to provide a standardized vocabulary for information storage
and retrieval systems.
(a) Purpose: A thesaurus serves four major purposes, such as:
1.

2.

3.

4.

(b)

To control the term used in indexing, providing a means of translating the natural
language authors, indexers and enquirers into the more constrained language used for
indexing and retrieval.
To ensure, through the provision of a controlled language, consistent practice between
different indexers, employed by the same agency, or between indexers employed by
the different agencies in a cooperative network;
To limit the number of terms that need to be assigned to a document. The terms
assigned to a document should represent, as specifically as possible, the concepts
described by the author, but they need not include terms of broader conference and
can therefore be displayed effectively in a thesaurus
To serve as a search aid in retrieval, including retrieval from free-text systems
Structure: A thesaurus displays, through its structure the synonymous, hierarchical and
other relationships between their terms which together comprise an indexing
language.

Indexing tends to be more consistent when the


vocabulary used is controlled, because indexers are
more likely to agree on the terms needed to describe
a particular topic if they are selected from a preestablished list.
From the searchers point of view, it is easier to
identify the terms appropriate to information needs if
these terms must be selected from a definitive list.
Thus controlled vocabulary tends to match language
of indexers and searchers.

Summary

Information is organized and managed through


cataloging, classification, indexing, abstracting and by
compilation of bibliographies.
Vocabulary control will assist the matching of queries
and items organized.
As the importance of information to the user is
relative and depends on changing situations, times and
needs, information is most valuable when it is quickly
and easily available. Effective organization will ensure
efficient retrieval.

References

Chowdhury, G. G. 2004. Introduction to modern


information retrieval. 2nd ed. London: Facet Publishing.
Gosling, Mary and Hopgood, Elizabeth. 1999. Learn about
information. 2nd ed. Canberra: DocMatrix.

Você também pode gostar