Você está na página 1de 8

Text Mining

Data Mining is typically concerned with the detection of patterns in numeric data, but very
often important (e.g., critical to business) information is stored in the form of text. Unlike
numeric data, text is often amorphous, and difficult to deal with. Text Mining generally
consists of the analysis of (multiple) text documents by extracting key phrases, concepts,
etc. and the preparation of the text processed in that manner for further analyses with
numeric Data Mining techniques (e.g., to determine co-occurrences of concepts, key
phrases, names, addresses, product names, etc.).
Representation of Text Documents
In Text Mining study, a document is generally used as the basic unit of analysis. A
document is a sequence of words and punctuation, following the grammatical rules of the
language, containing any relevant segment of text and can be of any length. It can be the
paper, an essay, book, web page, emails, etc, depending on the type of analysis being
performed and depending upon the goals of the researcher. In some cases, a document may
contain only a chapter, a single paragraph, or even a single sentence. The fundamental unit
of text is a word. A term is usually a word, but it can also be a word-pair or phrase. In this
thesis, we will use term and word interchangeably. Words are comprised of characters, and
are the basic units from which meaning is constructed. By combining a word with
grammatical structure, a sentence is made. Sentences are the basic unit of action in text,
containing information about the action of some subject. Paragraphs are the fundamental
unit of composition and contain a related series of ideas or actions. As the length of text
increases, additional structural forms become relevant, often including sections, chapters,
entire documents, and finally, a corpus of documents. A corpus is a collection of
documents. And, a lexicon is the set of all unique words in the corpus.
In Text Mining studies, a sentence is regarded simply as a set of words, or a bag of
words, and the order of words can be changed without impacting the outcome of the
analysis. The syntactical structure of a sentence or paragraph is intentionally ignored in
order to efficiently handle the text. The bag-or-words concept is also referred to as
exchangeability in the generative language model.
Text Mining Techniques
Text Mining is an interdisciplinary field that utilizes techniques from the general field of
Data Mining and additionally, combines methodologies from various other areas such as
Information Extraction, Information Retrieval, Computational Linguistics, Categorization,
Clustering, Summarization, Topic Tracking and Concept Linkage. In the following
sections, we will discuss each of these technologies and the role that they play in Text
Mining.
a)

Information Extraction: Information extraction (IE) is a process of automatically


extracting structured information from unstructured and/or semi-structured machinereadable documents, processing human language texts by means of NLP. The final
output of the extraction process is some type of database obtained by looking for
predefined sequences in text, a process called pattern matching.

Text Mining

Tasks performed by IE systems include:

Term analysis, which identifies the terms appearing in a document. This is


especially useful for documents that contain many complex multi-word terms,
such as scientific research papers.

Named-entity recognition, which identifies the names appearing in a document,


such as names of people or organizations. Some systems are also able to
recognize dates and expressions of time, quantities and associated units,
percentages, and so on.

Fact extraction, which identifies and extracts complex facts from documents.
Such facts could be relationships between entities or events.

Figure 1: Overview of IE-based text mining framework


IE transforms a corpus of textual documents into a more structured database, the database
constructed by an IE module then can be provided to the KDD module for further mining
of knowledge as illustrated in figure 1.
b)

Information Retrieval: Retrieval of text-based information also termed Information


Retrieval (IR) has become a topic of great interest with the advent of text search
engines on the Internet. Text is considered to be composed of two fundamental units,
namely the document (book, journal paper, chapters, sections, paragraphs, Web
pages, computer source code, and so forth) and the term (word, word-pair, and
phrase within a document). Traditionally in IR, text queries and documents both are
represented in a unified manner, as sets of terms, to compute the distances between
queries and documents thus providing a framework within to directly implement
simple text retrieval algorithms.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
357

Text Mining

c)

Computational Linguistics/ Natural Language Processing: Natural Language


Processing is a theoretically motivated range of computational techniques for
analyzing and representing naturally occurring texts at one or more levels of
linguistic analysis for the purpose of achieving human-like language processing for a
range of tasks or applications. The goal of Natural Language Processing (NLP) is to
design and build a computer system that will analyze, understand, and generate
natural human-languages. Applications of NLP include machine translation of one
human-language text to another; generation of human-language text such as fiction,
manuals, and general descriptions; interfacing to other systems such as databases and
robotic systems thus enabling the use of human-language type commands and
queries; and understanding human-language text to provide a summary or to draw
conclusions. NLP system provides the following tasks:

Parse a sentence to determine its syntax.


Determine the semantic meaning of a sentence.
Analyze the text context to determine its true meaning for comparing it with other
text.
The role of NLP in Text Mining is to provide the systems in the information
extraction phase with linguistic data that they need to perform their task. Often this is
done by annotating documents with information like sentence boundaries, part-ofspeech tags, parsing results, which can then be read by the information extraction
tools.

d)

Categorization: Categorization is the process of recognizing, differentiating and


understanding the ideas and objects to group them into categories, for specific
purpose. Ideally, a category illuminates a relationship between the subjects and
objects of knowledge. Categorization is fundamental in language, prediction,
inference, decision making and in all kinds of environmental interaction.
There are many categorization theories and techniques. In a broader historical view,
however, three general approaches to categorization may be identified as:

e)

Classical categorization -According to the classical view, categories should be


clearly defined, mutually exclusive and collectively exhaustive, belonging to one,
and only one, of the proposed categories.

Conceptual clustering It is a modern variation of the classical approach in which


classes (clusters or entities) are generated by first formulating their conceptual
descriptions and then classifying the entities according to these descriptions.
Conceptual clustering is closely related to fuzzy set theory, in which objects may
belong to one or more groups, in varying degrees of fitness.

Prototype theory -Categorization can also be viewed as the process of grouping


things based on prototypes. Categorization based on prototypes is the basis for
human development, and relies on learning about the world via embodiment.
Topic Tracking: A topic tracking system works by keeping user profiles and,
based on the documents the user views, predicts other documents of interest to the
user. Yahoo offers a free topic tracking tool (www.alerts.yahoo.com) that allows
users to choose keywords and notifies them when news relating to those topics
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
358

Text Mining

becomes available.
Topic tracking technology however has limitations. For example, if a user sets up
an alert for Text Mining, s/he will receive several news stories on mining for
minerals, and very few that are actually on Text Mining. Some of the better Text
Mining tools let users select particular categories of interest or the software
automatically can even infer the users interests based on his/her reading history
and click-through information. Keyword extraction has become a basis of several
Text Mining applications such as search engine, text categorization, summarization,
and topic detection. Nelken et al. proposed a disambiguation system that separates
the on-topic occurrences and filters them from the potential multitude of references
to unrelated entities.
f)

Clustering: Clustering is a technique in which objects of logically similar


properties are physically placed together in one class of objects and a single access
to the disk makes the entire class available. There are many clustering methods
available, and each of them may give a different grouping of a dataset. The choice
of a particular method will depend on the type of output desired, the known
performance of method with particular types of data, the hardware and software
facilities available and the size of the dataset. In general, clustering methods may be
divided into two categories based on the cluster structure which they produce. The
non-hierarchical methods divide a dataset of N objects into M clusters, with or
without overlap. These methods are divided into partitioning methods, in which the
classes are mutually exclusive, and the less common clumping methods, in which
overlap is allowed. Each object is a member of the cluster with which it is most
similar; however the threshold of similarity has to be defined. The hierarchical
methods produce a set of nested clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one cluster remains. The
hierarchical methods can be further divided into agglomerative or divisive methods.
In agglomerative methods, the hierarchy is build up in a series of N-1
agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered
dataset. The less common divisive methods begin with all objects in a single cluster
and at each of N-1 steps divide some clusters into two smaller clusters, until each
object resides in its own cluster.

g)

Concept Linkage: Concept linkage identifies related documents based on commonly


shared concepts and between them. The primary goal of concept linkage is to
provide browsing for information rather than searching for it as in IR. For example,
a Text Mining software solution may easily identify a link between topics X and Y,
and Y and Z. Concept linkage is a valuable concept in Text Mining which could
also detect a potential link between X and Z, something that a human researcher has
not come across because of the large volume of information s/he would have to sort
through to make the connection. Concept linkage is beneficial to identify links
between diseases and treatments. In the near future, Text Mining tools with concept
linkage capabilities will be beneficial in the biomedical field helping researchers to
discover new treatments by associating treatments that have been used in related
fields.

h)

Information Visualization: Visual Text Mining, or information visualization, puts


large textual sources in a visual hierarchy or map and provides browsing
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
359

Text Mining

capabilities, in addition to simple searching e.g., Informatik Vs DocMiner. The


user can interact with the document map by zooming, scaling, and creating submaps. The government can use information visualization to identify terrorist
networks or to find information about crimes that may have been previously
thought unconnected. It could provide them with a map of possible relationships
between suspicious activities so that they can investigate connections that they
would not have come up with on their own. Text Mining with Information
visualization has been shown to be useful in academic areas, where it can allow an
author to easily identify and explore papers in which s/he is referenced. It is useful
to user allowing them to narrow down a broad range of documents and explore
related topics.
i)

Summarization: A summary is a text that is produced from one or more texts, that
contain a significant portion of the information in the original text(s), and that is no
longer than half of the original text(s). Text here includes multimedia documents,
on-line documents, hypertexts, etc. Many types of summary that have been
identified include indicative summaries (that provide an idea of what the text is
about without giving any content) and informative ones (that do provide some
shortened version of the content). Extracts are summaries created by reusing
portions (words, sentences, etc.) of the input text verbatim, while abstracts are
created by regenerating the extracted content. Generic summary is not related to
specific topic while query-based summary generates a summary discussing the
topic mentioned in the given query. Also summary can be created for single
document or multi-documents.

Domains of applications of Text Mining


Regarded as the next wave of knowledge discovery, Text Mining has a very high
commercial value. It is an emerging technology for analyzing large collections of
unstructured documents for the purposes of extracting interesting and non-trivial patterns
or knowledge. Text Mining applications can be broadly organized into two groups:

Document exploration tools They organize documents based on their text content
and provide an environment for a user to navigate and browse in a document or
concept space. A popular approach is to perform clustering on the documents based
on their similarities in content and present the groups or clusters of the documents
in certain graphical representation.

Document analysis tools They analyze the text content of the documents and
discover the relationships between concepts or entities described in the documents.
They are mainly based on natural language processing techniques, including text
analysis, text categorization, information extraction, and summarization.

There are many possible application domains based on Text Mining technology. We
briefly mention a few below:

Customer profile analysis, e.g., mining incoming emails for customers' complaint
and feedback.

Patent analysis, e.g., analyzing patent databases for major technology players,
trends, and opportunities.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
360

Text Mining

Information dissemination, e.g., organizing and summarizing trade news and


reports for personalized information services.

Company resource planning, e.g., mining a company's reports and correspondences


for activities, status, and problems reported.

Security Issues, e.g., analyzing plain text sources such as Internet news. It also
involves in the study of text encryption.

Open-ended survey responses, e.g., analyzing a certain set of words or terms that
are commonly used by respondents to describe the pros and cons of a product or
service (under investigation), suggesting common misconceptions or confusion
regarding the items in the study.

Text classification, e.g., filtering out most undesirable junk email automatically
based on certain terms or words that are not likely to appear in legitimate messages.

Competitive Intelligence, e.g., enabling companies to organize and modify the


company strategies according to present market demands and the opportunities
based on the information collected by the company about themselves, the market
and their competitors, and to manage enormous amount of data for analyzing to
make plans.

Customer Relationship Management (CRM), e.g., rerouting specific requests


automatically to the appropriate service or supplying immediate answers to the
most frequently asked questions.

Multilingual Applications of Natural Language Processing, e.g., identifying and


analyzing web pages published in different languages.

Technology watch, e.g., identifying the relevant Science & Technology literatures,
and extracting the required information from these literatures efficiently.

Text summarization, e.g., creating a condensed version of a document or a


document collection (multi-document summarization) that should contain its most
important topics.

Bio-entity recognition, e.g., identifying and classifying technical terms in the


domain of molecular biology corresponding to concepts instances that are of
interest to biologists. Examples of such entities include the names of proteins,
genes and their locations of activity such as cells or organism names.

Organize repositories of document-related meta-information, e.g., automatic text


categorization methods are used to create structured metadata used for searching
and retrieving relevant documents based on a query.

Gain insights about trends, relations between people/places/organizations, e.g.,


aggregating and comparing information extracted automatically from documents of
certain type like incoming mail, customer letters, news-wires and so on.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
361

Text Mining

Architecture of a Text Mining System


Text Mining system takes as an input a collection of documents and then preprocess each
document by checking its format and character sets. Next, these preprocessed documents
go through a text analysis phase, sometimes repeating the techniques, until the required
information is extracted. Three text analysis techniques are shown in Figure 2, but many
other combinations of techniques could be used depending on the goals of the organization.
The resulting extracted information can be input to a management information system,
yielding an abundant amount of knowledge for the user of that system. Figure 1.6 explores
the detailed processing steps followed in Text Mining System.

Figure 3 : Text Mining Process


Different steps of Text Mining process as shown above, are briefly discussed below:
a) Document files of different formats like PDF files, txt files or flat files are collected
from different sources such as online chat, SMS, emails, message boards,
newsgroups, blogs, wikis and web pages. This unstructured dataset of documentsis
pre-processed to perform following three tasks:

Tokenize the file into individual tokens using space as the delimiter.

Remove the stop words which do not convey any meaning.

Use porter stemmer algorithm to stem the words with common root word.

b) Feature Generation and Feature Selection activities are performed on these


retrieved and preprocessed documents to represent the unstructured text documents
in a more structured spread sheet format. Feature Selection algorithms help to
identify the important features which requires an exhaustive search of all subsets of
features of chosen cardinality. If the large numbers are available this id
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
362

Text Mining

impractical for supervised learning algorithms the search is for satisfactory set of
features instead of optimal set.
c) After the appropriate selection of features the Text Mining techniques are
incorporated for the applications like Information Retrieval, Information
Extraction, Summarization and Topic Discovery for necessary knowledge
discovery process.
Figure 3 depicts the knowledge stored in the management information system where the
knowledge is stored and retrieved.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
363

Você também pode gostar