Escolar Documentos
Profissional Documentos
Cultura Documentos
Data Mining is typically concerned with the detection of patterns in numeric data, but very
often important (e.g., critical to business) information is stored in the form of text. Unlike
numeric data, text is often amorphous, and difficult to deal with. Text Mining generally
consists of the analysis of (multiple) text documents by extracting key phrases, concepts,
etc. and the preparation of the text processed in that manner for further analyses with
numeric Data Mining techniques (e.g., to determine co-occurrences of concepts, key
phrases, names, addresses, product names, etc.).
Representation of Text Documents
In Text Mining study, a document is generally used as the basic unit of analysis. A
document is a sequence of words and punctuation, following the grammatical rules of the
language, containing any relevant segment of text and can be of any length. It can be the
paper, an essay, book, web page, emails, etc, depending on the type of analysis being
performed and depending upon the goals of the researcher. In some cases, a document may
contain only a chapter, a single paragraph, or even a single sentence. The fundamental unit
of text is a word. A term is usually a word, but it can also be a word-pair or phrase. In this
thesis, we will use term and word interchangeably. Words are comprised of characters, and
are the basic units from which meaning is constructed. By combining a word with
grammatical structure, a sentence is made. Sentences are the basic unit of action in text,
containing information about the action of some subject. Paragraphs are the fundamental
unit of composition and contain a related series of ideas or actions. As the length of text
increases, additional structural forms become relevant, often including sections, chapters,
entire documents, and finally, a corpus of documents. A corpus is a collection of
documents. And, a lexicon is the set of all unique words in the corpus.
In Text Mining studies, a sentence is regarded simply as a set of words, or a bag of
words, and the order of words can be changed without impacting the outcome of the
analysis. The syntactical structure of a sentence or paragraph is intentionally ignored in
order to efficiently handle the text. The bag-or-words concept is also referred to as
exchangeability in the generative language model.
Text Mining Techniques
Text Mining is an interdisciplinary field that utilizes techniques from the general field of
Data Mining and additionally, combines methodologies from various other areas such as
Information Extraction, Information Retrieval, Computational Linguistics, Categorization,
Clustering, Summarization, Topic Tracking and Concept Linkage. In the following
sections, we will discuss each of these technologies and the role that they play in Text
Mining.
a)
Text Mining
Fact extraction, which identifies and extracts complex facts from documents.
Such facts could be relationships between entities or events.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
357
Text Mining
c)
d)
e)
Text Mining
becomes available.
Topic tracking technology however has limitations. For example, if a user sets up
an alert for Text Mining, s/he will receive several news stories on mining for
minerals, and very few that are actually on Text Mining. Some of the better Text
Mining tools let users select particular categories of interest or the software
automatically can even infer the users interests based on his/her reading history
and click-through information. Keyword extraction has become a basis of several
Text Mining applications such as search engine, text categorization, summarization,
and topic detection. Nelken et al. proposed a disambiguation system that separates
the on-topic occurrences and filters them from the potential multitude of references
to unrelated entities.
f)
g)
h)
Text Mining
Summarization: A summary is a text that is produced from one or more texts, that
contain a significant portion of the information in the original text(s), and that is no
longer than half of the original text(s). Text here includes multimedia documents,
on-line documents, hypertexts, etc. Many types of summary that have been
identified include indicative summaries (that provide an idea of what the text is
about without giving any content) and informative ones (that do provide some
shortened version of the content). Extracts are summaries created by reusing
portions (words, sentences, etc.) of the input text verbatim, while abstracts are
created by regenerating the extracted content. Generic summary is not related to
specific topic while query-based summary generates a summary discussing the
topic mentioned in the given query. Also summary can be created for single
document or multi-documents.
Document exploration tools They organize documents based on their text content
and provide an environment for a user to navigate and browse in a document or
concept space. A popular approach is to perform clustering on the documents based
on their similarities in content and present the groups or clusters of the documents
in certain graphical representation.
Document analysis tools They analyze the text content of the documents and
discover the relationships between concepts or entities described in the documents.
They are mainly based on natural language processing techniques, including text
analysis, text categorization, information extraction, and summarization.
There are many possible application domains based on Text Mining technology. We
briefly mention a few below:
Customer profile analysis, e.g., mining incoming emails for customers' complaint
and feedback.
Patent analysis, e.g., analyzing patent databases for major technology players,
trends, and opportunities.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
360
Text Mining
Security Issues, e.g., analyzing plain text sources such as Internet news. It also
involves in the study of text encryption.
Open-ended survey responses, e.g., analyzing a certain set of words or terms that
are commonly used by respondents to describe the pros and cons of a product or
service (under investigation), suggesting common misconceptions or confusion
regarding the items in the study.
Text classification, e.g., filtering out most undesirable junk email automatically
based on certain terms or words that are not likely to appear in legitimate messages.
Technology watch, e.g., identifying the relevant Science & Technology literatures,
and extracting the required information from these literatures efficiently.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
361
Text Mining
Tokenize the file into individual tokens using space as the delimiter.
Use porter stemmer algorithm to stem the words with common root word.
Text Mining
impractical for supervised learning algorithms the search is for satisfactory set of
features instead of optimal set.
c) After the appropriate selection of features the Text Mining techniques are
incorporated for the applications like Information Retrieval, Information
Extraction, Summarization and Topic Discovery for necessary knowledge
discovery process.
Figure 3 depicts the knowledge stored in the management information system where the
knowledge is stored and retrieved.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
363