Escolar Documentos
Profissional Documentos
Cultura Documentos
Text mining, also referred to as text data mining, independently or in conjunction with query and analysis
roughly equivalent to text analytics, is the process of de- of elded, numerical data. It is a truism that 80 percent of
riving high-quality information from text. High-quality business-relevant information originates in unstructured
information is typically derived through the devising of form, primarily text.[5] These techniques and processes
patterns and trends through means such as statistical pat- discover and present knowledge facts, business rules,
tern learning. Text mining usually involves the pro- and relationships that is otherwise locked in textual
cess of structuring the input text (usually parsing, along form, impenetrable to automated processing.
with the addition of some derived linguistic features and
the removal of others, and subsequent insertion into a
database), deriving patterns within the structured data, 2 History
and nally evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combi-
Labor-intensive manual text mining approaches rst sur-
nation of relevance, novelty, and interestingness. Typical
faced in the mid-1980s,[6] but technological advances
text mining tasks include text categorization, text cluster-
have enabled the eld to advance during the past decade.
ing, concept/entity extraction, production of granular tax-
Text mining is an interdisciplinary eld that draws on
onomies, sentiment analysis, document summarization,
information retrieval, data mining, machine learning,
and entity relation modeling (i.e., learning relations be-
statistics, and computational linguistics. As most infor-
tween named entities).
mation (common estimates say over 80%)[5] is currently
Text analysis involves information retrieval, lexical anal- stored as text, text mining is believed to have a high com-
ysis to study word frequency distributions, pattern recog- mercial potential value. Increasing interest is being paid
nition, tagging/annotation, information extraction, data to multilingual data mining: the ability to gain informa-
mining techniques including link and association analy- tion across languages and cluster similar items from dif-
sis, visualization, and predictive analytics. The overarch- ferent linguistic sources according to their meaning.
ing goal is, essentially, to turn text into data for analysis,
The challenge of exploiting the large proportion of enter-
via application of natural language processing (NLP) and
prise information that originates in unstructured form
analytical methods.
has been recognized for decades.[7] It is recognized in the
A typical application is to scan a set of documents writ- earliest denition of business intelligence (BI), in an Oc-
ten in a natural language and either model the document tober 1958 IBM Journal article by H.P. Luhn, A Business
set for predictive classication purposes or populate a Intelligence System, which describes a system that will:
database or search index with the information extracted.
"...utilize data-processing machines for
auto-abstracting and auto-encoding of docu-
ments and for creating interest proles for each
1 Text analytics of the 'action points in an organization. Both
incoming and internally generated documents
The term text analytics describes a set of linguistic, are automatically abstracted, characterized by
statistical, and machine learning techniques that model a word pattern, and sent automatically to ap-
and structure the information content of textual sources propriate action points.
for business intelligence, exploratory data analysis,
research, or investigation.[1] The term is roughly synony- Yet as management information systems developed start-
mous with text mining; indeed, Ronen Feldman modied ing in the 1960s, and as BI emerged in the '80s and '90s
a 2000 description of text mining[2] in 2004 to describe as a software category and eld of practice, the empha-
text analytics.[3] The latter term is now used more fre- sis was on numerical data stored in relational databases.
quently in business settings while text mining is used This is not surprising: text in unstructured documents
in some of the earliest application areas, dating to the is hard to process. The emergence of text analytics in its
1980s,[4] notably life-sciences research and government current form stems from a refocusing of research in the
intelligence. late 1990s from algorithm development to application, as
The term text analytics also describes that application of described by Prof. Marti A. Hearst in the paper Untan-
text analytics to respond to business problems, whether gling Text Data Mining:[8]
1
2 4 APPLICATIONS
For almost a decade the computational lin- opinion, mood, and emotion. Text analytics tech-
guistics community has viewed large text col- niques are helpful in analyzing, sentiment at the en-
lections as a resource to be tapped in order tity, concept, or topic level and in distinguishing
to produce better text analysis algorithms. In opinion holder and opinion object.[9]
this paper, I have attempted to suggest a new
emphasis: the use of large online text collec- Quantitative text analysis is a set of techniques stem-
tions to discover new facts and trends about the ming from the social sciences where either a human
world itself. I suggest that to make progress we judge or a computer extracts semantic or grammat-
do not need fully articial intelligent text analy- ical relationships between words in order to nd out
sis; rather, a mixture of computationally-driven the meaning or stylistic patterns of, usually, a casual
and user-guided analysis may open the door to personal text for the purpose of psychological pro-
exciting new results. ling etc.[10]
One online text mining application in the biomedical lit- 4.7 Academic applications
erature is PubGene that combines biomedical text mining
with network visualization as an Internet service.[13][14] The issue of text mining is of importance to publishers
who hold large databases of information needing indexing
GoPubMed is a knowledge-based search engine for
for retrieval. This is especially true in scientic disci-
biomedical texts.
plines, in which highly specic information is often con-
tained within written text. Therefore, initiatives have
been taken such as Natures proposal for an Open Text
4.3 Software applications Mining Interface (OTMI) and the National Institutes of
Health's common Journal Publishing Document Type
Text mining methods and software is also being re- Denition (DTD) that would provide semantic cues to
searched and developed by major rms, including IBM machines to answer specic queries contained within text
and Microsoft, to further automate the mining and anal- without removing publisher barriers to public access.
ysis processes, and by dierent rms working in the area Academic institutions have also become involved in the
of search and indexing in general as a way to improve text mining initiative:
their results. Within public sector much eort has been
concentrated on creating software for tracking and mon- The National Centre for Text Mining (NaCTeM),
itoring terrorist activities.[15] is the rst publicly funded text mining centre in the
world. NaCTeM is operated by the University of
Manchester[24] in close collaboration with the Tsujii
Lab,[25] University of Tokyo.[26] NaCTeM provides
4.4 Online media applications customised tools, research facilities and oers ad-
vice to the academic community. They are funded
Text mining is being used by large media companies, such by the Joint Information Systems Committee (JISC)
as the Tribune Company, to clarify information and to and two of the UK Research Councils (EPSRC &
provide readers with greater search experiences, which in BBSRC). With an initial focus on text mining in
turn increases site stickiness and revenue. Additionally, the biological and biomedical sciences, research has
on the back end, editors are beneting by being able to since expanded into the areas of social sciences.
share, associate and package news across properties, sig-
In the United States, the School of Information at
nicantly increasing opportunities to monetize content.
University of California, Berkeley is developing a
program called BioText to assist biology researchers
in text mining and analysis.
4.5 Business and marketing applications The Text Analysis Portal for Research (TAPoR),
currently housed at the University of Alberta, is a
Text mining is starting to be used in marketing as scholarly project to catalogue text analysis applica-
well, more specically in analytical customer relation- tions and create a gateway for researchers new to the
ship management.[16] Coussement and Van den Poel practice.
(2008)[17][18] apply it to improve predictive analyt-
ics models for customer churn (customer attrition).[17]
Text mining is also being applied in stock returns 4.8 Digital humanities and computational
prediction.[19] sociology
The automatic analysis of vast textual corpora has created
the possibility for scholars to analyse millions of docu-
4.6 Sentiment analysis ments in multiple languages with very limited manual in-
tervention. Key enabling technologies have been parsing,
Sentiment analysis may involve analysis of movie reviews machine translation, topic categorization, and machine
for estimating how favorable a review is for a movie.[20] learning.
Such an analysis may need a labeled data set or labeling The automatic parsing of textual corpora has enabled the
of the aectivity of words. Resources for aectivity of extraction of actors and their relational networks on a
words and concepts have been made for WordNet[21] and vast scale, turning textual data into network data. The re-
ConceptNet,[22] respectively. sulting networks, which can contain thousands of nodes,
Text has been used to detect emotions in the related area are then analysed by using tools from network theory
of aective computing.[23] Text based approaches to af- to identify the key actors, the key communities or par-
fective computing have been used on multiple corpora ties, and general properties such as robustness or struc-
such as students evaluations, children stories and news tural stability of the overall network, or centrality of cer-
stories. tain nodes.[28] This automates the approach introduced by
4 7 IMPLICATIONS
[9] Full Circle Sentiment Analysis. Breakthrough Analysis. [24] The University of Manchester. Manchester.ac.uk. Re-
Retrieved 2015-02-23. trieved 2015-02-23.
6 10 EXTERNAL LINKS
11.2 Images
File:Lock-green.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg License: CC0 Contributors: en:File:
Free-to-read_lock_75.svg Original artist: User:Trappist the monk
File:Open_Access_logo_PLoS_transparent.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/77/Open_Access_logo_
PLoS_transparent.svg License: CC0 Contributors: http://www.plos.org/ Original artist: art designer at PLoS, modied by Wikipedia users
Nina, Beao, and JakobVoss
File:Tripletsnew2012.png Source: https://upload.wikimedia.org/wikipedia/commons/4/43/Tripletsnew2012.png License: CC BY 4.0
Contributors: Own work Original artist: Thinkbig-project