Escolar Documentos
Profissional Documentos
Cultura Documentos
Outline
Text Mining and Information Extraction
Introduction to Text Mining
Methods of Information Extraction
Applications to Information Extraction
Advanced Topics
Mobile Search
LiveClassifier
References
Marti Hearst, What is Text Mining,
http://www.sims.berkeley.edu/~hearst/text-mining.html
Marti Hearst, Untangling Text Data Mining, ACL 1999.
Douglas E. Appelt and David J. Israel, Introduction to
Information Extraction Technology, IJCAI 1999 Tutorial.
Ion Muslea, Extraction Patterns for Information
Extraction Tasks: A Survey, AAAI 1999 Workshop on
Machine Learning for Information Extraction.
Andrew McCallum and William Cohen, Information
Extraction from the World Wide Web, KDD 2003 tutorial.
(Also earlier version in NIPS 2002 tutorial)
Hamish Cunningham, Named Entity Recognition, RANLP
2003 tutorial.
Text Mining
Text Mining is the discovery by
computer of new, previously
unknown information, by
automatically extracting
information from different written
resources [Marti Hearst]
NAME
TITLE
10
ORGANIZATION
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
11
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka named entity
Gates
extraction
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
12
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
13
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
14
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
15
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Database
Query,
Search
Data mine
16
17
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
NNP
Classifier
NP
which class?
VP
NP
BEGIN
END
BEGIN
END
VP
S
18
Mo
st
PP
l ik
el y
pa
rse
?
BEGIN
Methods of Information
Extractions
Three Approaches
NLP (Linguistic) approach
Named entity recognition
Extraction pattern/template
approach
Wrapper generation/induction
Statistical approach
Class-based language model
PAT-tree-based
19
MUC-7 Tasks
NE: Named Entity recognition and
typing
CO: co-reference resolution
TE: Template Elements (attributes)
TR: Template Relations
ST: Scenario Templates
21
An Example
Performance Levels
Vary according to text type, domain,
scenario, language
NE: up to 97% (tested in English,
Spanish, Japanese, Chinese)
CO: 60-70% resolution
TE: 80%
TR: 75-80%
ST: 60% (but human level may be only
80%)
23
24
Problems in NE
Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types: John Smith (company
vs. person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may
More complex problems in NE
Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing,
formatting, ...
26
Precision =
(Correct + Partially correct) / (Correct + Incorrect
+ Partial)
Recall =
(Correct + Partially correct) / (Correct + Missing +
Partial)
Why: NE boundaries are often misplaced, so
some partially correct results
27
Word segmentation
Syntactic analysis
Parsing
Domain-specific module
Coreference
Merging partial results
28
Rule based
Developed by experienced
language engineers,
linguistic resources required
Make use of human intuition
Requires only small amount
of training data
Development could be very
time consuming
Good performance
Some changes may be hard
to accommodate
Learning Systems
33
Rule-based Examples
FACILE - used in MUC-7 [Black et al 98]
ANNIE - part of GATE, Sheffields opensource infrastructure for language
processing
Gazetteer Lists for Rule-based NE
Internal location indicators e.g., {river,
mountain, forest} for natural locations;
{street, road, crescent, place, square, }
for address locations
Internal organization indicators e.g.,
35 Inc, }
company designators {GmbH, Ltd,
36
Example approaches
IdentiFinder [Bikel et al 99] (Hidden Markov Models)
MENE [Borthwick et al 98], combining rule-based and ML NE
(Maximum Entropy)
NE Recognition without Gazetteers [Mikheev et al 99],
combining rule-based grammars and statistical (MaxEnt)
models
Fine-grained Classification of NEs [Fleischman 02]
Ex: Person classification into 8 sub-categories athlete,
politician/government, clergy, businessperson,
entertainer/artist, lawyer, doctor/scientist, police
37
Boosting
Bootstrapping
Class-based Language Model
Conditional Markov Model
Decision Tree
Hidden Markov Model
Maximum Entropy Model
Memory-based Learning
Stacking
Support Vector Machine
Transform-based Learning
Voted Perceptron
Others
38
Extraction Pattern/Template
Approach
Types of extraction patterns (rules)
Syntactic/semantic constraints
Delimiter-based
Combination of both
Types of documents to be
extracted
Semi-structured documents: Web
pages,
Unstructured documents: free
39 text
42
Statistical Approach to IE
Class-based language model [Brown et
al. 1992] for Chinese NE
[COLING 2002]
C * , W * arg max P (C , W )
C ,W
arg max P (W | C ) P (C )
C ,W
P (W | C ) P ( w1...wm | c1...cm ) P( wi | ci )
i 1
44
PAT-tree-based Approach
SCP (Symmetric Conditional Probability)
Cohesion holding the words together
Low frequency n-grams tend to be discarded
CD (Context Dependency)
Dependence on the left- or right- adjacent
word/character
Low frequency n-grams can be extracted
Association Measure
SCP ( w1 wn )
p( w1 wn ) 2
1
n 1
p(w1 wi ) p(wi 1 wn )
n 1 i 1
freq ( w1...wn ) 2
1
n 1
freq (w1...wi ) freq (wi 1...wn )
n 1 i 1
CD( w1 wn )
LC ( w1 wn ) RC ( w1 wn )
freq ( w1 wn ) 2
SCPCD ( w1 wn ) SCP( w1 wn ) CD ( w1 wn )
LC ( w1 wn ) RC ( w1 wn )
1
n 1
freq ( w1 wi ) freq ( wi 1 wn )
i 1
n 1
46
Association
Measure
Precision
Recall
Avg. R-P
CD
68.1 %
5.9 %
37.0 %
SCP
62.6 %
63.3 %
63.0 %
SCPCD
79.3 %
78.2 %
78.7 %
47
Speed Performance
Table 2. The obtained average speed performance of different
term extraction methods.
Term Extraction Method
Time for
Preprocessing
Time for
Extraction
0.87 s
0.99 s
PATtree+LocalMaxs
(Web Queries)
2.30 s
0.61 s
63.47 s
4,851.67 s
PATtree+LocalMaxs
(1,367 docs)
840.90 s
71.24 s
47,247.55 s
350,495.65 s
PATtree+LocalMaxs
(5,357 docs)
11,086.67 s
759.32 s
48
Broader View
some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Database
Query,
Search
5 Data mine
50
Applications to Information
Extraction
LiveTrans
LiveClassifier
51
52
Academia Sinica
LiveTrans
LiveTrans
Engine
Engine
Target translations
The Web
Search
Search results
results
LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html
53
Noises
Anchor-text set
A set of anchor texts
pointing to the same
page (URL)
Multilingual translations
Yahoo/ /
America/ /
-USA
Yahoo
Engine
Korea
Search
Yahoo! America
http://www.yahoo.
Yahoo!
com
Japan
Taiwan
China
Anchor-text-set corpus
A collection of anchortext sets
55
Web
Web
Spider
Spider
AnchorText
Corpus
Term
Term
Extraction
Extraction
Search
Search
Engine
Engine
Source
Query
National Palace
Museum
SearchResult
Pages
Similarity
Similarity
Estimation
Estimation
Target
Translation
,
,
56
IE Resources
Data
RISE, http://www.isi.edu/~muslea/RISE/index.html
Linguistic Data Consortium (LDC)
Penn Treebank, Named Entities, Relations, etc.
http://www.biostat.wisc.edu/~craven/ie
http://www.cs.umass.edu/~mccallum/data
Code
TextPro, http://www.ai.sri.com/~appelt/TextPro
MALLET, http://www.cs.umass.edu/~mccallum/mallet
Both
http://www.cis.upenn.edu/~adwait/penntools.h
tml
http://www.cs.umass.edu/~mccallum/ie
57
Others
58
Other Workshops
Workshop on Text Mining and Link
Analysis
TextLink 2007, in IJCAI 2007
TextLink 2003, in IJCAI 2003
61
62
Data Provider
Repositories
Metadata
63
repository
User
Service
Provider
Request
Data
Provider
Response
repository
64
Bronze Script
( )
Seal Script ( )
68
Goal
We intend to develop an integrated
technology to facilitate easier
processing of large amounts of missing
characters
This includes the input, representation,
font generation, display, distribution,
and search for all the missing
characters
We propose an effective composite
approach to handle the formation
and
69
basic components of Chinese characters
70
Advanced Topics
Mobile Search
LiveClassifier
Concept Search
71
Mobile Search
72
References
R. Schusteritsch, S. Rao, and K. Rodden, Mobile
Search with Text Messages: Designing the User
Experience for Google SMS, Proceedings of CHI
2005, pp. 1777-1780 (poster).
B. Miller, Chinas Internet Portals and Content
Providers Look to a Future Beyond Mobile Text
Messaging, The Yankee Group Report, Feb.
2004.
Communications of the ACM, Vol.48, No.7,
Designing for the Mobile Devices, Jul. 2005.
73
75
Content Explodes
News alerts
Sports news
Weather
Special interest information
Ex.: Yao Ming
77
(XHTML)
http://mobile.google.com/
(WML)
78
(Images)
(Mobile Web)
79
Google SMS
http://www.google.com/sms/
80
http://www.google.com/gmm/
81
Yahoo (92466)
4INFO (44636)
AOL
MSN
Synfonic, UpSNAP,
82
Google SMS
Specialized information
Business listings
Residential listings
Product prices
Dictionary definitions
Area codes
Zip codes
Design Constraints
Conceptual Model
Communicating affordances
Minimal and prominent sequence of
instructions and the features that are
considered most useful are shown on the
Google SMS home page
Sending the message help should return a
concise set of instructions on how to use it
Collaborate with PR team to mention the
help command and highlight the most
important features
Work with Marketing team to develop a
86
wallet-size instruction card
88
Alternatively, SMS
90
LiveClassifier
91
LiveClassifier
92
LiveClassifier
Exploiting the structure information
inherent for training
Web
LiveClassifier
People
Place
Subjects
Sub-subjects
94
LiveClassifier
Classifying
documents
Into classes
95
LiveClassifier
Classifying
short texts
Into classes
96
LiveClassifier
97
LiveClassifier
98
99
Concept Search
Conventional search
doc
Concept-level search
Interesting
document
professor
doc
researcher
AI
NTU
neural
network
Taiwan
researcher
AI
100
LiveTrans + LiveClassifier
101
102