Text Retrieval and Applications

Text Retrieval and Applications
More Advanced Topics

J. H. Wang
May 20, 2008
1
Outline
Text Mining and Information Extraction
Introduction to Text Mining
Methods of Information Extraction
Applications to Information Extraction
Applications in Digital Libraries

OAI
Unencoded character problem
Advanced Topics
Mobile Search
LiveClassifier
Text Mining and Information

Extraction
Introduction to Text Mining
Methods of Information Extraction
Applications to Information
Extraction
References
Marti Hearst, What is Text Mining,
http://www.sims.berkeley.edu/~hearst/text-mining.html
Marti Hearst, Untangling Text Data Mining, ACL 1999.
Douglas E. Appelt and David J. Israel, Introduction to
Information Extraction Technology, IJCAI 1999 Tutorial.
Ion Muslea, Extraction Patterns for Information
Extraction Tasks: A Survey, AAAI 1999 Workshop on
Machine Learning for Information Extraction.
Andrew McCallum and William Cohen, Information
Extraction from the World Wide Web, KDD 2003 tutorial.
(Also earlier version in NIPS 2002 tutorial)
Hamish Cunningham, Named Entity Recognition, RANLP
2003 tutorial.
Text Mining
Text Mining is the discovery by
computer of new, previously
unknown information, by
automatically extracting
information from different written
resources [Marti Hearst]
Text Mining vs. Web Search

In search, the user is typically
looking for something that is
already known and has been
written by someone else
In text mining, the goal is to
discover heretofore unknown
information, something that no one
yet knows and so could not have
yet written down
6
Text Mining vs. Data Mining

Data mining tries to find
interesting (non-trivial, implicit,
previously unknown, potentially
useful) patterns from large
databases
In text mining, the patterns are
extracted from natural language
text rather than from structured
databases of facts
7
Text Mining vs. Computational

Linguistics (or NLP)
NLP is making a lot of progress in
doing small subtasks in text
analysis
Word segmentation, part-of-speech
tagging, word sense disambiguation,
Text understanding vs. mining

8
Text Mining vs. Information

Extraction
There are programs that can, with
reasonable accuracy, extract
information from text with
somewhat regularized structure
Discovering new knowledge vs.
showing trends
What is Information Extraction

As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
"We can be open source. We love the concept

of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying
10
ORGANIZATION

As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT

customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman

11
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT

customers.
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka named entity
Gates
extraction
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
12
As a family
of techniques:
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT

customers.
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
13
As a family
of techniques:
October 14, 2002, 4:00 a.m. PT

customers.
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
14
As a family
of techniques:
customers.
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
15
NAME
Bill Gates
Bill Veghte
Richard Stallman

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
October 14, 2002, 4:00 a.m. PT
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Database
Query,
Search
Data mine
Label training data
16
Information Extraction Tasks

Information Extraction (IE) pulls facts and
structured information from the content of
large text collections
Unstructured or semi-structured structured
Federal government funded research

MUC: Message Understanding Conferences (1987
1998) by DARPA
TIPSTER (1991-1998) by DARPA
ACE: Automatic Content Extraction (1999-) by NIST
http://www.nist.gov/speech/tests/ace/
http://www.ldc.upenn.edu/Projects/ACE/
17
Landscape of IE Techniques: Models

Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
Wisconsin
Wyoming
Boundary Models
Sliding Window
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines

Context Free Grammars

NNP
NNP
Classifier
NP
which class?
VP
NP
BEGIN
END
BEGIN
END
VP
S
18
Any of these models can be used to capture words, formatting or both.
Mo
st
PP
l ik
el y
Most likely state sequence?
pa
rse
?
BEGIN
Methods of Information
Extractions
Three Approaches
NLP (Linguistic) approach
Named entity recognition
Extraction pattern/template
approach
Wrapper generation/induction
Statistical approach
Class-based language model
PAT-tree-based
19
NLP (Linguistic) Approach to IE

MUC-7 tasks
Preprocessing
Two kinds of approaches
Baseline
Rule-based approach
Learning-based approach
20
MUC-7 Tasks
NE: Named Entity recognition and
typing
CO: co-reference resolution
TE: Template Elements (attributes)
TR: Template Relations
ST: Scenario Templates
21
An Example
The shiny red rocket was

fired on Tuesday. It is the
brainchild of Dr. Big Head.
Dr. Head is a staff scientist
at We Build Rockets Inc.
NE: entities are "rocket",

"Tuesday", "Dr. Head"
and "We Build Rockets"
CO: "it" refers to the
rocket; "Dr. Head" and
"Dr. Big Head" are the
same
TE: the rocket is "shiny
red" and Head's
"brainchild".
TR: Dr. Head works for
We Build Rockets Inc.
ST: a rocket launching
event occurred with the
22
Performance Levels
Vary according to text type, domain,
scenario, language
NE: up to 97% (tested in English,
Spanish, Japanese, Chinese)
CO: 60-70% resolution
TE: 80%
TR: 75-80%
ST: 60% (but human level may be only
80%)
23
What are Named Entities?

NER involves identification of proper
names in texts, and classification into a
set of predefined categories of interest
Person names
Organizations (companies, government
organizations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
24
What are Named Entities (2)

Other common types: measures (percent,
money, weight etc), email addresses, Web
addresses, street addresses, etc.
Some domain-specific entities: names of
drugs, medical conditions, names of ships,
bibliographic references etc.
MUC-7 entity definition guidelines
[Chinchor97]
http://www.itl.nist.gov/iaui/894.02/related_proje
cts/muc/proceedings/ne_task.html
25
Problems in NE
Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types: John Smith (company
vs. person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may
More complex problems in NE
Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing,
formatting, ...
26
The Evaluation Metric

Precision vs. recall
F-Measure = (2 + 1)PR / 2R + P
[van Rijsbergen 75]
reflects the weighting between precision and

recall, typically =1
We may also want to take account of partially

correct answers:
Precision =
(Correct + Partially correct) / (Correct + Incorrect
+ Partial)
Recall =
(Correct + Partially correct) / (Correct + Missing +
Partial)
Why: NE boundaries are often misplaced, so
some partially correct results
27
Pre-processing for NE Recognition

Tokenization
Word segmentation
Lexical or morphological processing

Part of speech tagging
Word sense tagging
Syntactic analysis
Parsing
Domain-specific module
Coreference
Merging partial results
28
Two kinds of NE approaches

Knowledge Engineering
Rule based
Developed by experienced
language engineers,
linguistic resources required
Make use of human intuition
Requires only small amount
of training data
Development could be very
time consuming
Good performance
Some changes may be hard
to accommodate
Learning Systems
Use statistics or other

machine learning
Developers do not need NE
expertise
Domain independent
Requires large amounts of
annotated training data,
which may be difficult to
obtain
Some changes may require
re-annotation of the entire
training corpus
Annotators are cheap (but
you get what you pay for!)
29
Baseline: List Lookup Approach

System that recognizes only entities stored in
its lists (gazetteers)
Online phone directories and yellow pages for person
and organisation names (e.g. [Paskaleva02])
Locations lists: US GEOnet Names Server (GNS) data
3.9 million locations with 5.37 million names (e.g.,
[Manov03])
Automatic collection from annotated training data
Advantages - Simple, fast, language

independent, easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot
30
resolve ambiguity
Rule-based: Shallow Parsing

Approach (Internal Structure)
Internal evidence names often have
internal structure. These components
can be either stored or guessed, e.g.
location:
Cap. Word + {City, Forest,
Center, River}
e.g. Sherwood Forest
Cap. Word + {Street, Boulevard,

Avenue, Crescent, Road}
e.g. Portobello Street
31
Problems with the Shallow Parsing

Approach
Ambiguously capitalized words (first word in

sentence)
[All American Bank] vs. All [State Police]
Semantic ambiguity
"John F. Kennedy" = airport (location)
"Philip Morris" = organization
Structural ambiguity
[Cable and Wireless] vs. [Microsoft] and
[Dell];
[Center for Computational Linguistics] vs.
message from [City Hospital] for [John Smith]
32
Shallow Parsing Approach with

Context
Use of context-based patterns is helpful
in ambiguous cases
"David Walton" and "Goldman Sachs" are
indistinguishable
But with the phrase "David Walton of
Goldman Sachs" and the Person entity
"David Walton" recognized, we can use the
pattern "[Person] of [Organization]" to
identify "Goldman Sachs correctly
33
Examples of Context Patterns
[PERSON] earns [MONEY]

[PERSON] joined [ORGANIZATION]
[PERSON] left [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
[ORGANIZATION]'s [JOBTITLE] [PERSON]
[ORGANIZATION] [JOBTITLE] [PERSON]
the [ORGANIZATION] [JOBTITLE]
part of the [ORGANIZATION]
[ORGANIZATION] headquarters in [LOCATION]
price of [ORGANIZATION]
sale of [ORGANIZATION]
investors in [ORGANIZATION]
[ORGANIZATION] is worth [MONEY]
[JOBTITLE] [PERSON]
[PERSON], [JOBTITLE]
34
Rule-based Examples
FACILE - used in MUC-7 [Black et al 98]
ANNIE - part of GATE, Sheffields opensource infrastructure for language
processing
Gazetteer Lists for Rule-based NE
Internal location indicators e.g., {river,
mountain, forest} for natural locations;
{street, road, crescent, place, square, }
for address locations
Internal organization indicators e.g.,
35 Inc, }
company designators {GmbH, Ltd,
Using Co-reference to Classify

Ambiguous NEs
Improves NE results by assigning entity
type to previously unclassified names,
based on
relations with classified
NEs
Classification of unknown entities very
useful for surnames which match a full
name, or abbreviations, e.g. [Bonfield]
will match [Sir Peter Bonfield];
[International Business Machines Ltd.]
will match [IBM]
36
Machine Learning Approaches

ML approaches frequently break down the NE task in
two parts:
Recognizing the entity boundaries
Classifying the entities in the NE categories
Example approaches
IdentiFinder [Bikel et al 99] (Hidden Markov Models)
MENE [Borthwick et al 98], combining rule-based and ML NE
(Maximum Entropy)
NE Recognition without Gazetteers [Mikheev et al 99],
combining rule-based grammars and statistical (MaxEnt)
models
Fine-grained Classification of NEs [Fleischman 02]
Ex: Person classification into 8 sub-categories athlete,
politician/government, clergy, businessperson,
entertainer/artist, lawyer, doctor/scientist, police
37
ML Approaches to Named Entity

Identification
Boosting
Bootstrapping
Class-based Language Model
Conditional Markov Model
Decision Tree
Hidden Markov Model
Maximum Entropy Model
Memory-based Learning
Stacking
Support Vector Machine
Transform-based Learning
Voted Perceptron
Others
38
Extraction Pattern/Template
Approach
Types of extraction patterns (rules)
Syntactic/semantic constraints
Delimiter-based
Combination of both
Types of documents to be
extracted
Semi-structured documents: Web
pages,
Unstructured documents: free
39 text
IE from Free Text

The parliament was bombed by the
guerrillas.
AutoSlog [Riloff 1993]
LIEP [Huffman 1996]
PALKA [Kim & Moldovan 1995]
CRYSTAL [Soderland et al. 1995]
CRYSTAL+Webfoot [Soderland 1997]
HASTEN [Krupka 1995]
40
IE from Online Documents

WHISK [Soderland 1999]
A special type of regular expression
RAPIER [Califf & Mooney 1997]

Robust Automated Production of
Information Extraction Rules
SRV [Freitag 1998]

First-order logic
41
Wrapper Induction Systems

Wrapper: a procedure for extracting a
particular resources content
WIEN [Kushmerick, Weld & Doorenbos,
1997]
First wrapper induction system
SoftMealy [Hsu & Dung, 1998]

Finite State Transducer (FST)
STALKER [Muslea, Minton & Knoblock,

1999]
Hierarchical information extraction
BWI [Freitag & Kushmerick 2000]

Boosting
42
Statistical Approach to IE
Class-based language model [Brown et
al. 1992] for Chinese NE
[COLING 2002]
PAT-tree-based [SIGIR 1997]

Long, repeated patterns without length
limitation
Space and time efficient
Incremental
http://pattree.openfoundry.org/
43
Class-based Language Model

Types of classes
Person, location, organization, terms in
dictionary
Number of classes = |V|+3 if the size of
vocabulary is |V|
C * , W * arg max P (C , W )
C ,W
arg max P (W | C ) P (C )
C ,W
P (C ) P (c1...cm ) P (c1 | s ) P (c2 | c1 , s ){ P (ci | ci 2 , ci 1 )}P ( / s | cm , cm 1 )

i 3
P (W | C ) P ( w1...wm | c1...cm ) P( wi | ci )
i 1
44
PAT-tree-based Approach
SCP (Symmetric Conditional Probability)
Cohesion holding the words together
Low frequency n-grams tend to be discarded
CD (Context Dependency)
Dependence on the left- or right- adjacent
word/character
Low frequency n-grams can be extracted
SCPCD: a combination of the two

45
Association Measure
SCP ( w1 wn )
p( w1 wn ) 2
1
n 1
p(w1 wi ) p(wi 1 wn )
n 1 i 1
freq ( w1...wn ) 2
1
n 1
freq (w1...wi ) freq (wi 1...wn )
n 1 i 1
CD( w1 wn )
LC ( w1 wn ) RC ( w1 wn )
freq ( w1 wn ) 2
SCPCD ( w1 wn ) SCP( w1 wn ) CD ( w1 wn )
LC ( w1 wn ) RC ( w1 wn )
1
n 1
freq ( w1 wi ) freq ( wi 1 wn )
i 1
n 1
46
Term Extraction Performance

Table 1. The obtained extraction accuracy including precision,
recall, and average recall-precision of auto-extracted
translation candidates using different methods.
Association
Measure
Precision
Recall
Avg. R-P
CD
68.1 %
5.9 %
37.0 %
SCP
62.6 %
63.3 %
63.0 %
SCPCD
79.3 %
78.2 %
78.7 %
47
Speed Performance
Table 2. The obtained average speed performance of different
term extraction methods.
Term Extraction Method
Time for
Preprocessing
Time for
Extraction
LocalMaxs (Web Queries)
0.87 s
0.99 s
PATtree+LocalMaxs
(Web Queries)
2.30 s
0.61 s
LocalMaxs (1,367 docs)
63.47 s
4,851.67 s
PATtree+LocalMaxs
(1,367 docs)
840.90 s
71.24 s
LocalMaxs (5,357 docs)
47,247.55 s
350,495.65 s
PATtree+LocalMaxs
(5,357 docs)
11,086.67 s
759.32 s
48
State of the Art Performance

Person, Location, Organization,
F1 in high 80s or low- to mid-90s
Binary relation extraction
Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
F1 in 60s (events) or 70s (facts) or 80s
(attributes)
Wrapper induction
Extremely accurate performance obtainable
Human effort (~30min) required on
49 each
site
Broader View
some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Database
Query,
Search
5 Data mine
Label training data
50
Applications to Information
Extraction
LiveTrans
LiveClassifier
51
LiveTrans: Cross-language Web

Search
52
Web Mining Approach to Term

Translation Extraction
Source query
Anchor
Anchor texts
texts
Academia Sinica
LiveTrans
LiveTrans
Engine
Engine
Target translations
The Web
Search
Search results
results
LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html
53
National Palace Museum vs.

Search-Result Page
Noises
Mixed-language characteristic in Chinese pages

How to extract translation candidates?
Which candidates to choose?
54
Yahoo vs. -- Anchor-Text Set
Anchor text (link text)

The descriptive text of
a link on a Web page
Anchor-text set
A set of anchor texts
pointing to the same
page (URL)
Multilingual translations
Yahoo/ /
America/ /
-USA
Yahoo
Engine
Korea
Search
Yahoo! America
http://www.yahoo.
Yahoo!
com
Japan
Taiwan
China
Anchor-text-set corpus
A collection of anchortext sets
55
Term Translation Extraction from Different

Resources
Web
Web
Spider
Spider
AnchorText
Corpus
Term
Term
Extraction
Extraction
Search
Search
Engine
Engine
Source
Query
National Palace
Museum
SearchResult
Pages
Similarity
Similarity
Estimation
Estimation
Target
Translation
,
,
56
IE Resources
Data
RISE, http://www.isi.edu/~muslea/RISE/index.html
Linguistic Data Consortium (LDC)
Penn Treebank, Named Entities, Relations, etc.
http://www.biostat.wisc.edu/~craven/ie
http://www.cs.umass.edu/~mccallum/data
Code
TextPro, http://www.ai.sri.com/~appelt/TextPro
MALLET, http://www.cs.umass.edu/~mccallum/mallet
Both
http://www.cis.upenn.edu/~adwait/penntools.h
tml
http://www.cs.umass.edu/~mccallum/ie
57
Text Mining Related Workshops

Workshops in Data Mining conferences
KDD 2000, Workshop on Text Mining (TextKDD

2000)
ICDM 2001, Workshop on Text Mining (TextDM
2001)
PAKDD 2002, Workshop on Text Mining
SDM 2001-2003, 2006-2008, Workshop on Text
Mining
Workshop in Machine Learning conferences
ECML 1998, Workshop on Text Mining

ICML 1999, Workshop on Machine Learning in Text
Data Analysis
ICML 2002, Workshop on Text Learning (TextML
2002)
Others
RANLP 2005 Text Mining Workshop
58
Workshops on Machine Learning and IE

AAAI 1999, Workshop on Machine Learning for
Information Extraction
ECAI 2000, Workshop on Machine Learning for
Information Extraction
IJCAI 2001, Workshop on Adaptive Text
Extraction and Mining (ATEM 2001)
ECML 2003, Workshop on Adaptive Text
AAAI 2004, Workshop on Adaptive Text
EACL 2006, Workshop on Adaptive Text
59
Other Workshops
Workshop on Text Mining and Link
Analysis
TextLink 2007, in IJCAI 2007
TextLink 2003, in IJCAI 2003
Workshop on Link Analysis

LinkKDD 2003-2006, in KDD 20032006
AAAI 2005 workshop on Link Analysis
60
Applications in Digital Libraries

OAI
Unencoded Character Problem
61
Some Problems in Digital Libraries

Variety in objects
Museums, libraries,
Difficult to integrate metadata
Ancient characters in archives

Difficult to input, display, distribute,
62
OAI (Open Archives Initiatives)

Starting from Oct. 1999
OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting)
version 2.0 of 2002
Service Provider
Harvesters
Data Provider
Repositories
Metadata
63
Service Provider vs. Data Provider
repository
User
Service
Provider
Request
Data
Provider
Response
repository
64
OAI-PMH vs. Z39.50

What is the relationship between the
OAI-PMH and other protocols such as
Z39.50?
The OAI technical framework is intentionally
simple
Providing a low barrier for participants
Easy-to-implement and easy-to-deploy alternative,
not intended to replace other approaches
Protocols such as Z39.50 have more complete

functionality
Session management, results sets, specification of
predicates that filter the records returned
An increase in difficulty of implementation and cost
65
Why Dublin Core?

Why does the protocol mandate a common
metadata format (and why is that common format
Dublin Core)?
Mapping among multiple metadata formats
Creating services such as common search interfaces across

heterogeneous metadata formats
A less burdensome and ultimately more deployable

solution is to require repositories to map to a simple and
common metadata format
The fifteen elements in Dublin Core as a de facto
standard for simple cross-discipline metadata
Dublin Core Metadata Element Set (DCMES)
Cooperation between the OAI and the Dublin Core

Metadata Initiative (DCMI) has led to a common XML
schema for unqualified dublin core that is available at
http://dublincore.org/schemas/xmls/simpledc20020312.
xsd
66
Unencoded Character Problem

Large amounts of Chinese characters in
different forms such as Bronze Script
( ) and Seal Script ( )
People can better appreciate the long
history of the Chinese character
evolution process and the Chinese
culture in general
However, digitization of the heritage
materials brings a big problem: these
characters are not included in common
character encodings in computers (the
missing characters)
67
Example Unencoded Characters
Bronze Script
( )
Seal Script ( )
68
Goal
We intend to develop an integrated
technology to facilitate easier
processing of large amounts of missing
characters
This includes the input, representation,
font generation, display, distribution,
and search for all the missing
characters
We propose an effective composite
approach to handle the formation
and
69
basic components of Chinese characters
Composite Approach to Unencoded

Chinese Characters [JCDL 2005]
70
Advanced Topics
Mobile Search
LiveClassifier
Concept Search
71
Mobile Search
Introduction to Mobile Search

Existing Services
Google Mobile/SMS
Issues
72
References
R. Schusteritsch, S. Rao, and K. Rodden, Mobile
Search with Text Messages: Designing the User
Experience for Google SMS, Proceedings of CHI
2005, pp. 1777-1780 (poster).
B. Miller, Chinas Internet Portals and Content
Providers Look to a Future Beyond Mobile Text
Messaging, The Yankee Group Report, Feb.
2004.
Communications of the ACM, Vol.48, No.7,
Designing for the Mobile Devices, Jul. 2005.
73
Introduction to Mobile Search

Internet users (as of 2003)
US: 170 million (population: 292 million)
China: 80 million (population: 1.3 billion)
Japan: 70 million (population: 126 million)
Currently, there are more than two

billion mobile phone users worldwide,
which is more than three times the
number of PC users
More than a half-billion mobile phones
sold each year (in 2004)
74
Internet and Mobile Users in China
75
SMS-based Mobile Internet

SMS (Short Message Service)
1 billion messages worldwide everyday

Nearly 200 billion messages in China in
2003
Content Explodes
News alerts
Sports news
Weather
Special interest information
Ex.: Yao Ming
From SMS to search

76
Two Common Modes of Mobile

Search
Mobile Web browsing
Text Messaging (SMS)
77
Google Mobile (1/2)
(XHTML)
http://mobile.google.com/
(WML)
78
Google Mobile (2/2)
(Images)
(Mobile Web)
79
Google SMS
http://www.google.com/sms/
80
Google Maps for Mobile
http://www.google.com/gmm/
81
Existing Mobile Search Services

Google (46645)
SMS short code
Google Mobile, http://mobile.google.com/

Google SMS, http://www.google.com/sms/
Google Maps for Mobile, http://www.google.com/gmm/
Yahoo (92466)
Yahoo! Mobile, http://mobile.yahoo.com/

Yahoo! Go, http://go.yahoo.com
4INFO (44636)
AOL
AOL Mobile Search, http://mobile.aolsearch.com/
MSN
MSN Mobile, http://mobile.msn.com/
Synfonic, UpSNAP,
82
Google SMS
Specialized information
Business listings
Residential listings
Product prices
Dictionary definitions
Area codes
Zip codes
Google SMS attempts to return the desired

information directly, rather than returning
hyperlinks
83
Design Constraints
Conceptual Model
1-to-1 communication vs. mobile search

Abbreviation interpretation
Inherent Limitations of Mobile Devices and

SMS
Text entry is slow

Charge per message sent
Small, low-resolution screens
SMS message size limitation: 160 characters
No guarantee for the receiving order
SMS interface: text-only, one-dimensional with no
menus, forms, or buttons to help users understand
its affordances
Not possible for the system to offer instructions or a
prompt
84
To get any feedback, the user must wait a new
Addressing Users Existing Conceptual

Models
Most users had some initial problems
understanding how SMS could be used
for search
Users existing conceptual model of
Google searching also cause some
initial problems
Changes to message interpretation
froogle, shopping, yellow pages,
white pages, dictionary combined with
help, tips, instructions
price or prices vs. product search
85
Communicating affordances
Minimal and prominent sequence of
instructions and the features that are
considered most useful are shown on the
Google SMS home page
Sending the message help should return a
concise set of instructions on how to use it
Collaborate with PR team to mention the
help command and highlight the most
important features
Work with Marketing team to develop a
86
wallet-size instruction card
Addressing the Limitations of Mobile

Devices and SMS
Order of messages
1of2
Limited input technology

Query refinement is more difficult (cofee)
Immediately returning search results for the
closest match
Limited output technology

No results, help
No more than 3 messages in response to
each help request
87
Issues in Mobile Search

User interface for input
Mobile search result output
Context-aware, location-based
service
User preference
88
To allow users to get updated or additional information

with fewer keystrokes
To transform the search experience found on a PC to a
mobile device
Tiny screens, network bandwidth,
Wireless Application Protocol (WAP) to shrink Web pages
down to more manageable sizes
Google WML
Alternatively, SMS
To provide users who are not working with a mouse or a

keyboard with a simple way to enter queries
Shortcuts (for example, w for weather)
To integrate the search vendors local and mobile search

functions
To integrate commerce and mobile search applications
89
Possible Related Directions

Context-aware Retrieval
Ubiquitous Computing
Communications of the ACM, Vol.48,
No.3, The Disappearing Computer,
Mar. 2005.
Communications of the ACM, Vol.45,
No.12, Issues and Challenges in
Ubiquitous Computing, Dec. 2002.
90
LiveClassifier
A system that creates classifiers through Web mining
91
LiveClassifier
Users create topic hierarchies

and define classes/keywords
92
LiveClassifier
Exploiting the structure information
inherent for training
Web
Auto-extracted training data;

No manually-labeled data provided
93
LiveClassifier
People
Place
Subjects
Sub-subjects
94
LiveClassifier
Classifying
documents
Into classes
95
LiveClassifier
Classifying
short texts
Into classes
96
LiveClassifier
97
LiveClassifier
98
99
Concept Search
Conventional search
doc
Keyword search for researcher and AI and

Taiwan
Concept-level search
Interesting
document
professor
doc
researcher
AI
NTU
neural
network
Taiwan
researcher
AI
100
LiveTrans + LiveClassifier
101
Thanks for Your Attention!
102

Text Retrieval and Applications - More Advanced Topics: J. H. Wang May 20, 2008

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Text Retrieval and Applications - More Advanced Topics: J. H. Wang May 20, 2008

Enviado por

Direitos autorais:

Formatos disponíveis

More Advanced Topics

Applications in Digital Libraries

Text Mining and Information

Text Mining vs. Web Search

Text Mining vs. Data Mining

Text Mining vs. Computational

Text understanding vs. mining

Text Mining vs. Information

What is Information Extraction

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

"We can be open source. We love the concept

What is Information Extraction

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

"We can be open source. We love the concept

What is Information Extraction

October 14, 2002, 4:00 a.m. PT

What is Information Extraction

October 14, 2002, 4:00 a.m. PT

What is Information Extraction

October 14, 2002, 4:00 a.m. PT

What is Information Extraction

For years, Microsoft Corporation CEO Bill

October 14, 2002, 4:00 a.m. PT

Train extraction models

Label training data

Information Extraction Tasks

Federal government funded research

Landscape of IE Techniques: Models

Abraham Lincoln was born in Kentucky.

Finite State Machines

Context Free Grammars

Any of these models can be used to capture words, formatting or both.

Most likely state sequence?

NLP (Linguistic) Approach to IE

The shiny red rocket was

NE: entities are "rocket",

What are Named Entities?

What are Named Entities (2)

The Evaluation Metric

reflects the weighting between precision and

We may also want to take account of partially

Pre-processing for NE Recognition

Lexical or morphological processing

Two kinds of NE approaches

Use statistics or other

Baseline: List Lookup Approach

Advantages - Simple, fast, language

Rule-based: Shallow Parsing

Cap. Word + {Street, Boulevard,

Problems with the Shallow Parsing

Ambiguously capitalized words (first word in

Shallow Parsing Approach with

Examples of Context Patterns

[PERSON] earns [MONEY]

Using Co-reference to Classify

Machine Learning Approaches

ML Approaches to Named Entity

IE from Free Text

IE from Online Documents