Você está na página 1de 102

Text Retrieval and Applications

More Advanced Topics


J. H. Wang
May 20, 2008
1

Outline
Text Mining and Information Extraction
Introduction to Text Mining
Methods of Information Extraction
Applications to Information Extraction

Applications in Digital Libraries


OAI
Unencoded character problem

Advanced Topics
Mobile Search
LiveClassifier

Text Mining and Information


Extraction
Introduction to Text Mining
Methods of Information Extraction
Applications to Information
Extraction

References
Marti Hearst, What is Text Mining,
http://www.sims.berkeley.edu/~hearst/text-mining.html
Marti Hearst, Untangling Text Data Mining, ACL 1999.
Douglas E. Appelt and David J. Israel, Introduction to
Information Extraction Technology, IJCAI 1999 Tutorial.
Ion Muslea, Extraction Patterns for Information
Extraction Tasks: A Survey, AAAI 1999 Workshop on
Machine Learning for Information Extraction.
Andrew McCallum and William Cohen, Information
Extraction from the World Wide Web, KDD 2003 tutorial.
(Also earlier version in NIPS 2002 tutorial)
Hamish Cunningham, Named Entity Recognition, RANLP
2003 tutorial.

Text Mining
Text Mining is the discovery by
computer of new, previously
unknown information, by
automatically extracting
information from different written
resources [Marti Hearst]

Text Mining vs. Web Search


In search, the user is typically
looking for something that is
already known and has been
written by someone else
In text mining, the goal is to
discover heretofore unknown
information, something that no one
yet knows and so could not have
yet written down
6

Text Mining vs. Data Mining


Data mining tries to find
interesting (non-trivial, implicit,
previously unknown, potentially
useful) patterns from large
databases
In text mining, the patterns are
extracted from natural language
text rather than from structured
databases of facts
7

Text Mining vs. Computational


Linguistics (or NLP)
NLP is making a lot of progress in
doing small subtasks in text
analysis
Word segmentation, part-of-speech
tagging, word sense disambiguation,

Text understanding vs. mining


8

Text Mining vs. Information


Extraction
There are programs that can, with
reasonable accuracy, extract
information from text with
somewhat regularized structure
Discovering new knowledge vs.
showing trends

What is Information Extraction


As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT


For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

NAME

TITLE

"We can be open source. We love the concept


of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

10

ORGANIZATION

What is Information Extraction


As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT


For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

IE

NAME
Bill Gates
Bill Veghte
Richard Stallman

"We can be open source. We love the concept


of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

11

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..

What is Information Extraction

As a family
of techniques:

Information Extraction =
segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT


For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

Microsoft Corporation
CEO
Bill Gates
Microsoft
aka named entity
Gates
extraction
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
12

What is Information Extraction

As a family
of techniques:

Information Extraction =
segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT


For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
13

What is Information Extraction

As a family
of techniques:

Information Extraction =
segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT


For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
14

What is Information Extraction

As a family
of techniques:

Information Extraction =
segmentation + classification + association + clustering

Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founder of the Free
Software Foundation, countered saying

* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
15

NAME
Bill Gates
Bill Veghte
Richard Stallman

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..

October 14, 2002, 4:00 a.m. PT

IE in Context
Create ontology
Spider
Filter by relevance

IE

Segment
Classify
Associate
Cluster
Load DB
Document
collection

Train extraction models

Database
Query,
Search
Data mine

Label training data

16

Information Extraction Tasks


Information Extraction (IE) pulls facts and
structured information from the content of
large text collections
Unstructured or semi-structured structured

Federal government funded research


MUC: Message Understanding Conferences (1987
1998) by DARPA
TIPSTER (1991-1998) by DARPA
ACE: Automatic Content Extraction (1999-) by NIST
http://www.nist.gov/speech/tests/ace/
http://www.ldc.upenn.edu/Projects/ACE/

17

Landscape of IE Techniques: Models


Classify Pre-segmented
Candidates

Lexicons
Abraham Lincoln was born in Kentucky.
member?

Alabama
Alaska

Wisconsin
Wyoming

Boundary Models
Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Sliding Window
Abraham Lincoln was born in Kentucky.

Classifier
Classifier
which class?

which class?

Try alternate
window sizes:

Finite State Machines


Abraham Lincoln was born in Kentucky.

Context Free Grammars


Abraham Lincoln was born in Kentucky.
NNP

NNP

Classifier

NP

which class?

VP

NP
BEGIN

END

BEGIN

END

VP
S

18

Any of these models can be used to capture words, formatting or both.

Mo

st

PP

l ik
el y

Most likely state sequence?

pa
rse
?

BEGIN

Methods of Information
Extractions
Three Approaches
NLP (Linguistic) approach
Named entity recognition

Extraction pattern/template
approach
Wrapper generation/induction

Statistical approach
Class-based language model
PAT-tree-based
19

NLP (Linguistic) Approach to IE


MUC-7 tasks
Named entity recognition
Preprocessing
Two kinds of approaches
Baseline
Rule-based approach
Learning-based approach
20

MUC-7 Tasks
NE: Named Entity recognition and
typing
CO: co-reference resolution
TE: Template Elements (attributes)
TR: Template Relations
ST: Scenario Templates

21

An Example

The shiny red rocket was


fired on Tuesday. It is the
brainchild of Dr. Big Head.
Dr. Head is a staff scientist
at We Build Rockets Inc.

NE: entities are "rocket",


"Tuesday", "Dr. Head"
and "We Build Rockets"
CO: "it" refers to the
rocket; "Dr. Head" and
"Dr. Big Head" are the
same
TE: the rocket is "shiny
red" and Head's
"brainchild".
TR: Dr. Head works for
We Build Rockets Inc.
ST: a rocket launching
event occurred with the
22

Performance Levels
Vary according to text type, domain,
scenario, language
NE: up to 97% (tested in English,
Spanish, Japanese, Chinese)
CO: 60-70% resolution
TE: 80%
TR: 75-80%
ST: 60% (but human level may be only
80%)
23

What are Named Entities?


NER involves identification of proper
names in texts, and classification into a
set of predefined categories of interest
Person names
Organizations (companies, government
organizations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions

24

What are Named Entities (2)


Other common types: measures (percent,
money, weight etc), email addresses, Web
addresses, street addresses, etc.
Some domain-specific entities: names of
drugs, medical conditions, names of ships,
bibliographic references etc.
MUC-7 entity definition guidelines
[Chinchor97]
http://www.itl.nist.gov/iaui/894.02/related_proje
cts/muc/proceedings/ne_task.html
25

Problems in NE
Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types: John Smith (company
vs. person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may
More complex problems in NE
Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing,
formatting, ...
26

The Evaluation Metric


Precision vs. recall
F-Measure = (2 + 1)PR / 2R + P
[van Rijsbergen 75]

reflects the weighting between precision and


recall, typically =1

We may also want to take account of partially


correct answers:

Precision =
(Correct + Partially correct) / (Correct + Incorrect
+ Partial)
Recall =
(Correct + Partially correct) / (Correct + Missing +
Partial)
Why: NE boundaries are often misplaced, so
some partially correct results
27

Pre-processing for NE Recognition


Tokenization

Word segmentation

Lexical or morphological processing


Part of speech tagging
Word sense tagging

Syntactic analysis
Parsing

Domain-specific module
Coreference
Merging partial results

28

Two kinds of NE approaches


Knowledge Engineering

Rule based
Developed by experienced
language engineers,
linguistic resources required
Make use of human intuition
Requires only small amount
of training data
Development could be very
time consuming
Good performance
Some changes may be hard
to accommodate

Learning Systems

Use statistics or other


machine learning
Developers do not need NE
expertise
Domain independent
Requires large amounts of
annotated training data,
which may be difficult to
obtain
Some changes may require
re-annotation of the entire
training corpus
Annotators are cheap (but
you get what you pay for!)
29

Baseline: List Lookup Approach


System that recognizes only entities stored in
its lists (gazetteers)
Online phone directories and yellow pages for person
and organisation names (e.g. [Paskaleva02])
Locations lists: US GEOnet Names Server (GNS) data
3.9 million locations with 5.37 million names (e.g.,
[Manov03])
Automatic collection from annotated training data

Advantages - Simple, fast, language


independent, easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot
30
resolve ambiguity

Rule-based: Shallow Parsing


Approach (Internal Structure)
Internal evidence names often have
internal structure. These components
can be either stored or guessed, e.g.
location:
Cap. Word + {City, Forest,
Center, River}
e.g. Sherwood Forest

Cap. Word + {Street, Boulevard,


Avenue, Crescent, Road}
e.g. Portobello Street
31

Problems with the Shallow Parsing


Approach

Ambiguously capitalized words (first word in


sentence)
[All American Bank] vs. All [State Police]
Semantic ambiguity
"John F. Kennedy" = airport (location)
"Philip Morris" = organization
Structural ambiguity
[Cable and Wireless] vs. [Microsoft] and
[Dell];
[Center for Computational Linguistics] vs.
message from [City Hospital] for [John Smith]
32

Shallow Parsing Approach with


Context
Use of context-based patterns is helpful
in ambiguous cases
"David Walton" and "Goldman Sachs" are
indistinguishable
But with the phrase "David Walton of
Goldman Sachs" and the Person entity
"David Walton" recognized, we can use the
pattern "[Person] of [Organization]" to
identify "Goldman Sachs correctly

33

Examples of Context Patterns

[PERSON] earns [MONEY]


[PERSON] joined [ORGANIZATION]
[PERSON] left [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
[ORGANIZATION]'s [JOBTITLE] [PERSON]
[ORGANIZATION] [JOBTITLE] [PERSON]
the [ORGANIZATION] [JOBTITLE]
part of the [ORGANIZATION]
[ORGANIZATION] headquarters in [LOCATION]
price of [ORGANIZATION]
sale of [ORGANIZATION]
investors in [ORGANIZATION]
[ORGANIZATION] is worth [MONEY]
[JOBTITLE] [PERSON]
[PERSON], [JOBTITLE]
34

Rule-based Examples
FACILE - used in MUC-7 [Black et al 98]
ANNIE - part of GATE, Sheffields opensource infrastructure for language
processing
Gazetteer Lists for Rule-based NE
Internal location indicators e.g., {river,
mountain, forest} for natural locations;
{street, road, crescent, place, square, }
for address locations
Internal organization indicators e.g.,
35 Inc, }
company designators {GmbH, Ltd,

Using Co-reference to Classify


Ambiguous NEs
Improves NE results by assigning entity
type to previously unclassified names,
based on
relations with classified
NEs
Classification of unknown entities very
useful for surnames which match a full
name, or abbreviations, e.g. [Bonfield]
will match [Sir Peter Bonfield];
[International Business Machines Ltd.]
will match [IBM]

36

Machine Learning Approaches


ML approaches frequently break down the NE task in
two parts:
Recognizing the entity boundaries
Classifying the entities in the NE categories

Example approaches
IdentiFinder [Bikel et al 99] (Hidden Markov Models)
MENE [Borthwick et al 98], combining rule-based and ML NE
(Maximum Entropy)
NE Recognition without Gazetteers [Mikheev et al 99],
combining rule-based grammars and statistical (MaxEnt)
models
Fine-grained Classification of NEs [Fleischman 02]
Ex: Person classification into 8 sub-categories athlete,
politician/government, clergy, businessperson,
entertainer/artist, lawyer, doctor/scientist, police
37

ML Approaches to Named Entity


Identification

Boosting
Bootstrapping
Class-based Language Model
Conditional Markov Model
Decision Tree
Hidden Markov Model
Maximum Entropy Model
Memory-based Learning
Stacking
Support Vector Machine
Transform-based Learning
Voted Perceptron
Others
38

Extraction Pattern/Template
Approach
Types of extraction patterns (rules)
Syntactic/semantic constraints
Delimiter-based
Combination of both

Types of documents to be
extracted
Semi-structured documents: Web
pages,
Unstructured documents: free
39 text

IE from Free Text


The parliament was bombed by the
guerrillas.
AutoSlog [Riloff 1993]
LIEP [Huffman 1996]
PALKA [Kim & Moldovan 1995]
CRYSTAL [Soderland et al. 1995]
CRYSTAL+Webfoot [Soderland 1997]
HASTEN [Krupka 1995]
40

IE from Online Documents


WHISK [Soderland 1999]
A special type of regular expression

RAPIER [Califf & Mooney 1997]


Robust Automated Production of
Information Extraction Rules

SRV [Freitag 1998]


First-order logic
41

Wrapper Induction Systems


Wrapper: a procedure for extracting a
particular resources content
WIEN [Kushmerick, Weld & Doorenbos,
1997]
First wrapper induction system

SoftMealy [Hsu & Dung, 1998]


Finite State Transducer (FST)

STALKER [Muslea, Minton & Knoblock,


1999]
Hierarchical information extraction

BWI [Freitag & Kushmerick 2000]


Boosting

42

Statistical Approach to IE
Class-based language model [Brown et
al. 1992] for Chinese NE
[COLING 2002]

PAT-tree-based [SIGIR 1997]


Long, repeated patterns without length
limitation
Space and time efficient
Incremental
http://pattree.openfoundry.org/
43

Class-based Language Model


Types of classes
Person, location, organization, terms in
dictionary
Number of classes = |V|+3 if the size of
vocabulary is |V|

C * , W * arg max P (C , W )
C ,W

arg max P (W | C ) P (C )
C ,W

P (C ) P (c1...cm ) P (c1 | s ) P (c2 | c1 , s ){ P (ci | ci 2 , ci 1 )}P ( / s | cm , cm 1 )


i 3

P (W | C ) P ( w1...wm | c1...cm ) P( wi | ci )
i 1

44

PAT-tree-based Approach
SCP (Symmetric Conditional Probability)
Cohesion holding the words together
Low frequency n-grams tend to be discarded

CD (Context Dependency)
Dependence on the left- or right- adjacent
word/character
Low frequency n-grams can be extracted

SCPCD: a combination of the two


45

Association Measure
SCP ( w1 wn )

p( w1 wn ) 2

1
n 1
p(w1 wi ) p(wi 1 wn )
n 1 i 1
freq ( w1...wn ) 2

1
n 1
freq (w1...wi ) freq (wi 1...wn )
n 1 i 1

CD( w1 wn )

LC ( w1 wn ) RC ( w1 wn )
freq ( w1 wn ) 2

SCPCD ( w1 wn ) SCP( w1 wn ) CD ( w1 wn )
LC ( w1 wn ) RC ( w1 wn )

1
n 1
freq ( w1 wi ) freq ( wi 1 wn )

i 1
n 1
46

Term Extraction Performance


Table 1. The obtained extraction accuracy including precision,
recall, and average recall-precision of auto-extracted
translation candidates using different methods.

Association
Measure

Precision

Recall

Avg. R-P

CD

68.1 %

5.9 %

37.0 %

SCP

62.6 %

63.3 %

63.0 %

SCPCD

79.3 %

78.2 %

78.7 %
47

Speed Performance
Table 2. The obtained average speed performance of different
term extraction methods.
Term Extraction Method

Time for
Preprocessing

Time for
Extraction

LocalMaxs (Web Queries)

0.87 s

0.99 s

PATtree+LocalMaxs
(Web Queries)

2.30 s

0.61 s

LocalMaxs (1,367 docs)

63.47 s

4,851.67 s

PATtree+LocalMaxs
(1,367 docs)

840.90 s

71.24 s

LocalMaxs (5,357 docs)

47,247.55 s

350,495.65 s

PATtree+LocalMaxs
(5,357 docs)

11,086.67 s

759.32 s
48

State of the Art Performance


Named entity recognition
Person, Location, Organization,
F1 in high 80s or low- to mid-90s
Binary relation extraction
Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
F1 in 60s (events) or 70s (facts) or 80s
(attributes)
Wrapper induction
Extremely accurate performance obtainable
Human effort (~30min) required on
49 each
site

Broader View
some other issues

3 Create ontology
Spider
Filter by relevance
Tokenize

1
2

IE

Segment
Classify
Associate
Cluster
Load DB

Document
collection

4 Train extraction models

Database
Query,
Search

5 Data mine

Label training data

50

Applications to Information
Extraction
LiveTrans
LiveClassifier

51

LiveTrans: Cross-language Web


Search

52

Web Mining Approach to Term


Translation Extraction
Source query
Anchor
Anchor texts
texts

Academia Sinica

LiveTrans
LiveTrans
Engine
Engine
Target translations

The Web
Search
Search results
results

LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

53

National Palace Museum vs.


Search-Result Page

Noises

Mixed-language characteristic in Chinese pages


How to extract translation candidates?
Which candidates to choose?
54

Yahoo vs. -- Anchor-Text Set

Anchor text (link text)


The descriptive text of
a link on a Web page

Anchor-text set
A set of anchor texts
pointing to the same
page (URL)
Multilingual translations
Yahoo/ /
America/ /

-USA
Yahoo
Engine

Korea

Search
Yahoo! America

http://www.yahoo.
Yahoo!
com

Japan

Taiwan

China

Anchor-text-set corpus
A collection of anchortext sets
55

Term Translation Extraction from Different


Resources

Web
Web
Spider
Spider

AnchorText
Corpus

Term
Term
Extraction
Extraction
Search
Search
Engine
Engine

Source
Query
National Palace
Museum

SearchResult
Pages

Similarity
Similarity
Estimation
Estimation

Target
Translation
,
,
56

IE Resources
Data
RISE, http://www.isi.edu/~muslea/RISE/index.html
Linguistic Data Consortium (LDC)
Penn Treebank, Named Entities, Relations, etc.

http://www.biostat.wisc.edu/~craven/ie
http://www.cs.umass.edu/~mccallum/data

Code

TextPro, http://www.ai.sri.com/~appelt/TextPro
MALLET, http://www.cs.umass.edu/~mccallum/mallet

Both
http://www.cis.upenn.edu/~adwait/penntools.h
tml
http://www.cs.umass.edu/~mccallum/ie
57

Text Mining Related Workshops


Workshops in Data Mining conferences

KDD 2000, Workshop on Text Mining (TextKDD


2000)
ICDM 2001, Workshop on Text Mining (TextDM
2001)
PAKDD 2002, Workshop on Text Mining
SDM 2001-2003, 2006-2008, Workshop on Text
Mining

Workshop in Machine Learning conferences

ECML 1998, Workshop on Text Mining


ICML 1999, Workshop on Machine Learning in Text
Data Analysis
ICML 2002, Workshop on Text Learning (TextML
2002)

Others

RANLP 2005 Text Mining Workshop

58

Workshops on Machine Learning and IE


AAAI 1999, Workshop on Machine Learning for
Information Extraction
ECAI 2000, Workshop on Machine Learning for
Information Extraction
IJCAI 2001, Workshop on Adaptive Text
Extraction and Mining (ATEM 2001)
ECML 2003, Workshop on Adaptive Text
Extraction and Mining (ATEM 2003)
AAAI 2004, Workshop on Adaptive Text
Extraction and Mining (ATEM 2004)
EACL 2006, Workshop on Adaptive Text
Extraction and Mining (ATEM 2006)
59

Other Workshops
Workshop on Text Mining and Link
Analysis
TextLink 2007, in IJCAI 2007
TextLink 2003, in IJCAI 2003

Workshop on Link Analysis


LinkKDD 2003-2006, in KDD 20032006
AAAI 2005 workshop on Link Analysis
60

Applications in Digital Libraries


OAI
Unencoded Character Problem

61

Some Problems in Digital Libraries


Variety in objects
Museums, libraries,
Difficult to integrate metadata

Ancient characters in archives


Difficult to input, display, distribute,

62

OAI (Open Archives Initiatives)


Starting from Oct. 1999
OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting)
version 2.0 of 2002
Service Provider
Harvesters

Data Provider
Repositories
Metadata
63

Service Provider vs. Data Provider

repository
User

Service
Provider

Request

Data
Provider

Response

repository

64

OAI-PMH vs. Z39.50


What is the relationship between the
OAI-PMH and other protocols such as
Z39.50?
The OAI technical framework is intentionally
simple
Providing a low barrier for participants
Easy-to-implement and easy-to-deploy alternative,
not intended to replace other approaches

Protocols such as Z39.50 have more complete


functionality
Session management, results sets, specification of
predicates that filter the records returned
An increase in difficulty of implementation and cost
65

Why Dublin Core?


Why does the protocol mandate a common
metadata format (and why is that common format
Dublin Core)?
Mapping among multiple metadata formats

Creating services such as common search interfaces across


heterogeneous metadata formats

A less burdensome and ultimately more deployable


solution is to require repositories to map to a simple and
common metadata format
The fifteen elements in Dublin Core as a de facto
standard for simple cross-discipline metadata
Dublin Core Metadata Element Set (DCMES)

Cooperation between the OAI and the Dublin Core


Metadata Initiative (DCMI) has led to a common XML
schema for unqualified dublin core that is available at
http://dublincore.org/schemas/xmls/simpledc20020312.
xsd
66

Unencoded Character Problem


Large amounts of Chinese characters in
different forms such as Bronze Script
( ) and Seal Script ( )
People can better appreciate the long
history of the Chinese character
evolution process and the Chinese
culture in general
However, digitization of the heritage
materials brings a big problem: these
characters are not included in common
character encodings in computers (the
missing characters)
67

Example Unencoded Characters

Bronze Script
( )

Seal Script ( )
68

Goal
We intend to develop an integrated
technology to facilitate easier
processing of large amounts of missing
characters
This includes the input, representation,
font generation, display, distribution,
and search for all the missing
characters
We propose an effective composite
approach to handle the formation
and
69
basic components of Chinese characters

Composite Approach to Unencoded


Chinese Characters [JCDL 2005]

70

Advanced Topics
Mobile Search
LiveClassifier
Concept Search

71

Mobile Search

Introduction to Mobile Search


Existing Services
Google Mobile/SMS
Issues

72

References
R. Schusteritsch, S. Rao, and K. Rodden, Mobile
Search with Text Messages: Designing the User
Experience for Google SMS, Proceedings of CHI
2005, pp. 1777-1780 (poster).
B. Miller, Chinas Internet Portals and Content
Providers Look to a Future Beyond Mobile Text
Messaging, The Yankee Group Report, Feb.
2004.
Communications of the ACM, Vol.48, No.7,
Designing for the Mobile Devices, Jul. 2005.
73

Introduction to Mobile Search


Internet users (as of 2003)
US: 170 million (population: 292 million)
China: 80 million (population: 1.3 billion)
Japan: 70 million (population: 126 million)

Currently, there are more than two


billion mobile phone users worldwide,
which is more than three times the
number of PC users
More than a half-billion mobile phones
sold each year (in 2004)
74

Internet and Mobile Users in China

75

SMS-based Mobile Internet


SMS (Short Message Service)

1 billion messages worldwide everyday


Nearly 200 billion messages in China in
2003

Content Explodes

News alerts
Sports news
Weather
Special interest information
Ex.: Yao Ming

From SMS to search


76

Two Common Modes of Mobile


Search
Mobile Web browsing
Text Messaging (SMS)

77

Google Mobile (1/2)

(XHTML)

http://mobile.google.com/
(WML)

78

Google Mobile (2/2)

(Images)

(Mobile Web)
79

Google SMS

http://www.google.com/sms/
80

Google Maps for Mobile

http://www.google.com/gmm/
81

Existing Mobile Search Services


Google (46645)

SMS short code

Google Mobile, http://mobile.google.com/


Google SMS, http://www.google.com/sms/
Google Maps for Mobile, http://www.google.com/gmm/

Yahoo (92466)

Yahoo! Mobile, http://mobile.yahoo.com/


Yahoo! Go, http://go.yahoo.com

4INFO (44636)
AOL

AOL Mobile Search, http://mobile.aolsearch.com/

MSN

MSN Mobile, http://mobile.msn.com/

Synfonic, UpSNAP,

82

Google SMS
Specialized information

Business listings
Residential listings
Product prices
Dictionary definitions
Area codes
Zip codes

Google SMS attempts to return the desired


information directly, rather than returning
hyperlinks
83

Design Constraints
Conceptual Model

1-to-1 communication vs. mobile search


Abbreviation interpretation

Inherent Limitations of Mobile Devices and


SMS

Text entry is slow


Charge per message sent
Small, low-resolution screens
SMS message size limitation: 160 characters
No guarantee for the receiving order
SMS interface: text-only, one-dimensional with no
menus, forms, or buttons to help users understand
its affordances
Not possible for the system to offer instructions or a
prompt
84
To get any feedback, the user must wait a new

Addressing Users Existing Conceptual


Models
Most users had some initial problems
understanding how SMS could be used
for search
Users existing conceptual model of
Google searching also cause some
initial problems
Changes to message interpretation
froogle, shopping, yellow pages,
white pages, dictionary combined with
help, tips, instructions
price or prices vs. product search
85

Communicating affordances
Minimal and prominent sequence of
instructions and the features that are
considered most useful are shown on the
Google SMS home page
Sending the message help should return a
concise set of instructions on how to use it
Collaborate with PR team to mention the
help command and highlight the most
important features
Work with Marketing team to develop a
86
wallet-size instruction card

Addressing the Limitations of Mobile


Devices and SMS
Order of messages
1of2

Limited input technology


Query refinement is more difficult (cofee)
Immediately returning search results for the
closest match

Limited output technology


No results, help
No more than 3 messages in response to
each help request
87

Issues in Mobile Search


User interface for input
Mobile search result output
Context-aware, location-based
service
User preference

88

To allow users to get updated or additional information


with fewer keystrokes
To transform the search experience found on a PC to a
mobile device
Tiny screens, network bandwidth,
Wireless Application Protocol (WAP) to shrink Web pages
down to more manageable sizes
Google WML

Alternatively, SMS

To provide users who are not working with a mouse or a


keyboard with a simple way to enter queries
Shortcuts (for example, w for weather)

To integrate the search vendors local and mobile search


functions
To integrate commerce and mobile search applications
89

Possible Related Directions


Context-aware Retrieval
Ubiquitous Computing
Communications of the ACM, Vol.48,
No.3, The Disappearing Computer,
Mar. 2005.
Communications of the ACM, Vol.45,
No.12, Issues and Challenges in
Ubiquitous Computing, Dec. 2002.

90

LiveClassifier

A system that creates classifiers through Web mining

91

LiveClassifier

Users create topic hierarchies


and define classes/keywords

92

LiveClassifier
Exploiting the structure information
inherent for training

Web

Auto-extracted training data;


No manually-labeled data provided
93

LiveClassifier
People

Place

Subjects

Sub-subjects
94

LiveClassifier

Classifying
documents

Into classes

95

LiveClassifier

Classifying
short texts

Into classes

96

LiveClassifier

97

LiveClassifier

98

99

Concept Search
Conventional search
doc

Keyword search for researcher and AI and


Taiwan

Concept-level search

Interesting
document

professor

doc

researcher

AI

NTU
neural
network

Taiwan
researcher

AI

100

LiveTrans + LiveClassifier

101

Thanks for Your Attention!

102

Você também pode gostar