Você está na página 1de 42

1

Unsolved problems in machine learning


associated with the Open Mind Initiative

David G. Stork
Ricoh Silicon Valley
stork@OpenMind.org
The Open Mind Initiative: A collaborative 2

framework for collecting and learning netizen


contributions to open source 'intelligent' software

David G. Stork
Ricoh Silicon Valley
stork@OpenMind.org
3

Outline

One-sentence description of Open Mind


Background
Open Mind Initiative
Projects
Unsolved problems in machine learning
Conclusions
4

Open Mind Initiative

A collaborative framework (based on


traditional open source methodology)
for developing “intelligent” software,
where...
» domain experts provide algorithms,
» tool developers provide software
infrastructure and tools, and
» non-expert netizens provide raw data.

Description Background Open Mind Projects Unsolved problems Conclusions


5

E-community & Open Source waves

GNU, SendMail, emacs, Apache


Linux
» 10M lines; 20M seats; dbl. time 6 mo.,
» 105 contributors
Newhoo! dmoz.org
» Open web directory
» 1.5M sites; 22,229 editors; 218,651 categories
Infomedia
» Open source encyclopedia

Description Background Open Mind Projects Unsolved problems Conclusions


6

Growth of new software methods

1990 105 programmers → 1995 Linux


1995 106 web authors → 1999 Newhoo!
1999 109 netizens → 2003 Open Mind

New communication allows communities and


collaboration, and thus new software methods
Opportunities expand to less-skilled users

Description Background Open Mind Projects Unsolved problems Conclusions


7
Pattern recognition
and intelligent systems

Recognizer = Theory + Model + Data


Theory: excellent
Models: depend on problem
Data: there’s never enough!
» “the group with the most data wins”
» the internet lowers the cost of data
collection

Description Background Open Mind Projects Unsolved problems Conclusions


8

Software tools

Tools for customization/experimentation


» CSLU
» Nuance
» HTK
Non-experts can use many of these!

Description Background Open Mind Projects Unsolved problems Conclusions


9

Open Mind Initiative

Three main functions, provided by


» Domain Experts
– fundamental algorithms, process control,
education/proselytizing, ...
» Tool developers
– software infrastructure, tools, ...
» Netizens
– raw data, low-level bug reports, ...

Description Background Open Mind Projects Unsolved problems Conclusions


10

Domain Experts

Provide algorithms (e.g., grammatical inference,


HMMs, neural nets,...)
Process control, data truthing
» detect outliers for review/rejection
» data “voting”
» catch trials
» signal dection theory (d’)
» bias avoidance
Trend to publish data and algorithms on the web
More university work is being done in open source

Description Background Open Mind Projects Unsolved problems Conclusions


11

Tool/infrastructure developers

Get maximum information for minimum


netizen effort
» learning with queries (e.g., informative
patterns)
Make it easy (fast) for contributors
Web infrastructure
Collaborative software (version control)
Establish communities
Reward contributors
Description Background Open Mind Projects Unsolved problems Conclusions
12

Netizens
Incentives
» benefits in used system
» fun (games: 1013 clicks on Solitaire; Marathon, MUDD, ...)
» recognition (post names by amount of info. accepted,
organized by different criteria)
» general interest (note progress: data and performance)
» altruism/philanthropy (cf. OED, SETI 106 hits/day, ...)
» education (linguistics in schools, ...)
» lottery, gifts, frequent flier miles, ...
» money
1.5M inmates, 1M in nursing homes, ...

Description Background Open Mind Projects Unsolved problems Conclusions


“Generic project” structure 13

Isolated character recognition

Recognizer: “off the shelf”


Isolated characters displayed on netizens’
browsers
Synthetic data (noise, rate, ...)
Learning with queries (present informative
patterns); each pattern more valuable than an
iid sampled one
Improved recognition

Description Background Open Mind Projects Unsolved problems Conclusions


14

Collecting labels of isolated character

Open Mind host

4 9 9 4 ... 4 9
4 9 4 9 4 9 4 9 4 9 4 9

netizens
Description Background Open Mind Projects Unsolved problems Conclusions
15

Relation to traditional open source


Open Source Open Mind
• no netizens • netizens crucial
• expert knowledge (C++filt,gdbm) • informal knowledge (read, hear)
• machine learning irrelevant • machine learning essential
• web infrastructure useful • web infrastructure essential
• most work is directly • most work is on the
on the final software infrastructure

• hacker culture (105) • netizen culture (108)


• software released • software and data released

Description Background Open Mind Projects Unsolved problems Conclusions


16

Relation to data mining


Data Mining Open Mind
• type of data may not be available • data tailored to the project desired
for the project desired (e.g., OCR) (e.g., OCR)
• no interactive queries • interactive queries
slower learning faster learning
ambiguities not resolved ambiguities resolved

• relatively fixed amount of data • new data encouraged


• model data as it exists • collect data for best classifier
• little or no netizen support • netizen support

Description Background Open Mind Projects Unsolved problems Conclusions


Open framework allows
17

cross-project integration

Use Open Mind linguistic constraints for


Open Mind OCR
Use Open Mind speech as front end for
games used in other projects
Use Open Mind common sense for
Open Mind language understanding

Description Background Open Mind Projects Unsolved problems Conclusions


18

Open Mind handwriting

University of Nijmegen (Netherlands)


Segmentation data capture software
Transcription capture software
Seven-classifier system
Large set of unlabelled characters
Porting to the web and bulletproofing

Description Background Open Mind Projects Unsolved problems Conclusions


19

Description Background Open Mind Projects Unsolved problems Conclusions


20

Description Background Open Mind Projects Unsolved problems Conclusions


21

Open Mind speech

U. Sherbrooke (Canada) & Carnegie Mellon


Engine: Sphinx2 system
First target: isolated Linux commands
Database software
Generic tools:
» HMM, FFT, LPC, FIR, IIR
» k-means clustering
» maximum mutual information VQ
Working software:
» speaker identification
» gender identification
Description Background Open Mind Projects Unsolved problems Conclusions
22

Description Background Open Mind Projects Unsolved problems Conclusions


23

MIT Media Lab


Netizens contribute:
» Assertions (“grass is green”)
» Ontology information (“all chairs are
furniture”)
» Inferencing rules (“if all A are B and all B
are C, then all A are C”)

Description Background Open Mind Projects Unsolved problems Conclusions


Possible Project (1) 24

Open Mind text-to-speech

Text-to-speech generator on desktops


Netizens type their favorite sentences,
spoken by 2 models having different
parameters (prosidy, pitch, ....)
Two-alternative forced choice
Parameters set via learning with queries
Preferences may cluster
More natural interfaces
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (2) 25

Open Mind spam filter

Netizens forward to Open Mind spam


site spam and non-spam
Classifier learns features
» “!”, “$”, “free”, maximum length of ALL
CAPS, average length of ALL CAPS, ...
Semantics
Better spam filter

Description Background Open Mind Projects Unsolved problems Conclusions


Possible Project (3) 26

Open Mind chatbot

Netizen answers questions online


» complete sentences
» choose most natural paragraph
Game interface (Dungeons & Dragons)
» choose “most natural” paragraph
Better text generation; Loebner Prize

Description Background Open Mind Projects Unsolved problems Conclusions


Possible Project (4) 27

Open Mind object recognition

Netizens submit digital photos and


labels (“cat sitting,” “horse running,” ...)
Semi-automatic segmentation
3D models trained
Better object recognition software; search by
image content

Description Background Open Mind Projects Unsolved problems Conclusions


Possible Project (5) 28

Open Mind GO

Contributors read tutorial and take test


Score board positions taken from database
Netizens play games against each other and
the current system
Scores used to guide massive search
System implemented in parallel on netizens’
computers, networked
» user interest!
Better GO software ($1M prize)
Description Background Open Mind Projects Unsolved problems Conclusions
29

Problems in machine learning

Goal: Optimal classifier/AI system


Differs from traditional learning in noise (but
both usable)
» Interactive learning means we want to reward
good data, punish bad
» on-line (rapid) so as to improve interactive queries
Learn reliability of netizens
Game theory/decision theory/machine
learning/pattern recognition

Description Background Open Mind Projects Unsolved problems Conclusions


30

General issues

Relative value of learning with queries vs. iid samples


Data truthing/outlier detection
Optimal learning strategies given...
» Bayes error
» probability of hostile data
» probability of data error
Somewhat like Educational Testing Service: checks
the predictability of an SAT question by correlating
responses with those on other test questions.

Description Background Open Mind Projects Unsolved problems Conclusions


31

Open Mind Animals (Stork & Lam, 00)


2 legs?

Y N

can fly? can swim?

Y N Y N

feathers?

elephant
dog
Y N human
parrot

bat

mane?

Y N
horse

dog

Description Background Open Mind Projects Unsolved problems Conclusions


Ensuring data quality in
32

Open Mind Animals

Must correct errors early!


Misspelled animal
» check in pre-compiled lexicon, also...
Detection of questionable data (bug report)
» lock node; notice to domain expert who arbitrates
Submission of animal already in tree
» warning and lowest parent node question listed; if player
persists, both nodes locked until arbitrated by 3rd player
Submission collisions
» lock node for 30 seconds

Description Background Open Mind Projects Unsolved problems Conclusions


33

General data “voting”

Present each query to k netizens,


accept iff all k agree
k large small amount of reliable data
k small large amount of unreliable
data
Estimate netizen reliabilities
Find optimal, k*, for different reliabilities,
state of classifier, Bayes error, ...
Description Background Open Mind Projects Unsolved problems Conclusions
34

“Catch” trials

If low submission rate, we do not have k netizens


online simultaneously; must estimate reliability
individually
Make 1 out of every q samples be unambiguous
(given by classifier or precompiled set); if netizen fails
on this “catch” trial he is unreliable, and data
discarded
q small small amount of reliable data
q large large amount of unreliable data
What is optimal, q*?

Description Background Open Mind Projects Unsolved problems Conclusions


35

Exploratory learning (Thrun, 95)

Learning with queries (not iid!)


Decision theory: each action (query) has an
expected cost/payoff. Choose the query
which, when answered, will lead to the
greatest improvement in the classifier.
How does it depend upon the state of the
classifier? Netizen reliabilities?
Classifier sensitivity analysis

Description Background Open Mind Projects Unsolved problems Conclusions


36

Sensitivity analysis
37

Game theory

Seek strategy to reward/teach netizens


to give “good” data
But “good” depends upon the
classifier...which depends upon the
data...
But adversaries learn too!

Description Background Open Mind Projects Unsolved problems Conclusions


38

Take-home messages

Era of large data sets


» needed for further progress
Open software development
» leads to high-quality software
» integrate components
» can be used for many projects in pattern recognition and AI
Netizens
» can contribute large amounts of informal data
Data collection
» opportunities for algorithms and theory

Description Background Open Mind Projects Unsolved problems Conclusions


39

Open Mind is inevitable


• Need is here • Web is here • Theory/Machine learning is here
• (Networked) computer power and memory are here

Intelligent netizens’
Open Mind
systems knowledge

This collaboration is going to happen!


» Less radical than Richard Stallman or Linus Torvald...

Description Background Open Mind Projects Unsolved problems Conclusions


40

Initiative status

www.OpenMind.org, mailing list


Logo (Imageworks, Inc.)
Homepage and project template designed (Diamond
Bullet Designs)
Public relations (Ruder Finn)
Legal counsel
Solicited corporate donations (e.g., books, CDs, ...)
Demonstration project and infrastructure: Animals
Three projects, several infrastructure developers

Description Background Open Mind Projects Unsolved problems Conclusions


41

Summary

Open Mind
» Collaborative framework for developing
“intelligent systems”
» Experts, tool developers, netizens
New areas in machine learning
Vision of the future

Description Background Open Mind Projects Unsolved problems Conclusions


42

Questions/Comments...

“teaching computers the stuff we all know”


www.OpenMind.org

to subscribe to mailing list, send mail to: majordomo@OpenMind.org


in message body: subscribe openmind-general <your e-mail>

Você também pode gostar