Escolar Documentos
Profissional Documentos
Cultura Documentos
6/16/15
Topics to Be Covered
Started with language
Analyzer
Stemming
Language Identification
Identify words
Normalizing token
Ruducing words to their root form
Stemming issue
Lemmatization
Types of stemmer
Goals
What is happening behind the search
engine
6/16/15
6/16/15
Lowercase tokens
The > the
6/16/15
Analyzer
The english analyzer remove the possesive `s
John `s> John
6/16/15
Question ?
6/16/15
Analyzer : Configuration
Language analyzer can be use without
configuration. Analyzer allows to control
some behaviour
Stem-word exclusion
For example : organ and organization are the
same word
Custom stopword
6/16/15
Incorrect stemming
Stemming rules are different for different
language
We cannot use one stemming rule for all
language
Root word sometimes change the meaning
of the actual word
For example : ebay.co.uk
ebay.de
6/16/15
Identify Language
We know our own document language
External document may be contain different
language
We can use language detector for identifying
the language
For example: Compact language detector
from google
It can detect 160+ language
It can detect mutiple language within a single
line of text.
6/16/15
Identifying Words
6/16/15
Words
Words are separated whitespace or
punctuation
In English there are some controversy
word: O`clock, cooperate, eyewitness
In Deutsch or dutch there are some
combined words
Some asian language have no whitespace
between words
Dedicated analyzer for many language
6/16/15
Standard Analyzer
By default standard analyzer is used
We can define standard analyzer as a
custom analyzer
6/16/15
Standard Tokenizer
Take a string as input , process the input
and break it into individual words
Whitespace tokenizer simply break on
whitespace
You are the 1st runner home!
You,are , the, 1st, runner,home
6/16/15
6/16/15
6/16/15
Overstemming
6/16/15
Lemmatization
It is a set of related words
For example: paying,paid,pays is pay
6/16/15
Types of stemmer
Algorithm Stemmers:
It is use algorithm
Easy to use
Fast ,use little memory
Good for regular works
Dictionary Stemmers:
It use dictionary
It use more momory
Have to load all words
6/16/15
Question?
Which stemmer we should use?
6/16/15
Stopwords
Performance Vs Precision
6/16/15
6/16/15
Default stopwords
The default English stopwords used in
Elasticsearch are as follows:
a, an, and, are, as, at, be, but, by, for, if, in,
into, is, it, no, not, of, on, or, such, that, the,
their, then, there, these, they, this, to, was,
will, with
6/16/15
Pros:
Performance
Search Fox instead of the fox
6/16/15
Synonyms
Synonyns are listed as comma separate
values
6/16/15
6/16/15
Phonetic Matching
Sound similar, may be spelling differ
Algorithm for word to phonetic
representation are
Soundex algorithm is the granddaddy of t all
Metaphone and double metaphone for
English
Caverphone for matching names in New
Zealand
Klner Phonetik for better handling of
German words.
6/16/15
Q&A
6/16/15
Summary
6/16/15
Analyzer
Stemming
Identify Words
Normalizing Token
Types of Stemmer
Stopwords
Synonyms,Mispeling
Thank you
6/16/15