Elasticsearch:Dealing With Human Language

Elasticsearch-3
Dealing with Human Language

Mohammad Aminul Islam
11103812
Cologne University of Applied Science
6/16/15
Topics to Be Covered
Started with language
Analyzer
Stemming
Language Identification
Identify words
Normalizing token
Ruducing words to their root form
Stemming issue
Lemmatization
Types of stemmer
Stopwords: Performance Vs Precision

Synonyms, Typoes and Mispeling
6/16/15
Goals
What is happening behind the search
engine
6/16/15
I know all the words, but that

sentence make no sense to me
Matt Groening
6/16/15
Started with languages: Analyzer

Elasticsearch has a collection of language
analyzer
These analyzer perform four works
Tokenized text into individual words
The quick brown foxes > The , quick , brown,
foxes
Lowercase tokens
The > the
Remove common stopwords

quick, brown ,foxes
Token to their root form

Foxes> fox
6/16/15
Analyzer
The english analyzer remove the possesive `s
John `s> John
The french analyzer remove the elision like

l ` and qu `
The german analyzer normalize and ae
with a, > ss
6/16/15
Question ?
I am not happy with HTC mobile> I ,am , happy ,

HTC, mobile
What can we do now?
6/16/15
Analyzer : Configuration
Language analyzer can be use without
configuration. Analyzer allows to control
some behaviour
Stem-word exclusion
For example : organ and organization are the
same word
Custom stopword
6/16/15
For example : not, no cosider as important

word.
Incorrect stemming
Stemming rules are different for different
language
We cannot use one stemming rule for all
language
Root word sometimes change the meaning
of the actual word
For example : ebay.co.uk
ebay.de
6/16/15
Identify Language
We know our own document language
External document may be contain different
language
We can use language detector for identifying
the language
For example: Compact language detector
from google
It can detect 160+ language
It can detect mutiple language within a single
line of text.
6/16/15
Identifying Words
6/16/15
Words
Words are separated whitespace or
punctuation
In English there are some controversy
word: O`clock, cooperate, eyewitness
In Deutsch or dutch there are some
combined words
Some asian language have no whitespace
between words
Dedicated analyzer for many language
6/16/15
Standard Analyzer
By default standard analyzer is used
We can define standard analyzer as a
custom analyzer
6/16/15
Standard Tokenizer
Take a string as input , process the input
and break it into individual words
Whitespace tokenizer simply break on
whitespace
You are the 1st runner home!
You,are , the, 1st, runner,home
The letter tokenizer break on any character

that is not a letter
You,are , the, st, runner,home
Standard Tokenizer use unicode text
segmentation algorighm, it allows text
containing a mixture of language
6/16/15
Reducing word to their

root form
6/16/15
Words can changes their form
Number: fox, foxes

Tense: pay, paid, paying
Gender: waiter, waitress
Person: hear, hears
Stemming try to remove difference between

inflected word and root word
For example : foxes > fox
**Problem is Root word not always same
meaning
6/16/15
Two issues of Stemming

Understemming:
Fail to reduce the word with same meaning
and same root
For example: jumped and jumps > jump
Jumping reduce to jumpi
relevent document will not come
Overstemming
Fail to keep two words with distinct meanings

separate
For example: General and generate >gener
irrelevent document will come
6/16/15
Lemmatization
It is a set of related words
For example: paying,paid,pays is pay
It can group words by their word sense

For example: wake and wakeup is different
Lemmatization is much more complicated and
expensive process
6/16/15
Types of stemmer
Algorithm Stemmers:
It is use algorithm
Easy to use
Fast ,use little memory
Good for regular works
Dictionary Stemmers:
It use dictionary
It use more momory
Have to load all words
6/16/15
Question?
Which stemmer we should use?
6/16/15
Stopwords
Performance Vs Precision
6/16/15
For search purposes some words are more

important than others. For better indexing we
need to find out valuable words
Low frequency terms:
Words that rearly appear in document have a
high value or weight
High frequency terms:
Common words that appear in many
document have lower value or weight such as
the, an
6/16/15
Default stopwords
The default English stopwords used in
Elasticsearch are as follows:
a, an, and, are, as, at, be, but, by, for, if, in,
into, is, it, no, not, of, on, or, such, that, the,
their, then, there, these, they, this, to, was,
will, with
These stopwords filtered out before indexing

with little negative impact on retrieval. Is it a
good idea?
6/16/15
Pros & Cons of Stopwords

Cons:
Distinguishing happy form not happy
Finding Shakespeares quotation To be, or not
to be
Using the country code for Norway: no
Pros:
Performance
Search Fox instead of the fox
6/16/15
Synonyms
Synonyns are listed as comma separate
values
With the => syntax, it is possible to

specify a list of terms to
6/16/15
Typoes & Mispelings

80% of human misspellings have an edit
distance of 1
80% of misspellings could be corrected
with a single edit
Edit distance specified by fuzziness
parameter 2
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters
6/16/15
Phonetic Matching
Sound similar, may be spelling differ
Algorithm for word to phonetic
representation are
Soundex algorithm is the granddaddy of t all
Metaphone and double metaphone for
English
Caverphone for matching names in New
Zealand
Klner Phonetik for better handling of
German words.
6/16/15
Q&A
6/16/15
Summary
6/16/15
Analyzer
Stemming
Identify Words
Normalizing Token
Types of Stemmer
Stopwords
Synonyms,Mispeling
Thank you
6/16/15

Elasticsearch:Dealing With Human Language

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Elasticsearch:Dealing With Human Language

Enviado por

Direitos autorais:

Formatos disponíveis

Elasticsearch-3

Dealing with Human Language

Stopwords: Performance Vs Precision

I know all the words, but that

Started with languages: Analyzer

Remove common stopwords

Token to their root form

The french analyzer remove the elision like

I am not happy with HTC mobile> I ,am , happy ,

For example : not, no cosider as important

The letter tokenizer break on any character

Reducing word to their

Words can changes their form

Number: fox, foxes

Stemming try to remove difference between

Two issues of Stemming

Fail to keep two words with distinct meanings

It can group words by their word sense

For search purposes some words are more

These stopwords filtered out before indexing

Pros & Cons of Stopwords

With the => syntax, it is possible to

Typoes & Mispelings

Você também pode gostar