Você está na página 1de 24

Myanmar Language Web

Search Engine Implementation


History

Thura Aung
BCIS ( Comp & Info Sys)
Outline
Abstract

1. Introduction

2. Development history of Myanmar Search Engine

3. Purpose

4. Development history of Myanmar search engine

5. Problems occurs in Myanmar Search Engine Development

6. Propose Myanmar Search Engine Architecture

7. Conclusion

8. Reference
Abstract

Mother languages Find Information


have no white regarding with
spaces or Myanmar culture,
delimiters to history and
segment the tourism.
word.

Myanmar search
Myanmar should engine on par
also have language with other Google
search engine. Search Engine.
Introduction
 Search engines for Myanmar
Language are still being
developed.
 Name of former Myanmar search
engines are as follows.
Sr.No. Website URL
1 http://www.searchmyanmar.com
2 http://www.myanmarcrawler.com
3 http://search.mymyanmar.net
4 http://www.etrademyanmar.com
5 http://dir.yahoo.com/Regional/Countries/My
anmar__Burma/_
6 http://www.dmoz.org/Regional/Asia/Myanm
ar
7 http://myanmar-myanmar.com
Purpose

Analyze previous
To describes initial
developer idea
development of
that are relating
Myanmar Search
algorithm of
Engine.
search and word
segmentation.

Many problems
found during the
implementation
process and
explore these.
Development history of Myanmar search engine
Date Developers’ Name Title

May 04, About Myanmar Unicode


Mr. William W.L.K
2006 Implementation Standards

In year Hla Hla Htay, Kavi Myanmar Word Segmentation, ICCA


2006 Narayana Murthy 2006

October 02, Tun Thura Thet; Jin- Word segmentation for the Myanmar
2007 Cheon Na; Wunna Ko Ko language

In year Zin Maung Maung, Yoshiki A Rule-based Syllable Segmentation


2008 Mikami, of Myanmar Text

Pann Yu Mon, Maung Maung Statistical Analysis of Myanmar Words on


February,
Thant, Ohnmar Htun Pe, San the World Wide Web for Search Engine
2008
Ko Oo, Yoshiki Mikami Development, ICCA 2009
Development history of Myanmar search engine
Sr. Findings
Title
No.
1 Mr. William -Tokenizing Myanmar Algorithm defined for Searching.
W.L.K, About -Searching Applications for Tokenizing algorithm Syllable Based
Myanmar Line Breaking Algorithm.
Unicode -Using search engine library e.g. Apache Lucene (Java & C+ +).
Implementation - Processing text for lexicon analysis e.g. Machine Translation (MT
Engine)
Standards

Syllable Segmentation:
Tun Thura
-Consider on base character, and pre-base character, a post-base
Thet; Jin-
2 character, an above-base character and a below-base character.
Cheon Na;
-A dictionary-based statistical approach for syllable merging and a
Wunna Ko Ko, rule-based heuristic approach for syllable segmentation.
Word -99.05% recall, 98.94% precision and 98.99% F-measure.
segmentation
for the Word Segmentation:
Myanmar
-Broken down into sentences as a phrases by looking at
language punctuation marks and spaces.
-Search engines require documents to be indexed by words,
words of query compared indexed words of documents.
Development history of Myanmar search engine
Sr. Title Findings
No.
3 Word Segmentation:
Hla Hla Htay,
Kavi Narayana -Myanmar sentences can be tokenized by eliminating stop words.
Murthy, -It is a longest sentence matching and recognition for stop word that
Myanmar is leading to correct word segmentation.
Word -Two approaches for the segmentation. One based on a list of stop
Segmentation, words and other using n-grams of syllables.
ICCA 2006 -Collected 1216 stop words, 4550 syllable words and 99% accuracy.
-Collected about 100,000 words
-Achieved about 65% accuracy in word hypothesis.

4 Syllable Segmentation:
Zin Maung
-Syllable segmentation algorithm based on syllable structure of Myanmar
Maung,
script and Rule-based approach for segmentation.
Yoshiki -The corpus contains a total of 32,238 Myanmar syllables.
Mikami, A -Accuracy rate of 99.96% for segmentation.
Rule-based Word Segmentation:
Syllable
Segmentation -Program converts the input text string into equivalent sequence
of category using CMCACV for Myanmar.
of Myanmar -Dictionary-dependent techniques of word segmentation.
Text -A segmentation program break pair of characters comparing the
input character sequence tables.
Development history of Myanmar search engine
Sr.
Title Findings
No.
5 Pann Yu Mon, -Purpose of Language Specific Crawler (LSC) is maximum
Language collection of web pages.
Specific -Multi-threaded Crawler and save URLs in CSV file.
-Save pages content in Dearby database.
Crawler for
-Crawler accuracy rate is 93%.
Myanmar Web -Indexing is assigned to documents of a corpus.
Pages -Keyword saves for web pages and belongs to database.
-Unicode need for Myanmar language content for
transliteration of encoding to Myanmar Unicode.
Pann Yu Mon, -Collected Myanmar words from documents on Websites and to
6 Maung Maung know words frequently used and research based on ASCII
Thant, Ohnmar format.
Htun Pe, San Ko -Used Myanmar-English dictionary index words.
Oo, Yoshiki
Mikami, Statistical Word Segmentation:
Analysis of
Myanmar Words -Word segmentation program for Myanmar text based on longest
on the World Wide string matching algorithm.
Web for Search -Identified total 766,892 Myanmar words (12,211 unique
Engine headwords).
Development, 5,861 words (0.76%) were not identified. Accuracy is 99.24%.
ICCA 2009
Development history of Myanmar search engine
Sr.
Title Findings
No.
7 Myanmar -Development of Myanmar Search Engine based on
Search Engine Google API (Application Interface).
and Myanmar -Myanmar English bilingual type search engine.
Web Directory -Web search query using ZawGyi font, this search
Website, engine result outcome is web pages that are writing in
developed by ZawGyi font.
Myanmar .Net -Using Myanmar 3 and other Myanmar Unicode font, this
Search Engine query result outcome will be web pages
that are writing in Myanmar 3 Unicode font and other.
-Detail could be sought by surfing http://myanmar-
myanmar.com.
Problems occurs in Myanmar Search Engine
Problems or requirements Overcome these problem
• Some web sites using different  Converting program required.
Non standard Unicode
 Web crawler program need to
• Development of Web crawler. craw dynamic Myanmar web
pages that are written in verity of
Myanmar fonts and show crawling
depth or level of website.
• Indexing technique for Myanmar Lucene indexing based on analyzer
words. and choose indexing method for
Myanmar pages (i.e. first character
index structure)
• Link ranking system for Search
Engine Need to develop Lucene scoring
system for Myanmar pages.

• Need sufficient finical institution.  MCPA established a competition


for Myanmar Search Engine and
will give a reward for winning.
Propose Myanmar Search Engine
Architecture
 Software for Myanmar Search Engine

 Word Segmentation System and Font Convector

 Input Method Editor and Myanmar Unicode font

 Maintenance and Updating


Software for Myanmar Search Engine
Web Manual Input
User

DB
Application

Get User’s Present


Gather Query Search Result
File
System Data
Lucene

Index Search
Documents Index

Index

• Proposed system architecture of Myanmar Search Engine


• Open Source software, Lucene indexing method is good for the search
engine.
• Lucene is a high performance, scalable Information Retrieval (IR) library.
Lucene Architecture

Query
Field Hits

Document Index Query Index


Analyzer
Class Writer Parser Searcher

Analysis Document Index Query Search

Lucene API
Analyzers in Lucene

• Using analyzer when adding documents to the index

• Using analyzer when retrieving documents

• Analyzer is a collection of operations


Word Segmentation System

Checking
Syntax
errors

Input Search
Query 1.Word result
Segmentation
2.Part-Of-
Speech
Lucene Search
3.Steeming
Engine
……

 Supports MS Windows, Linux and Mac or Apple Operating System

 Main function of system is stop words removing; syllable breaking and words
break rules
 Works as a analyzer for Lucene between Unicode font and Operating
System for search on WWW
Tokenization or word segmentation for
Indexing process
synonyms, antonyms
words

Singular, plural Words Normalizing


Program

Original words
Myanmar
Tokens Search
Stemming Engine
Program
Alternative words

Combined words Lemmatization


Program

Technical terms
Previous Unicode Converting Program
version
Unicode
fonts
webpage
e.g. v.4.0
Converting Unicode Crawling
Program Standard and
Encoding Indexing
Non-
Unicode
fonts
webpage

• All Unicode fonts based upon Myanmar Unicode Standard (Zawgyi


(Zawgyi-- One not
include)
• Some of webpage are using Non-Unicode fonts that are based upon
Latin ASCII codes.

• Unicode converting program for Unicode Standard 5.2 should be


required
Operating System and Internet Browser
Myanmar character set need to support by default in OS and Browser
• Microsoft
Windows CE, Windows NT, Windows 2K, Windows XP and Windows .Net Server
2003
• Apple
Mac OS X 10.1, Mac OS X Server, ATSUI
• SUN
• SCO
• UnixWare 7.1.0
• And Linux OSs

Myanmar Unicode font supports the following Internet Browser.


• Internet Explorer 5.5, 6.0, 7.0 beta
• partially support, implemented by Microsoft.
• embedded fonts features support

• Opera, Netscape and others browsers


• partially support, needed to install fonts

• Mozilla FireFox
• fully support, language pack available (ver 1.4)
Myanmar Unicode font and Input Method Editor

UnicodeStandard compliant fonts such as


Myanmar3, Padauk, Parabike, Zawgyi-
Zawgyi-One

Linux
inux Myanmar Pango Module font name is
Masterpiece UniSan.
Analysis Myanmar Words
1. Stemming words or new derived words (for example big (ၾကိးေသာ) [kji ], bigger
(ပိုၾကီးေသာ) [pou kji], biggest (Aၾကီးဆံုး) [a kji hsoun] )

2. Alternative words or Generic words or morphological variations of existing words (for


example Osamar (A ိုစမာ), Osamar Bin Lar Din (A ိုစမာဗင္လာဒင္), Terrorist Leader
(A ၾကမ္းဖက္ေခါင္းေဆာင္))

3. Synonyms, Antonyms words (for example big (ၾကီးသည္) [a kji], small (ေသးငယ္သည္) [a
tha])

4. Technical terms or technical words (for example traditional medicine name (cough tablet
(ေခ်ာင္းဆိုးေပ်ာင္ေဆး) [chaun: hsou: hsei:]), traditional product name ( mohinga
(မုန္႔ဟင္းခါး) [mou. hin: ga: ]), traditional name for engineering words ( tape measure
(ေပၾကိဳး)[pei gjou:])

5. Combined words (for example cook rice is not combination of cook and rice)
(ထမင္းခ်က္ျခင္းသည္ ခ်က္ျပဳတ္ျခင္း ႏွင့္ ထမင္း ေပါင္းထားျခင္း မဟုတ္)

6. Loan Words (for example computer (ကြန္ပ်ဴတာ) [kun pju ta], sub-committee
(ဆပ္ေကာ္မတီ) [hsa _ ko ma ti], cherry (ခ်ယ္ရီ) [che ri], bureaucracy (ဗ်ဴရိုကေရစီ) [bju rou
karei si], order (ေA ာ္ဒါ) [o da]), opera (ေA ာ္ပရာ) [o para])

7. Lemmatization word (for example play (ကစားသည္) [gaza], will play (ကစားလိမ့္မည္) [gaza
mji], play ground (ကစားကြင္း) [gaza: gwin], game (ကစားပြဲ) [gaza pwe],)
List of possible stop words
No. Part of Speech Example
1 Subject personal pronouns I (ကၽြန္ေတာ္) [kja no], we(ကၽြႏု္တို႔) [kja no do], he
(သူ) [thu], she(သူမ) [thu ma], it (ထိုAရာ)
[htou],
2 Object personal pronouns Me (ကၽြန္ေတာ္) [kjanou’ ko], us(ကၽြႏုပိတို႔ကို) [thu tou
ko ], him (သူကို) [thu ko], her (သူမကို) [thu ma
ko]
3 Possessive pronouns and adjectives Mine (က်ေနာ္၏) [kjanou’ i.], your (သူ၏) [thu i.] ,
his (သူ၏) [thu i. ], her (သူမ၏) [thu ma i.]

4 Reflexive personal pronouns Myself (မိမိကိုယ္တိုင)္ , ourselves (ကိုယ့္ကိုယ့္က)ို ,


himself (သူကိုယ္တိုင္), herself (သူမကိုယ္တိုင္),
yourself (မင္းတို႔ကိုယ္တိုင္), itself (ထိုAရာပင္)
5 Relative pronouns Next , then (ထိုေနာက္) [htou nau], therefore,
because of (ထိုေၾကာင့္) [htou gjaum]

6 Indefinite pronouns and adjectives Some (Aခ်ိဳ႕) [a.chou ], few (Aနည္းငယ္) [ a ne:
nge], none (စိုးစU္မွ်)

7 Demonstrative pronouns and This (ဤAရာ) [i. ha ], that (ထိုAရာ) [htou ha],
adjectives these(ေဟာဒီ), those (ဟိုဟာ)

8 Interrogative Pronoun and Questions Who (မည္သူ) [be thu], when (ဘယ္ေနရာ) [be], how
(ဘယ္လို) [be lou], what (ဘယ္လဲ [be le]
Maintenance and Updating
 Every language has new words (slang), which are daily
language usage.

 Myanmar language also has these words (Myanmar Slang


or Myanmar Ban Sar Kar).

• Maintenance team will do the following duties.


• 1) Regular updating of the index.
• 2) Check disk space and CPU utilization.
• 3) Regular backups along with other server files.
• 4) Test scripting to uncover any corruption or problems.
• 5) Rebuilds to remove obsolete data and improve efficiency.
References
1) Myanmar-Thai Co-Workshop on Myanmar Language Implementation,
Application Development Efforts related to Myanmar Unicode, Mr. Ngwe Tun,
Solveware. Available

2) Myanmar Unicode Implementation Standards, The 5th Myanmar ICT Week


2006, William W.L.K

3) Word segmentation for the Myanmar language, Tun Thura Thet; Jin-Cheon
Na; Wunna Ko Ko at:

4) Statistical Analysis of Myanmar Words on the World Wide Web for Search
Engine Development, Pann Yu Mon; Maung Maung Thant; Ohnmar Htun Pe;
San Ko Oo; Yoshiki Mikami, Management and Information Systems
Engineering Department, Nagaoka University of Technology, International
University of Japan

5) Myanmar Word Segmentation, Hla Hla Htay; Kavi Narayana Murthy,


Department of Computer and Information Sciences, University of Hyderabad

6) Myanmar Language Enabled Applications in Windows & Linux, © 2006


Myanmar NLP Research Project, Collected By Ngwe Tun

7) Lucene in Action, ERIK HATCHER OTIS GOSPODNETIC, MANNING,


Greenwich

8) Myanmar IT Professionals Forum, http://myanmaritpros.com/


9) A Rule-based Syllable Segmentation of Myanmar Text, Zin Maung Maung
and Yoshiki Mikami

Você também pode gostar