Você está na página 1de 6

Unicode Searching Algorithm Using Multilevel

Binary Tree Applied on Bangla Unicode


of the world. So Unicode searching must be efficient and
reliable for all languages. In some Unicode searching
algorithm, especially in Bangla, there exist only if else
condition. Some one uses the Hash Table to develop
Unicode searching method. Here a Multi Level Tree based
Unicode searching algorithm is proposed which will be
more efficient for Unicode searching on different
languages.

Abstract- Unicode Searching Algorithm using


multilevel binary tree is proposed to search the
Unicode in efficient way. The algorithm is applied on
Bangla Unicode searching to convert Bijoy string into
Unicode string. Firs, the algorithm build a multilevel
binary tree based on ACII code with its corresponding
Unicode. The tree is build from a multilevel binary
sorted data containing ASCII code and its
corresponding Unicode. The data must be sorted based
on ASCII code. The algorithm takes Bangla Bijoy
string as input value and output the same string in
Unicode format. The input Bijoy string must be in
Unicode readable format

I.

II.

PRELIMINARY STUDIES

A. Bijoy String to Unicode Readable Format String


There exist 11 independent characters (vowel) and 39
dependent characters (consonant) [6] in Bangla literature.
There also exist some independent and dependent
character symbols called Kar and Fala respectively.
These symbols must be used with a character. A large
number of Complex Characters (combination of two of
more characters) exist in Bangle language. A single
Complex Character may contain two or more independent
or dependent characters, must joined with a symbol

INTRODUCTION

Fundamentally, computers just deal with numbers. They


store letters and other characters by assigning a number
for each one. Before Unicode was invented, there were
hundreds of different encoding systems for assigning
these numbers. These encoding systems also conflict with
one another. That is, two encodings can use the same
number for two different characters, or use different
numbers for the same character. Any given computer
(especially servers) needs to support many different
encodings; yet whenever data is passed between different
encodings or platforms, that data always runs the risk of
corruption. Unicode is changing these all. Unicode
provides a unique number for every character, no matter
what the platform, no matter what the program, no matter
what the language.

named Hasanta (

).

In plain text all the characters and symbols may placed


independently anywhere in a sentence. Bijoy follows this
rule. But Unicode maintains a unique format to use the
symbols with a character. In Unicode the symbols must be
used after a character with no gap between them i.e.
character + symbol. But in some cases like
Chandrabindu or Ref the placement is different.
Chandrabindu must be used after a character if no
symbols are exists with that character i.e. character +
Chandrabindu. If any symbol is exist with a character
then the Chandrabindu is used after the symbol i.e.
character + symbol + Chandrabindu. Ref must be
placed before character.

Bangla is the mother tongue of Bangladeshi people and


the second most popular language in India. It is a reach
language and more than 10% people speak in Bangla in
the world. But in field of Computer Science the research
on this language is not good enough. There exists some
Bangla writing software like Bijoy, Avro, Akkhor etc.
Bijoy is the most popular and oldest software which is
used to write the plain test only. Avro uses Unicode to
write Bangla sentences. Now a day Unicode format is
used to write any language and sometimes it is necessary
to convert the plain text into Unicode format. So
searching Unicode is necessary. In Unicode, there are
65,535 distinct characters that cover all modern languages

Figure 1 shows some examples, representing Bijoy plain


text and its Unicode readable format of Bangla sentences.
Figure 1(a) contains a complex character that forms
with two s, Bangla dependent characters (consonant),
i.e.
. .

321
T. Sobh (ed.), Innovations and Advanced Techniques in C omputer and Information Sciences and Engineering, 321326.
2007 Springer.

322

UNICODE SEARCHING ALGORITHM USING MULTILEVEL BINARY TREE

322

of the data list, second value indicates the middle value of


the first dividing part of the previous data list and so on.

Fig. 2. Binary Sorted data generated from a simple nondecreasing sorted list.

Fig. 1 (a)

Figure 2 shows the Binary Sorted data generated from a


non-decreasing sorted list. Here in Binary sorted data the
first value 8 indicates the middle value of Simple sorted
data. Second value 4 is the middle value of the first
division (1 to 7) and third value 2 is the middle value of
the first division (1 to 3) of the previous first division of
Simple sorted data.
Here another term Multi Level is used with the term
Binary Sort because this method is applied on multilevel
sorted data. Figure 3 shows the Multi Level Binary Sorted
Data formation.

Fig. 1 (b)

Fig. 1 (c)
Fig. 1. Bijoy plain text and its Unicode readable formatted
text showing ASCII and Unicode of corresponding
character.

B. Multi Level Binary Sort


Binary search is a well known and efficient search
algorithm in the field of Computer Science. To apply this
algorithm on a list it must be sorted in increasing or
decreasing order. There exist a lot of sorting algorithms
which are used to sort data in non-increasing or nondecreasing order. Here the term Binary Sort is used
because the method follows the technique of Binary
Search algorithm and rearrange the data into a in order
tree format where its first value indicates the middle value

Fig. 3. Three level Binary Sorted data generation from


three level non-decreasing sorted data.
Here in the first level of the Simple non-decreasing data
has 5 twos and 3 fives which are treated as single integer
each. So applying Binary Sort on the basis of first level
the sequence of 2 and 5 will remain unchanged. Now
come to the second level. Here exist 2 threes for common
value 2 of the first level and is treated as single three.
Applying Binary Sort on the values of common two and
five of first level the values will be rearranged where the
sequence of three of second level will remain unchanged.
In the third level the sort will be applied on the values of
the common three of second level. At the end of the
process the three level non-decreasing data will become as
the
last
format
shown
on
the
figure
3.

323

UNICODE SEARCHING ALGORITHM USING MULTILEVEL BINARY TREE

Now if the multilevel Binary Sort is applied on the data of


the first table in figure 4 based on the ASCII value, the
result will become as the same as the data of the second
table in figure 4.

323

Binary search tree has a property. If x is a node in a


binary search tree, y is a left child of x, then y < x. If
z is right child of x, then z > x or z = x. The time
complexity of binary search tree is O(h) where h is the
height of the tree. In figure 5 the tree has height 4.
III.

MULTILEVEL BINARY TREE

Binary search tree can be defined in two ways based on


two conditions z > x and z >= x [here z and x are denoted
as right and root nodes respectively]. But in the Multilevel
Binary Tree structure only z > x is considered because all
the same values of a single level of the Multilevel Binary
Tree treated as a single node. If a desired value is found in
any node of a level it will transfer its control to its branch
node which holds the root of the next level tree of that
node. The same condition is true for each node of a level
of the Tree.
Suppose T is a Multilevel Binary tree. Then T is called a
Multilevel Binary tree if each node N of T has the
following properties:

Fig. 4. Multilevel Binary Sort is applied on the ASCII


based Unicode data list.

C. Binary Tree and Binary Search Tree


Binary tree is a basic architecture in which each node has
two own child. Binary Search tree is organized in a inorder binary tree. The tree may represent in linked data
structure where each node is an object. Figure 5 shows a
graphical representation of a complete binary tree based
on the Binary Sorted (in order) data shown in figure 2.

Fig. 5. A complete binary tree structure

N must have two leaf nodes and one branch node.


Leaf nodes may hold sub tree of its level and the
branch node must hold its next level tree of that node.
The value of N must be grater than every value in its
left sub tree (L) and must be less than every value in
its right sub tree (R) i.e. N > L and N < R.
If any value V is equal to N of that level, then Vs
corresponding values will formed the next level tree
with the same properties.

Figure 6 shows the 3 levels Binary tree formed as the


values of the 3 level Binary Sorted data. First box
indicates the first level tree then second level tree and
finally third box shows the third level tree. On the tree 97
and 170 has no repetition to maintain the properties of the
Multi Level Binary Search Tree (MLBST).

Fig. 6. Three level Binary Sorted Tree

324

UNICODE SEARCHING ALGORITHM USING MULTILEVEL BINARY TREE

A. Algorithm
Here first a MLBST_Node class is declared which holds
the elements and links of the node.
The class, MLBST_Node, contains first two variables to
hold searched Unicode data and last three values contains
the MLBST_Node type object indicating the two leafs and
a branch node.
The Buield_MLBST method builds the MLBST
(Multilevel Binary Sorted Tree i.e. in order tree) from the
Multilevel Binary Sorted data.
---------------------------------------------------------------------Declaring MLBST_Node Class
Start:
AsciiCode := 0 as a integer value.
UniCode := Null as a string value.
LeftNode := Null as a MLBST_Node type
RightNode := Null as a MLBST_Node type
BranchNode := Null as a MLBST_Node
type
End:
------------------------------------------------------------------------------------------------------------------------------------------Declaring Buield_MLBST method with Five
Parameters (
Parameter1: Integer array containing the
ASCII codes of a single character (simple/complex).
Patameter2: Integer value containing the
length of the array
Parameter3: Integer value indicating the
point of that array from where the ASCII codes will
be used.
Parameter4: String value containing the
corresponding Unicode.
Parameter5: MLBST_Node type value
indicating the Parent Node.
)
Start:
If the Parameter3 is greater or equal to the
Parameter2 then return TRUE.
If Parameter5 is equal to NULL then
Start:
AsciiCode := value of Paremeter1 at the position of
Parameter3.
If Parameter3 is equal to the Paremeter2 then
Start:
UniCode := value of Parameter4
End:
Create a BranchNode of Parameter5.
Increment Parameter3 by One.
Call Buield_MLBST method with the BranchNode of
Parameter5 and return.
End:
If the value of Parameter1 at the position of
Parameter3 is greater than

324

AsciiCode of Parameter5 then


Start:
If RightNode of Patameter5 is NULL then
Start:
Create a RightNode of Parameter5
End:
Call Buield_MLBST method with the RightNode of
Parameter5 and return.
End:
Else If the value of Parameter1 at the position of
Parameter3 is less than
AsciiCode of Parameter5 then
Start:
If LeftNode of Patameter5 is NULL then
Start:
Create a LeftNode of Parameter5
End:
Call Buield_MLBST method with the LeftNode of
Parameter5 and return.
End:
Else If the value of Parameter1 at the position of
Parameter3 is equal to
AsciiCode of Parameter5 then
Start:
If BranchNode of Patameter5 is NULL then
Start:
Create a BranchNode of Parameter5
End:
Increment Parameter3 by One.
Call Buield_MLBST method with the BranchNode of
Parameter5 and return.
End:
End:
------------------------------------------------------------------------------------------------------------------------------------------Declare UniCodeTemp as a string type global variable.
Declare StartIndex as an integer type global variable.
------------------------------------------------------------------------------------------------------------------------------------------Declaring Search_MLBST method with Four
Parameters (
Parameter1: Integer array containing the
ASCII codes of a single character (simple/complex).
Patameter2: Integer value containing the
length of the array.
Parameter3: Integer value Indicating the
point of that array from where the
ASCII codes will be used.
Parameter4: MLBST_Node type value
indicating the Parent Node.
)
Start:
If the Parameter3 is greater or equal to the Parameter2
then return the UniCode of that Node.

UNICODE SEARCHING ALGORITHM USING MULTILEVEL BINARY TREE

325

If the value of Parameter1 at the position


of Parameter3 is greater than AsciiCode of
Parameter5 then
Start:
If RightNode of Patameter5 is NULL then return
FALSE.
Else
Call Search_MLBST method with the RigntNode of
Parameter5 and return.
End:

325

B. Search Tree
Let the height of each level tree is h and the MLBST has
n trees. In worst case to search a desired data the
complexity becomes (nh ) .
The best case occurs when the searched data exists in the
root of the first level tree.

C. Complexity of MLBST applied on Bangla Unicode


Else If the value of Parameter1 at the position of
Parameter3 is less than
AsciiCode of Parameter5 then
Start:
If LeftNode of Patameter5 is NULL then return FALSE.

In Bangla the total characters are approximately 300


including characters (vowel, consonant, complex
character) and symbols. Again, for a char the ASCII code
level is not larger than 4. Figure 7 shows a Bangla
complex character having three level ASCII values and its
corresponding UNICODE.

Else
Call Search_MLBST method with the LeftNode of
Parameter5 and return.
End:
Else If the value of Parameter1 at the position of
Parameter3 is equal to
AsciiCode of Parameter5 then
Start:
UniCodeTemp := UniCode of Paremeter5.
Increment StartIndex by One.
Increment Parameter3 by One.
Call Search_MLBST method with the BranchNode of
Parameter5 and return.
End:
End:
----------------------------------------------------------------------

IV. COMPLEXITY ANALYSIS

MLBST is like the Binary tree so each nod of this tree has
two child nodes (Branch node holds its next level tree so
branch node is not considered). Let the height of each
level tree of the MLBST is h.
So, each level tree holds ( 2 1 ) nodes.
h

Here each node has a branch node that holds the root of its
next level tree. If the MLBST has two level trees then the
h
h
2h
total nodes become ( 2 * 2 = 2 ). Here -1 is omitted
for large value of h.
If MLBST has n level tree then the total nodes will
be 2

Considering the ASCII level the MLBST has 4 level trees


i.e. n = 4 .
Let the first level tree holds the 300 nodes with the first
level ASCII values. The each second level tree contains
not more than 6 nodes of second level ASCII values
(consider 10 rather than 6). Third and fourth level tree has
a little number of nodes with the 3rd and 4th level ASCII
values. Let the number is 4.
So, for the first level tree, 300 = 2 1 . Taking log in
h

both sides the equation becomes, log 301 = h log 2 i.e.

A. Build Tree

nh

Fig.7. Bangla complex character having three level ASCII.

and complexity will be (2 ) .


nh

h = 8.233 . So for the 1st level tree it can be shown as


h1 = 9 (selling) i.e. height of the first level tree is 9.
Similarly, for the 2nd level tree, h 2 = 4 (selling) and for
3rd and 4th level tree, h3 = h4 = 3 (selling). So the total
nodes of the MLBST becomes,

(2 9 1) * (2 4 1) * (2 3 1) * (2 3 1) .
Omitting -1 the tree building complexity becomes

(219 ) and searching complexity becomes (19) .


D. Representation of total number of nodes of a MLBST
Let the MLBST has n number of level trees. If each
level has a different size of height then the total number of
nodes of the MLBST can be represented as

UNICODE SEARCHING ALGORITHM USING MULTILEVEL BINARY TREE

326

326

V. CONCLUSION

N=
(2 h1 1) * (2 h 2 1) * (2 h3 1) * ... * (2 hn 1)
Here N indicates the total number of nodes and h1 is the
height of first level tree, h2 is the height of the each
second level tree and so on. The equation can also be
represented as,

The algorithm is applied on Bangla Unicode but it may be


applied on its related problem. It can also be used Unicode
searching of other various languages. The main drawback
of the algorithm is that, it needs a Multilevel Binary
Sorted data to build the MLBST and its tree building
nh
complexity is (2 ) but MLBST build its tree only for
first time.

N=

ACKNOWLEDGEMENT

hb

b=1

C n1

hb ) h a

b=1

First of all, would like to express thank to almighty Allah


who created us and give us the power to live in this world.
Then thank to respected parents and my younger brother.
Thank also goes to honorable teachers of IIUC.

a =1
n

n 1

hb ) ( h a + h m )

b=1

REFERENCES

[1]

Introduction to Algorithms (Second Edition)


by. Thomas H. Cormen, Charles E. Leiserson,
Ronald L. Rivest, Clifford Stein

[2]

http://www.acm.uiuc.edu/conference/index.php

[3]

http://www.jorendorff.com/articles/index.html

[4]

http://www.unicode.org/unicode/reports/tr10/tr108.html

[5]

http://www.connect-bangladesh.org/bangla/
webbangla.html

[6]

http://www.betelco.com/bd/bangla/bangla.html

a =1 m = a +1

n2

n 1

hb ) ( h a + h m + h n )

b =1

+ ...

a =1 m = a +1 n = m +1
n

....

C n1

ha

a =1

In the last two terms have (+/-) together. It indicates that


the even position of a term has the sign - (negative) and
odd position of a term has the sign + (positive).

Você também pode gostar