Você está na página 1de 22

TEORIJA INFORMACIJE

eljko Jerievi, dr. sc.


Zavod za raunarstvo, Tehniki fakultet &
Zavod za biologiju i medicinsku genetiku, Medicinski fakultet
51000 Rijeka, Croatia
Phone: (+385) 51-651 594
E-mail: zeljko.jericevic@riteh.hr
http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html

Information theory
Iz dosadanjeg gradiva znamo da se informacija prije slanja kroz
kanal treba prirediti. To se postie pretvorbom informacije u
formu koja ima entropiju blisku maksimalnoj ime se efikasnost
prenosa pribliava maksimalnoj. Ovo se moe postii
kompresijom bez gubitaka informacije (lossless compression),
napr. aritmetikim kodiranjem.
Druga pretvorba odnosi se na sigurnost prijenosa pri emu se
informacija prevodi u formu gdje je za odreeni tip pogreaka
mogua automatska korekcija (napr. Hamming-ovim
kodiranjem).
10 February 2012

zeljko.jericevic@riteh.hr

Saimanje (compression)

10 February 2012

zeljko.jericevic@riteh.hr

Entropijsko kodiranje:
Kraft-ova nejednakost (u Huffman & Shannon-Fano)

1.4.1 The Kraft inequality


We shall prove the existence of efficient source codes by actually constructing some
codes that are important in applications. However, getting to these results requires some
intermediate steps.
A binary variable-length source code is described as a mapping from the source
alphabet A to a set of finite strings, C from the binary code alphabet, which we always
denote {0, 1}. Since we allow the strings in the code to have different lengths, it is
important that we can carry out the reverse mapping in a unique way. A simple way of
ensuring this property is to use a prefix code, a set of strings chosen in such a way that
no string is also the beginning (prefix) of another string. Thus, when the current string
belongs to C, we know that we have reached the end, and we can start processing the
following symbols as a new code string. In Example 1.5 an example of a simple prefix
code is given.
If ci is a string in C and l(ci ) its length in binary symbols, the expected length of the
source code per source symbol is
L(C) =
N
i=1
P(ci )l(ci ).
If the set of lengths of the code is {l(ci )}, any prefix code must satisfy the following
important condition, known as the Kraft inequality:
i
2l(ci ) 1. (1.10)

10 February 2012

Entropijsko kodiranje:
Kraft-ova nejednakost
1.4.1 The Kraft inequality
The code can be described as a binary search tree: starting from the root, two branches
are labelled 0 and 1, and each node is either a leaf that corresponds to the end of a string,
or a node that can be assumed to have two continuing branches. Let lm be the maximal
length of a string. If a string has length l(c), it follows from the prefix condition that
none of the 2lml(c) extensions of this string are in the code. Also, two extensions of
different code strings are never equal, since this would violate the prefix condition. Thus
by summing over all codewords we get
i
2lml(ci ) 2lm
and the inequality follows. It may further be proven that any uniquely decodable code
must satisfy (1.10) and that if this is the case there exists a prefix code with the
same set of code lengths. Thus restriction to prefix codes imposes no loss in coding
performance.
10 February 2012

Entropijsko kodiranje:
Kraft-ova nejednakost

1.4.1 The Kraft inequality


Example 1.5 (A simple code). The code {0, 10, 110, 111} is a
prefix code for an alphabet
of four symbols. If the probability distribution of the source is
(1/2, 1/4, 1/8, 1/8), the
average length of the code strings is 1 1/2 + 2 1/4 + 3
1/4 = 7/4, which is
also the entropy of the source.
10 February 2012

Entropijsko kodiranje:
Kraft-ova nejednakost

1.4.1 The Kraft inequality


If all the numbers log P(ci ) were integers, we could choose these as the lengths
l(ci ). In this way the Kraft inequality would be satisfied with equality, and furthermore
L = i P(ci )l(ci ) = i P(ci )log P(ci ) = H(X)
and thus the expected code length would equal the entropy. Such a case is shown in
Example 1.5. However, in general we have to select code strings that only approximate
the optimal values. If we round log P(ci ) to the nearest larger integer log P(ci ),
the lengths satisfy the Kraft inequality, and by summing we get an upper bound on the
code lengths
l(ci ) = log P(ci ) log P(ci ) + 1. (1.11)
The difference between the entropy and the average code length may be evaluated
from
H(X) L = i P(ci ) log P(ci ) li = i P(ci )log 2l P(ci ) log i 2li 0,
where the inequalities are those established by Jensen and Kraft, respectively. This gives
H(X) L H(X) + 1, (1.12)
where the right-hand side is given by taking the average of (1.11).
The loss due to the integer rounding may give a disappointing resultwhen the coding is
done on single source symbols. However, if we apply the result to strings of N symbols,
we find an expected code length of at most NH + 1, and the result per source symbol
becomes at most H + 1/N. Thus, for sources with independent symbols, we can get an
expected code length close to the entropy by encoding sufficiently long strings of source
symbols.

10 February 2012

Aritmetiko kodiranje
Pretpostavimo da elimo poslati poruku koja se sastoji
od 3 slova: A, B & C s podjednakom vjerojatnosti
pojavljivanja
Upotreba 2 bita po simbolu je neefikasna: jedna od
kombinacija bitova se nikada nee upotrebiti.
Bolja ideja je upotreba realnih brojeva izmedu 0 & 1 u
brojevnom sustavu po bazi 3, pri cemu svaka znamenka
predstavlja simbol.
Na primjer, sekvenca ABBCAB postaje 0.011201 (uz
A=0, B=1, C=2)
10 February 2012

Aritmetiko kodiranje
Prevoenjem realnog broja 0.011201 po bazi 3 u
binarni, dobivamo 0.001011001
Upotreba 2 bita po simbolu zahtjeva 12 bitova za
sekvencu ABBCAB, a binarna reprezentacija 0.011201
(u bazi 3) zahtjeva 9 bitova u binarnoj bazi to je uteda
od 25%.
Metoda se zasniva na efikasnim in place algoritmima
za prevoenje iz jedne baze u drugu
10 February 2012

Brzo prevoenje iz jedne baze u drugu


Linux/Unix bc program
Primjeri:
echo "ibase=2; 0.1" | bc
.5
echo "ibase=3; 0.1000000" | bc
.3333333
echo "ibase=3; obase=2; 0.011201" | bc
.00101100100110010001
echo "ibase=2; obase=3; .001011001" | bc
.0112002011101011210 zaokrueno na .011201 (duina106)

Aritmetiko dekodiranje
Aritmetikim kodiranjem moemo postii rezultat
blizak optimalnom (optimalno je log2p bita za svaki
simbol vjerojatnosti p).
Primjer s etiri simbola, aritmetikim kodom 0.538 i
sljedeom distribucijom vjerojatnosti (D je kraj
poruke):

Simbol
A
B C D
Vjerojatnost 0.6 0.2 0.1 0.1
10 February 2012

11

Aritmetiki kod sekvence je 0.538 (ACD)


Prvi korak: poetni interval [0,1] podjeli u subintervale
proporcionalno vjerojatnostima:

Simbol
A
B
C
D
Interval [0 0.6) [0.6 0.8) [0.8 0.9) [0.9 1)
0.538 pada u prvi interval (simbol A)

10 February 2012

12

Aritmetiki kod sekvence je 0.538 (ACD)


Drugi korak: interval [0,6) izabran u prvom koraku
podjeli u subintervale proporcionalno vjerojatnostima:
Simbol
A
B
C
D
Interval [0 0.36) [0.46 0.48) [0.48 0.54) [0.54 0.6)

0.538 pada u trei sub-interval (simbol C)

10 February 2012

13

Aritmetiki kod sekvence je 0.538 (ACD)


Trei korak: interval [0.48-0.54) izabran u prvom
koraku podjeli u subintervale proporcionalno
vjerojatnostima:
Simbol

Interval [0.48 0.516) [0.516 0.528) [0.528 0.534) [0.534 0.54)

0.538 pada u etvrti sub-interval (simbol D, koji je


ujedno i simbol zavretka niza)
10 February 2012

14

Aritmetiki kod sekvence je 0.538 (ACD)


Grafiki prikaz aritmetikog dekodiranja

10 February 2012

15

Aritmetiki kod sekvence je 0.538 (ACD)


(ne)Jednoznanost: Ista sekvenca mogla se prikazati kao
0.534, 0.535, 0.536, 0.537 ili 0.539. Uporaba dekadskih
umijesto binarnih znamenki uvodi neefikasnost.
Informacijski sadraj tri dekadske zamenke je oko 9.966
bita (zato?)
Istu poruku moemo binarno kodirati kao 0.10001010
to odgovara 0.5390625 dekadski i zahtjeva 8 bita.

10 February 2012

16

Aritmetiki kod sekvence je 0.538 (ACD)


8 bita je vie nego stvarna entropija poruke (1.58 bita)
zbog kratkoe poruke i pogrene distribucije. Ako se
uzme u obzir stvarna distribucija simbola u poruci
poruka se moe kodirati uz upotrebu sljedeih intervala:
[0, 1/3); [1/9, 2/9); [5/27, 6/27); i binarnog intervala of
[1011110, 1110001).
Rezultat kodiranja je poruka 111, odnosno 3 bita
Ispravna statistika poruke je krucijalna za efikasnost
kodiranja!
10 February 2012

17

Aritmetiko kodiranje
Iterativno dekodiranje poruke

18

Aritmetiko kodiranje
Iterativno kodiranje poruke

19

Aritmetiko kodiranje
Dva simbola s vjerojatnou pojavljivanja px=2/3
& py=1/3

20

Aritmetiko kodiranje
Tri simbola s vjerojatnou pojavljivanja px=2/3 &
py=1/3

21

Aritmetiko kodiranje

22

Você também pode gostar