Você está na página 1de 45

Principles of Data Compression:

Theory and Applications

Dr. Daniel Leon-Salas

Motivation
The Information Revolution

IEEE Intercon 2014, Arequipa

Motivation
Consider a 3 minute song:
assuming two channels, a 16-bit resolution, a
sampling rate of 48 kHz, it will take 33 MB of disk
space to store the song.

Consider a 5 megapixel camera:


assuming an 8-bit resolution per pixel, it will take
5 MB of disk space to store one picture.

One second of video using the CCIR 601


format (720485) needs more than 30
megabytes of storage space.
IEEE Intercon 2014, Arequipa

Introduction
If data generation is growing at an explosive
rate, why not focus on improving transmission
and storage technologies?
Transmission and storage technologies are
improving but not at the same rate as data is
generated.
This is especially true for wireless
communications where the radio spectrum is
limited.
IEEE Intercon 2014, Arequipa

Introduction
Data compression is the art or science of
representing information in a compact form.
Data compression is performed by identifying
and exploiting structure and redundancies in
the data.
Data can be samples of audio, images, text
files, it can be generated by sensors or
scientific instruments, social networks,
markets, etc.
IEEE Intercon 2014, Arequipa

Introduction
Consider Morse code, developed in the 19th
century, in which letters are encoded with dots
and dashes.
some letters (e and a) occur more often than others (q
and j).
letters that occur more frequently are encoded using
shorter sequences: e .
a . Letters that occur less frequently are encoded using
longer sequences: q - - . j .- - -

In this case the statistical structure of the data


was exploited.
IEEE Intercon 2014, Arequipa

Introduction
There are many other types of structure in
data that can be exploited to achieve
compression.
In speech, the physical structure of our vocal
tract determines the kind of sounds that we
can produce instead of sending speech
samples we can send information about the
vocal tract to the receiver.
We can also exploit characteristics of the end
user of the data.
IEEE Intercon 2014, Arequipa

Introduction
In many cases, when transmitting images or
audio, the end user is a human.
Humans have limited hearing and vision
abilities.
We can exploit the limitations of human
perception to discard irrelevant information
and obtain higher compression.

IEEE Intercon 2014, Arequipa

Compression and Reconstruction


Original

Reconstructed
compression

reconstruction
(decompression)

Compression Algorithm

IEEE Intercon 2014, Arequipa

Lossless Compression
Lossless compression involves no loss of
information.
The recovered data is an exact copy of the
original.
Useful in applications that cannot tolerate any
difference:

medical images

scientific data

IEEE Intercon 2014, Arequipa

financial records

computer programs

Lossy Compression
In lossy compression some loss of information is
tolerated.
The original data cannot be recovered exactly but
results in higher compression ratios.
Useful in applications where some loss of
information is not critical:

speech coding

telephone
communications

IEEE Intercon 2014, Arequipa

video coding

digital photography

Compression Performance
Compression ratio (CR):
CR =

# bits required to represent data without compression


# bits required to represent data with compression

Rate: average number of bits per sample or symbol


Distortion (for lossy compression):
MSE =

2
2

2
max
PSNR dB = 10 log10
MSE

IEEE Intercon 2014, Arequipa

Example 1
Lets consider the following input sequence:
= [9, 11, 11, 11, 14, 13, 15, 17, 16, 17, 20, 21]

To encode this sequence using plain binary code, we would need to use 5
bits per number and a total of 60 bits.
K. Sayood, Introduction to Data Compression, 2nd edition, Morgan Kauffman

IEEE Intercon 2014, Arequipa

Example 1
If we use the model:
=+8

and compute the residual = = [0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1]

The residual consists of only three numbers {1, 0, 1} which can be


encoded using 2 bits per number for a total 36 bits.

IEEE Intercon 2014, Arequipa

Example 2
Input sequence:
a_barayaran_array_ran_far_faar_faaar_away
The sequence is made of eight different characters (symbols):
a, b, f, n, r, w, y, _
Hence, we can use three bits per symbol to encode the
sequence resulting in a total of 413=123 bits for the entire
sequence.
However, we can use fewer bits if we realize that some
symbols occur more frequently than others.
We can use fewer bits to encode the more frequent symbols.
K. Sayood, Introduction to Data Compression, 2nd edition, Morgan Kauffman

IEEE Intercon 2014, Arequipa

Example 2
Input sequence: a_barayaran_array_ran_far_faar_faaar_away
Input character

Frequency

Variable-length code

Fixed-length code

16

001

01100

010

0100

011

0111

100

000

101

01101

110

0101

111

000
codewords

001

codes

Using variable-length codes we can encode the sequence using only 97 bits.
IEEE Intercon 2014, Arequipa

Statistical Redundancy
Statistical redundancy was employed in
Example 2 to build a code to encode the input
sequence.
When compressing text, statistical redundancy
can be extended to, not only characters, but
also words dictionary technique.
Examples of compression solutions that use
the dictionary technique include the LempelZiv (LZ) algorithm, LZ77, gzip, Zip, PNG, PKZip.
IEEE Intercon 2014, Arequipa

Information and Entropy


Information can be defined as a message that helps to resolve
uncertainty.
In Information Theory information is taken as a sequence of
symbols from an alphabet.
Entropy is a measure of information.
source

A
{a1, a2 an}

symbols

First-order entropy of the source:

a1 a2 a3 a6 a8 a 5 a3 a4

( ) log ( )
=1

alphabet

IEEE Intercon 2014, Arequipa

message

Entropy

( ) log ( )
=1

If the base of the logarithm is 2 the units of entropy are bits. If the base is
10 the units are hartleys. If the base is e the units are nats.
The first-order entropy assumes that the symbols occur independently of
each other.
The entropy is a measure of the average number of bits needed to
encoded the output of the source.
Claude Shannon showed that the best rate that a lossless compression
algorithm can achieve is equal to the entropy of the source.
Example:
Lets consider a source with an alphabet consisting of four symbols: a 1, a2, a3, a4.
P(a1) = 1/2, P(a2) = 1/4, P(a3) = 1/8, P(a4) = 1/8
H = -(1/2 log2(1/2) + 1/4 log2(1/4) + 1/8 log2(1/8) + 1/8 log2(1/8)) = 1.75
bits/symbol.
IEEE Intercon 2014, Arequipa

Coding
Coding is the process of assigning binary sequences to symbols of
an alphabet.
Example:
Lets consider a source with a four-symbol alphabet such that: P(a1) = 1/2,
P(a2) = 1/4, P(a3) = 1/8, P(a4) = 1/8
H = 1.75 bits/symbol.
Symbol

Probability

Code 1

Code 2

Code 3

Code 4

a1

0.5

a2

0.25

10

01

a3

0.125

00

110

011

a4

0.125

10

11

111

0111

1.125 bits

1.25 bits

1.75 bits

1.875 bits

Average length

IEEE Intercon 2014, Arequipa

uniquely
decodable
codes

Prefix Codes
Consider the following codewords:
k bits

n bits

C2

C1
n bits

IF

C2

C1

then we say that C1 is a prefix of C2

k bits
dangling suffix

If the dangling suffix is itself a codeword, the code is not uniquely


decodable.
A prefix code is a code in which no codeword is a prefix of
another codeword.
Prefix codes are uniquely decodable.
IEEE Intercon 2014, Arequipa

Huffman Coding
Huffman coding is an algorithm for building optimum prefix
codes.
It was developed as a class assignment in the first class on
information theory taught by Robert Fano at MIT in 1950.
Huffman coding assumes that the probabilities of the source are
known.
Huffman coding is based on the following observations about
optimum prefix codes:
Symbols with higher probability have shorter codewords than
less probable symbols.
The two symbols with the lowest probabilities have the same
length (proof by contradiction)
In a Huffman code the codewords corresponding to the two
symbols with the lowest probabilities differ only in the last bit.
IEEE Intercon 2014, Arequipa

Huffman Coding
Example:
Lets build a Huffman code for a source with a four-symbol alphabet
such that: (a1) = 0.5, P(a2) = 0.25, P(a3) = 0.125, P(a4) = 0.125

0.5

0.25

0.125

0.125

a1

a2

a3

a4

0.5

0.25

a1

a2

IEEE Intercon 2014, Arequipa

0.25
0

a3

a4

Huffman Coding
2

0.5

0.25

a1

a2

0.25

0.5

a3

a4

0.5

a1
1

0
0.25

IEEE Intercon 2014, Arequipa

a2

0.25
0

a3

a4

Huffman Coding
4

1.0

0.5

a1

0.5
1

0
0.25

a2
0.125

a3

a4

Probability

Codeword

a1

0.5

a2

0.25

10

a3

0.125

110

a4

0.125

111

Average codeword length:


lavg = 0.51 + 0.252 + 0.1253 +
0.1253 = 1.75 bits

0.25
0

Symbol

0.125

It can be shown that for Huffman codes:


H(S) lavg H(S)+1

IEEE Intercon 2014, Arequipa

Decoding Huffman Codes


Example:
Decode the following message using the Huffman code from
previous example: 0110101110
Encoded message

0110101110

0110101110

a1

0110101110

a2

0110101110
0

a3

a4

IEEE Intercon 2014, Arequipa

0110101110

Decoded message

a1
a1 a3
a1 a3 a2
a1 a3 a2 a4
a1 a3 a 2 a4 a1

Adaptive Huffman Codes


Huffman coding requires knowledge of the probabilities of the source.
If this knowledge is not available, Huffman coding becomes a two-pass
procedure:
first pass to compute the probabilities
second pass to encode the output of the source.
The adaptive Huffman coding algorithm converts this two-pass
procedure into a single-pass procedure.
In adaptive Huffman coding, the transmitter and the receiver start with
a code tree that has a single node corresponding to all the symbols not
yet transmitted (NYT).
As transmission progresses, nodes corresponding to transmitted
symbols are added to the tree.
The first time a symbol is transmitted, the code for NYT is transmitted
first followed by a non-adaptive code agreed by the transmitter and the
receiver before transmission starts.
IEEE Intercon 2014, Arequipa

Golomb-Rice Codes
The Golomb-Rice codes are a family of codes commonly used in data
compression applications due to their low-complexity and good
compression performance.
The JPEG committee and the Consultative Committee for Space Data
Systems (CCSDS), for instance, have adopted the Golomb-Rice codes as
part of their standards.
Golomb-Rices codes have also been recommended in the lossless audio
compression standard H.264 and are already used in many commercial
audio compression software.
The Golomb-Rice codes have their origin in the pioneering work of
Golomb who proposed a method to encode run lengths of events of a
binary source when pom=1/2, where po is the probability of events and m
is an integer.

IEEE Intercon 2014, Arequipa

Golomb-Rice Codes
binary source

A
{0, 1}

100001000100001000000010001001

po is the probability of a 1
(pom=1/2 where m is an integer)

run lengths (non-negative integers)


P(n)
Geometric distribution

. 10. . 12.

0 1 2 3 4 5 6 7 8 9

IEEE Intercon 2014, Arequipa

11

Golomb-Rice Codes
The Golomb-Rice codes consider the special case when m = 2k (k0)

Encoding Procedure:

n
2k

unary code

n
mod k
2

natural binary
code

b7 b6 b5 b4 b3 b2 b1 b0

1 1 1 1 1 1 0 b3 b2 b1 b0

unary code

binary code

Example:
n =17 (00010001)
k=0 codeword = 111111111111111110
k=1 codeword = 1111111101
k=2 codeword = 1111001
k=3 codeword = 110001
IEEE Intercon 2014, Arequipa

k=4
k=5
k=6
k=7

codeword = 100001
codeword = 010001
codeword = 0010001
codeword = 00010001

Golomb-Rice Codes
Practical sources produce positive and negative numbers
(double-sided distribution)

P(n)

....

. 10. 11. .

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

Use the following mapping:


2n

M(n) =
IEEE Intercon 2014, Arequipa

if n 0

2|n|1 if n < 0

Maps positive input numbers


to even integers and negative
input numbers to odd integers.

Adaptive Golomb-Rice Codes

source

G-R
coder

codeword

adaptive
algorithm
M

IEEE Intercon 2014, Arequipa

Adaptive Golomb-Rice Codes

1)
2)
3)
4)
5)
6)
7)

Initialize k to kini;
Reset counter;
Read input n and encode it using parameter k;
If (unary code 1) increment counter;
If (unary code = 0) decrement counter;
If (counter value M) k++; Goto 2;
If (counter value -M) k--; Goto 2;

IEEE Intercon 2014, Arequipa

Entropy Coding
If the source has a narrow distribution, an entropy encoder (Huffman, Golomb-Rice,
arithmetic) can be used directly
P(n)

source

entropy
encoder

compressed
output

Otherwise, a decorrelation step might be necessary

source

decorrelation
predictive coding,
transform coding,
subband coding

IEEE Intercon 2014, Arequipa

entropy
encoder

compressed
output

Predictive Coding Decorrelation


In an image, a pixel
generally has a value
close to one of its
neighbors

55 57 59 63
58 61 63 69
60 64

X = 64

XX

IEEE Intercon 2014, Arequipa

-2

-1

eX

pixel prediction

prediction residual

Predictive Coding Decorrelation


Original

Residual

Histogram

Histogram

IEEE Intercon 2014, Arequipa

Context Adaptive Lossless Image


Compression (CALIC)
2

Pixel neighborhood
NN NNE

WW

NW

NE

The neighboring pixels N, W,


NE, NW, NN, WW, NNE are
available to both the encoder
and the decoder (assuming a
raster scan)

Initial pixel prediction:


If > 80

else if > 80

else {

+ /2 + ( )/4
if > 32
( + )/2

To get an idea of the boundaries present


in the neighborhood:

else if > 32
( + )/2

= | | + + | |

else if > 8
(3 + )/4

= | | + + | |

The initial prediction is refined based on the


relationships of the pixels in the neighborhood
(contexts). For each context we keep track of how much
prediction error is generated and use it to refine the
initial prediction.
IEEE Intercon 2014, Arequipa

else if > 8
(3 + )/4
}

Transform Coding
In transform coding the input sequence is transformed into another sequence in
which most of the information is contained in only a few elements.
For a 1D signal such as audio or speech, , the forward transform is defined as:
=
and the inverse transform is defined as:
=
the transforms are orthonormal transforms: = =

For 2D signals such as images, a two-dimensional separable transform is used.


In a separable transform, we can take a 1D transform in one dimension and
another 1D transform in the other dimension.
In matrix notation:
=
and the inverse transform is given by:
=

IEEE Intercon 2014, Arequipa

Transform Coding
In the JPEG standard, the forward transform is the Discrete Cosine Transform
(DCT) and the inverse transform is the Inverse Discrete Cosine Transform (IDCT).
The DCT transform matrix is defined as:
, =

1
2+1
cos

= 0, = 0,1, , 1

2
2+1
cos

= 1,2, , 1, = 0,1, , 1

input image
DC

DCT

Quantization
AC
quantization
table

IEEE Intercon 2014, Arequipa

DPCM

RLC

Entropy
encoder

compressed
image

Transform Coding - DCT


183 177 147
189 153 63
187 99 37
101 42 36
41 41 38
44 49 49
51 58 55
44 50 52

79
39
38
39
45
50
50
54

41
38
42
61
57
54
55
55

34
37
41
63
73
60
57
59

35
39
46
59
52
58
58
67

43
44
46
44
47
54
54
63

DCT

DC coefficient

AC coefficients

502.0 119.5 83.8 48.3

6.0

0.0

-0.1

-0.3

88.6 173.4 90.9 22.5 11.5

-1.8

-0.2

-0.8

62.0 78.7 22.2 -44.9 -19.8

-9.4

-7.3

-1.1

4.7 -37.1 -44.6 -30.2 -12.2

5.0

-3.0

4.1 11.5

5.1

12.2

3.5 -22.5 -36.9 -20.3 -13.0

IEEE Intercon 2014, Arequipa

12.1

9.7

-7.0

-6.6

2.6 11.3

8.5 11.5

9.2

7.9

3.7

-6.4

6.3 10.1

3.8

1.8

2.6

9.8

1.4

-2.0

0.3

2.3

-5.1

-1.2

Quantization of DCT Coefficients


Quantization Table ()

DCT coefficients
502.0 119.5
88.6 173.4
62.0 78.7
12.2 4.7
3.5 -22.5
12.1 9.7
9.2 7.9
2.6 9.8

83.8
90.9
22.2
-37.1
-36.9
-7.0
3.7
1.4

48.3
22.5
-44.9
-44.6
-20.3
-6.6
-6.4
-2.0

6.0
11.5
-19.8
-30.2
-13.0
2.6
6.3
0.3

0.0
-1.8
-9.4
-12.2
4.1
11.3
10.1
-1.2

-0.1
-0.2
-7.3
5.0
11.5
8.5
3.8
2.3

-0.3
-0.8
-1.1
-3.0
5.1
11.5
1.8
-5.1

16
12
14
14
18
24
49
72

11
12
13
17
22
35
64
92

10
14
16
22
37
55
78
95

16
19
24
29
56
64
87
98

24
26
40
51
68
81
103
112

40
58
57
87
109
104
121
100

51
60
69
80
103
113
120
103

= round

IEEE Intercon 2014, Arequipa

61
55
56
62
77
92
101
99

Quantized coefficients
496
84
56
14
0
24
0
0

121
168
78
0
-22
0
0
0

80
84
16
-44
-37
0
0
0

48 0
19 0
-48 0
-58 -51
0 0
0 0
0 0
0 0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

After quantization the DCT


coefficients are transmitted
following a zig-zag pattern.
The coefficients are
encoded using a Huffman
code.

Transform Coding - DCT


Original

IEEE Intercon 2014, Arequipa

Coded using DCT

Sub-band Coding
In sub-band coding the input signal is decomposed into several subbands using an analysis filter bank.
Depending on the signal different sub-bands will contain different
amounts of information.
Sub-bands with lots of information are encoded using more bits while
sub-bands with little information are encoded using fewer bits.
At the decoder side, the signal is reconstructed using a bank of synthesis
filter.

. . .
f1
IEEE Intercon 2014, Arequipa

f2

f3

. . .

fM

Subband Coding
analysis
filter 1

entropy
encoder 1

. . .

entropy
decoder 1

synthesis
filter 1

analysis
filter 2

entropy
encoder 2

. . .

entropy
decoder 2

synthesis
filter 2

output

input
analysis
filter 3

entropy
encoder 3

analysis
filter M

entropy
encoder M

IEEE Intercon 2014, Arequipa

. . .

. . .

entropy
decoder 3

synthesis
filter 3

entropy
decoder M

synthesis
filter M

Further Reading
Khalid Sayood, Introduction to Data Compression, 4th edition, Morgan
Kaufmann, San Francisco, 2012.
G. Held and T. R. Marshall, Data Compression, 3rd edition, John Wiley
and Sons, New York, 1991.
N. S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice Hall,
Englewood Cliffs, 1984.
B. E. Usevitch, A tutorial on modern lossy wavelet image compression:
foundations of JPEG 2000, IEEE Signal Processing Magazine, vol. 18, no.
5, 2001.
D. Pan, Digital audio compression, Digital Technical Journal, vol. 5, no.
2, 1993.
M. Hans and R. W. Schafer, Lossless compression of digital audio, IEEE
Signal Processing Magazine, vol. 18, no. 4, 2001.
G. E. Blelloch, Introduction to Data Compression, course notes,
Computer Science Department, Carnegie Mellon University
IEEE Intercon 2014, Arequipa

Você também pode gostar