Você está na página 1de 5

How File Compression Works

If you download many programs and files off the Internet, you've probably encountered ZI files before! "his compression system is a very handy invention, especially for Web users, because it lets you reduce the overall number of bits and bytes in a file so it can be transmitted faster over slower Internet connections, or take up less space on a disk! #nce you download the file, your computer uses a program such as WinZip or $tuffit to e%pand the file back to its original si&e! If everything works correctly, the e%panded file is identical to the original file before it was compressed!

't first glance, this seems very mysterious! How can you reduce the number of bits and bytes and then add those e%act bits and bytes back later( 's it turns out, the basic idea behind the process is fairly straightforward! In this article, we'll e%amine this simple method as we take a very small file through the basic process of compression!

Finding )edundancy
*ost types of computer files are fairly redundant ++ they have the same information listed over and over again! File+compression programs simply get rid of the redundancy! Instead of listing a piece of information over and over again, a file+compression program lists that information once and then refers back to it whenever it appears in the original program! 's an e%ample, let's look at a type of information we're all familiar with, words! In -ohn F! .ennedy's /01/ inaugural address, he delivered this famous line, "Ask not what your country can do for you -- ask what you can do for your country." "he 2uote has /3 words, made up of 1/ letters, /1 spaces, one dash and one period! If each letter, space or punctuation mark takes up one unit of memory, we get a total file si&e of 30 units! "o get the file si&e down, we need to look for redundancies! Immediately, we notice that,

"ask" appears two times "what" appears two times "your" appears two times "country" appears two times "can" appears two times "do" appears two times "for" appears two times "you" appears two times

Ignoring the difference between capital and lower+case letters, roughly half of the phrase is redundant! 4ine words ++ ask, not, what, your, country, can, do, for, you ++ give us almost everything we need for the entire 2uote! "o construct the second half of the phrase, we 5ust point to the words in the first half and fill in the spaces and punctuation! In the ne%t section, we'll see how file+compression systems accomplish this!

6ooking it 7p
*ost compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files! 86Z8 refers to Lempel and Ziv, the algorithm's creators, and 8dictionary8 refers to the method of cataloging pieces of data! "he system for arranging dictionaries varies, but it could be as simple as a numbered list! When we go through .ennedy's famous words, we pick out the words that are repeated and put them into the numbered inde%! "hen, we simply write the number instead of writing out the whole word! $o, if this is our dictionary,

1. 2. 3. . !. ". #.

ask what your country can do for

$. you #ur sentence now reads,

"1 not 2 3

! " # $ -- 1 2 $ ! " # 3 "

If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern! "his is what the e%pansion program on your computer does when it e%pands a downloaded file! 9ou might also have encountered compressed files that open themselves up! "o create this sort of file, the programmer includes a simple e%pansion program with the compressed file! It automatically reconstructs the original file once it's downloaded! :ut how much space have we actually saved with this system( 8/ not ; < = > 1 3 ? ++ / ; ? > 1 3 < =8 is certainly shorter than 8'sk not what your country can do for you@ ask what you can do for your country@8 but keep in mind that we need to save the dictionary itself along with the file! In an actual compression scheme, figuring out the various file re2uirements would be fairly complicated@ but for our purposes, let's go back to the idea that every character and every

space takes up one unit of memory! We already saw that the full phrase takes up 30 units! #ur compressed sentence Aincluding spacesB takes up <3 units, and the dictionary Awords and numbersB also takes up <3 units! "his gives us a file si&e of 3=, so we haven't reduced the file si&e by very much! :ut this is only one sentenceC 9ou can imagine that if the compression program worked through the rest of .ennedy's speech, it would find these words and others repeated many more times! 'nd, as we'll see in the ne%t section, it would also be rewriting the dictionary to get the most efficient organi&ation possible!

$earching for atterns


In our e%ample, we picked out all the repeated words and put those in a dictionary! "o us, this is the most obvious way to write a dictionary! :ut a compression program sees it 2uite differently, It doesn't have any concept of separate words ++ it only looks for patterns! 'nd in order to reduce the file si&e as much as possible, it carefully selects which patterns to include in the dictionary! If we approach the phrase from this perspective, we end up with a completely different dictionary! If the compression program scanned .ennedy's phrase, the first redundancy it would come across would be only a couple of letters long! In 8ask not what your,8 there is a repeated pattern of the letter 8t8 followed by a space ++ in 8not8 and 8what!8 If the compression program wrote this to the dictionary, it could write a 8/8 every time a 8t8 were followed by a space! :ut in this short phrase, this pattern doesn't occur enough to make it a worthwhile entry, so the program would eventually overwrite it! "he ne%t thing the program might notice is 8ou,8 which appears in both 8your8 and 8country!8 If this were a longer document, writing this pattern to the dictionary could save a lot of space ++ 8ou8 is a fairly common combination in the Dnglish language! :ut as the compression program worked through this sentence, it would 2uickly discover a better choice for a dictionary entry, 4ot only is 8ou8 repeated, but the entire words 8your8 and 8country8 are both repeated, and they are actually repeated together, as the phrase 8your country!8 In this case, the program would overwrite the dictionary entry for 8ou8 with the entry for 8your country!8 "he phrase 8can do for8 is also repeated, one time followed by 8your8 and one time followed by 8you,8 giving us a repeated pattern of 8can do for you!8 "his lets us write /> characters Aincluding spacesB with one number value, while 8your country8 only lets us write /< characters Awith spacesB with one number value, so the program would overwrite the 8your country8 entry as 5ust 8r country,8 and then write a separate entry for 8can do for you!8 "he program proceeds in this way, picking up all repeated bits of information and then calculating which patterns it should write to the dictionary! "his ability to rewrite the dictionary is the 8adaptive8 part of LZ adaptive dictionary-based algorithm! "he way a program actually does this is fairly complicated, as you can see by the discussions on Eata+ Compression!com! 4o matter what specific method you use, this in+depth searching system lets you compress the file much more efficiently than you could by 5ust picking out words! 7sing the patterns we picked out above, and adding 8FF8 for spaces, we come up with this larger dictionary,

1. 2. 3. .

ask%% what%% you r%%country

!. %%can%%do%%for%%you 'nd this smaller sentence,

"1not%%23 !%%--%%123! "

"he sentence now takes up /? units of memory, and our dictionary takes up =/ units! $o we've compressed the total file si&e from 30 units to >0 unitsC "his is 5ust one way of compressing the phrase, and not necessarily the most efficient one! A$ee if you can find a better wayCB In the ne%t section, we'll see some of the ways in which compression percentage might vary!

How *uch Can 9ou "rim(


$o how good is this system( "he file-reduction ratio depends on a number of factors, including file type, file si&e and compression scheme! In most languages of the world, certain letters and words often appear together in the same pattern! :ecause of this high rate of redundancy, te&t files compress very well! ' reduction of >G percent or more is typical for a good+si&ed te%t file! *ost programming languages are also very redundant because they use a relatively small collection of commands, which fre2uently go together in a set pattern! Files that include a lot of uni2ue information, such as graphics or * < files, cannot be compressed much with this system because they don't repeat many patterns Amore on this in the ne%t sectionB! If a file has a lot of repeated patterns, the rate of reduction typically increases with file si&e! 9ou can see this 5ust by looking at our e%ample ++ if we had more of .ennedy's speech, we would be able to refer to the patterns in our dictionary more often, and so get more out of each entry's file space! 'lso, more pervasive patterns might emerge in the longer work, allowing us to create a more efficient dictionary! "his efficiency also depends on the specific algorithm used by the compression program! $ome programs are particularly suited to picking up patterns in certain types of files, and so may compress them more succinctly! #thers have dictionaries within dictionaries, which might compress efficiently for larger files but not for smaller ones! While all compression programs of this sort work with the same basic idea, there is actually a good deal of variation in the manner of e%ecution! rogrammers are always trying to build a better system!

6ossy and 6ossless


"he type of compression we've been discussing here is called lossless compression, because it lets you recreate the original file e%actly! 'll lossless compression is based on the

idea of breaking a file into a 8smaller8 form for transmission or storage and then putting it back together on the other end so it can be used again! Lossy compression works very differently! "hese programs simply eliminate 8unnecessary8 bits of information, tailoring the file so that it is smaller! "his type of compression is used a lot for reducing the file si&e of bitmap pictures, which tend to be fairly bulky! "o see how this works, let's consider how your computer might compress a scanned photograph! ' lossless compression program can't do much with this type of file! While large parts of the picture may look the same ++ the whole sky is blue, for e%ample ++ most of the individual pi%els are a little bit different! "o make this picture smaller without compromising the resolution, you have to change the color value for certain pi%els! If the picture had a lot of blue sky, the program would pick one color of blue that could be used for every pi%el! "hen, the program rewrites the file so that the value for every sky pi%el refers back to this information! If the compression scheme works well, you won't notice the change, but the file si&e will be significantly reduced! #f course, with lossy compression, you can't get the original file back after it has been compressed! 9ou're stuck with the compression program's reinterpretation of the original! For this reason, you can't use this sort of compression for anything that needs to be reproduced e%actly, including software applications, databases and presidential inauguration speeches! For more information on file compression and related topics, check out the links on the ne%t page!

Você também pode gostar